Streaming WebSocket API

Real-time, low-latency speech recognition over a WebSocket connection. Designed for interactive applications such as voice assistants, live captioning, and real-time transcription.

Endpoint: /api/v1/speech-services/stream

Requires Authentication - Scopes: speech:sessions:write

Rate Limited - This endpoint enforces stricter rate limits

Concurrency Limited - This endpoint has a limit on simultaneous active sessions

Bandwidth Throttled – This endpoint enforces stricter bandwidth limitations

Authentication

Authentication is performed within the session.start JSON message - not during the WebSocket upgrade handshake. This allows browser-based clients that cannot set custom headers during the upgrade to use the API.

The authentication object in the session.start message must contain one of:

token - A JWT Bearer token (the same token used for REST API calls).
apiKey - An API key value (e.g., kk-...).

Authentication Object Example

{
  "authentication": {
    "token": "eyJhbGciOiJSUzI1NiIs..."
  }
}

If authentication fails, the server responds with an error message (code UNAUTHENTICATED) and closes the connection.

Protocol Lifecycle

The communication follows a strictly ordered sequence of text (JSON) and binary frames:

Handshake: Client establishes the WebSocket connection.
Session Initialization: Client sends a session.start JSON message.
Session Confirmation: Server validates credentials and responds with a session.info JSON message.
Audio Streaming: Client sends binary frames containing raw PCM audio data.
Real-time Results: Server sends result.partial and result.final JSON messages as audio is processed.
Termination: Client signals the end of audio with a session.end JSON message.
Completion: Server sends a session.completed JSON message and closes the connection.

Client Messages

Messages sent from the client to the server must be valid JSON text frames, except for audio data.

`session.start`

The first message sent by the client to initialize the session.

Field	Type	Required	Description
`type`	`string`	Yes	Must be `"session.start"`.
`model`	`string`	Yes	Model alias (e.g., `general`, `medical`).
`authentication`	`object`	Yes	Authentication credentials.
`config`	`object`	No	Session configuration.

Example: session.start

{
  "type": "session.start",
  "model": "general",
  "authentication": {
    "apiKey": "kk-abc123..."
  },
  "config": {
    "language": "he-IL",
    "enableEndpointDetection": true,
    "recordAudio": true,
    "metadata": {
      "department": "radiology",
      "externalId": "case-4821"
    }
  }
}

`session.end`

Signals that the client has finished streaming audio. After sending this, the client should not send any more audio frames.

Field	Type	Required	Description
`type`	`string`	Yes	Must be `"session.end"`.

Example: session.end

{
  "type": "session.end"
}

Audio Frames (Binary)

After receiving a session.info message, the client should send the audio data as binary frames.

Format: Raw PCM 16-bit Mono.
Sample Rate: Typically 16kHz (matching the model's expectation).
Frame Size: Recommended 100ms - 250ms of audio per frame.

Server Messages

Messages sent from the server to the client are JSON text frames.

`session.info`

Sent once after a successful session.start.

Field	Type	Description
`type`	`string`	Must be `"session.info"`.
`sessionId`	`string (UUID)`	Unique identifier for the streaming session.
`model`	`string`	The resolved model used for the session.

`result.partial`

Contains interim transcription results as the user is speaking. These results may change as more context is gathered.

Field	Type	Description
`type`	`string`	Must be `"result.partial"`.
`text`	`string`	The current transcript of the ongoing segment.
`startTime`	`double`	Start time of the segment in seconds.
`endTime`	`double`	Current end time of the partial segment in seconds.

`result.final`

Sent when the server determines a segment of speech is complete (e.g., after a pause).

Field	Type	Description
`type`	`string`	Must be `"result.final"`.
`text`	`string`	The final transcript of the segment.
`startTime`	`double`	Start time of the segment in seconds.
`endTime`	`double`	End time of the segment in seconds.
`confidence`	`double`	Confidence score (0.0 to 1.0).

`session.completed`

Sent after the client sends session.end and the server finishes processing all remaining audio.

Field	Type	Description
`type`	`string`	Must be `"session.completed"`.
`durationSeconds`	`double`	Total duration of the processed audio.
`wordsCount`	`integer`	Total number of words transcribed.

`error`

Sent when a protocol or processing error occurs. The server usually closes the connection after sending an error.

Field	Type	Description
`type`	`string`	Must be `"error"`.
`code`	`string`	Machine-readable error code (e.g., `INVALID_CONFIG`, `QUOTA_EXCEEDED`).
`message`	`string`	Human-readable description of the error.

Data Models

`SpeechServiceStreamingConfig`

Configuration parameters for the streaming session.

Field	Type	Default	Description
`language`	`string`	-	Target language code (e.g., `en-US`, `he-IL`).
`enableEndpointDetection`	`boolean`	`true`	Automatically detect silence and finalize segments.
`recordAudio`	`boolean`	`false`	Whether to store a recording of the session for later playback.
`metadata`	`object`	-	Custom key-value pairs to associate with the session.

Example Usage

JavaScript (Browser)

const socket = new WebSocket('wss://koldan.dixilang.com/api/v1/speech-services/stream');

socket.onopen = () => {
  // 1. Start the session
  socket.send(JSON.stringify({
    type: 'session.start',
    model: 'general',
    authentication: { apiKey: 'kk-...' },
    config: { language: 'he-IL' }
  }));
};

socket.onmessage = (event) => {
  const msg = JSON.parse(event.data);

  if (msg.type === 'session.info') {
    console.log('Session started:', msg.sessionId);
    // 2. Now start sending binary audio data
    // sendAudioData(socket); 
  } else if (msg.type === 'result.partial') {
    console.log('Interim:', msg.text);
  } else if (msg.type === 'result.final') {
    console.log('Final:', msg.text);
  }
};

Streaming Fundamentals: Learn about the concepts, lifecycle, and best practices of streaming.
Streaming Sessions API: Manage and retrieve results from completed streaming sessions via REST.
Transcription API: REST API for pre-recorded file transcriptions.

Streaming WebSocket API

Authentication

Protocol Lifecycle

Client Messages

session.start

session.end

Audio Frames (Binary)

Server Messages

session.info

result.partial

result.final

session.completed

error