Skip to content

Streaming WebSocket API

Real-time, low-latency speech recognition over a WebSocket connection. Designed for interactive applications such as voice assistants, live captioning, and real-time transcription.


Endpoint: /api/v1/speech-services/stream

Requires Authentication - Scopes: speech:sessions:write

Rate Limited - This endpoint enforces stricter rate limits

Concurrency Limited - This endpoint has a limit on simultaneous active sessions

Bandwidth Throttled – This endpoint enforces stricter bandwidth limitations


Authentication

Authentication is performed within the session.start JSON message - not during the WebSocket upgrade handshake. This allows browser-based clients that cannot set custom headers during the upgrade to use the API.

The authentication object in the session.start message must contain one of:

  • token - A JWT Bearer token (the same token used for REST API calls).
  • apiKey - An API key value (e.g., kk-...).
Authentication Object Example
{
  "authentication": {
    "token": "eyJhbGciOiJSUzI1NiIs..."
  }
}

If authentication fails, the server responds with an error message (code UNAUTHENTICATED) and closes the connection.


Protocol Lifecycle

The communication follows a strictly ordered sequence of text (JSON) and binary frames:

  1. Handshake: Client establishes the WebSocket connection.
  2. Session Initialization: Client sends a session.start JSON message.
  3. Session Confirmation: Server validates credentials and responds with a session.info JSON message.
  4. Audio Streaming: Client sends binary frames containing raw PCM audio data.
  5. Real-time Results: Server sends result.partial and result.final JSON messages as audio is processed.
  6. Termination: Client signals the end of audio with a session.end JSON message.
  7. Completion: Server sends a session.completed JSON message and closes the connection.

Client Messages

Messages sent from the client to the server must be valid JSON text frames, except for audio data.

session.start

The first message sent by the client to initialize the session.

Field Type Required Description
type string Yes Must be "session.start".
model string Yes Model alias (e.g., general, medical).
authentication object Yes Authentication credentials.
config object No Session configuration.
Example: session.start
{
  "type": "session.start",
  "model": "general",
  "authentication": {
    "apiKey": "kk-abc123..."
  },
  "config": {
    "language": "he-IL",
    "enableEndpointDetection": true,
    "recordAudio": true,
    "metadata": {
      "department": "radiology",
      "externalId": "case-4821"
    }
  }
}

session.end

Signals that the client has finished streaming audio. After sending this, the client should not send any more audio frames.

Field Type Required Description
type string Yes Must be "session.end".
Example: session.end
{
  "type": "session.end"
}

Audio Frames (Binary)

After receiving a session.info message, the client should send the audio data as binary frames.

  • Format: Raw PCM 16-bit Mono.
  • Sample Rate: Typically 16kHz (matching the model's expectation).
  • Frame Size: Recommended 100ms - 250ms of audio per frame.

Server Messages

Messages sent from the server to the client are JSON text frames.

session.info

Sent once after a successful session.start.

Field Type Description
type string Must be "session.info".
sessionId string (UUID) Unique identifier for the streaming session.
model string The resolved model used for the session.

result.partial

Contains interim transcription results as the user is speaking. These results may change as more context is gathered.

Field Type Description
type string Must be "result.partial".
text string The current transcript of the ongoing segment.
startTime double Start time of the segment in seconds.
endTime double Current end time of the partial segment in seconds.

result.final

Sent when the server determines a segment of speech is complete (e.g., after a pause).

Field Type Description
type string Must be "result.final".
text string The final transcript of the segment.
startTime double Start time of the segment in seconds.
endTime double End time of the segment in seconds.
confidence double Confidence score (0.0 to 1.0).

session.completed

Sent after the client sends session.end and the server finishes processing all remaining audio.

Field Type Description
type string Must be "session.completed".
durationSeconds double Total duration of the processed audio.
wordsCount integer Total number of words transcribed.

error

Sent when a protocol or processing error occurs. The server usually closes the connection after sending an error.

Field Type Description
type string Must be "error".
code string Machine-readable error code (e.g., INVALID_CONFIG, QUOTA_EXCEEDED).
message string Human-readable description of the error.

Data Models

SpeechServiceStreamingConfig

Configuration parameters for the streaming session.

Field Type Default Description
language string - Target language code (e.g., en-US, he-IL).
enableEndpointDetection boolean true Automatically detect silence and finalize segments.
recordAudio boolean false Whether to store a recording of the session for later playback.
metadata object - Custom key-value pairs to associate with the session.

Example Usage

JavaScript (Browser)

const socket = new WebSocket('wss://koldan.dixilang.com/api/v1/speech-services/stream');

socket.onopen = () => {
  // 1. Start the session
  socket.send(JSON.stringify({
    type: 'session.start',
    model: 'general',
    authentication: { apiKey: 'kk-...' },
    config: { language: 'he-IL' }
  }));
};

socket.onmessage = (event) => {
  const msg = JSON.parse(event.data);

  if (msg.type === 'session.info') {
    console.log('Session started:', msg.sessionId);
    // 2. Now start sending binary audio data
    // sendAudioData(socket); 
  } else if (msg.type === 'result.partial') {
    console.log('Interim:', msg.text);
  } else if (msg.type === 'result.final') {
    console.log('Final:', msg.text);
  }
};