Skip to content

Streaming WebSocket API

Real-time, low-latency speech recognition over a WebSocket connection. Designed for interactive applications such as voice assistants, live captioning, and real-time transcription.


Endpoint: /api/v1/speech-services/stream

Requires Authentication - Scopes: speech:sessions:write

Rate Limited - This endpoint enforces stricter rate limits

Concurrency Limited - This endpoint has a limit on simultaneous active sessions

Bandwidth Throttled – This endpoint enforces stricter bandwidth limitations


Authentication

Authentication is performed within the session.start JSON message - not during the WebSocket upgrade handshake. This allows browser-based clients that cannot set custom headers during the upgrade to use the API.

The authentication object in the session.start message must contain one of:

  • token - A JWT Bearer token (the same token used for REST API calls).
  • apiKey - An API key value (e.g., kk-...).
Authentication Object Example
{
  "authentication": {
    "token": "eyJhbGciOiJSUzI1NiIs..."
  }
}

If authentication fails, the server responds with an error message (code UNAUTHENTICATED) and closes the connection.


Protocol Lifecycle

The communication follows a strictly ordered sequence of text (JSON) and binary frames:

  1. Handshake: Client establishes the WebSocket connection.
  2. Session Initialization: Client sends a session.start JSON message.
  3. Session Confirmation: Server validates credentials and responds with a session.info JSON message.
  4. Audio Streaming: Client sends binary frames containing raw PCM audio data.
  5. Real-time Results: Server sends result.partial and result.final JSON messages as audio is processed.
  6. Termination: Client signals the end of audio with a session.end JSON message.
  7. Completion: Server sends a session.completed JSON message and closes the connection.

Client Messages

Messages sent from the client to the server must be valid JSON text frames, except for audio data.

session.start

The first message sent by the client to initialize the session.

Field Type Required Description
type string Yes Must be "session.start".
model string Yes Model alias (e.g., general, medical).
authentication object Yes Authentication credentials.
config object No Session configuration.
Example: session.start
{
  "type": "session.start",
  "model": "general",
  "authentication": {
    "apiKey": "kk-abc123..."
  },
  "config": {
    "language": "he-IL",
    "enableEndpointDetection": true,
    "recordAudio": true,
    "diarization": {
      "enabled": true,
      "maxSpeakers": 4
    },
    "metadata": {
      "department": "radiology",
      "externalId": "case-4821"
    }
  }
}

session.end

Signals that the client has finished streaming audio. After sending this, the client should not send any more audio frames.

Field Type Required Description
type string Yes Must be "session.end".
Example: session.end
{
  "type": "session.end"
}

Audio Frames (Binary)

After receiving a session.info message, the client should send the audio data as binary frames.

  • Format: Raw PCM 16-bit Mono.
  • Sample Rate: Typically 16kHz (matching the model's expectation).
  • Frame Size: Recommended 100ms - 250ms of audio per frame.

Server Messages

Messages sent from the server to the client are JSON text frames.

session.info

Sent once after a successful session.start.

Field Type Description
type string Must be "session.info".
sessionId string (UUID) Unique identifier for the streaming session.
resolvedModel string The resolved model used for the session.
usedFallback boolean Whether model alias resolution fell back to a default model.

result.partial

Contains interim transcription results as the user is speaking. These results may change as more context is gathered.

Field Type Description
type string Must be "result.partial".
transcript string The current transcript of the ongoing segment.
words WordTiming[] Word timings for the current partial result, including optional speakerTag when diarization is active.

result.final

Sent when the server determines a segment of speech is complete (e.g., after a pause).

Field Type Description
type string Must be "result.final".
segmentIndex integer Stable zero-based segment index. If the same index is sent again, clients should treat it as a revision of an earlier final result rather than a new segment.
transcript string The final transcript of the segment.
confidence double Confidence score (0.0 to 1.0).
words WordTiming[] Word timings for the final result, including optional speakerTag when diarization is active.

session.completed

Sent after the client sends session.end and the server finishes processing all remaining audio.

Field Type Description
type string Must be "session.completed".
sessionId string (UUID) Unique identifier of the completed session.
totalSegments integer Total number of finalized segments.
totalWords integer Total number of words transcribed.

error

Sent when a protocol or processing error occurs. The server usually closes the connection after sending an error.

Field Type Description
type string Must be "error".
code string Machine-readable error code (e.g., INVALID_CONFIG, QUOTA_EXCEEDED).
message string Human-readable description of the error.

Data Models

SpeechServiceStreamingConfig

Configuration parameters for the streaming session.

Field Type Default Description
language string - Target language code (e.g., en-US, he-IL).
enableEndpointDetection boolean true Automatically detect silence and finalize segments.
recordAudio boolean false Whether to store a recording of the session for later playback.
diarization object - Optional online speaker diarization settings.
metadata object - Custom key-value pairs to associate with the session.

diarization fields:

Field Type Default Description
enabled boolean false Whether to enable online speaker diarization for the session.
maxSpeakers integer - Optional maximum expected number of speakers. Must be greater than 0 when provided.

WordTiming fields:

Field Type Description
word string The recognized word token.
startSeconds double Word start time in seconds.
endSeconds double Word end time in seconds.
confidence double Word-level confidence score.
speakerTag integer Optional speaker tag assigned by online diarization.

Example Usage

JavaScript (Browser)

const socket = new WebSocket('wss://koldan.dixilang.com/api/v1/speech-services/stream');

socket.onopen = () => {
  // 1. Start the session
  socket.send(JSON.stringify({
    type: 'session.start',
    model: 'general',
    authentication: { apiKey: 'kk-...' },
    config: { language: 'he-IL' }
  }));
};

socket.onmessage = (event) => {
  const msg = JSON.parse(event.data);

  if (msg.type === 'session.info') {
    console.log('Session started:', msg.sessionId);
    // 2. Now start sending binary audio data
    // sendAudioData(socket); 
  } else if (msg.type === 'result.partial') {
    console.log('Interim:', msg.text);
  } else if (msg.type === 'result.final') {
    console.log('Final:', msg.transcript, msg.words);
  }
};