Streaming WebSocket API
Real-time, low-latency speech recognition over a WebSocket connection. Designed for interactive applications such as voice assistants, live captioning, and real-time transcription.
Endpoint: /api/v1/speech-services/stream
Requires Authentication - Scopes: speech:sessions:write
Rate Limited - This endpoint enforces stricter rate limits
Concurrency Limited - This endpoint has a limit on simultaneous active sessions
Bandwidth Throttled – This endpoint enforces stricter bandwidth limitations
Authentication
Authentication is performed within the session.start JSON message - not during the WebSocket upgrade handshake. This allows browser-based clients that cannot set custom headers during the upgrade to use the API.
The authentication object in the session.start message must contain one of:
token- A JWT Bearer token (the same token used for REST API calls).apiKey- An API key value (e.g.,kk-...).
If authentication fails, the server responds with an error message (code UNAUTHENTICATED) and closes the connection.
Protocol Lifecycle
The communication follows a strictly ordered sequence of text (JSON) and binary frames:
- Handshake: Client establishes the WebSocket connection.
- Session Initialization: Client sends a
session.startJSON message. - Session Confirmation: Server validates credentials and responds with a
session.infoJSON message. - Audio Streaming: Client sends binary frames containing raw PCM audio data.
- Real-time Results: Server sends
result.partialandresult.finalJSON messages as audio is processed. - Termination: Client signals the end of audio with a
session.endJSON message. - Completion: Server sends a
session.completedJSON message and closes the connection.
Client Messages
Messages sent from the client to the server must be valid JSON text frames, except for audio data.
session.start
The first message sent by the client to initialize the session.
| Field | Type | Required | Description |
|---|---|---|---|
type |
string |
Yes | Must be "session.start". |
model |
string |
Yes | Model alias (e.g., general, medical). |
authentication |
object |
Yes | Authentication credentials. |
config |
object |
No | Session configuration. |
{
"type": "session.start",
"model": "general",
"authentication": {
"apiKey": "kk-abc123..."
},
"config": {
"language": "he-IL",
"enableEndpointDetection": true,
"recordAudio": true,
"metadata": {
"department": "radiology",
"externalId": "case-4821"
}
}
}
session.end
Signals that the client has finished streaming audio. After sending this, the client should not send any more audio frames.
| Field | Type | Required | Description |
|---|---|---|---|
type |
string |
Yes | Must be "session.end". |
Audio Frames (Binary)
After receiving a session.info message, the client should send the audio data as binary frames.
- Format: Raw PCM 16-bit Mono.
- Sample Rate: Typically 16kHz (matching the model's expectation).
- Frame Size: Recommended 100ms - 250ms of audio per frame.
Server Messages
Messages sent from the server to the client are JSON text frames.
session.info
Sent once after a successful session.start.
| Field | Type | Description |
|---|---|---|
type |
string |
Must be "session.info". |
sessionId |
string (UUID) |
Unique identifier for the streaming session. |
model |
string |
The resolved model used for the session. |
result.partial
Contains interim transcription results as the user is speaking. These results may change as more context is gathered.
| Field | Type | Description |
|---|---|---|
type |
string |
Must be "result.partial". |
text |
string |
The current transcript of the ongoing segment. |
startTime |
double |
Start time of the segment in seconds. |
endTime |
double |
Current end time of the partial segment in seconds. |
result.final
Sent when the server determines a segment of speech is complete (e.g., after a pause).
| Field | Type | Description |
|---|---|---|
type |
string |
Must be "result.final". |
text |
string |
The final transcript of the segment. |
startTime |
double |
Start time of the segment in seconds. |
endTime |
double |
End time of the segment in seconds. |
confidence |
double |
Confidence score (0.0 to 1.0). |
session.completed
Sent after the client sends session.end and the server finishes processing all remaining audio.
| Field | Type | Description |
|---|---|---|
type |
string |
Must be "session.completed". |
durationSeconds |
double |
Total duration of the processed audio. |
wordsCount |
integer |
Total number of words transcribed. |
error
Sent when a protocol or processing error occurs. The server usually closes the connection after sending an error.
| Field | Type | Description |
|---|---|---|
type |
string |
Must be "error". |
code |
string |
Machine-readable error code (e.g., INVALID_CONFIG, QUOTA_EXCEEDED). |
message |
string |
Human-readable description of the error. |
Data Models
SpeechServiceStreamingConfig
Configuration parameters for the streaming session.
| Field | Type | Default | Description |
|---|---|---|---|
language |
string |
- | Target language code (e.g., en-US, he-IL). |
enableEndpointDetection |
boolean |
true |
Automatically detect silence and finalize segments. |
recordAudio |
boolean |
false |
Whether to store a recording of the session for later playback. |
metadata |
object |
- | Custom key-value pairs to associate with the session. |
Example Usage
JavaScript (Browser)
const socket = new WebSocket('wss://koldan.dixilang.com/api/v1/speech-services/stream');
socket.onopen = () => {
// 1. Start the session
socket.send(JSON.stringify({
type: 'session.start',
model: 'general',
authentication: { apiKey: 'kk-...' },
config: { language: 'he-IL' }
}));
};
socket.onmessage = (event) => {
const msg = JSON.parse(event.data);
if (msg.type === 'session.info') {
console.log('Session started:', msg.sessionId);
// 2. Now start sending binary audio data
// sendAudioData(socket);
} else if (msg.type === 'result.partial') {
console.log('Interim:', msg.text);
} else if (msg.type === 'result.final') {
console.log('Final:', msg.text);
}
};
Related Documentation
- Streaming Fundamentals: Learn about the concepts, lifecycle, and best practices of streaming.
- Streaming Sessions API: Manage and retrieve results from completed streaming sessions via REST.
- Transcription API: REST API for pre-recorded file transcriptions.