Streaming and WebSocket Fundamentals

Koldan provides a high-performance, low-latency WebSocket API for real-time speech-to-text processing. Unlike the REST API, which is designed for pre-recorded files, the Streaming API allows you to process audio as it is being captured, making it ideal for live applications.

Why Use Streaming?

While the REST API is optimized for throughput and batch processing, the Streaming API is built for interactivity. Use streaming when you need:

Live Captioning: Displaying text to users in real-time as they speak.
Voice Assistants: Building conversational interfaces that respond immediately to user input.
Real-time Monitoring: Analyzing live audio feeds for specific keywords or events.
Low Latency: Minimizing the time between speech occurring and the transcript appearing.

How it Works

The Koldan Streaming API uses the WebSocket protocol to establish a persistent, full-duplex communication channel between your application and the Koldan servers.

The Lifecycle of a Session

A typical streaming session follows a predictable lifecycle:

Connection: Your client establishes a standard WebSocket connection to the Koldan streaming endpoint.
Authentication & Initialization: You send a "start" message containing your credentials (API Key or JWT) and configuration (e.g., the language and model you want to use).
Streaming Audio: You send binary audio data in small chunks. Koldan processes these chunks immediately.
Receiving Results: Koldan sends back transcription results as they are generated. You will receive two types of results:
- Partial Results: Quick, intermediate guesses that may change as more audio is processed.
- Final Results: Stable, high-accuracy transcripts for completed phrases or sentences.
Termination: You send an "end" message to signal that you are finished speaking, and Koldan sends a final confirmation before closing the connection.

Key Concepts

Full-Duplex Communication

Unlike standard HTTP requests where the client sends a request and waits for a response, WebSockets allow both the client and the server to send messages independently at any time. This is what enables real-time feedback.

Binary vs. Text Frames

The Streaming API uses two types of WebSocket frames: - Text Frames: Used for control messages (starting/ending sessions) and receiving transcription results (JSON). - Binary Frames: Used for sending the raw audio data (PCM) to the server.

Partial vs. Final Results

To provide the best user experience, Koldan returns results as soon as possible: - Partial results are optimized for speed. They appear quickly but might be corrected by the AI as it hears more context. - Final results are emitted once the AI is confident about a segment of speech. These are the "canonical" transcripts of the session.

Best Practices for Real-time Apps

To get the most out of Koldan's streaming capabilities, consider the following:

Consistent Audio Buffering: Send audio in regular, small chunks (typically 20ms to 100ms of audio per frame). Irregular or overly large chunks can lead to "jittery" transcription updates.
Handle Network Stability: WebSocket connections can be sensitive to network interruptions. Implement reconnection logic and handle connection-closed events gracefully in your application.
Use the Right Model: Different models have different latency profiles. For live use cases, ensure you are using a model optimized for real-time performance.
Secure Your Credentials: Although the connection is over wss:// (secure WebSocket), never hardcode API keys in client-side code (like browsers). Always use a secure backend to manage authentication or use short-lived tokens.

Streaming API Reference: Manage and retrieve results from completed streaming sessions via REST.
Streaming WebSocket API: Detailed protocol and message formats for real-time transcription.