Files and Transcriptions

Koldan's speech services revolve around two core resources that form a processing pipeline:

Files - media files (audio or video) that you upload or import into the platform.
Transcriptions - speech-to-text jobs that process a file and produce a text transcript.

Files

A file is the starting point of any speech processing workflow. It represents a media file (audio or video) stored in Koldan's object storage.

Uploading and Importing

There are two ways to get a file into Koldan:

Method	How It Works
Direct upload	Upload the binary file via multipart form data. The file is stored immediately and is ready for transcription.
URI import	Provide a remote URI (e.g., an S3 presigned URL or HTTPS link). Koldan downloads the file asynchronously in the background.

URI-imported files go through an ingestion process before they can be used:

Ingestion Status	Description
`PENDING`	Import request accepted, download not yet started
`DOWNLOADING`	File is being downloaded from the source URI
`COMPLETED`	File downloaded successfully and ready for use
`FAILED`	Download failed (unreachable URI, timeout, invalid format, etc.)

Directly uploaded files skip ingestion entirely - they have no ingestion status.

File Properties

Every file carries the following information:

Property	Description
Name	A human-readable display name (e.g., `Weekly Standup`)
Filename	The original filename from upload (e.g., `meeting-2026-03-15.wav`)
Description	Optional free-text description of the file content
Path	Virtual directory path for organization (see Virtual Paths)
Size	File size in bytes
Duration	Media duration in seconds (extracted after upload)
SHA-256	Checksum of the file content for deduplication and integrity verification
Metadata	Arbitrary key-value JSON object for custom data (e.g., `{"department": "engineering", "project": "alpha"}`)
Tags	User-defined labels for categorization (see Tags)

Virtual Paths

Files are organized into virtual directories using path strings - similar to folders in a file system. Paths always start and end with /.

Example Path	Meaning
`/`	Root directory (default for all files)
`/meetings/`	A "meetings" folder
`/meetings/2026/march/`	Nested folder structure

Virtual paths are purely organizational - they don't affect storage or processing. You can:

List files at a path - returns only direct children of that directory.
List recursively - returns all descendants under a path.
Move files - update a file's path to reorganize without re-uploading.

Path rules

Paths must start and end with /. They are normalized automatically - trailing/leading slashes are added if missing, and consecutive slashes are collapsed.

Listening Audio

When you upload a media file, Koldan can generate a listening audio derivative - a compressed MP3 version optimized for playback in web browsers and client applications. Listening audio serves several purposes:

Browser compatibility - the original file may be in a format that browsers can't play natively (e.g., WAV, FLAC, or multi-channel audio).
Reduced size - the MP3 derivative is significantly smaller than the original, making streaming faster and more efficient.
Storage savings - once the listening audio is generated, you can discard the original file content to free up storage quota while still being able to listen to the recording in the web interface.

Listening Audio Status	Description
`PENDING`	Generation requested, not yet started
`PROCESSING`	Audio conversion is in progress
`COMPLETED`	MP3 ready for streaming
`FAILED`	Conversion failed

Listening audio generation can be:

Automatic - enabled by default on upload (configurable via the generateListeningAudio parameter).
On-demand - triggered later via a dedicated endpoint on an existing file.

The listening audio has its own independent lifecycle - it can be deleted and purged separately from the original file. Streaming is done via an endpoint that supports HTTP range requests for seeking within the audio.

Content Lifecycle

Files follow a three-stage content lifecycle:

stateDiagram-v2
    [*] --> Active
    Active --> ContentDiscarded: Discard content
    Active --> Deleted: Delete
    Deleted --> Purged: Purge / Retention expires
    ContentDiscarded --> Deleted: Delete
    Deleted --> Active: Restore

Stage	What It Means
Active	File and its content are fully available
Content discarded	The binary content has been removed from storage, but the file metadata (name, tags, path, duration, etc.) is preserved. Useful when you want to keep the record without storing the media. Transcriptions and summaries remain available.
Deleted	The file is marked as deleted and hidden from default listings, but can still be found by filtering for deleted files. A scheduled purge date is set based on your retention policy.
Purged	The binary content is permanently removed from storage. The database record is retained for audit purposes, but the file can no longer be downloaded.

Purge is irreversible

Once a file is purged - whether by explicit request or automatic retention - the binary content cannot be recovered. Only the metadata record remains.

Files can be shared with other users in your tenant. Koldan supports two sharing mechanisms:

Mechanism	Description
Direct share	Share a file with a specific user by their UUID, granting either `VIEW` (read-only) or `EDIT` (read + write) permission.
Internal publish	Set a file's publish mode to `INTERNAL` to make it visible to all authenticated users within the same tenant (read-only).

Shared files - along with their transcriptions, summaries, and listening audio - become accessible to the recipient according to the granted permission level. Users can discover files shared with them via a dedicated endpoint and then drill into subresources (transcriptions, summaries) by passing the file ID.

File owners retain full control: they can manage shares, change the publish mode, and revoke access at any time.

For full API details, endpoint reference, and permission matrix, see the File Sharing REST API guide.

Transcriptions

A transcription job takes a file and produces a text transcript using Koldan's speech recognition engine.

Creating a Transcription

There are two approaches:

Approach	Description
Two-step	Upload a file first, then create a transcription job referencing the file's ID. Useful when you want to transcribe the same file multiple times with different options.
Upload and transcribe	A single combined endpoint that uploads the file and starts a transcription job in one request. The most common approach for simple workflows.

Transcription Options

When creating a transcription, you can configure:

Language - specify the source language or let the engine auto-detect it.
Model - choose which speech model to use.
Punctuation - add punctuation marks to the transcript (see below).
Capitalization - apply correct casing to the transcript (see below).
Diarization - enable speaker identification to label who said what (see below).
Webhook - receive a notification at a URL when the job completes or fails.

Punctuation and Capitalization

Language-specific - availability depends on the languages installed on your server.

Raw speech recognition output is typically all-lowercase with no punctuation marks. Koldan offers optional post-processing that enriches the transcript with proper punctuation and capitalization:

Option	What It Does
Punctuation	Inserts punctuation marks (`.`, `,`, `?`, `!`) at appropriate positions in the text.
Capitalization	Applies correct casing to words - sentence starts, proper nouns, acronyms, etc.

Both options are independent toggles - you can enable one, both, or neither when creating a transcription.

When enabled, the enrichment is applied automatically after speech recognition completes. The results appear at two levels:

Segment text - the text field of each segment is rebuilt with punctuation and capitalization applied, so you get natural-looking sentences out of the box.
Word-level detail - each word in the words array carries a punctuation field (the mark following the word, e.g., "," or ".") and a capitalization field (the corrected form, e.g., "Hello", "NASA"). These are null when the respective service was not applied.

Availability

Punctuation and capitalization require a dedicated processing service configured on the server. If the service is not deployed, these options are silently ignored. Capitalization support may also vary by language - some languages may not support it.

Diarization

Diarization identifies and labels different speakers in the audio. Koldan supports multiple diarization modes:

Speaker diarization - automatically detects the number of speakers and assigns labels (Speaker 1, Speaker 2, etc.). You can optionally hint the expected number of speakers.
Channel diarization - maps audio channels to speakers (e.g., left channel = Agent, right channel = Customer). Useful for call center recordings with separate channels per participant.

Job Lifecycle

Every transcription job follows a status progression:

stateDiagram-v2
    [*] --> PENDING
    PENDING --> IN_PROGRESS: Processing starts
    IN_PROGRESS --> COMPLETED: Success
    IN_PROGRESS --> FAILED: Error
    PENDING --> CANCELLED: User cancels
    IN_PROGRESS --> CANCELLED: User cancels

Status	Description
`PENDING`	Job is queued and waiting for an available processing slot
`IN_PROGRESS`	Speech recognition is actively processing the audio
`COMPLETED`	Transcription finished successfully - results are available
`FAILED`	Processing failed - check the error code and error details for the cause
`CANCELLED`	Job was cancelled by the user before completion

Transcription Content Lifecycle

Transcriptions follow the same delete and purge pattern as files:

Stage	What It Means
Active	Job and results are fully available
Deleted	Hidden from default listings. A purge date is scheduled per your retention policy.
Purged	Result data is permanently removed from storage. The job record remains for audit, but results return `410 Gone`.

Data Retention, Quotas, and Rate Limits - how retention policies and quotas affect files, transcriptions, and summaries
Subscriptions - how subscription plans can override quotas and rate limits
Transcriptions API - full API reference for transcription endpoints
Summaries API - full API reference for summary endpoints
Prompt Templates API - managing prompt templates for summaries
Speech Models - how models work, model types, and capabilities
Languages API - available languages for transcription
File Sharing - sharing files with other users and publishing within your tenant