Files and Transcriptions
Koldan's speech services revolve around two core resources that form a processing pipeline:
- Files - media files (audio or video) that you upload or import into the platform.
- Transcriptions - speech-to-text jobs that process a file and produce a text transcript.
Files
A file is the starting point of any speech processing workflow. It represents a media file (audio or video) stored in Koldan's object storage.
Uploading and Importing
There are two ways to get a file into Koldan:
| Method | How It Works |
|---|---|
| Direct upload | Upload the binary file via multipart form data. The file is stored immediately and is ready for transcription. |
| URI import | Provide a remote URI (e.g., an S3 presigned URL or HTTPS link). Koldan downloads the file asynchronously in the background. |
URI-imported files go through an ingestion process before they can be used:
| Ingestion Status | Description |
|---|---|
PENDING |
Import request accepted, download not yet started |
DOWNLOADING |
File is being downloaded from the source URI |
COMPLETED |
File downloaded successfully and ready for use |
FAILED |
Download failed (unreachable URI, timeout, invalid format, etc.) |
Directly uploaded files skip ingestion entirely - they have no ingestion status.
File Properties
Every file carries the following information:
| Property | Description |
|---|---|
| Name | A human-readable display name (e.g., Weekly Standup) |
| Filename | The original filename from upload (e.g., meeting-2026-03-15.wav) |
| Description | Optional free-text description of the file content |
| Path | Virtual directory path for organization (see Virtual Paths) |
| Size | File size in bytes |
| Duration | Media duration in seconds (extracted after upload) |
| SHA-256 | Checksum of the file content for deduplication and integrity verification |
| Metadata | Arbitrary key-value JSON object for custom data (e.g., {"department": "engineering", "project": "alpha"}) |
| Tags | User-defined labels for categorization (see Tags) |
Virtual Paths
Files are organized into virtual directories using path strings - similar to folders in a file system. Paths always start and end with /.
| Example Path | Meaning |
|---|---|
/ |
Root directory (default for all files) |
/meetings/ |
A "meetings" folder |
/meetings/2026/march/ |
Nested folder structure |
Virtual paths are purely organizational - they don't affect storage or processing. You can:
- List files at a path - returns only direct children of that directory.
- List recursively - returns all descendants under a path.
- Move files - update a file's path to reorganize without re-uploading.
Path rules
Paths must start and end with /. They are normalized automatically - trailing/leading slashes are added if missing, and consecutive slashes are collapsed.
Tags
Tags are user-defined labels that help categorize and filter files. Each tag has:
- A display name - the human-readable label (e.g.,
Meeting Notes). - A canonical name - a normalized form used for matching, where the name is lowercased and spaces/hyphens are replaced with underscores (e.g.,
meeting_notes).
Tags are scoped per user within a tenant. You can:
- Assign tags during upload - pass tag names when uploading or importing a file.
- Update tags - add or remove tags on an existing file at any time.
- Filter by tags - list files that match specific tags, using all (AND) or any (OR) matching.
- Search and validate - look up existing tags or validate new tag names before use.
Tags are created automatically when first used - if you assign a tag name that doesn't exist yet, Koldan creates it for you.
Listening Audio
When you upload a media file, Koldan can generate a listening audio derivative - a compressed MP3 version optimized for playback in web browsers and client applications. Listening audio serves several purposes:
- Browser compatibility - the original file may be in a format that browsers can't play natively (e.g., WAV, FLAC, or multi-channel audio).
- Reduced size - the MP3 derivative is significantly smaller than the original, making streaming faster and more efficient.
- Storage savings - once the listening audio is generated, you can discard the original file content to free up storage quota while still being able to listen to the recording in the web interface.
| Listening Audio Status | Description |
|---|---|
PENDING |
Generation requested, not yet started |
PROCESSING |
Audio conversion is in progress |
COMPLETED |
MP3 ready for streaming |
FAILED |
Conversion failed |
Listening audio generation can be:
- Automatic - enabled by default on upload (configurable via the
generateListeningAudioparameter). - On-demand - triggered later via a dedicated endpoint on an existing file.
The listening audio has its own independent lifecycle - it can be deleted and purged separately from the original file. Streaming is done via an endpoint that supports HTTP range requests for seeking within the audio.
Content Lifecycle
Files follow a three-stage content lifecycle:
stateDiagram-v2
[*] --> Active
Active --> ContentDiscarded: Discard content
Active --> Deleted: Delete
Deleted --> Purged: Purge / Retention expires
ContentDiscarded --> Deleted: Delete
Deleted --> Active: Restore
| Stage | What It Means |
|---|---|
| Active | File and its content are fully available |
| Content discarded | The binary content has been removed from storage, but the file metadata (name, tags, path, duration, etc.) is preserved. Useful when you want to keep the record without storing the media. Transcriptions and summaries remain available. |
| Deleted | The file is marked as deleted and hidden from default listings, but can still be found by filtering for deleted files. A scheduled purge date is set based on your retention policy. |
| Purged | The binary content is permanently removed from storage. The database record is retained for audit purposes, but the file can no longer be downloaded. |
Purge is irreversible
Once a file is purged - whether by explicit request or automatic retention - the binary content cannot be recovered. Only the metadata record remains.
File Sharing & Publishing
Files can be shared with other users in your tenant. Koldan supports two sharing mechanisms:
| Mechanism | Description |
|---|---|
| Direct share | Share a file with a specific user by their UUID, granting either VIEW (read-only) or EDIT (read + write) permission. |
| Internal publish | Set a file's publish mode to INTERNAL to make it visible to all authenticated users within the same tenant (read-only). |
Shared files - along with their transcriptions, summaries, and listening audio - become accessible to the recipient according to the granted permission level. Users can discover files shared with them via a dedicated endpoint and then drill into subresources (transcriptions, summaries) by passing the file ID.
File owners retain full control: they can manage shares, change the publish mode, and revoke access at any time.
For full API details, endpoint reference, and permission matrix, see the File Sharing REST API guide.
Transcriptions
A transcription job takes a file and produces a text transcript using Koldan's speech recognition engine.
Creating a Transcription
There are two approaches:
| Approach | Description |
|---|---|
| Two-step | Upload a file first, then create a transcription job referencing the file's ID. Useful when you want to transcribe the same file multiple times with different options. |
| Upload and transcribe | A single combined endpoint that uploads the file and starts a transcription job in one request. The most common approach for simple workflows. |
Transcription Options
When creating a transcription, you can configure:
- Language - specify the source language or let the engine auto-detect it.
- Model - choose which speech model to use.
- Punctuation - add punctuation marks to the transcript (see below).
- Capitalization - apply correct casing to the transcript (see below).
- Diarization - enable speaker identification to label who said what (see below).
- Webhook - receive a notification at a URL when the job completes or fails.
Punctuation and Capitalization
Language-specific - availability depends on the languages installed on your server.
Raw speech recognition output is typically all-lowercase with no punctuation marks. Koldan offers optional post-processing that enriches the transcript with proper punctuation and capitalization:
| Option | What It Does |
|---|---|
| Punctuation | Inserts punctuation marks (., ,, ?, !) at appropriate positions in the text. |
| Capitalization | Applies correct casing to words - sentence starts, proper nouns, acronyms, etc. |
Both options are independent toggles - you can enable one, both, or neither when creating a transcription.
When enabled, the enrichment is applied automatically after speech recognition completes. The results appear at two levels:
- Segment text - the
textfield of each segment is rebuilt with punctuation and capitalization applied, so you get natural-looking sentences out of the box. - Word-level detail - each word in the
wordsarray carries apunctuationfield (the mark following the word, e.g.,","or".") and acapitalizationfield (the corrected form, e.g.,"Hello","NASA"). These arenullwhen the respective service was not applied.
Availability
Punctuation and capitalization require a dedicated processing service configured on the server. If the service is not deployed, these options are silently ignored. Capitalization support may also vary by language - some languages may not support it.
Diarization
Diarization identifies and labels different speakers in the audio. Koldan supports multiple diarization modes:
- Speaker diarization - automatically detects the number of speakers and assigns labels (Speaker 1, Speaker 2, etc.). You can optionally hint the expected number of speakers.
- Channel diarization - maps audio channels to speakers (e.g., left channel = Agent, right channel = Customer). Useful for call center recordings with separate channels per participant.
Job Lifecycle
Every transcription job follows a status progression:
stateDiagram-v2
[*] --> PENDING
PENDING --> IN_PROGRESS: Processing starts
IN_PROGRESS --> COMPLETED: Success
IN_PROGRESS --> FAILED: Error
PENDING --> CANCELLED: User cancels
IN_PROGRESS --> CANCELLED: User cancels
| Status | Description |
|---|---|
PENDING |
Job is queued and waiting for an available processing slot |
IN_PROGRESS |
Speech recognition is actively processing the audio |
COMPLETED |
Transcription finished successfully - results are available |
FAILED |
Processing failed - check the error code and error details for the cause |
CANCELLED |
Job was cancelled by the user before completion |
Transcription Content Lifecycle
Transcriptions follow the same delete and purge pattern as files:
| Stage | What It Means |
|---|---|
| Active | Job and results are fully available |
| Deleted | Hidden from default listings. A purge date is scheduled per your retention policy. |
| Purged | Result data is permanently removed from storage. The job record remains for audit, but results return 410 Gone. |
Related Pages
- Data Retention, Quotas, and Rate Limits - how retention policies and quotas affect files, transcriptions, and summaries
- Subscriptions - how subscription plans can override quotas and rate limits
- Transcriptions API - full API reference for transcription endpoints
- Summaries API - full API reference for summary endpoints
- Prompt Templates API - managing prompt templates for summaries
- Speech Models - how models work, model types, and capabilities
- Languages API - available languages for transcription
- File Sharing - sharing files with other users and publishing within your tenant