Skip to content

Files and Transcriptions

Koldan's speech services revolve around two core resources that form a processing pipeline:

  1. Files - media files (audio or video) that you upload or import into the platform.
  2. Transcriptions - speech-to-text jobs that process a file and produce a text transcript.

Files

A file is the starting point of any speech processing workflow. It represents a media file (audio or video) stored in Koldan's object storage.

Uploading and Importing

There are two ways to get a file into Koldan:

Method How It Works
Direct upload Upload the binary file via multipart form data. The file is stored immediately and is ready for transcription.
URI import Provide a remote URI (e.g., an S3 presigned URL or HTTPS link). Koldan downloads the file asynchronously in the background.

URI-imported files go through an ingestion process before they can be used:

Ingestion Status Description
PENDING Import request accepted, download not yet started
DOWNLOADING File is being downloaded from the source URI
COMPLETED File downloaded successfully and ready for use
FAILED Download failed (unreachable URI, timeout, invalid format, etc.)

Directly uploaded files skip ingestion entirely - they have no ingestion status.

File Properties

Every file carries the following information:

Property Description
Name A human-readable display name (e.g., Weekly Standup)
Filename The original filename from upload (e.g., meeting-2026-03-15.wav)
Description Optional free-text description of the file content
Path Virtual directory path for organization (see Virtual Paths)
Size File size in bytes
Duration Media duration in seconds (extracted after upload)
SHA-256 Checksum of the file content for deduplication and integrity verification
Metadata Arbitrary key-value JSON object for custom data (e.g., {"department": "engineering", "project": "alpha"})
Tags User-defined labels for categorization (see Tags)

Virtual Paths

Files are organized into virtual directories using path strings - similar to folders in a file system. Paths always start and end with /.

Example Path Meaning
/ Root directory (default for all files)
/meetings/ A "meetings" folder
/meetings/2026/march/ Nested folder structure

Virtual paths are purely organizational - they don't affect storage or processing. You can:

  • List files at a path - returns only direct children of that directory.
  • List recursively - returns all descendants under a path.
  • Move files - update a file's path to reorganize without re-uploading.

Path rules

Paths must start and end with /. They are normalized automatically - trailing/leading slashes are added if missing, and consecutive slashes are collapsed.

Tags

Tags are user-defined labels that help categorize and filter files. Each tag has:

  • A display name - the human-readable label (e.g., Meeting Notes).
  • A canonical name - a normalized form used for matching, where the name is lowercased and spaces/hyphens are replaced with underscores (e.g., meeting_notes).

Tags are scoped per user within a tenant. You can:

  • Assign tags during upload - pass tag names when uploading or importing a file.
  • Update tags - add or remove tags on an existing file at any time.
  • Filter by tags - list files that match specific tags, using all (AND) or any (OR) matching.
  • Search and validate - look up existing tags or validate new tag names before use.

Tags are created automatically when first used - if you assign a tag name that doesn't exist yet, Koldan creates it for you.

Listening Audio

When you upload a media file, Koldan can generate a listening audio derivative - a compressed MP3 version optimized for playback in web browsers and client applications. Listening audio serves several purposes:

  • Browser compatibility - the original file may be in a format that browsers can't play natively (e.g., WAV, FLAC, or multi-channel audio).
  • Reduced size - the MP3 derivative is significantly smaller than the original, making streaming faster and more efficient.
  • Storage savings - once the listening audio is generated, you can discard the original file content to free up storage quota while still being able to listen to the recording in the web interface.
Listening Audio Status Description
PENDING Generation requested, not yet started
PROCESSING Audio conversion is in progress
COMPLETED MP3 ready for streaming
FAILED Conversion failed

Listening audio generation can be:

  • Automatic - enabled by default on upload (configurable via the generateListeningAudio parameter).
  • On-demand - triggered later via a dedicated endpoint on an existing file.

The listening audio has its own independent lifecycle - it can be deleted and purged separately from the original file. Streaming is done via an endpoint that supports HTTP range requests for seeking within the audio.

Content Lifecycle

Files follow a three-stage content lifecycle:

stateDiagram-v2
    [*] --> Active
    Active --> ContentDiscarded: Discard content
    Active --> Deleted: Delete
    Deleted --> Purged: Purge / Retention expires
    ContentDiscarded --> Deleted: Delete
    Deleted --> Active: Restore
Stage What It Means
Active File and its content are fully available
Content discarded The binary content has been removed from storage, but the file metadata (name, tags, path, duration, etc.) is preserved. Useful when you want to keep the record without storing the media. Transcriptions and summaries remain available.
Deleted The file is marked as deleted and hidden from default listings, but can still be found by filtering for deleted files. A scheduled purge date is set based on your retention policy.
Purged The binary content is permanently removed from storage. The database record is retained for audit purposes, but the file can no longer be downloaded.

Purge is irreversible

Once a file is purged - whether by explicit request or automatic retention - the binary content cannot be recovered. Only the metadata record remains.

File Sharing & Publishing

Files can be shared with other users in your tenant. Koldan supports two sharing mechanisms:

Mechanism Description
Direct share Share a file with a specific user by their UUID, granting either VIEW (read-only) or EDIT (read + write) permission.
Internal publish Set a file's publish mode to INTERNAL to make it visible to all authenticated users within the same tenant (read-only).

Shared files - along with their transcriptions, summaries, and listening audio - become accessible to the recipient according to the granted permission level. Users can discover files shared with them via a dedicated endpoint and then drill into subresources (transcriptions, summaries) by passing the file ID.

File owners retain full control: they can manage shares, change the publish mode, and revoke access at any time.

For full API details, endpoint reference, and permission matrix, see the File Sharing REST API guide.


Transcriptions

A transcription job takes a file and produces a text transcript using Koldan's speech recognition engine.

Creating a Transcription

There are two approaches:

Approach Description
Two-step Upload a file first, then create a transcription job referencing the file's ID. Useful when you want to transcribe the same file multiple times with different options.
Upload and transcribe A single combined endpoint that uploads the file and starts a transcription job in one request. The most common approach for simple workflows.

Transcription Options

When creating a transcription, you can configure:

  • Language - specify the source language or let the engine auto-detect it.
  • Model - choose which speech model to use.
  • Punctuation - add punctuation marks to the transcript (see below).
  • Capitalization - apply correct casing to the transcript (see below).
  • Diarization - enable speaker identification to label who said what (see below).
  • Webhook - receive a notification at a URL when the job completes or fails.

Punctuation and Capitalization

Language-specific - availability depends on the languages installed on your server.

Raw speech recognition output is typically all-lowercase with no punctuation marks. Koldan offers optional post-processing that enriches the transcript with proper punctuation and capitalization:

Option What It Does
Punctuation Inserts punctuation marks (., ,, ?, !) at appropriate positions in the text.
Capitalization Applies correct casing to words - sentence starts, proper nouns, acronyms, etc.

Both options are independent toggles - you can enable one, both, or neither when creating a transcription.

When enabled, the enrichment is applied automatically after speech recognition completes. The results appear at two levels:

  • Segment text - the text field of each segment is rebuilt with punctuation and capitalization applied, so you get natural-looking sentences out of the box.
  • Word-level detail - each word in the words array carries a punctuation field (the mark following the word, e.g., "," or ".") and a capitalization field (the corrected form, e.g., "Hello", "NASA"). These are null when the respective service was not applied.

Availability

Punctuation and capitalization require a dedicated processing service configured on the server. If the service is not deployed, these options are silently ignored. Capitalization support may also vary by language - some languages may not support it.

Diarization

Diarization identifies and labels different speakers in the audio. Koldan supports multiple diarization modes:

  • Speaker diarization - automatically detects the number of speakers and assigns labels (Speaker 1, Speaker 2, etc.). You can optionally hint the expected number of speakers.
  • Channel diarization - maps audio channels to speakers (e.g., left channel = Agent, right channel = Customer). Useful for call center recordings with separate channels per participant.

Job Lifecycle

Every transcription job follows a status progression:

stateDiagram-v2
    [*] --> PENDING
    PENDING --> IN_PROGRESS: Processing starts
    IN_PROGRESS --> COMPLETED: Success
    IN_PROGRESS --> FAILED: Error
    PENDING --> CANCELLED: User cancels
    IN_PROGRESS --> CANCELLED: User cancels
Status Description
PENDING Job is queued and waiting for an available processing slot
IN_PROGRESS Speech recognition is actively processing the audio
COMPLETED Transcription finished successfully - results are available
FAILED Processing failed - check the error code and error details for the cause
CANCELLED Job was cancelled by the user before completion

Transcription Content Lifecycle

Transcriptions follow the same delete and purge pattern as files:

Stage What It Means
Active Job and results are fully available
Deleted Hidden from default listings. A purge date is scheduled per your retention policy.
Purged Result data is permanently removed from storage. The job record remains for audit, but results return 410 Gone.