Speech-to-Text API Guide

Overview

The Audio API provides two main endpoints:

📝 transcriptions: Convert audio to text

🔄 translations: Translate audio to English

Supported Formats

📁 File size: Maximum 25 MB

🎵 Supported formats: mp3, mp4, mpeg, mpg, m4a, wav, webm

Usage

1. Transcription

Convert audio to text in the original language

2. Translation

Convert audio in any language to English text

3. Timestamp Feature

4. Handling Large Files

Use PyDub to split files larger than 25 MB:

Optimization Recommendations

Prompt Usage Tips

🔍 Correct specific vocabulary recognition

📜 Maintain contextual coherence

✍️ Control punctuation output

🗣️ Preserve filler words

📝 Control output text style (e.g., Simplified vs. Traditional Chinese)

Supported Languages

Supports 98 languages, including:

Major Asian languages: Chinese, Japanese, Korean, etc.

European languages: English, French, German, etc.

Other regional languages: Arabic, Hindi, etc.

Note: Only languages with word error rate (WER) below 50% are listed. Other languages are supported but may have lower quality.

Python Speech-to-Text

Speech-to-Text API Guide#

Overview#

Supported Formats#

Usage#

1. Transcription#

2. Translation#

3. Timestamp Feature#

4. Handling Large Files#

Optimization Recommendations#

Prompt Usage Tips#

Supported Languages#