name: whisper description: OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR. version: 1.0.0 author: Orchestra Research license: MIT dependencies: [openai-whisper, transformers, torch] metadata: hermes: tags: [Whisper, Speech Recognition, ASR, Multimodal, Multilingual, OpenAI, Speech-To-Text, Transcription, Translation, Audio Processing]
Whisper - Robust Speech Recognition
OpenAI's multilingual speech recognition model.
When to use Whisper
Use when:
- Speech-to-text transcription (99 languages)
- Podcast/video transcription
- Meeting notes automation
- Translation to English
- Noisy audio transcription
- Multilingual audio processing
Metrics:
- 72,900+ GitHub stars
- 99 languages supported
- Trained on 680,000 hours of audio
- MIT License
Use alternatives instead:
- AssemblyAI: Managed API, speaker diarization
- Deepgram: Real-time streaming ASR
- Google Speech-to-Text: Cloud-based
Quick start
Installation
# Requires Python 3.8-3.11 and ffmpeg
# DO NOT use pip directly in hermes-agent venv — venv has no pip module
# Use uv instead (available at ~/.local/bin/uv or system-wide):
uv venv /home/lxgxdx/whisper-venv --python 3.11
uv pip install --python /home/lxgxdx/whisper-venv/bin/python openai-whisper
# Activate before use:
source /home/lxgxdx/whisper-venv/bin/activate
# Requires ffmpeg:
# macOS: brew install ffmpeg
# Ubuntu: sudo apt install ffmpeg
# Windows: choco install ffmpeg
Pitfall — venv pip module missing: If python3 -m pip fails with No module named pip, the venv was created without pip. Recreate with uv venv above. Do NOT try to pip install pip or modify the venv — just recreate it.
Pitfall — model download on first load: whisper.load_model() downloads the model (~461MB for small) on first run. This can take 1-2 minutes on slow connections — this is normal, not a hang.
Basic transcription
import whisper
# Load model
model = whisper.load_model("base")
# Transcribe
result = model.transcribe("audio.mp3")
# Print text
print(result["text"])
# Access segments
for segment in result["segments"]:
print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")
Model sizes
# Available models
models = ["tiny", "base", "small", "medium", "large", "turbo"]
# Load specific model
model = whisper.load_model("turbo") # Fastest, good quality
| Model | Parameters | English-only | Multilingual | Speed | VRAM |
|---|---|---|---|---|---|
| tiny | 39M | ✓ | ✓ | ~32x | ~1 GB |
| base | 74M | ✓ | ✓ | ~16x | ~1 GB |
| small | 244M | ✓ | ✓ | ~6x | ~2 GB |
| medium | 769M | ✓ | ✓ | ~2x | ~5 GB |
| large | 1550M | ✗ | ✓ | 1x | ~10 GB |
| turbo | 809M | ✗ | ✓ | ~8x | ~6 GB |
Recommendation: Use turbo for best speed/quality, base for prototyping
Transcription options
Chinese government meeting transcription
Government meetings involve multiple speakers, formal vocabulary, and policy terminology.
Supplying a domain-specific initial_prompt dramatically improves accuracy:
result = model.transcribe(
"meeting.wav",
language="zh",
initial_prompt=(
"这是一段政府部务会会议录音,与会人员包括多位领导干部,"
"讨论统战工作相关议题。"
)
)
Chinese meeting transcription workflow:
- Pre-check audio duration with
ffprobe -v quiet -show_entries format=duration -of csv=p=0 file.wav - Files <1 min may be truncated/incomplete — check size before processing
- Long files (>30 min) may degrade; consider splitting with
ffmpeg -i in.wav -ss 0 -t 3600 part1.wav -ss 3600 part2.wav - Use
smallmodel for Chinese (good quality/speed balance on CPU) - Output raw text first, then use an LLM to structure into meeting minutes format
Task selection
# Transcription (default)
result = model.transcribe("audio.mp3", task="transcribe")
# Translation to English
result = model.transcribe("spanish.mp3", task="translate")
# Input: Spanish audio → Output: English text
Initial prompt
# Improve accuracy with context
result = model.transcribe(
"audio.mp3",
initial_prompt="This is a technical podcast about machine learning and AI."
)
# Helps with:
# - Technical terms
# - Proper nouns
# - Domain-specific vocabulary
Quick file inspection before transcribing
# Check duration — files <5 seconds are truncated, skip them
ffprobe -v quiet -show_entries format=duration -of csv=p=0 file.wav
# Check audio quality
ffprobe -v quiet -show_entries stream=channels,sample_rate,codec_name -of csv=p=0 file.wav
# Check volume levels
ffmpeg -i file.wav -af volumedetect -f null /dev/null 2>&1 | grep -E "max_volume|mean_volume"
Heuristics:
- <5 seconds → truncated/invalid, skip
- ADPCM codec (adpcm_ms) → low quality, expect poor accuracy with small model
- mean_volume < -25 dB → too quiet, preprocess with volume boost
- audio check is essential before starting a long transcription; a 60-minute transcription failing due to a bad file wastes an hour
Timestamps
# Word-level timestamps
result = model.transcribe("audio.mp3", word_timestamps=True)
for segment in result["segments"]:
for word in segment["words"]:
print(f"{word['word']} ({word['start']:.2f}s - {word['end']:.2f}s)")
Temperature fallback
# Retry with different temperatures if confidence low
result = model.transcribe(
"audio.mp3",
temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
)
Command line usage
# Basic transcription
whisper audio.mp3
# Specify model
whisper audio.mp3 --model turbo
# Output formats
whisper audio.mp3 --output_format txt # Plain text
whisper audio.mp3 --output_format srt # Subtitles
whisper audio.mp3 --output_format vtt # WebVTT
whisper audio.mp3 --output_format json # JSON with timestamps
# Language
whisper audio.mp3 --language Spanish
# Translation
whisper spanish.mp3 --task translate
Batch processing
import os
audio_files = ["file1.mp3", "file2.mp3", "file3.mp3"]
for audio_file in audio_files:
print(f"Transcribing {audio_file}...")
result = model.transcribe(audio_file)
# Save to file
output_file = audio_file.replace(".mp3", ".txt")
with open(output_file, "w") as f:
f.write(result["text"])
CPU-optimized transcription (faster-whisper)
For CPU-based transcription (no GPU), always use faster-whisper instead of openai-whisper. It supports int8 quantization and runs 4× faster with comparable accuracy.
uv pip install faster-whisper
from faster_whisper import WhisperModel
# CPU optimized — int8 gives 4× speedup over openai-whisper float32
model = WhisperModel("medium", device="cpu", compute_type="int8")
segments, info = model.transcribe("audio.wav", language="zh")
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
Recommended models by use case (CPU only)
| Scenario | Model | faster-whisper compute_type | Notes |
|---|---|---|---|
| Quick test / short clip | tiny or base |
int8 | ~instant |
| English podcast/lecture | small |
int8 | Good balance |
| Chinese meeting (low quality) | medium |
int8 | Minimum for government/formal context |
| Chinese meeting (clean audio) | small |
int8 | Acceptable with initial_prompt |
| Max accuracy (any language) | large-v3 |
int8 | Slow but best |
Critical finding — Chinese government meeting audio: Low-bitrate WAV files (ADPCM, 32kHz mono) produce extremely poor results with small model — names, policy terms, and numbers are consistently misrecognized. medium int8 is the practical minimum for usable quality. Even with medium, expect ~70-80% accuracy; use an LLM pass to correct terminology afterward.
Audio preprocessing (required for low-quality recordings)
Before transcribing low-bitrate or noisy audio, always normalize and resample:
ffmpeg -y -i original.wav \
-af "highpass=f=200,lowpass=f=8000,volume=1.5,alimiter=limit=0.95" \
-ar 16000 -ac 1 -acodec pcm_s16le \
output_norm.wav
The filter chain: highpass removes rumble (f=200), lowpass removes high-frequency noise (f=8000), volume boosts quiet speech, alimiter prevents clipping. Output: 16kHz mono PCM — Whisper's optimal input.
Real-time transcription
# For streaming audio, use faster-whisper
# pip install faster-whisper
from faster_whisper import WhisperModel
model = WhisperModel("base", device="cuda", compute_type="float16")
# Transcribe with streaming
segments, info = model.transcribe("audio.mp3", beam_size=5)
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
GPU acceleration
import whisper
# Automatically uses GPU if available
model = whisper.load_model("turbo")
# Force CPU
model = whisper.load_model("turbo", device="cpu")
# Force GPU
model = whisper.load_model("turbo", device="cuda")
# 10-20× faster on GPU
Integration with other tools
Subtitle generation
# Generate SRT subtitles
whisper video.mp4 --output_format srt --language English
# Output: video.srt
With LangChain
from langchain.document_loaders import WhisperTranscriptionLoader
loader = WhisperTranscriptionLoader(file_path="audio.mp3")
docs = loader.load()
# Use transcription in RAG
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
Extract audio from video
# Use ffmpeg to extract audio
ffmpeg -i video.mp4 -vn -acodec pcm_s16le audio.wav
# Then transcribe
whisper audio.wav
Best practices
- Use turbo model - Best speed/quality for English
- Specify language - Faster than auto-detect
- Add initial prompt - Improves technical terms
- Use GPU - 10-20× faster
- Batch process - More efficient
- Convert to WAV - Better compatibility
- Split long audio - <30 min chunks
- Check language support - Quality varies by language
- Use faster-whisper - 4× faster than openai-whisper
- Monitor VRAM - Scale model size to hardware
Performance
| Model | Real-time factor (CPU) | Real-time factor (GPU) |
|---|---|---|
| tiny | ~0.32 | ~0.01 |
| base | ~0.16 | ~0.01 |
| turbo | ~0.08 | ~0.01 |
| large | ~1.0 | ~0.05 |
Real-time factor: 0.1 = 10× faster than real-time
Language support
Top-supported languages:
- English (en)
- Spanish (es)
- French (fr)
- German (de)
- Italian (it)
- Portuguese (pt)
- Russian (ru)
- Japanese (ja)
- Korean (ko)
- Chinese (zh)
Full list: 99 languages total
Limitations
- Hallucinations - May repeat or invent text
- Long-form accuracy - Degrades on >30 min audio
- Speaker identification - No diarization
- Accents - Quality varies
- Background noise - Can affect accuracy
- Real-time latency - Not suitable for live captioning
Resources
- GitHub: https://github.com/openai/whisper ⭐ 72,900+
- Paper: https://arxiv.org/abs/2212.04356
- Model Card: https://github.com/openai/whisper/blob/main/model-card.md
- Colab: Available in repo
- License: MIT