whisper

star 119

OpenAI Whisper for speech recognition and transcription — local inference, multiple model sizes, language detection, and subtitle generation.

AlexAI-MCP By AlexAI-MCP schedule Updated 4/7/2026

name: whisper description: OpenAI Whisper for speech recognition and transcription — local inference, multiple model sizes, language detection, and subtitle generation. version: 1.0.0 author: hermes-CCC (ported from Hermes Agent by NousResearch) license: MIT metadata: hermes: tags: [MLOps, Whisper, Speech-Recognition, Transcription, Audio, ASR] related_skills: []


Whisper

Purpose

  • Use this skill for local speech-to-text transcription and subtitle generation.
  • Prefer it when privacy, offline processing, or batch audio workflows matter.
  • Whisper works well for meetings, podcasts, interviews, voice notes, and extracted video audio.
  • It supports multilingual transcription and language detection.

Install

  • Standard Whisper package:
pip install openai-whisper
  • Faster inference alternative:
pip install faster-whisper
  • faster-whisper is often the better default for production batch jobs because it is typically faster at similar accuracy.

Command Line Usage

  • Basic CLI transcription:
whisper audio.mp3 --model medium --language en
  • Use GPU explicitly:
whisper audio.mp3 --model medium --language en --device cuda
  • Generate subtitles in SRT format:
whisper audio.mp3 --model medium --language en --output_format srt
  • The CLI is a good fit for one-off transcription, shell scripts, and media preprocessing jobs.

Python API

import whisper

model = whisper.load_model("medium")
result = model.transcribe("audio.mp3")

print(result["text"])
print(result["language"])
print(result["segments"][:2])
  • Standard load pattern:
  • import whisper
  • model = whisper.load_model('medium')

What transcribe() Returns

  • text: the full transcript

  • segments: timestamped segment-level outputs

  • language: detected or selected language code

  • Example shape:

{
    "text": "Full transcript text",
    "language": "en",
    "segments": [
        {"id": 0, "start": 0.0, "end": 4.5, "text": "Hello everyone"},
    ],
}

Model Sizes

  • tiny

  • base

  • small

  • medium

  • large-v3

  • The tradeoff is simple:

  • smaller models are faster and cheaper

  • larger models are slower but more accurate

Model Selection Guidance

  • tiny: fast experiments and low-resource CPU runs
  • base: simple automation on clean audio
  • small: balanced for lightweight production tasks
  • medium: common quality default for serious transcription
  • large-v3: best accuracy when latency and VRAM are acceptable

Language Selection

  • Set the language when you know it:
whisper audio.mp3 --model medium --language en
  • Explicit language hints usually improve speed and stability.
  • If the language is unknown, let the model detect it.

Language Detection

  • Whisper can estimate the spoken language from audio features.
  • Example pattern:
import whisper

model = whisper.load_model("medium")
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)

_, probs = model.detect_language(mel)
language = max(probs, key=probs.get)
print(language)
  • model.detect_language(audio) is the key workflow concept, though in practice you pass the processed spectrogram tensor.

Subtitle Generation

  • Generate .srt subtitles from the CLI:
whisper audio.mp3 --model medium --language en --output_format srt
  • Subtitle outputs are useful for:
  • video captions
  • podcast transcripts
  • lecture indexing
  • searchable archives

Batch Processing Multiple Files

  • Simple shell loop:
Get-ChildItem *.mp3 | ForEach-Object {
  whisper $_.FullName --model medium --language en --output_format srt
}
  • Python batch example:
from pathlib import Path

import whisper

model = whisper.load_model("medium")

for path in Path("audio").glob("*.mp3"):
    result = model.transcribe(str(path))
    out_path = path.with_suffix(".txt")
    out_path.write_text(result["text"], encoding="utf-8")
  • Batch processing is a common pattern for meeting folders, call archives, and downloaded media collections.

GPU vs CPU

  • GPU example from the CLI:
whisper audio.mp3 --model medium --device cuda
  • GPU is strongly preferred for:

  • medium

  • large-v3

  • multi-file batch jobs

  • CPU is acceptable for:

  • tiny

  • base

  • occasional short clips

Common Use Cases

  • YouTube audio transcription after extracting audio from video
  • meeting notes from Zoom or Teams recordings
  • podcast transcription for search and republishing
  • lecture indexing and subtitle generation
  • multilingual voice note transcription

faster-whisper

  • Install with:
pip install faster-whisper
  • It is often around 4x faster while maintaining comparable accuracy.
  • It is a strong choice for MLOps pipelines where throughput matters.

faster-whisper Example

from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5)

print(info.language, info.language_probability)

for segment in segments:
    print(f"[{segment.start:.2f} -> {segment.end:.2f}] {segment.text}")
  • faster-whisper is particularly useful for server-side or queue-based transcription systems.

Output Management

  • Save plain text for downstream NLP.
  • Save SRT for subtitles and media players.
  • Save structured segments for search, speaker labeling pipelines, or analytics.

Segment-Level Workflows

  • Segment timestamps make it easy to:
  • jump to precise moments in media
  • chunk transcripts for embeddings
  • build searchable meeting or lecture interfaces
  • align captions with edited video

Quality Tips

  • Use the cleanest source audio available.
  • Downmix weird multi-channel recordings if channels are corrupted or imbalanced.
  • Remove long leading silence when possible.
  • Pick a larger model for noisy audio, accents, or technical vocabulary.

Common Failure Modes

  • Slow runtime:

  • switch to faster-whisper

  • move from CPU to GPU

  • use a smaller model

  • Bad language choice:

  • set --language en or another known language explicitly

  • inspect the detected language before large batch runs

  • Poor transcript quality:

  • upgrade from base or small to medium or large-v3

  • improve the source audio

  • split very long recordings into manageable chunks

  • Memory issues:

  • use a smaller model

  • run on GPU with enough VRAM

  • use faster-whisper with a suitable compute type

Recommended Workflow

  • Start with medium for general English transcription.
  • Move to large-v3 when quality matters more than speed.
  • Use faster-whisper for production batch jobs.
  • Save both transcript text and structured segments.

When To Use This Skill

  • You need local transcription rather than a hosted ASR API.
  • You want subtitle files for recorded media.
  • You need multilingual speech recognition with no cloud dependency.
  • You are building meeting, media, or archival transcription pipelines.

Quick Reference

  • Install: pip install openai-whisper
  • Faster option: pip install faster-whisper
  • CLI: whisper audio.mp3 --model medium --language en
  • GPU: --device cuda
  • SRT subtitles: --output_format srt
  • Python load: model = whisper.load_model("medium")
  • Result fields: text, segments, language
Install via CLI
npx skills add https://github.com/AlexAI-MCP/hermes-CCC --skill whisper
Repository Details
star Stars 119
call_split Forks 23
navigation Branch main
article Path SKILL.md
More from Creator