whisper-preprocess

star 3

Audio preprocessing and transcription pipeline using ffmpeg + OpenAI Whisper. Use this skill whenever the user wants to transcribe audio or video files (lectures, meetings, podcasts, interviews), improve audio quality before transcription, remove silence from recordings, or mentions Whisper, speech-to-text, or transcricao. Also trigger when the user has MKV, MP4, WAV, M4A, or other media files they want converted to text. This skill handles the full pipeline: audio extraction, silence removal, voice enhancement, segmentation, and Whisper transcription. Works 100% offline.

j0ruge By j0ruge schedule Updated 5/30/2026

name: whisper-preprocess metadata: version: 1.0.0 description: "Audio preprocessing and transcription pipeline using ffmpeg + OpenAI Whisper. Use this skill whenever the user wants to transcribe audio or video files (lectures, meetings, podcasts, interviews), improve audio quality before transcription, remove silence from recordings, or mentions Whisper, speech-to-text, or transcricao. Also trigger when the user has MKV, MP4, WAV, M4A, or other media files they want converted to text. This skill handles the full pipeline: audio extraction, silence removal, voice enhancement, segmentation, and Whisper transcription. Works 100% offline."

Whisper Preprocess Pipeline

You are setting up and running an audio preprocessing + transcription pipeline. This pipeline solves four critical Whisper problems:

  1. Hallucinations during silence — Whisper invents text like "Obrigado", "E ai", "Subscribe" during quiet segments
  2. Hallucination cascade in long audio — after the first hallucination, Whisper repeats it for the rest of the file
  3. Poor transcription from noisy audio — background noise reduces accuracy
  4. Silent failures on Windows due to encoding — the whisper CLI crashes when printing accented/CJK characters in a cp1252 console, producing empty .txt files with no clear error

The pipeline runs 100% offline when the Whisper model is already cached locally — only needs ffmpeg, Python, and the downloaded model.

Critical lessons learned (from real-world testing)

These are hard-won insights from iterating on a 2h13min lecture recording and a multilingual (PT+JA) meeting. Follow them carefully:

1. NEVER use afftdn (FFT noise reduction)

The afftdn filter creates tonal artifacts ("musical noise") that sound like constant beeping/whistling. This completely destroys the audio for Whisper — every segment will transcribe as garbage. The same applies to lowpass=f=8000 which can interact badly with other filters.

Safe enhancement chain: highpass=f=100, acompressor, loudnorm — nothing else.

2. Remove silence BEFORE enhancement

The loudnorm filter normalizes everything to -16 LUFS, including silence. If you run silenceremove after loudnorm, it finds nothing to remove because the "silence" has been boosted. Always: extract → remove silence → enhance.

3. Silence threshold should be -30dB, not -40dB

Most recordings have background noise above -40dB. At -40dB threshold, silenceremove finds zero silence. Use -30dB as default. For noisier recordings, try -25dB.

4. Segment long audio (mandatory for >10 minutes)

Whisper hallucinates on long audio files. A 2h13min file produced only ~19 lines of real content before devolving into repeated "Obrigado" or "E ai" for the remaining 2 hours. Segmenting into 10-minute chunks solved this completely — each segment produced ~6-7K characters of accurate transcription.

5. Use --condition_on_previous_text False

Without this, once Whisper starts hallucinating, it conditions on its own hallucinated output and the error cascades. With False, each segment starts fresh. This doesn't prevent hallucinations entirely (segmentation does that), but limits the damage.

6. Compress output audio to OPUS (listening copy, decoupled — see lesson #9)

Enhanced WAV files are enormous (2.6 GB for 2h). The listenable *_enhanced.opus is now encoded at 48 kHz, 64 kbps, -application audio (wideband, fuller voice). The old -application voip is narrowband telephony tuning and was dropped. Bitrate/sample-rate are configurable via --listen-bitrate / --listen-sr.

9. NEVER feed the listening copy through silence-removal + dynamic AGC (anti-"picotamento")

Symptom: the voice in *_enhanced.opus is choppy / cuts between syllables ("picotando"), worst for a low-volume or impaired (dysarthric, slow) speaker.

Cause: the listenable file used to inherit the whole transcription chain — silenceremove (cuts quiet speech) + acompressor release=50ms (pumps every syllable gap) + single-pass loudnorm (dynamic AGC that "breathes" on pauses). Three gain-modulation/chopping sources stacked.

Fix (now built in): the listening copy is decoupled from the transcription path and generated by build_listenable() from the original file:

  • wideband (48 kHz), no silence removal by default → continuous audio, the listener follows every pause and nothing is clipped;
  • stable gain only — slow-release compressor (release=400ms) + fixed makeup + true-peak alimiter with lookahead. No dynamic AGC. Verified A/B on a real recording: a 2-pass loudnorm linear=true reverted to "Dynamic" on clipping sources, so it is NOT used here.
  • limiter ceiling has headroom (--listen-limit 0.6) for Opus inter-sample overshoot → final true-peak stays below 0 dBFS (the old recipe actually clipped at +0.7 dBFS).

The transcription path is unchanged in spirit (16 kHz, silenceremove at -30dB, then enhance) — only made gentler: detection=rms, window=0.025, stop_silence=0.5, stop_duration=2.0, which also helps Whisper on slow speakers. Lessons #1 (no afftdn) and #3 (-30dB) still hold; denoise on the listening copy is opt-in and off by default (--denoise).

7. UTF-8 everywhere on Windows (handled automatically)

On Windows the default console encoding is cp1252. When whisper CLI prints transcription progress containing accented or CJK characters it crashes with UnicodeEncodeError, and the Python subprocess.run reader thread separately crashes with UnicodeDecodeError on bytes > 0x7F. Result: all segments come back empty with the misleading log line "Segmento vazio ou falhou". The pipeline now sets PYTHONIOENCODING=utf-8, reconfigures sys.stdout/stderr, and forces encoding="utf-8", errors="replace" on every subprocess.run — no user action needed. If you ever invoke whisper standalone on Windows, set $env:PYTHONIOENCODING="utf-8" first.

8. For multilingual recordings, run two passes and merge

A single --language choice loses everything spoken in another script. Run a primary pass in the dominant language, a secondary pass in the second language, and merge the SRTs choosing the secondary version where the script ratio is high. See "Multilingual recordings" below — the script does this for you with --secondary-language.

The Pipeline (6 stages)

Input (MKV/MP4/etc.)
  |
  v
[1] Extract audio → WAV 16kHz mono
  |
  v
[2] Remove silence (transcription path, on RAW 16kHz, before enhancement)
  |  silenceremove threshold=-30dB, duration=2.0s, padding=0.5s, detection=rms
  v
[3] Enhance audio (transcription path)
  |  highpass=100Hz → acompressor → loudnorm EBU R128
  |  NO afftdn, NO lowpass (they cause artifacts)
  v
[4] Segment into ~10min chunks → cached by input hash
  |
  v
[5] Transcribe each segment with Whisper
  |  temperature=0, condition_on_previous_text=False
  |  optional 2nd pass in --secondary-language + merge
  |
  |  (PARALLEL, decoupled) build_listenable() from the ORIGINAL file →
  |  48kHz, NO silence removal, stable gain (slow compressor + true-peak limiter,
  |  no dynamic AGC) → continuous, no "picotamento" — see lesson #9
  v
[6] Compress listening copy to OPUS (48k, application=audio) + cleanup
  |
  v
Output: *_transcricao_<lang>.txt/.srt, *_transcricao_merged.txt, *_enhanced.opus

Prerequisites

ffmpeg -version        # audio processing
whisper --help         # transcription (pip install -U openai-whisper)
python --version       # 3.9+

Check required ffmpeg filters:

ffmpeg -filters 2>/dev/null | grep -E "silenceremove|loudnorm|highpass|acompressor"

How to use

Copy scripts/whisper_preprocess.py to the user's working directory, then:

# Basic — full pipeline with auto-cleanup
python whisper_preprocess.py "lecture.mkv"

# With context for proper nouns and technical terms
python whisper_preprocess.py "lecture.mkv" --initial-prompt "Seminario de Teologia, professor Carlos"

# Test preprocessing only (listen before transcribing)
python whisper_preprocess.py "lecture.mkv" --only-preprocess

# Keep intermediate files for debugging / segment reuse
python whisper_preprocess.py "lecture.mkv" --keep-intermediate

# CPU only (no GPU)
python whisper_preprocess.py "lecture.mkv" --device cpu

# Better quality model (needs ~10GB VRAM)
python whisper_preprocess.py "lecture.mkv" --model large-v2

# Noisier recording — more aggressive silence detection
python whisper_preprocess.py "lecture.mkv" --silence-threshold="-25dB"

# Quiet / impaired (slow, low-volume) speaker — louder, fuller listening copy
python whisper_preprocess.py "call.m4a" --listen-makeup 7 --listen-comp-threshold="-24dB"

# Longer segments for very clean audio
python whisper_preprocess.py "lecture.mkv" --segment-minutes 15

# Multi-idioma (ex.: reuniao PT + JA com merge automatico)
python whisper_preprocess.py "meeting.mkv" \
  --language Portuguese --secondary-language Japanese \
  --initial-prompt "Reuniao tecnica sobre UPS, participantes Kusaba, Murata, Heitor" \
  --secondary-initial-prompt "Kusaba Murata UPS battery substation"

# Modo offline (modelo ja baixado, sem internet)
python whisper_preprocess.py "lecture.mkv" --offline

# Forcar reprocessamento ignorando cache de segmentos
python whisper_preprocess.py "lecture.mkv" --force-reprocess

Important: when passing negative thresholds, use = syntax: --silence-threshold="-30dB" (not --silence-threshold "-30dB", which argparse interprets as a flag).

Multilingual recordings

Use --secondary-language when the audio mixes two languages (e.g., a Portuguese meeting with Japanese speakers, or English narration with Spanish quotes). The script will:

  1. Preprocess once
  2. Transcribe all segments in --language (primary)
  3. Transcribe the same segments in --secondary-language
  4. Merge the two SRTs into a unified *_transcricao_merged.txt. The merge prefers the primary version, but for cues where the secondary script (kanji/hiragana, cyrillic, etc.) appears prominently in both passes (sign of foreign speech mistranscribed phonetically), it substitutes the secondary version prefixed with a tag like [JA].

Outputs in this mode:

  • *_transcricao_pt.txt / .srt — primary pass
  • *_transcricao_ja.txt / .srt — secondary pass
  • *_transcricao_merged.txt — unified, with [JA] markers

The merge logic also lives in scripts/merge_passes.py and can be called standalone:

python merge_passes.py --primary meeting_pt.srt --secondary meeting_ja.srt \
  --output merged.txt --secondary-script japanese --marker-tag JA

Supported scripts: japanese, chinese, korean, cyrillic, arabic, greek, hebrew, thai. For anything else pass a regex via --secondary-script-regex.

Offline use

When --offline is passed (or env var WHISPER_OFFLINE=1 is set), the script:

  • Sets HF_HUB_OFFLINE=1, TRANSFORMERS_OFFLINE=1 so Whisper won't touch the network
  • Pre-validates that the model is in local cache; fails fast with a clear path if it isn't

Whisper caches models in:

  • ~/.cache/whisper/<model>.pt (classic models: tiny, base, small, medium, large-v1, large-v2, large-v3)
  • ~/.cache/huggingface/hub/models--openai--whisper-large-v3-turbo/snapshots/ (turbo)

To pre-download a model while online:

python -c "import whisper; whisper.load_model('turbo')"
# or
python -c "import whisper; whisper.load_model('large-v2')"

You can also pass --model-dir DIR to use a portable copy of the models.

Segment cache (reuse across runs)

Intermediate processing is cached per-input. The script computes a short hash from the input's path + size + mtime and uses it to name the intermediate directory (e.g., meeting_intermediate_a1b2c3d4/). On subsequent runs with the same input (e.g., changing only --secondary-initial-prompt), the extract → silence → enhance → segment pipeline is skipped and the existing segments are reused.

To force a clean reprocess, pass --force-reprocess.

Combined with --keep-intermediate, this lets you iterate cheaply on prompts or experiment with alternate secondary languages without redoing the heavy ffmpeg work.

All parameters

Parameter Default Description
input (required) Input file (MKV, MP4, WAV, M4A, etc.)
--language Portuguese Primary transcription language
--secondary-language (none) Secondary language for 2nd pass + auto merge
--model turbo Whisper model: tiny, base, small, medium, large-v2, turbo
--model-dir (none) Pass-through to whisper's --model_dir
--device cuda Device: cuda (GPU) or cpu
--initial-prompt (none) Context for primary pass (names, acronyms)
--secondary-initial-prompt (none) Context for secondary pass
--silence-duration 2.0 Minimum silence to remove (seconds)
--silence-threshold -30dB Volume threshold for silence detection
--silence-keep 0.5 Silence kept at each cut (stop_silence, seconds)
--segment-minutes 10 Duration of each transcription segment
--listen-sr 48000 Sample rate of the listening *_enhanced.opus (48000/24000/16000)
--listen-bitrate 64k Opus bitrate of the listening copy
--listen-makeup 5.0 Listening compressor makeup gain (dB) — raise for a very quiet speaker
--listen-limit 0.6 Listening limiter ceiling 0..1 (lower = more Opus headroom)
--listen-highpass 80 High-pass (Hz) for the listening copy
--listen-comp-threshold -22dB Listening compressor threshold
--listen-comp-ratio 3.0 Listening compressor ratio
--trim-listen / --no-trim-listen off Also (gently) remove silence from the listening copy
--denoise / --no-denoise off Denoise listening copy (afftdn, or arnndn with --denoise-model)
--denoise-strength 12 afftdn noise reduction in dB (6..24)
--denoise-model (none) Path to a .rnnn model → uses arnndn instead of afftdn
--skip-enhance false Skip audio enhancement
--skip-silence false Skip silence removal
--keep-intermediate false Keep intermediate audio files
--only-preprocess false Only preprocess, don't transcribe
--offline false Force local model; fail if not cached (also via env WHISPER_OFFLINE=1)
--force-reprocess false Ignore segment cache and reprocess from scratch

Output files

File Description
*_transcricao_<lang>.txt Plain text transcription (primary pass)
*_transcricao_<lang>.srt SubRip subtitles for primary pass
*_transcricao_<lang2>.txt / .srt Secondary pass (only with --secondary-language)
*_transcricao_merged.txt Merged transcript with [TAG] markers (only with --secondary-language)
*_enhanced.opus Compressed enhanced audio (~54 MB for 2h)

Intermediate files are cached under <input>_intermediate_<hash>/ and removed at the end unless --keep-intermediate is used (or until --force-reprocess discards the cache).

Whisper model recommendations

Model VRAM Speed Quality Notes
turbo ~6GB Fast Good Recommended default
large-v2 ~10GB Slow Best Fewer hallucinations than v3
medium ~5GB Medium OK Fallback if VRAM limited
small ~2GB Fast Basic Quick drafts only

Prefer large-v2 over large-v3 — v3 has more hallucination issues for non-English languages.

Troubleshooting

Problem Cause Fix
"Obrigado" / "E ai" repeated Hallucination in silence Reduce --segment-minutes to 5
0% silence removed Threshold too low Use --silence-threshold="-25dB"
Beeping/whistling in audio afftdn filter Already removed from default pipeline (denoise is opt-in)
Voice choppy / cutting between syllables ("picotando") in *_enhanced.opus Old: silence-removal + fast compressor + dynamic loudnorm AGC stacked on the listening copy Fixed (lesson #9): listening copy decoupled, stable gain, no silence cut. For a very quiet speaker raise --listen-makeup (e.g. 7)
Listening copy still too quiet Low-volume speaker, default makeup Increase --listen-makeup (6–9) and/or lower --listen-comp-threshold
Argparse error with threshold -30dB parsed as flag Use --silence-threshold="-30dB"
Huge intermediate files WAV format Default behavior now auto-cleans
"Segmento vazio" with no real error Old issue (Windows cp1252) Fixed — script now logs whisper's stderr when exit≠0 or output is empty
"Modelo nao encontrado" in --offline Model not pre-cached Run python -c "import whisper; whisper.load_model('turbo')" once online
Mixed-language audio losing JA/RU/AR speech Single-pass --language only sees one script Add --secondary-language for an auto 2-pass + merge
Install via CLI
npx skills add https://github.com/j0ruge/skills_commands_manager --skill whisper-preprocess
Repository Details
star Stars 3
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator