whisper-preprocess

name: whisper-preprocess metadata: version: 1.0.0 description: "Audio preprocessing and transcription pipeline using ffmpeg + OpenAI Whisper. Use this skill whenever the user wants to transcribe audio or video files (lectures, meetings, podcasts, interviews), improve audio quality before transcription, remove silence from recordings, or mentions Whisper, speech-to-text, or transcricao. Also trigger when the user has MKV, MP4, WAV, M4A, or other media files they want converted to text. This skill handles the full pipeline: audio extraction, silence removal, voice enhancement, segmentation, and Whisper transcription. Works 100% offline."

Whisper Preprocess Pipeline

You are setting up and running an audio preprocessing + transcription pipeline. This pipeline solves four critical Whisper problems:

Hallucinations during silence — Whisper invents text like "Obrigado", "E ai", "Subscribe" during quiet segments
Hallucination cascade in long audio — after the first hallucination, Whisper repeats it for the rest of the file
Poor transcription from noisy audio — background noise reduces accuracy
Silent failures on Windows due to encoding — the whisper CLI crashes when printing accented/CJK characters in a cp1252 console, producing empty .txt files with no clear error

The pipeline runs 100% offline when the Whisper model is already cached locally — only needs ffmpeg, Python, and the downloaded model.

Critical lessons learned (from real-world testing)

These are hard-won insights from iterating on a 2h13min lecture recording and a multilingual (PT+JA) meeting. Follow them carefully:

1. NEVER use `afftdn` (FFT noise reduction)

The afftdn filter creates tonal artifacts ("musical noise") that sound like constant beeping/whistling. This completely destroys the audio for Whisper — every segment will transcribe as garbage. The same applies to lowpass=f=8000 which can interact badly with other filters.

Safe enhancement chain: highpass=f=100, acompressor, loudnorm — nothing else.

2. Remove silence BEFORE enhancement

The loudnorm filter normalizes everything to -16 LUFS, including silence. If you run silenceremove after loudnorm, it finds nothing to remove because the "silence" has been boosted. Always: extract → remove silence → enhance.

3. Silence threshold should be -30dB, not -40dB

Most recordings have background noise above -40dB. At -40dB threshold, silenceremove finds zero silence. Use -30dB as default. For noisier recordings, try -25dB.

4. Segment long audio (mandatory for >10 minutes)

Whisper hallucinates on long audio files. A 2h13min file produced only ~19 lines of real content before devolving into repeated "Obrigado" or "E ai" for the remaining 2 hours. Segmenting into 10-minute chunks solved this completely — each segment produced ~6-7K characters of accurate transcription.

5. Use `--condition_on_previous_text False`

Without this, once Whisper starts hallucinating, it conditions on its own hallucinated output and the error cascades. With False, each segment starts fresh. This doesn't prevent hallucinations entirely (segmentation does that), but limits the damage.

6. Compress output audio to OPUS (listening copy, decoupled — see lesson #9)

Enhanced WAV files are enormous (2.6 GB for 2h). The listenable *_enhanced.opus is now encoded at 48 kHz, 64 kbps, -application audio (wideband, fuller voice). The old -application voip is narrowband telephony tuning and was dropped. Bitrate/sample-rate are configurable via --listen-bitrate / --listen-sr.

9. NEVER feed the listening copy through silence-removal + dynamic AGC (anti-"picotamento")

Symptom: the voice in *_enhanced.opus is choppy / cuts between syllables ("picotando"), worst for a low-volume or impaired (dysarthric, slow) speaker.

Cause: the listenable file used to inherit the whole transcription chain — silenceremove (cuts quiet speech) + acompressor release=50ms (pumps every syllable gap) + single-pass loudnorm (dynamic AGC that "breathes" on pauses). Three gain-modulation/chopping sources stacked.

Fix (now built in): the listening copy is decoupled from the transcription path and generated by build_listenable() from the original file:

wideband (48 kHz), no silence removal by default → continuous audio, the listener follows every pause and nothing is clipped;
stable gain only — slow-release compressor (release=400ms) + fixed makeup + true-peak alimiter with lookahead. No dynamic AGC. Verified A/B on a real recording: a 2-pass loudnorm linear=true reverted to "Dynamic" on clipping sources, so it is NOT used here.
limiter ceiling has headroom (--listen-limit 0.6) for Opus inter-sample overshoot → final true-peak stays below 0 dBFS (the old recipe actually clipped at +0.7 dBFS).

The transcription path is unchanged in spirit (16 kHz, silenceremove at -30dB, then enhance) — only made gentler: detection=rms, window=0.025, stop_silence=0.5, stop_duration=2.0, which also helps Whisper on slow speakers. Lessons #1 (no afftdn) and #3 (-30dB) still hold; denoise on the listening copy is opt-in and off by default (--denoise).

7. UTF-8 everywhere on Windows (handled automatically)

On Windows the default console encoding is cp1252. When whisper CLI prints transcription progress containing accented or CJK characters it crashes with UnicodeEncodeError, and the Python subprocess.run reader thread separately crashes with UnicodeDecodeError on bytes > 0x7F. Result: all segments come back empty with the misleading log line "Segmento vazio ou falhou". The pipeline now sets PYTHONIOENCODING=utf-8, reconfigures sys.stdout/stderr, and forces encoding="utf-8", errors="replace" on every subprocess.run — no user action needed. If you ever invoke whisper standalone on Windows, set $env:PYTHONIOENCODING="utf-8" first.

8. For multilingual recordings, run two passes and merge

A single --language choice loses everything spoken in another script. Run a primary pass in the dominant language, a secondary pass in the second language, and merge the SRTs choosing the secondary version where the script ratio is high. See "Multilingual recordings" below — the script does this for you with --secondary-language.

The Pipeline (6 stages)

Input (MKV/MP4/etc.)
  |
  v
[1] Extract audio → WAV 16kHz mono
  |
  v
[2] Remove silence (transcription path, on RAW 16kHz, before enhancement)
  |  silenceremove threshold=-30dB, duration=2.0s, padding=0.5s, detection=rms
  v
[3] Enhance audio (transcription path)
  |  highpass=100Hz → acompressor → loudnorm EBU R128
  |  NO afftdn, NO lowpass (they cause artifacts)
  v
[4] Segment into ~10min chunks → cached by input hash
  |
  v
[5] Transcribe each segment with Whisper
  |  temperature=0, condition_on_previous_text=False
  |  optional 2nd pass in --secondary-language + merge
  |
  |  (PARALLEL, decoupled) build_listenable() from the ORIGINAL file →
  |  48kHz, NO silence removal, stable gain (slow compressor + true-peak limiter,
  |  no dynamic AGC) → continuous, no "picotamento" — see lesson #9
  v
[6] Compress listening copy to OPUS (48k, application=audio) + cleanup
  |
  v
Output: *_transcricao_<lang>.txt/.srt, *_transcricao_merged.txt, *_enhanced.opus

Prerequisites

ffmpeg -version        # audio processing
whisper --help         # transcription (pip install -U openai-whisper)
python --version       # 3.9+

Check required ffmpeg filters:

ffmpeg -filters 2>/dev/null | grep -E "silenceremove|loudnorm|highpass|acompressor"

How to use

Copy scripts/whisper_preprocess.py to the user's working directory, then:

# Basic — full pipeline with auto-cleanup
python whisper_preprocess.py "lecture.mkv"

# With context for proper nouns and technical terms
python whisper_preprocess.py "lecture.mkv" --initial-prompt "Seminario de Teologia, professor Carlos"

# Test preprocessing only (listen before transcribing)
python whisper_preprocess.py "lecture.mkv" --only-preprocess

# Keep intermediate files for debugging / segment reuse
python whisper_preprocess.py "lecture.mkv" --keep-intermediate

# CPU only (no GPU)
python whisper_preprocess.py "lecture.mkv" --device cpu

# Better quality model (needs ~10GB VRAM)
python whisper_preprocess.py "lecture.mkv" --model large-v2

# Noisier recording — more aggressive silence detection
python whisper_preprocess.py "lecture.mkv" --silence-threshold="-25dB"

# Quiet / impaired (slow, low-volume) speaker — louder, fuller listening copy
python whisper_preprocess.py "call.m4a" --listen-makeup 7 --listen-comp-threshold="-24dB"

# Longer segments for very clean audio
python whisper_preprocess.py "lecture.mkv" --segment-minutes 15

# Multi-idioma (ex.: reuniao PT + JA com merge automatico)
python whisper_preprocess.py "meeting.mkv" \
  --language Portuguese --secondary-language Japanese \
  --initial-prompt "Reuniao tecnica sobre UPS, participantes Kusaba, Murata, Heitor" \
  --secondary-initial-prompt "Kusaba Murata UPS battery substation"

# Modo offline (modelo ja baixado, sem internet)
python whisper_preprocess.py "lecture.mkv" --offline

# Forcar reprocessamento ignorando cache de segmentos
python whisper_preprocess.py "lecture.mkv" --force-reprocess

Important: when passing negative thresholds, use = syntax: --silence-threshold="-30dB" (not --silence-threshold "-30dB", which argparse interprets as a flag).

Multilingual recordings

Use --secondary-language when the audio mixes two languages (e.g., a Portuguese meeting with Japanese speakers, or English narration with Spanish quotes). The script will:

Preprocess once
Transcribe all segments in --language (primary)
Transcribe the same segments in --secondary-language
Merge the two SRTs into a unified *_transcricao_merged.txt. The merge prefers the primary version, but for cues where the secondary script (kanji/hiragana, cyrillic, etc.) appears prominently in both passes (sign of foreign speech mistranscribed phonetically), it substitutes the secondary version prefixed with a tag like [JA].

Outputs in this mode:

*_transcricao_pt.txt / .srt — primary pass
*_transcricao_ja.txt / .srt — secondary pass
*_transcricao_merged.txt — unified, with [JA] markers

The merge logic also lives in scripts/merge_passes.py and can be called standalone:

python merge_passes.py --primary meeting_pt.srt --secondary meeting_ja.srt \
  --output merged.txt --secondary-script japanese --marker-tag JA

Supported scripts: japanese, chinese, korean, cyrillic, arabic, greek, hebrew, thai. For anything else pass a regex via --secondary-script-regex.

Offline use

When --offline is passed (or env var WHISPER_OFFLINE=1 is set), the script:

Sets HF_HUB_OFFLINE=1, TRANSFORMERS_OFFLINE=1 so Whisper won't touch the network
Pre-validates that the model is in local cache; fails fast with a clear path if it isn't

Whisper caches models in:

~/.cache/whisper/<model>.pt (classic models: tiny, base, small, medium, large-v1, large-v2, large-v3)
~/.cache/huggingface/hub/models--openai--whisper-large-v3-turbo/snapshots/ (turbo)

To pre-download a model while online:

python -c "import whisper; whisper.load_model('turbo')"
# or
python -c "import whisper; whisper.load_model('large-v2')"

You can also pass --model-dir DIR to use a portable copy of the models.

Segment cache (reuse across runs)

Intermediate processing is cached per-input. The script computes a short hash from the input's path + size + mtime and uses it to name the intermediate directory (e.g., meeting_intermediate_a1b2c3d4/). On subsequent runs with the same input (e.g., changing only --secondary-initial-prompt), the extract → silence → enhance → segment pipeline is skipped and the existing segments are reused.

To force a clean reprocess, pass --force-reprocess.

Combined with --keep-intermediate, this lets you iterate cheaply on prompts or experiment with alternate secondary languages without redoing the heavy ffmpeg work.

All parameters

Parameter	Default	Description
`input`	(required)	Input file (MKV, MP4, WAV, M4A, etc.)
`--language`	Portuguese	Primary transcription language
`--secondary-language`	(none)	Secondary language for 2nd pass + auto merge
`--model`	turbo	Whisper model: tiny, base, small, medium, large-v2, turbo
`--model-dir`	(none)	Pass-through to whisper's `--model_dir`
`--device`	cuda	Device: cuda (GPU) or cpu
`--initial-prompt`	(none)	Context for primary pass (names, acronyms)
`--secondary-initial-prompt`	(none)	Context for secondary pass
`--silence-duration`	2.0	Minimum silence to remove (seconds)
`--silence-threshold`	-30dB	Volume threshold for silence detection
`--silence-keep`	0.5	Silence kept at each cut (stop_silence, seconds)
`--segment-minutes`	10	Duration of each transcription segment
`--listen-sr`	48000	Sample rate of the listening `*_enhanced.opus` (48000/24000/16000)
`--listen-bitrate`	64k	Opus bitrate of the listening copy
`--listen-makeup`	5.0	Listening compressor makeup gain (dB) — raise for a very quiet speaker
`--listen-limit`	0.6	Listening limiter ceiling 0..1 (lower = more Opus headroom)
`--listen-highpass`	80	High-pass (Hz) for the listening copy
`--listen-comp-threshold`	-22dB	Listening compressor threshold
`--listen-comp-ratio`	3.0	Listening compressor ratio
`--trim-listen` / `--no-trim-listen`	off	Also (gently) remove silence from the listening copy
`--denoise` / `--no-denoise`	off	Denoise listening copy (afftdn, or arnndn with `--denoise-model`)
`--denoise-strength`	12	afftdn noise reduction in dB (6..24)
`--denoise-model`	(none)	Path to a `.rnnn` model → uses arnndn instead of afftdn
`--skip-enhance`	false	Skip audio enhancement
`--skip-silence`	false	Skip silence removal
`--keep-intermediate`	false	Keep intermediate audio files
`--only-preprocess`	false	Only preprocess, don't transcribe
`--offline`	false	Force local model; fail if not cached (also via env `WHISPER_OFFLINE=1`)
`--force-reprocess`	false	Ignore segment cache and reprocess from scratch

Output files

File	Description
`*_transcricao_<lang>.txt`	Plain text transcription (primary pass)
`*_transcricao_<lang>.srt`	SubRip subtitles for primary pass
`*_transcricao_<lang2>.txt` / `.srt`	Secondary pass (only with `--secondary-language`)
`*_transcricao_merged.txt`	Merged transcript with `[TAG]` markers (only with `--secondary-language`)
`*_enhanced.opus`	Compressed enhanced audio (~54 MB for 2h)

Intermediate files are cached under <input>_intermediate_<hash>/ and removed at the end unless --keep-intermediate is used (or until --force-reprocess discards the cache).

Whisper model recommendations

Model	VRAM	Speed	Quality	Notes
turbo	~6GB	Fast	Good	Recommended default
large-v2	~10GB	Slow	Best	Fewer hallucinations than v3
medium	~5GB	Medium	OK	Fallback if VRAM limited
small	~2GB	Fast	Basic	Quick drafts only

Prefer large-v2 over large-v3 — v3 has more hallucination issues for non-English languages.

Troubleshooting

Problem	Cause	Fix
"Obrigado" / "E ai" repeated	Hallucination in silence	Reduce `--segment-minutes` to 5
0% silence removed	Threshold too low	Use `--silence-threshold="-25dB"`
Beeping/whistling in audio	`afftdn` filter	Already removed from default pipeline (denoise is opt-in)
Voice choppy / cutting between syllables ("picotando") in `*_enhanced.opus`	Old: silence-removal + fast compressor + dynamic loudnorm AGC stacked on the listening copy	Fixed (lesson #9): listening copy decoupled, stable gain, no silence cut. For a very quiet speaker raise `--listen-makeup` (e.g. 7)
Listening copy still too quiet	Low-volume speaker, default makeup	Increase `--listen-makeup` (6–9) and/or lower `--listen-comp-threshold`
Argparse error with threshold	`-30dB` parsed as flag	Use `--silence-threshold="-30dB"`
Huge intermediate files	WAV format	Default behavior now auto-cleans
"Segmento vazio" with no real error	Old issue (Windows cp1252)	Fixed — script now logs whisper's stderr when exit≠0 or output is empty
"Modelo nao encontrado" in `--offline`	Model not pre-cached	Run `python -c "import whisper; whisper.load_model('turbo')"` once online
Mixed-language audio losing JA/RU/AR speech	Single-pass `--language` only sees one script	Add `--secondary-language` for an auto 2-pass + merge