name: transcribe description: Transcribe a video/audio file to a JSON transcript with word-level timestamps using local whisper.cpp + a GGML model. Use when you have a media file and need text + word timing for downstream subtitle burning or segment ranking.
transcribe
Local whisper.cpp transcription. No API calls.
Inputs
input: path to a video or audio fileout(optional): output JSON path (defaults to<input>.transcript.json)language(optional): ISO code, defaulten
Output
JSON shaped as:
{
"source": "<input path>",
"language": "en",
"words": [
{"t0": 0.42, "t1": 0.81, "w": "hello"},
...
],
"segments": [
{"t0": 0.42, "t1": 4.10, "text": "Hello, welcome to the show."},
...
]
}
How
- Read
WHISPER_BINandWHISPER_MODELfrom.env. - If input is video, extract 16kHz mono WAV via
ffmpeg -i <in> -ac 1 -ar 16000 -f wav -. - Pipe to
whisper-cli --model "$WHISPER_MODEL" --output-json-full --no-prints -l <lang>. - Parse whisper-cli's JSON; flatten tokens into
words[], group by segment intosegments[].
Run
.claude/skills/transcribe/transcribe.sh <input> [out.json] [lang]
Idempotent: skips work if out is newer than input. Uses --max-len 1 --split-on-word for word-level segments; groups into sentence segments on .!? or every 18 words.