name: dk-transcribe-summarize
description: >
Transcribe audio files or YouTube videos via subtitles or OpenRouter's
multimodal API, generate 100-word and 400-word summaries using cloud
models (OpenRouter) with local LLM fallback (LM Studio), and output
the results as PDF, HTML, and Markdown files. Supports hybrid
subtitle-first transcription, custom prompts, and automatic fallback
across multiple models.
Triggers on speech-to-text, audio transcription, YouTube download,
or creating transcript documents with summaries.
license: MIT
compatibility: >
Requires: ffmpeg, uv, Node.js (for yt-dlp JS challenge solving),
Brave browser with YouTube login cookies,
LM Studio (lms CLI) with a loaded model (fallback summarization only),
OPENROUTER_API_KEY env var for transcription and cloud summarization.
Python deps (auto-resolved by uv): requests, yt-dlp, fpdf2, mistune,
webvtt-py, langdetect, mlx-whisper.
metadata:
author: dushyantkhosla
model-default: gemma-4-e4b-it
output-dir: /Users/dush/Code/transcribed//
formats: pdf, html, md
Audio Transcribe & Summarize (PDF, HTML & Markdown)
Overview
Modular Python CLI tool (scripts/) that:
- For YouTube URLs, tries subtitles first (hybrid mode) — downloads auto-captions, validates quality (WPM, repeated phrases, language detection), and uses them if good; falls back to audio transcription otherwise
- Downloads audio from YouTube (or accepts a local mp3/m4a/wav file) when subtitles unavailable or
--method audio - Transcribes audio via OpenRouter's multimodal API (
xiaomi/mimo-v2.5) using theinput_audiocontent type - Automatically chunks long audio (>10 min) into 10-min segments compressed to 16 kbps mono MP3 for reliable cloud transcription
- Falls back through transcription models (MiMo → Gemini Flash Lite → Gemini Flash → mlx-whisper local), each with 1 retry (each attempt has 5 internal retries with linear backoff)
- Generates ~100-word and ~400-word summaries — cloud-first via OpenRouter (same model cascade), falling back to local LLM via LM Studio (default:
gemma-4-e4b-it) only if all cloud models fail or API key is unset - Supports custom prompts (
--prompt) for single-summary mode (action items, key takeaways, etc.) - Verifies summaries are free of chain-of-thought contamination; retries with a strict prompt if detected
- Writes three formats — PDF, HTML, and Markdown — to
transcribed/<YYYY-MM-DD>/. PDF and HTML contain summaries only; Markdown includes the full transcript - Output filenames use
{channel}-{title}format when channel metadata is available
When to Use
- Transcribing audio files (mp3, m4a, wav) to text
- Downloading and transcribing YouTube videos
- Generating summary documents from audio content
- Creating PDF/HTML/Markdown reports with transcripts and summaries
- Working offline for summarization (local LLM, no API cost)
Do NOT use for:
- Real-time transcription (batch/offline only)
- Videos longer than ~2 hours (API payload limits)
- Speaker diarization (no speaker labels)
Quick Reference
| Step | Command |
|---|---|
| Run (default) | cd scripts && uv run transcribe_pdf.py |
| Run with URL | cd scripts && uv run transcribe_pdf.py "https://youtube.com/..." |
| Subtitles only | cd scripts && uv run transcribe_pdf.py --method subtitles |
| Audio only | cd scripts && uv run transcribe_pdf.py --method audio |
| Custom prompt | cd scripts && uv run transcribe_pdf.py "..." --prompt "Extract action items" |
| Input | Paste local path or YouTube URL at prompt (if no CLI arg) |
| Output | transcribed/<YYYY-MM-DD>/<channel>-<title>.{pdf,html,md} |
| API key | OPENROUTER_API_KEY env var (for transcription and cloud summarization) |
| Local model | Override with LOCAL_MODEL_NAME env var (default: gemma-4-e4b-it) |
| Prerequisites | ffmpeg, uv, lms (LM Studio CLI), Node.js |
Script
Reusable tool: scripts/transcribe_pdf.py
cd /path/to/skill
uv run scripts/transcribe_pdf.py
Coding Standards
All Python scripts MUST use PEP 723 inline metadata
Every executable Python script in this skill must:
- Start with
#!/usr/bin/env -S uv run --scriptshebang - Include a
# /// scriptblock withrequires-pythonanddependencieslist - Be runnable via
uv run <script.py>without any pre-existing virtual environment
This ensures zero configuration — uv creates an isolated environment automatically.
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.10"
# dependencies = [
# "requests>=2.32",
# "yt-dlp",
# "fpdf2>=2.8",
# "mistune>=3",
# "webvtt-py>=0.5",
# "langdetect>=1.0",
# "mlx-whisper>=0.4",
# ]
# ///
Non-executable modules (imported by scripts) do not need the shebang or script block.
Dependencies (PEP 723 inline metadata in transcribe_pdf.py):
requests>=2.32— HTTP client for OpenRouter and LM Studio APIsyt-dlp— YouTube audio download and subtitle extractionfpdf2>=2.8— PDF generationmistune>=3— Markdown-to-HTML conversion for HTML outputwebvtt-py>=0.5— VTT subtitle parsinglangdetect>=1.0— Language detection for subtitle quality checksmlx-whisper>=0.4— Local transcription fallback (Apple Silicon only)
How It Works
Phase 1: Input & Validation
- validate_environment() — checks
OPENROUTER_API_KEY,ffmpeg,uv, andlmsare available - parse_args() or prompt_user() — accepts URL/file path,
--method(hybrid/subtitles/audio), and optional--prompt
Phase 2: Transcription (Hybrid Pipeline)
For YouTube URLs with --method hybrid (default) or --method subtitles:
- get_subtitles() — downloads auto-generated English subtitles via
yt-dlp(VTT format) - check_subtitle_quality() — validates subtitle quality:
- Word-per-minute rate (min 50 WPM)
- Repeated phrase detection (flags garbled captions)
- Language detection (must be English)
- Minimum word count for long videos (500+ words for >10min)
- If quality is "good" → parse_vtt() extracts text, skip to Phase 3
- If quality is poor or subtitles unavailable → fall through to audio transcription
For --method audio or when subtitles fail:
- download_audio() — downloads best audio via
yt-dlpwith:- Brave browser cookies (for YouTube auth)
- JS runtime (Node.js) + EJS remote component (for n-challenge solving)
- FFmpeg post-processing to .m4a
- transcribe_with_fallback() — tries models in order, each with 1 retry (each attempt has 5 internal retries with linear backoff):
- MiMo (
xiaomi/mimo-v2.5) — cheapest, default - Gemini Flash Lite (
google/gemini-2.5-flash-lite) — mid-tier - Gemini Flash (
google/gemini-2.5-flash) — most capable - mlx-whisper (local, Apple Silicon) — zero cost, offline
- Audio >10 min is auto-chunked into 10-min segments (16 kbps mono MP3)
- If all models fail → raises
TranscriptionFailedwith detailed error
- MiMo (
Phase 3: Summarization (Cloud-First)
- summarize() → summarize_with_fallback() — generates summaries (cloud-first, local fallback):
- Tries OpenRouter models in order: MiMo → Gemini Flash Lite → Gemini Flash
- Each model has 1 retry with 5 internal retries per attempt
- If all cloud models fail or
OPENROUTER_API_KEYis unset → falls back to LM Studio - ensure_lmstudio_ready() — starts LM Studio server if needed, loads model (
gemma-4-e4b-it) - local_chat() — calls LM Studio's OpenAI-compatible API
- Special handling: Qwen models get
enable_thinking: falseto reduce CoT contamination
- verify_summary() — checks each summary for CoT contamination:
- Scans for trigger phrases:
thinking process,analyze the,drafting,iterative,step-by-step,let me think,here's my reasoning - If detected → retries with strict prompt ("Output ONLY the summary text. No preamble, no analysis...")
- If still contaminated → strips
<thinking>,<reasoning>XML tags and numbered analysis lists via_strip_thinking()
- Scans for trigger phrases:
Phase 4: Output & Cleanup
- write_pdf(), write_html(), write_markdown() — writes three formats to
transcribed/<YYYY-MM-DD>/:- Filename:
{channel}-{title}.{ext}(or just{title}.{ext}if no channel) - PDF and HTML: title page + 100-word summary + 400-word summary (no transcript)
- Markdown: metadata table + source info + summaries + full transcript
- Filename:
- For custom prompts (
--prompt):- summarize_custom() — generates single summary with custom instruction
- write_custom_output() — writes PDF/HTML/Markdown with single summary section
- Warns if transcript >5000 words (context window risk)
- Cleanup — unload_lmstudio_model() frees GPU memory, removes temp directories
Verification
After each summary is generated, verify_summary() runs a three-stage pipeline:
- Check — scans for CoT trigger phrases:
thinking process,analyze the,drafting,iterative,step-by-step,let me think,here's my reasoning - Retry — if contamination detected, re-generates with a strict prompt: "Output ONLY the summary text. No preamble, no analysis, no thinking, no numbered steps, no meta-commentary."
- Strip — if retry is still contaminated, runs
_strip_thinking()as a safety net:- Removes
<thinking>...</thinking>and<reasoning>...</reasoning>XML blocks - Removes
Thinking:/Reasoning:/Thought:line headers - Removes numbered analysis list items (
1.,2., etc.)
- Removes
External Dependencies
The script is not fully self-contained. Here is every external dependency:
Runtime & CLI Tools
| Dependency | Why | Install |
|---|---|---|
uv |
Runs the Python script (PEP 723 inline metadata) | curl -LsSf https://astral.sh/uv/install.sh | sh |
ffmpeg |
Audio compression + yt-dlp post-processing | brew install ffmpeg / apt install ffmpeg |
lms (LM Studio CLI) |
Start/stop/load/unload local LLM | Download LM Studio from https://lmstudio.ai |
node |
JavaScript runtime for yt-dlp n-challenge solving (already installed) | Pre-installed via nvm |
Python Packages (auto-resolved by uv run --script)
| Package | Version | Used For |
|---|---|---|
requests |
≥2.32 | HTTP client for OpenRouter + LM Studio APIs |
yt-dlp |
latest | YouTube audio download, subtitle extraction, cookie/JS support |
mistune |
≥3.0 | Markdown-to-HTML conversion for the HTML output file |
fpdf2 |
≥2.8 | PDF generation |
webvtt-py |
≥0.5 | VTT subtitle parsing |
langdetect |
≥1.0 | Language detection for subtitle quality checks |
mlx-whisper |
≥0.4 | Local transcription fallback (Apple Silicon only) |
Environment Variables
| Variable | Required | Purpose |
|---|---|---|
OPENROUTER_API_KEY |
✅ Yes | Authentication for OpenRouter transcription and cloud summarization APIs |
LOCAL_MODEL_NAME |
❌ Optional | Override default summarization model (default: gemma-4-e4b-it) |
Running Services
| Service | Port | Why |
|---|---|---|
| LM Studio server | localhost:1234 |
Serves the local LLM via OpenAI-compatible API |
| Brave browser (with YouTube cookies) | N/A | Provides authenticated YouTube session for yt-dlp |
Network Access
| Endpoint | Purpose |
|---|---|
https://openrouter.ai/api/v1/chat/completions |
Audio transcription + cloud summarization (MiMo, Gemini Flash Lite/Flash) |
https://www.youtube.com |
Video page + audio download |
https://github.com/yt-dlp/ejs/releases/... |
EJS challenge solver script (auto-downloaded) |
http://localhost:1234/v1/chat/completions |
Local summarization via LM Studio |
File System
| Path | Purpose |
|---|---|
/Users/dush/Code/transcribed/<YYYY-MM-DD>/ |
Output directory for PDF, HTML, Markdown (auto-created) |
| Temporary directories | yt-dlp download cache, compressed audio (cleaned up) |
Note: The only dependencies
uvresolves automatically are the seven Python packages listed above. Everything else — system tools, running services, browser cookies, API keys — must be present on the machine before running.
Configuration
| Setting | Default | Notes |
|---|---|---|
| Transcription model | xiaomi/mimo-v2.5 |
On OpenRouter. Change MODEL_NAME in config.py |
| Local summarization model | gemma-4-e4b-it |
Override via LOCAL_MODEL_NAME env var |
| LM Studio URL | http://localhost:1234/v1 |
Change LMSTUDIO_URL in config.py |
| Audio compression (single file) | 8 kbps mono MP3 | 16 kHz sample rate, compressed via ffmpeg |
| Audio compression (chunks) | 16 kbps mono MP3 | 10-min chunks for cloud API |
| Max chunk duration | 600s (10 min) | Change MAX_CHUNK_SECONDS in config.py |
| Max tokens (transcription) | 10,000 | Covers ~60-90 min of speech per chunk |
| Output directory | /Users/dush/Code/transcribed/<YYYY-MM-DD>/ |
Change OUTPUT_BASE in config.py |
| Transcription fallback | MiMo → Gemini Flash Lite → Gemini Flash → mlx-whisper | Change FALLBACK_MODELS in config.py |
| Summarization fallback | MiMo → Gemini Flash Lite → Gemini Flash → LM Studio | Change SUMMARIZATION_MODELS in config.py |
| Whisper model path | ~/.lmstudio/models/mlx-community/whisper-large-v3-turbo |
Change WHISPER_MODEL_PATH in config.py |
| Subtitle min WPM | 50 | Change SUBTITLE_MIN_WPM in config.py |
| Max prompt length | 2000 chars | Change MAX_PROMPT_LENGTH in config.py |
Common Mistakes
| Mistake | Fix |
|---|---|
OPENROUTER_API_KEY not set |
Export it in your shell env |
ffmpeg not found |
brew install ffmpeg or apt install ffmpeg |
lms not found |
Install LM Studio from https://lmstudio.ai |
| 400 error "exceeds 8 MB" | Video too long; auto-chunking splits into 10-min segments |
| yt-dlp JS runtime warning | Install Deno: brew install deno |
| Empty transcription | Check audio isn't silent or corrupted |
| Summary has thinking noise | Script auto-detects, retries with strict prompt, then strips; try a different model via LOCAL_MODEL_NAME |
| whisper fails / not found | mlx-whisper requires Apple Silicon. If on Intel/Linux, the script exits after cloud models with a clear error. |
| Subtitles garbled or sparse | Quality checks catch WPM <50, repeated phrases, wrong language. Use --method audio to skip subtitles. |
| Custom summary truncated | Transcript >5000 words may exceed local model context. Use default 100/400-word pipeline for long content. |