name: transcript-fetcher
description: >-
Use when the user wants a clean plain-text transcript of a video URL
(YouTube, Vimeo, X.com/Twitter incl. Broadcasts/Spaces) or a Skool
classroom lesson. X uses embedded captions when present and falls
back to ASR (MacWhisper/Whisper/whisper.cpp/cloud) automatically.
Skool cookies are optional (public communities work without; private
ones accept Netscape cookies.txt). Manual->auto language fallback,
rolling caption dedup, >> speaker turns preserved, JSON stat sidecar
plus optional .description.md sidecar.
tier: 2
version: 1.2
status: active
changelog: >-
v1.2 — Add X.com (Twitter) adapter: native status video + Broadcasts/
Spaces. Captions-first (subtitles/automatic_captions via yt-dlp) with
automatic ASR fallback through a pluggable backend chain (MacWhisper
mw -> Whisper CLI -> whisper.cpp -> opt-in OpenAI/compatible cloud).
ffmpeg is required for the ASR path on HLS sources (Broadcasts/Spaces) —
used to extract a clean audio-only m4a; the skill fails fast (exit 7)
when it is absent there. New stat field transcript_origin,
exit code 7 (MissingDependency), --debug stage logging, --asr-*
flags. Skill-local .env config (secrets-safe, 0600) +
install_components.py component installer. transcript-fetcher is now a
universal extraction layer (any future yt-dlp platform reuses the X
pipeline).
v1.1 — Add Vimeo + Skool adapters; opt-in --with-description /
--description-only producing a .description.md (YAML front-
matter + Markdown body) for YouTube & Skool. Skool delegates
embedded YouTube/Vimeo to those adapters automatically and uses
cookies.txt for private/paid communities; public ones work without.
v1.0 — Initial release — YouTube adapter, ru/ru-orig/en fallback
ladder, offline tests.
Transcript Fetcher
Purpose: Given a video URL, produce a clean plain-text transcript
ready for downstream summarization or analysis. Single responsibility:
fetch + clean. Composes well with summarizing-meetings (run this
first, then feed the resulting .txt to that skill).
1. Red Flags (Anti-Rationalization)
STOP and READ THIS if you are thinking:
- "I'll just paste the URL into the model and ask it to transcribe" -> WRONG. Models do not have audio access and will hallucinate. Always run
fetch.pyand read back the resulting.txt. - "Manual
rufailed, I'll just useenauto-translation, no warning needed" -> WRONG. Auto-translated English subtitles destroy idioms, names, and technical terms. Thequality_flag: english_auto_translationin the stat MUST be surfaced to the user. - "I'll skip writing the JSON stat sidecar, the .txt is enough" -> WRONG. The sidecar records WHICH track was picked. Without it, downstream cannot tell whether the transcript is high-quality manual subs or low-quality auto-translation.
- "The transcript has weird
>>markers, I'll strip them" -> WRONG. Those are speaker-turn boundaries. Removing them collapses multi-speaker meetings into a single voice and ruins downstream attribution. - "Auto-generated Russian (
ru-orig) is garbage, I'll prefereninstead" -> WRONG.ru-origis the actual Russian audio transcribed;ru(without-orig) is often an English auto-translation back to Russian.ru-orig>ru>en. - "I'll add
ffmpegto the pip deps just in case" -> WRONG. The caption path (WebVTT parsing) is pure Python and needs nothing extra;ffmpegis a soft-optional external system tool, never a pip dependency. It IS genuinely required for the X ASR path on HLS sources (Broadcasts/Spaces) — yt-dlp uses it to extract a clean audio-onlym4a, and the skill fails fast (exit 7) when it is absent there — but it is detected at runtime (install_components.py), not bundled. Do not pull heavy packages intorequirements.txt.
2. Capabilities
- Fetch YouTube and Vimeo captions via
yt-dlp(no audio download). - Fetch X.com / Twitter — native status video AND Broadcasts/Spaces.
The X provider is captions-first: it reuses embedded
subtitles/automatic_captionswhen present, and only when none exist does it download the smallest media and transcribe via ASR. Fully automatic — no mode switch. ASR runs through a pluggable backend chain: MacWhisper (mw) → Whisper CLI → whisper.cpp → opt-in OpenAI/compatible cloud.ffmpegis required for the X ASR path on HLS sources (Broadcasts/Spaces): with it the smallest media is extracted to a clean audio-onlym4a; without it the skill fails fast (exit 7) because yt-dlp's no-ffmpeg HLS output is not a valid container the ASR engine can open. (For non-HLS progressive media, or when embedded captions exist, ffmpeg is not needed.) - Fetch Skool lesson pages via a stdlib HTML scrape — public
communities work without auth; private/paid ones accept an optional
Netscape
cookies.txt. Then delegate embedded YouTube/Vimeo videos to those adapters; capture author-supplied transcript field when present. - Fall back through a configurable ladder (default for ru: manual ru -> auto ru-orig -> auto ru -> auto en).
- Clean WebVTT to plain text: strip timestamps, inline timing tags, and rolling-caption overlap; decode HTML entities.
- Preserve
>>speaker-turn markers as paragraph breaks. - Emit a JSON stat sidecar (chosen track, char count, speaker-turn count, quality flag, plus optional title/uploader/duration metadata). For X media it also records
transcript_origin(embedded-captions|macwhisper|whisper-cli|whisper-cpp|openai-api) so downstream skills know HOW the text was produced. - Optionally write
<out>.description.md(YAML frontmatter + Markdown body) when--with-descriptionis passed — gives you a ready-to-ingest description for RAG / Obsidian / human review. - Batch mode for processing multiple URLs from a text file.
- Source-agnostic architecture: each platform is one file under
scripts/sources/. The yt-dlp + ASR pipeline is shared (sources/_ytdlp_media.py+asr/), so a future TikTok/Twitch/Vimeo-ASR provider is one new file + one host entry — no pipeline changes. Zoom and podcast slots remain reserved. - Configurable + secrets-safe: a skill-local
.env(seescripts/.env.example) externalises every endpoint, model, and tool path. The cloud ASR endpoint works with any OpenAI-compatible server (Groq, self-hosted whisper). A.envholding an API key is refused unlesschmod 600(and not a symlink); the key is sent only in an HTTP header, never on argv or in logs.
3. Execution Mode
- Mode:
script-first - Why this mode: Fetching captions, parsing WebVTT, deduplicating rolling captions, and applying the fallback ladder are deterministic operations with > 5 lines of business logic. The CLI is the contract; SKILL.md is orchestration.
4. Script Contract
- Install (one-time): creates the venv + yt-dlp, then reports which optional ASR components are present:
bash skills/transcript-fetcher/scripts/install.sh # Optional ASR engines (for caption-less X media) — detect / install: ./scripts/.venv/bin/python scripts/install_components.py # status report ./scripts/.venv/bin/python scripts/install_components.py --install-whisper # pip openai-whisper into the venv ./scripts/.venv/bin/python scripts/install_components.py --system --run # brew/apt ffmpeg + whisper.cpp - Single URL:
Optional flags:cd skills/transcript-fetcher ./scripts/.venv/bin/python scripts/fetch.py <URL> --out <path/to/output.txt>--lang ru(default),--prefer manual|auto(defaultmanual),--with-description,--description-only,--cookies-file PATH,--json-errors,--debug(stage logging to stderr),--asr-allow-cloud(opt-in cloud ASR),--asr-model <id>,--asr-timeout-sec N,--max-duration-min N(X: transcribe only the first N minutes — clips the download),--auth-map PATH(per-host cookies),--cookies-from-browser BROWSER(X: load cookies from a local browser via yt-dlp). - X.com / Twitter (captions-first, automatic ASR fallback; no mode switch). A Broadcast/Space usually has no captions → ASR via the first available local backend (MacWhisper, etc.):
cd skills/transcript-fetcher ./scripts/.venv/bin/python scripts/fetch.py \ "https://x.com/i/broadcasts/<id>" \ --out broadcast.txt --with-description --debug # → broadcast.txt + .stat.json (source="x", transcript_origin="macwhisper", # chosen_track_kind="asr"). A status video WITH captions skips ASR # (transcript_origin="embedded-captions"). Use --cookies-file for # protected/age-gated media. - Skool lesson (cookies needed ONLY for private / paid communities — public ones work without):
./scripts/.venv/bin/python scripts/fetch.py \ "https://www.skool.com/<community>/classroom/<id>?md=<lesson-id>" \ --out lesson.txt --with-description \ --cookies-file ~/.config/skool-cookies.txt - Batch:
./scripts/.venv/bin/python scripts/fetch.py --batch urls.txt --out-dir transcripts/ - Inputs: A YouTube / Vimeo / Skool-lesson URL (or a file with one URL per line). Empty lines and
#comments in the batch file are ignored. - Outputs:
<out>.txt— clean plain text, UTF-8.<out>.txt.stat.json— sidecar with the chosen track, quality flag, plus optional title / uploader / upload_date / duration_sec / embed_source / embed_url metadata.<out>.description.md— only when--with-descriptionis passed; YAML frontmatter + Markdown body.- One JSON stat record per URL on stdout.
- Failure semantics: Non-zero exit. With
--json-errors, stderr carries a single JSON line{v, error, code, type, details?}. Exit codes:2usage error (incl. malformed Skool URL, cookies file path missing on disk),3no transcript producible (no caption track in the ladder AND — for X — ASR produced nothing / every available backend failed),4partial batch failure,5source-auth error (HTTP 401/403 — private Skool community needs cookies, X protected/suspended/age-gated media, or supplied cookies expired),6source rate-limit (HTTP 429),7missing dependency (yt-dlp absent, ffmpeg required-but-absent, or no ASR backend available for caption-less media —details.remediationcarries the hint),1unexpected. - Idempotency: Re-running overwrites the output file and sidecar. yt-dlp itself caches nothing the skill depends on; behaviour is reproducible given network availability.
- Dry-run support: Not currently exposed as a flag. Inspect the fallback ladder via
_build_ladderif needed.
5. Safety Boundaries
- Allowed scope: Reads from the network (yt-dlp HTTPS to YouTube/Vimeo; stdlib HTTPS to Skool). Writes ONLY to the user-specified
--out/--out-dirpath plus a.stat.jsonsidecar (and optionally a.description.mdsidecar) next to it. - Default exclusions: Never downloads the video itself (only
--skip-download+ subtitle tracks /--write-info-jsonfor metadata). Never writes to any path the user did not explicitly pass via--out/--out-dir. - Destructive actions: None. The script does not delete or modify any pre-existing files outside the chosen output paths. In batch mode, output path collisions are handled per
--on-collision={error,skip,suffix}(default:error). - Optional artifacts: The JSON stat sidecar is mandatory in single-URL mode (it is the audit trail for which track was used). The
.description.mdsidecar is written only when--with-descriptionis passed. - URL allowlist: Source dispatch is hostname-based against an explicit allowlist:
- YouTube —
youtu.be,youtube.com,m.youtube.com,music.youtube.com,youtube-nocookie.com(pluswww.variants). - Vimeo —
vimeo.com,www.vimeo.com,player.vimeo.com. - X / Twitter —
x.com,www.x.com,mobile.x.com,twitter.com,www.twitter.com,mobile.twitter.com(status…/status/<id>and…/i/broadcasts/<id>). - Skool —
skool.com,www.skool.com,app.skool.com; additionally URLs must match/<community>/classroom/<id>?md=<lesson-id>. Landing //about//calendarpages are rejected. URLs that merely contain a supported host as a substring elsewhere are rejected.
- YouTube —
- ASR backends (external, optional): For caption-less X media the skill shells out (argv arrays, never a shell string) to whichever local engine is present — MacWhisper
mw, Whisper CLI, or whisper.cpp. None is a pip dependency; all are probed at runtime. ffmpeg (also external) is required to turn an X Broadcast's HLS stream into a valid audio file; the skill fails fast with exit 7 (clear remediation, before any large download) when ffmpeg is absent for an HLS source.bash scripts/install.shreports which engines are available;scripts/install_components.pyguides/installs them (incl. ffmpeg). If no ASR backend is available (and cloud is not opted in), the run also fails cleanly with exit 7, never a traceback. - Cloud ASR egress (opt-in only): The OpenAI/compatible cloud backend is used only with
--asr-allow-cloud(orTRANSCRIPT_FETCHER_ASR_ALLOW_CLOUD=1) AND an API key present. When used, the audio leaves the machine to the configured endpoint — disclosed here and in the stat notes. Local backends are always tried first; cloud is the last resort. - Secrets: The API key is read from
OPENAI_API_KEY/TRANSCRIPT_FETCHER_OPENAI_API_KEYor a skill-local.env. A.envis loaded only at the CLI entry point and is refused if it is a symlink or notchmod 600(group/world-readable). The key is sent only in an HTTPAuthorizationheader — never on a command line, never logged..envis git-ignored; onlyscripts/.env.example(placeholders) is committed. - Per-host cookies (
~/.transcript-fetcher/): cookies for auth-walled media resolve (after an explicit--cookies-file) from a skill-local home folder — mirrors thehtmlskill's~/.html. Anauth-map.json(--auth-map/TRANSCRIPT_FETCHER_AUTH_MAP/~/.transcript-fetcher/auth-map.json) maps a host to its{cookies_file}, or the convention~/.transcript-fetcher/<host>-cookies.txtis used. Host match is label-boundary (a keyx.commatchesx.com/*.x.com, neverevil-x.com); auth-map and convention files are hardened (symlink-reject +0600). The resolved Netscape cookies.txt feeds yt-dlp's--cookies(and Skool's opener).--cookies-from-browser BROWSERloads cookies straight from a local browser via yt-dlp (opt-in — reads the browser's cookie store). - Temp-file hygiene: For X media, all intermediates (audio, VTT, info.json,
.part,.m3u8, fragments) live under one tempdir removed in afinallyblock even on error — nothing is left behind. - Auth credentials:
--cookies-file <path>accepts a Netscapecookies.txtand is ALWAYS OPTIONAL for every source. The file is read once at startup, never copied or re-emitted. For Skool, public communities (e.g.zero-one) serve lessons without auth; private / paid communities respond with HTTP 401/403 and the user then needs to supply cookies. YouTube/Vimeo optionally forward the file to yt-dlp's--cookiesfor age-gated or unlisted videos. The skill never blocks on missing cookies up-front — it tries the fetch and surfaces aSourceAuthError(exit 5) only if the source returns 401/403.
6. Validation Evidence
- Local verification:
All offline tests must pass without network. The end-to-end network test is gated behindcd skills/transcript-fetcher ./scripts/.venv/bin/python -m unittest discover -s scripts/testsTRANSCRIPT_FETCHER_E2E=1. - Skill-validator (structural):
python3 .claude/skills/skill-creator/scripts/validate_skill.py skills/transcript-fetcher - Skill-validator (security):
python3 .claude/skills/skill-validator/scripts/validate.py skills/transcript-fetcher - Expected evidence: Both validators exit 0; unittest reports
OK.
7. Instructions
Step 1: Verify environment
If scripts/.venv/ does not exist, run bash scripts/install.sh first.
The install script is idempotent — safe to re-run.
Step 2: Choose mode
| Input | Mode | Command form |
|---|---|---|
| Single URL | single |
fetch.py <URL> --out path.txt |
| List of URLs in a file | batch |
fetch.py --batch urls.txt --out-dir dir/ |
Step 3: Pick a fallback strategy
Default is --lang ru --prefer manual. This tries:
manual:ru— user-uploaded Russian subtitles (highest quality).auto:ru-orig— YouTube auto-captions of the original Russian audio (good).auto:ru— YouTube auto-translation TO Russian (often noisy if speech was in another language).auto:en— English auto-captions as last resort (will setquality_flag = english_auto_translation).
For non-Russian content, pass --lang en (or another ISO code). For
non-Russian languages the lang-orig step is skipped — it is a YouTube
quirk that mainly matters for non-English speech.
Step 4: Run the CLI
Capture stdout (it carries the JSON stat). Read the stat to confirm which track was used:
./scripts/.venv/bin/python scripts/fetch.py \
"https://youtu.be/NSVTpCfBMK8" \
--out /tmp/talk.txt
# stdout: {"source":"youtube","url":"...","chosen_track_kind":"auto","chosen_track_lang":"ru-orig", ...}
Step 5: Inspect quality
Open the generated <out>.txt.stat.json. If quality_flag is set
(currently only "english_auto_translation"), surface a warning to
the user before passing the transcript to a downstream summarizer:
⚠️ TRANSCRIPT QUALITY: only English auto-translation was available for this URL. Idioms, proper names, and technical terms may be distorted. Consider asking the user for a manual transcription.
Step 6: Hand off
The clean .txt is now ready for summarizing-meetings or any other
downstream consumer. Pass the path; do not paste the contents inline
(transcripts are often large).
8. Workflows
- [ ] Verify scripts/.venv/ exists (run install.sh otherwise)
- [ ] Decide single vs batch mode
- [ ] Run fetch.py with the chosen language and preference
- [ ] Read the JSON stat sidecar
- [ ] Surface quality_flag warning if set
- [ ] Hand .txt path to downstream skill
9. Best Practices & Anti-Patterns
| DO THIS | DO NOT DO THIS |
|---|---|
| Always read the stat sidecar after fetching | Trust the .txt without checking which track was used |
Prefer ru-orig over ru for Russian content |
Pick the first track that returns text |
| Pass batch URLs through a file | Loop the CLI shell-side with arbitrary URLs |
Surface quality_flag in any user-visible output |
Silently downgrade to English auto-translation |
Use --json-errors in CI/automation pipelines |
Parse free-form stderr |
Rationalization Table
| Agent Excuse | Reality / Counter-Argument |
|---|---|
"yt-dlp is on $PATH so I can just call it" |
The skill invokes python -m yt_dlp from the per-skill venv. The system yt-dlp may be a different version with different output. |
"The >> markers are clutter" |
They are paragraph breaks for speaker turns. Downstream summarizers attribute statements by them. |
| "Rolling-caption dedup is overkill, just keep the longest cue" | The dedup IS keeping the longest cue. Without it, the same sentence would appear 3-4 times. |
| "I'll add a Vimeo adapter inline in fetch.py" | Add it as scripts/sources/vimeo.py. Each source is its own file. |
10. Examples
See examples/:
example_input_url.txt— batch input format.example_output_plain.txt— what a cleaned transcript looks like (excerpt).example_output_stat.json— what the stat sidecar contains.
11. Resources
scripts/fetch.py— CLI entry point.scripts/sources/youtube.py— YouTube adapter (yt-dlp orchestration + fallback ladder + description path).scripts/sources/vimeo.py— minimal Vimeo adapter (yt-dlp).scripts/sources/x.py— X.com / Twitter adapter (captions-first → ASR; theXTranscriptProvider).scripts/sources/_ytdlp_media.py— shared yt-dlp plumbing (metadata probe, caption inspection, audio-minimal download, failure classifier) reused by X and any future yt-dlp source.scripts/sources/_log.py— debug-only stage logger (stderr, gated on--debug).scripts/sources/_auth.py—~/.transcript-fetcher/per-host cookie resolution (auth-map + convention, hardened; mirrors thehtmlskill's~/.html).scripts/asr/— pluggable ASR backend package:_base.py(theASRBackendinterface),macwhisper.py,whisper_cli.py,whisper_cpp.py,openai_api.py(opt-in cloud),__init__.py(priority registry + fallback chain).scripts/_config.py— skill-local.envloader (secrets-safe) + typed config accessors (endpoints/models/tool paths).scripts/.env.example— config/secret template (copy to.env,chmod 600).scripts/install_components.py— detect / guide / install the optional ASR components.scripts/sources/skool.py— Skool lesson adapter (cookies.txt + Next.js scrape + embed delegation).scripts/sources/_vtt_to_text.py— pure-Python WebVTT cleaner.scripts/sources/_stat.py— sharedTranscriptStat+ sidecar writer + error classes.scripts/sources/_description.py—.description.mdwriter (YAML frontmatter + Markdown body).scripts/sources/_cookies.py— Netscape cookies.txt loader + authenticated opener.scripts/sources/_prosemirror.py— ProseMirror/TipTap v2 JSON → Markdown for Skool lesson bodies.scripts/install.sh— venv bootstrap.scripts/requirements.txt— pinned deps (single source of truth for the yt-dlp version range).scripts/tests/— offline unit tests + opt-in E2E network test.scripts/tests/_sanitize_fixture.py— utility for scrubbing PII from Skool HTML snapshots before they become fixtures.references/youtube_caption_format.md— what>>,>, rolling captions, andru-origactually mean.references/fallback_policy.md— the language ladder and why it is in this order.references/supported_sources.md— current and planned source slots.references/skool_adapter.md— Skool auth flow, schema notes, embed delegation rules.references/description_metadata.md—.description.mdformat for YouTube and Skool.docs/Manuals/transcript-fetcher_manual.md— user-facing manual with quick reference, troubleshooting, and composition recipes.
12. Composition
- Composes well with
summarizing-meetings: runtranscript-fetcherfirst to get a clean.txt, then pass that file tosummarizing-meetingsfor a structured Markdown summary. The two skills are intentionally separate — fetching is a deterministic file operation; summarizing is a prompt-first reasoning task.