name: video-summary description: Summarize a video (URL or local path) by pulling captions first, falling back to audio-only transcription, and optionally extracting frames for visual summaries. Use when the user pastes a YouTube/Vimeo/X/etc URL, points at a local video file, or asks to summarize, watch, or extract takeaways from a video. Triggers on "summarize this video", "what's in this video", "watch this", URLs ending in /watch, /shorts, common video extensions (.mp4/.mov/.mkv/.webm). allowed-tools: - Bash(bash ~/.claude/skills/video-summary/bin/run.sh *) - Bash(bash ~/.claude/skills/video-summary/setup.sh *)
video-summary
Local-first video summarizer. For URLs, yt-dlp first pulls metadata and
captions without media download. If captions are missing or unusable, it
downloads audio only and transcribes on-device with whisperkit-cli. Full video
download and frame extraction happen only with --with-frames.
No external API, no Python venv. macOS / Apple Silicon only.
Step 0: setup check (silent on success)
Skip if you already ran the script once this session.
test -d ~/.cache/whisperkit/models/argmaxinc/whisperkit-coreml/openai_whisper-large-v3-v20240930 \
&& command -v ffmpeg >/dev/null \
&& command -v yt-dlp >/dev/null \
&& command -v whisperkit-cli >/dev/null \
&& command -v jq >/dev/null \
&& echo OK
If output is OK, proceed. Otherwise tell the user:
Setup is incomplete. Run
bash ~/.claude/skills/video-summary/setup.shonce. First-time model download is ~626MB (1-10 min depending on connection).
Do not run setup.sh yourself. The user reviews and confirms.
Step 1: parse the user input
Separate the video source from any specific question.
summarize https://youtu.be/abc-> source = URL, question = nonehttps://youtu.be/abc what's the main argument?-> source = URL, question = "what's the main argument?"~/Movies/talk.mp4-> source = path, question = none
Step 2: run the skill
bash ~/.claude/skills/video-summary/bin/run.sh "<source>"
run.sh orchestrates watch.sh (metadata/captions -> audio fallback ->
optional frames -> transcript) and summarize.sh (haiku CLI subprocess that
writes the structured summary). Frame image tokens are used only when
--with-frames is passed.
Optional flags forwarded to watch.sh:
--with-framesdownload full video and extract frames; off by default--max-frames Nsoft cap on frames; default 100. Applies only with--with-frames--height Hcaps full-video download resolution and frame scale (default 480p). Applies only with--with-frames--start T/--end Tfocus on a section--fps Foverride auto-fps (cap 2)--language LANGforce whisperkit language (en,pt,es, ...)--refreshwipe the whole work dir (download, frames, transcript, report, summary)--refresh-downloadre-download (cascades to frames + transcript + report + summary)--refresh-framesre-extract frames (cascades to report + summary)--refresh-transcriptre-transcribe (cascades to report + summary)
Unreadable frames
Frames are disabled by default. When --with-frames is used, frames default to
480p. Do NOT lower this default - past runs at 240p caused VLM OCR failures
(couldn't read code/slides/captions).
The summarizer subagent flags unreadable frames in its output via a
FRAMES_UNREADABLE: line, and run.sh echoes the same signal to stderr:
[video-summary] FRAMES_UNREADABLE: <path1>, <path2>, ...
[video-summary] re-run with --refresh --height 720 (or 1080) to escalate
When you see this signal, do NOT silently accept the summary. Tell the user which frames were unreadable and offer to re-run at higher resolution:
bash ~/.claude/skills/video-summary/bin/run.sh "<source>" --with-frames --refresh-frames --height 720
Higher tiers: 720 -> 1080. Network cost roughly 2-3x per step. The cached video
is invalidated by --refresh and re-downloaded at the new cap.
Local flag handled by run.sh:
--refresh-summaryignore the summary cache only (re-run the haiku pass; useful after editingvideo-summary-prompt.md)
Resume points
Pipeline is composable. Each stage is gated by its own artifact under
~/.cache/video-summary/<id>/:
| Stage | Artifact | Refresh flag |
|---|---|---|
| download | download.json + download/ |
--refresh-download |
| frames | frames.tsv + frames/ |
--refresh-frames |
| transcribe | transcript.json |
--refresh-transcript |
| summarize | summaries/summary-*.md |
--refresh-summary |
Cascade rule: each --refresh-* wipes its own stage and everything downstream.
Re-running on the same source replays cached stages instantly. Delete any
artifact directly (e.g. rm transcript.json) to redo that stage without passing
a flag.
Decision tree (pick the matching flag for the user's intent):
- "Re-summarize, transcript is fine" ->
--refresh-summary - You edited
video-summary-prompt.md->--refresh-summary - "Transcript is bad, re-transcribe and re-summarize" ->
--refresh-transcript - "Need visual context" ->
--with-frames - "Frames are unreadable" ->
--with-frames --refresh-frames --height 720 - "Re-download at higher resolution" ->
--with-frames --refresh-download --height 720 - "Wipe everything" ->
--refresh
Stdout of run.sh is the clean structured summary (TLDR / Verdict / Summary /
Key moments / Caveats / Pacing) plus ## Processing strategy and
## Run metrics blocks. The FRAMES_UNREADABLE: and WORK_DIR: subagent
lines are stripped before output.
The work dir path is in the footer of pre-summary-context.md earlier in the
pipeline. To recover it for follow-ups, parse Work dir: from that file or use
the cache root ~/.cache/video-summary/<id>/ where <id> is derived from the
source URL/path.
The legacy alternative if claude CLI is unavailable: use the Agent tool with
subagent_type: general-purpose and model: haiku, and pass the contents of
video-summary-prompt.md plus pre-summary-context.md as the prompt body.
Artifacts
After a complete run, ~/.cache/video-summary/<id>/ contains reusable stages
for both text-only and --with-frames runs:
download/video.mp4- source video. Present only with--with-frames.download/video.info.json- full yt-dlp metadata dump.download/video.<lang>.vtt- manual or auto captions, when available.download/audio.<ext>- audio-only fallback when captions are missing or unusable.download/thumbnail.<ext>- thumbnail image, when yt-dlp provides one.download.json- manifest with paths + title/channel/chapters/description.frames/frame_NNNN.jpg- extracted frames at the chosen height/fps. Present only with--with-frames.frames.tsv-<seconds>\t<absolute path>per frame. Use this to map a frame to its timestamp - do NOT estimate fromframe_NNNNindices.audio.wav- 16kHz mono PCM. Only present for whisper fallback input.transcript.json-[{start, end, text}]. Canonical transcript for the video. Read directly for transcript follow-ups.pre-summary-context.md- assembled markdown consumed by the haiku subagent. It includes frame paths only with--with-frames.summaries/summary-<signature>.md- final structured output for each option set.summary.md- copy of the latest emitted summary for convenience.download.metrics.json/frames.metrics.json/transcribe.metrics.json- per-stage{seconds, cached, ...}.metrics.json- merged per-stage view forrun.sh.claude_metrics.json- haiku subprocess cost / tokens / duration.frames_unreadable.txt- sidecar; present only when the subagent flagged unreadable frames on the last run.
Step 3: handle follow-ups
The user may ask follow-ups about the same video:
- "What was on screen at 2:15?" Re-run with
--with-framesif frames are absent. If frames exist, read the relevant<work_dir>/frames/frame_NNNN.jpgdirectly. Do not re-dispatch the subagent for one frame. - "Re-summarize focused on the demo section" Re-run
run.shwith--start/--end; pipeline cache hit on the same source reuses cached media. - "Check the transcript around MM:SS" Read
<work_dir>/transcript.jsondirectly.
If the user asks about a different video, run run.sh on the new source.
Working dirs are independent per video ID.
Step 4: cleanup
Cache lives at ~/.cache/video-summary/. Do not delete entries proactively -
they are the cache. The user can rm -rf ~/.cache/video-summary/<id> themselves
when they want to drop one.
Failure modes
- Setup incomplete -> point to
setup.sh. Do not run it yourself. - yt-dlp fails on a URL -> the script prints the install age. If
14 days, suggest
brew upgrade yt-dlpand ask the user to retry. - whisperkit-cli fails -> hard fail. Surface the error verbatim. Do not fall back to frames-only.
- No captions and no audio track -> ffmpeg audio extract fails; same hard-fail path.
- Subagent returns malformed output -> surface what it returned and ask the user whether to retry, fall back to direct (you read frames), or accept as-is.
Bundled scripts
bin/run.shcache-aware orchestrator (entry point)bin/watch.shpipeline (download -> transcribe -> optional frames), cache-by-idbin/summarize.shstdin report → claude -p haiku → stdout summarybin/download.shyt-dlp wrapper / local path resolverbin/extract-frames.pyffprobe + ffmpeg, tiered fps curvebin/transcribe.shwhisperkit-cli wrapper (the swap point)bin/transcribe.pyWebVTT parser (system python3, no venv)video-summary-prompt.mdhaiku subprocess prompt templatesetup.shone-time install + model pre-download