talking-head-video-editor - SKILL.md Agent Skill

name: talking-head-video-editor description: Director's guide for editing talking-head footage into high-pacing 9:16 short-form social videos (Reels, Shorts, TikToks) when the input is an A-roll, supporting assets, and a transcript. Use this skill whenever the user wants to cut a talking-head clip into a vertical short, edit an interview into a reel, turn podcast footage into a TikTok, add B-roll/assets to an existing talking-head video, or otherwise produce short-form social from a person speaking on camera. Produces an EDL (edit decision list) for human review, then hands off to the hyperframes skill (required) for rendering. Use even when the user does not explicitly say "EDL" or "short-form" — if the input shape is talking-head + assets + transcript, this is the skill. requires: - hyperframes

Talking-Head Video Editor

Prerequisite — load the /hyperframes skill before doing anything else. This skill is layered on top of hyperframes and assumes its full context is in scope. If the hyperframes skill has not been loaded yet, load it now (read ~/.claude/skills/hyperframes/SKILL.md and follow its loading instructions) before continuing.

Read BOTH of these files before touching any files:

learnings_from_rev_runs.md — editorial and workflow decisions: asset relevance rules, face-zone measurement, zone allocation (captions vs CTAs vs face), design direction habits, communication patterns.

hyperframes_learnings.md — hyperframes framework technical rules: video nesting constraints, required attributes on every <video>, caption inlining requirement, GSAP plugin availability, beat boundary FP safety, stock B-roll re-encoding, render mechanics, frame-check checklist.

Skimming both once at the start of every job is how you avoid repeating mistakes already corrected in prior runs.

You are the director. Hyperframes is the cinematographer.

This skill is domain knowledge about how high-pacing 9:16 short-form social edits feel when the raw input is a person talking on camera plus a folder of assets. It is not a build spec for hyperframes — hyperframes already knows how to compose, animate, and render. Your job is to decide what the cut should be and produce an EDL that captures intent. Hyperframes reads the EDL and figures out execution.

The shape of the work

Inputs (require all three):

A-roll — the talking-head video file.
Assets — a folder of images and/or video clips that illustrate what's being said.
Transcript — timestamped, ideally word-level or phrase-level. If missing, ask the user to generate one (hyperframes has a transcribe path) before continuing.

Output:

An EDL describing each beat of the edit, anchored to the transcript.
Human review (markdown table view) and iteration until approved.
Handoff to hyperframes with EDL + design direction. Captions get rendered on top of the cut as part of the final video.

The transcript is the spine

Every decision in the edit hangs off the transcript. The transcript tells you:

Where the beats are — phrase boundaries, pauses, emphasis points, punchlines.
What's being said when — so an asset can land as the speaker mentions it.
How long each beat lasts — so pacing isn't guesswork.

Segment the transcript into phrase-level beats, not word-level and not whole-second chunks. A beat is roughly "one idea the speaker is delivering." Beats are usually 1.5–4 seconds. Cuts and layout changes happen on beat boundaries. This is the single biggest thing that separates good short-form from amateur short-form.

What "high-pacing" actually means

Not "fast cuts." Constant micro-variation. A scrolling viewer leaves the moment things feel static or predictable. So:

Something is always changing. A new layout, a new asset, a subtle motion, a graphic punch-in.
Variation matters more than speed. Five fast cuts that all look the same are still boring. One thoughtful layout swap on the right beat beats four mechanical cuts.
A held shot longer than ~3 seconds needs internal motion. Static + held = dead frame. Add a slow zoom, a parallax drift, a graphic appearing — something.
Cut on idea boundaries, not the clock. If a beat ends mid-thought, you'll feel it.

Layout examples

These are common patterns in good short-form. They are examples to think with, not a closed menu — hyperframes can compose other layouts when a beat calls for it.

Understanding "split" vs "overlay"

Split means A-roll is cropped and repositioned. The cropped portion must contain the face.

Use detect_face_zones.py to get exact face coordinates
Crop centered on the detected face bbox or center point
Example: face detected at bbox [350, 500, 380, 520] with center [540, 760]
- Crop region might be y 480-1440 (960px tall, centered on y=760)
- Adjust crop dimensions based on actual face position and composition needs

Overlay means A-roll stays full-frame (no crop), and the asset overlays on top in a safe zone (calculated from detected face bbox to avoid covering the face).

Layout patterns

A-roll full-frame. The whole 9:16 is the speaker, no assets visible. Use when the line is emotional, a punchline, a hard claim, or when the face is the information. Resist decorating every beat — let the face breathe on the moments that matter.
A-roll bottom + asset top, ~50/50 vertical split. A-roll is cropped (face-aware crop, ~~960px height) and positioned in the bottom half. Asset fills the top half (~~960px). Use when an asset directly illustrates the line being spoken — a chart, a screenshot, a still — and you need the asset large and readable while keeping the speaker visible.
Asset over A-roll (overlay). A-roll stays full-frame (no crop). Asset overlays as a card, chip, or graphic in a safe zone (above or below face). Can be small (icons, chips) or large (cards covering 40-50% of frame height). Use when showing the face matters — emotional moments, reactions, body language — and you want visual context without cropping away the speaker.
Stacked landscape strips — 2 or 3 wide assets stacked vertically. Use when multiple landscape clips/images belong to the same idea: three product shots, a before/middle/after sequence, three reaction clips. Fills 9:16 cleanly without letterboxing landscape footage. Can be overlaid on A-roll or replace it entirely depending on whether the face matters for that beat.
Asset full-frame. A-roll steps away entirely; the asset is the moment. Use for hero product shots, a clip the speaker is reacting to, or a visual that needs the whole frame to land.
Motion graphic over A-roll. A-roll stays full-frame; a designed motion graphic — from a simple label to a diagram, infographic, or asset-based composition — overlays in safe zones. Use when illustrating abstract concepts, emphasizing specific words/numbers, or adding context while keeping the speaker visible.

When choosing, the asset's aspect and content are hints: a wide screenshot pairs naturally with a split or overlay layout; a vertical phone screenshot might want full-frame or side-by-side; a hero product shot often wants full frame. These are starting points, not rules.

Motion in assets

Static assets read as dead frames. Subtle motion on assets is the default, not optional. The intent is "keep the brain engaged" — examples include slow zoom in or out, slow pan, parallax drift, a small rotation. Spec the intent in the EDL ("slow zoom in", "drift left") and let hyperframes pick the exact curve and amount.

A-roll itself is usually static — the viewer is reading the speaker's face, and camera moves on a talking head feel weird. Reserve dramatic motion for emphasis beats, not as a default.

Legibility and layering

Whatever ends up on screen has to be readable and unobstructed.

Assets are sized to be seen. If the viewer can't make out what an asset shows, it's not doing any work — it's noise. A screenshot of dense text, a chart with small labels, a phone UI — these need to be sized large enough to actually read on a phone screen. If a beat needs the asset readable but the layout makes it tiny, change the layout (go full-frame, drop the A-roll, simplify the asset).
Logical sizing. Asset size in the frame should match how much attention it deserves on that beat. The thing the viewer is meant to look at is the largest thing.
Don't overlap competing elements. Motion graphics, on-screen text, and captions all fight for the same eye. Don't stack a stat callout on top of a caption line. Don't drop a label where the caption strip lives. When two elements want the same screen real estate, decide which one matters more for that beat and let the other step out.

Captions

Captions are part of the deliverable — short-form social is watched on mute by default. The hyperframes skill handles caption rendering (see its references/captions.md); this skill's job is to make sure the EDL leaves room for them and to install the caption style the project will use.

Installing a caption style

Caption styles are hyperframes registry blocks — each one ships as its own component and must be installed into the project before it can be used. Install exactly one style per project (mixing styles inside a single short reads as inconsistent).

Default — if the user doesn't specify a style, install Editorial Emphasis. It's the most neutral and reads well on talking-head:

npx hyperframes add caption-editorial-emphasis

All available caption styles (run the one that matches the requested vibe):

npx hyperframes add caption-clip-wipe
npx hyperframes add caption-editorial-emphasis   # default
npx hyperframes add caption-emoji-pop
npx hyperframes add caption-glitch-rgb
npx hyperframes add caption-gradient-fill
npx hyperframes add caption-highlight
npx hyperframes add caption-kinetic-slam
npx hyperframes add caption-matrix-decode
npx hyperframes add caption-neon-accent
npx hyperframes add caption-neon-glow
npx hyperframes add caption-parallax-layers
npx hyperframes add caption-particle-burst
npx hyperframes add caption-pill-karaoke
npx hyperframes add caption-texture-lava
npx hyperframes add caption-weight-shift

Pick a style that matches the energy of the cut: emphatic/editorial pieces want editorial-emphasis or highlight; high-energy/punchy reels want kinetic-slam, emoji-pop, or particle-burst; techy/futuristic vibes want glitch-rgb, matrix-decode, or neon-glow; karaoke-style word-by-word reveal wants pill-karaoke. When unsure, default.

Caption zone in the EDL

Reserve a caption zone in every beat. Typically a band near the lower-middle of the frame; whichever zone you pick, keep it consistent across the video.
Don't place assets, motion graphics, or labels inside the caption zone. This is the most common overlap mistake.
Account for the zone when picking layouts. If a beat is "asset full-frame," the caption still has to land somewhere readable on top of it — make sure that's possible.

Motion graphics — the third tool

Use a motion graphic whenever the idea the speaker is delivering has a shape that can be shown. The job is to make the abstract concrete — show a relationship, give a number a stage, draw a contrast, visualize a connection.

Motion graphics work with A-roll and assets, not as substitutes for them. An asset can be a layer inside a motion graphic — a photo with a designed overlay and animation is a motion graphic. A-roll with an animated diagram on top is a motion graphic. The composition can include all three simultaneously.

Form follows content. Match the graphic form to what the content actually is:

A contrast between two things → spatial split, balance, or two-column layout
A number or stat → large display type, animated counter, or gauge — not a chip
A relationship or connection → diagram, route visualization, drawn edge
A sequence or list → items that build on screen one by one as each is spoken
A product or place → asset as the base layer, with labels, specs, or motion composited on top

Spec only the intent in the EDL ("show how A connects to B", "emphasize the stat on this line"). Don't pre-decide the execution — let hyperframes find the form.

The EDL

The EDL is a brief, not a build sheet. It captures roughly what the video should look like at each beat. Each row corresponds to a transcript span and says:

time — start/end (anchored to transcript)
text — the transcript phrase
what's on screen — describe the composable layers for this beat: A-roll (full-frame or cropped), any assets (as base layer, card, or strip), and any motion graphic elements (overlaid or composited on top). A beat can have all three simultaneously.
layout — the rough composition (one of the patterns above, or a new one with a short description)
motion — what's moving and roughly how
intent — one short line: why this beat is cut this way ("punch in on the stat", "let her land the joke", "graph slides in under the line about growth")

Format: JSON. See references/edl-schema.md for the schema and a worked example.

For human review, render the EDL as a markdown table — never dump raw JSON at the user. Columns: time, text, on-screen, layout, motion, intent. Iterate with the user until they approve.

Workflow

Initialize the edit folder.
```
bash scripts/setup_edit.sh <aroll.mp4> <transcript.json> <edit_folder_path>
```
Creates subdirs, initialises hyperframes (avoids the my-video/ subfolder trap), copies A-roll and transcript as real files. Requires hyperframes on PATH.

1b. Probe the A-roll.

bash scripts/probe_aroll.sh <edit_folder>/aroll.mp4

Gets resolution, aspect, duration, fps. Situational awareness — no layout decisions follow directly.

1c. Assess the face zone. Use automated face detection to get exact coordinates instead of manual inspection.

Tool available: scripts/detect_face_zones.py (requires: pip install -r scripts/requirements.txt)

Returns JSON with face bounding boxes:

{
  "image": "frame.png",
  "resolution": [1080, 1920],
  "faces": [{"index": 0, "bbox": [x, y, width, height], "center": [cx, cy], "score": 0.95}]
}

How to use (you decide when/how):

# Extract frame at whatever timestamp makes sense
ffmpeg -y -ss 10 -i <edit_folder>/aroll.mp4 -frames:v 1 <edit_folder>/drafts/frame.png

# Detect faces - JSON to stdout
# NOTE: Use scripts/.venv for Python dependencies (root .venv may have broken opencv)
cd /home/ayush/.claude/skills/talking-head-video-editor/scripts
source .venv/bin/activate
python3 detect_face_zones.py <edit_folder>/drafts/frame.png

# Interpret coordinates and determine safe zones for overlays/captions/CTAs
# If speaker moves, sample multiple timestamps to see the range

For face-aware crops: Use bbox or center coordinates to crop around the face.
For overlays: Calculate safe zones from bbox to avoid covering the face.
Optional visualization: Add --visualize -o annotated.png to see bounding boxes drawn on the image.

Confirm inputs and establish design direction. A-roll, assets source folder, transcript all in place. Either the user has color/font/vibe preferences, or hand off to hyperframes' visual identity gate when rendering. Lock the palette before writing any HTML.
Prepare assets.
```
bash scripts/prepare_assets.sh <selected_assets_dir> <edit_folder_path>
```
Prerequisite: curate which assets belong in <selected_assets_dir> first — do not pass a raw download folder. The script re-encodes long-GOP videos (prevents frozen frames in renders), copies images unchanged, and writes <edit_folder>/assets/inventory.json.
Inventory the assets. Open assets/inventory.json. Fill in what_it_shows for each entry. Note candidate beats in the transcript where each asset could land.
Segment the transcript into beats. Phrase-level. This is the EDL's row count.
Draft the EDL. Walk beat by beat, deciding layout / asset / motion / intent. Keep the variation principle in mind — don't repeat the same look for 3+ consecutive beats.
Show the markdown table to the user. Iterate until approved.

7b. Validate EDL timing.

python3 scripts/validate_edl.py edl.json

Run immediately after approval, before touching any HTML. Uses Python Decimal (exact base-10 arithmetic) to compute FP-safe durations and writes edl_validated.json. Use safe_duration (not raw end_s - start_s) for all data-duration values in HTML — never compute these by hand.

Hand off to hyperframes. Read the hyperframes SKILL.md, pass it edl_validated.json plus the design direction, and let it build the composition. The EDL is the source of truth for structure and intent; hyperframes owns the rendering.
Render in two passes: draft, then final. Render a draft, watch it, extract frames at critical beats with ad-hoc ffmpeg -y -ss <t> -i draft.mp4 -frames:v 1 -update 1 check.png, fix what is actually broken, then do the final render. See references/draft-render-sanity-check.md.
Captions. Pick a caption style (default: caption-editorial-emphasis) and install it into the project with npx hyperframes add caption-<style> — see the Captions section above for the full list. Captions are rendered on top of the cut by hyperframes; confirm the caption zone in the EDL doesn't collide with any beat's on-screen elements before approving.

Anti-patterns

These hurt short-form social specifically. Avoid them.

Same layout held for 5+ seconds. Even if the speaker is good, the visual pulse dies.
Assets stacked into every line. Viewer has nowhere to rest. Let the face breathe on emotional beats.
Cutting away from the face during the punchline or emotional peak. Trust the speaker.
Motion graphics used as wallpaper. They should add information, not fill silence.
Beats aligned to round seconds instead of transcript phrasing. It will feel mechanical.
Static assets with no motion. Reads as dead frame; viewer scrolls.
Designing the cut without reading the transcript end-to-end first. You'll miss the arc.
Assets too small to read. A chart or screenshot the viewer can't decipher is noise — size it to be legible on a phone.
Overlapping competing elements. A motion graphic landing on top of the caption line, a label dropped where captions live, two on-screen text blocks fighting for the same eye. Decide which element owns the beat and let the other step away.

What this skill does NOT do

It doesn't write hyperframes HTML, GSAP, or CSS. That's hyperframes' job — let it choose.
It doesn't pick exact colors, fonts, or motion curves. That's hyperframes' visual identity flow.
It doesn't render. After EDL approval, hand off.
It doesn't transcribe. If transcript is missing, ask the user to use hyperframes' transcribe path first.