remotion-talking-head-editor - SKILL.md Agent Skill

name: remotion-talking-head-editor description: Director's guide for editing talking-head footage into high-pacing 9:16 short-form social videos (Reels, Shorts, TikToks) using Remotion as the renderer, when the input is an A-roll, supporting assets, and a transcript. Use this skill whenever the user wants to cut a talking-head clip into a vertical short with Remotion, edit an interview into a reel rendered via Remotion, turn podcast footage into a TikTok built in Remotion/React, add B-roll/assets to an existing talking-head video using a Remotion pipeline, or otherwise produce short-form social from a person speaking on camera and the rendering stack is Remotion. Produces an EDL (edit decision list) for human review, then hands off to the remotion-best-practices skill (required) for composition authoring and rendering. Use even when the user does not explicitly say "EDL" or "short-form" — if the input shape is talking-head + assets + transcript and Remotion is in the picture, this is the skill. requires: - remotion-best-practices

Remotion Talking-Head Video Editor

Prerequisite — load the remotion-best-practices skill before doing anything else. This skill is layered on top of remotion-best-practices and assumes its full context is in scope. If it hasn't been loaded yet, load it now (read ~/.agents/skills/remotion-best-practices/SKILL.md and follow its loading instructions) before continuing. That skill carries every framework-specific technical rule — component shape, interpolation, asset embedding, fonts, captions, sequencing, timing — so this skill does not duplicate them.

Then read learnings_from_rev_runs.md before touching any files. It captures editorial and workflow decisions from prior runs: asset relevance rules, face-zone measurement, zone allocation (captions vs CTAs vs face), design direction habits, communication patterns. Skimming it once at the start of every job is how you avoid repeating mistakes already corrected.

Technical Remotion rules (the cinematographer's tools) live in remotion-best-practices. Editorial rules (the director's calls) live here.

You are the director. remotion-best-practices is the cinematographer.

This skill is domain knowledge about how high-pacing 9:16 short-form social edits feel when the raw input is a person talking on camera plus a folder of assets. It is not a build spec for Remotion components — remotion-best-practices already knows how to compose, animate, sequence, and render. Your job is to decide what the cut should be and produce an EDL that captures intent. The cinematographer reads the EDL and figures out execution.

The shape of the work

Inputs (require all three):

A-roll — the talking-head video file.
Assets — a folder of images and/or video clips that illustrate what's being said.
Transcript — timestamped, ideally word-level or phrase-level. If missing, ask the user to generate one (whisper / AssemblyAI / similar) before continuing.

Output:

An EDL describing each beat of the edit, anchored to the transcript.
Human review (markdown table view) and iteration until approved.
Handoff to remotion-best-practices with EDL + design direction. That skill scaffolds the Remotion project, writes the TSX composition, and renders the final video — including captions on top of the cut.

The transcript is the spine

Every decision in the edit hangs off the transcript. The transcript tells you:

Where the beats are — phrase boundaries, pauses, emphasis points, punchlines.
What's being said when — so an asset can land as the speaker mentions it.
How long each beat lasts — so pacing isn't guesswork.

Segment the transcript into phrase-level beats, not word-level and not whole-second chunks. A beat is roughly "one idea the speaker is delivering." Beats are usually 1.5–4 seconds. Cuts and layout changes happen on beat boundaries. This is the single biggest thing that separates good short-form from amateur short-form.

What "high-pacing" actually means

Not "fast cuts." Constant micro-variation. A scrolling viewer leaves the moment things feel static or predictable. So:

Something is always changing. A new layout, a new asset, a subtle motion, a graphic punch-in.
Variation matters more than speed. Five fast cuts that all look the same are still boring. One thoughtful layout swap on the right beat beats four mechanical cuts.
A held shot longer than ~3 seconds needs internal motion. Static + held = dead frame. Add a slow zoom, a parallax drift, a graphic appearing — something.
Cut on idea boundaries, not the clock. If a beat ends mid-thought, you'll feel it.

Layout examples

These are common patterns in good short-form. They are examples to think with, not a closed menu — the cinematographer can compose other layouts when a beat calls for it.

Understanding "split" vs "overlay"

Split means A-roll is cropped and repositioned. The cropped portion must contain the face.

Use detect_face_zones.py to get exact face coordinates
Crop centered on the detected face bbox or center point
Example: face detected at bbox [350, 500, 380, 520] with center [540, 760]
- Crop region might be y 480–1440 (960px tall, centered on y=760)
- Adjust crop dimensions based on actual face position and composition needs

Overlay means A-roll stays full-frame (no crop), and the asset overlays on top in a safe zone (calculated from detected face bbox to avoid covering the face).

Layout patterns

A-roll full-frame. The whole 9:16 is the speaker, no assets visible. Use when the line is emotional, a punchline, a hard claim, or when the face is the information. Resist decorating every beat — let the face breathe on the moments that matter.
A-roll bottom + asset top, ~50/50 vertical split. A-roll is cropped (face-aware crop, ~~960px height) and positioned in the bottom half. Asset fills the top half (~~960px). Use when an asset directly illustrates the line being spoken — a chart, a screenshot, a still — and you need the asset large and readable while keeping the speaker visible.
Asset over A-roll (overlay). A-roll stays full-frame (no crop). Asset overlays as a card, chip, or graphic in a safe zone (above or below face). Can be small (icons, chips) or large (cards covering 40–50% of frame height). Use when showing the face matters — emotional moments, reactions, body language — and you want visual context without cropping away the speaker.
Stacked landscape strips — 2 or 3 wide assets stacked vertically. Use when multiple landscape clips/images belong to the same idea: three product shots, a before/middle/after sequence, three reaction clips. Fills 9:16 cleanly without letterboxing landscape footage. Can be overlaid on A-roll or replace it entirely depending on whether the face matters for that beat.
Asset full-frame. A-roll steps away entirely; the asset is the moment. Use for hero product shots, a clip the speaker is reacting to, or a visual that needs the whole frame to land.
Motion graphic over A-roll. A-roll stays full-frame; a designed motion graphic — from a simple label to a diagram, infographic, or asset-based composition — overlays in safe zones. Use when illustrating abstract concepts, emphasizing specific words/numbers, or adding context while keeping the speaker visible.

When choosing, the asset's aspect and content are hints: a wide screenshot pairs naturally with a split or overlay layout; a vertical phone screenshot might want full-frame or side-by-side; a hero product shot often wants full frame. These are starting points, not rules.

Motion in assets

Static assets read as dead frames. Subtle motion on assets is the default, not optional. The intent is "keep the brain engaged" — examples include slow zoom in or out, slow pan, parallax drift, a small rotation. Spec the intent in the EDL ("slow zoom in", "drift left") and let the cinematographer pick the exact interpolation curve and amount.

A-roll itself is usually static — the viewer is reading the speaker's face, and camera moves on a talking head feel weird. Reserve dramatic motion for emphasis beats, not as a default.

Legibility and layering

Whatever ends up on screen has to be readable and unobstructed.

Assets are sized to be seen. If the viewer can't make out what an asset shows, it's not doing any work — it's noise. A screenshot of dense text, a chart with small labels, a phone UI — these need to be sized large enough to actually read on a phone screen. If a beat needs the asset readable but the layout makes it tiny, change the layout (go full-frame, drop the A-roll, simplify the asset).
Logical sizing. Asset size in the frame should match how much attention it deserves on that beat. The thing the viewer is meant to look at is the largest thing.
Don't overlap competing elements. Motion graphics, on-screen text, and captions all fight for the same eye. Don't stack a stat callout on top of a caption line. Don't drop a label where the caption strip lives. When two elements want the same screen real estate, decide which one matters more for that beat and let the other step out.

Captions

Captions are part of the deliverable — short-form social is watched on mute by default. The actual caption rendering, font handling, and word-level animation is handled by remotion-best-practices (see its rules/subtitles.md and rules/text-animations.md). This skill's job is to:

Reserve a caption zone in the EDL — typically a band near the lower-middle of the frame, derived from the detected face zone. Keep this zone consistent across the video.
Don't place assets, motion graphics, or labels inside the caption zone — this is the most common overlap mistake.
Pick a caption vibe — emphatic-editorial, kinetic-slam, karaoke word-by-word, neon, glitchy, etc. — and pass it to remotion-best-practices as part of the design direction. The cinematographer chooses the exact typography, color highlights, and animation form to match.
Account for the caption zone when picking layouts. If a beat is "asset full-frame," the caption still has to land somewhere readable on top of it — make sure that's possible.

Motion graphics — the third tool

Use a motion graphic whenever the idea the speaker is delivering has a shape that can be shown. The job is to make the abstract concrete — show a relationship, give a number a stage, draw a contrast, visualize a connection.

Motion graphics work with A-roll and assets, not as substitutes for them. An asset can be a layer inside a motion graphic — a photo with a designed overlay and animation is a motion graphic. A-roll with an animated diagram on top is a motion graphic. The composition can include all three simultaneously.

Form follows content. Match the graphic form to what the content actually is:

A contrast between two things → spatial split, balance, or two-column layout
A number or stat → large display type, animated counter, or gauge — not a chip
A relationship or connection → diagram, route visualization, drawn edge
A sequence or list → items that build on screen one by one as each is spoken
A product or place → asset as the base layer, with labels, specs, or motion composited on top

Spec only the intent in the EDL ("show how A connects to B", "emphasize the stat on this line"). Don't pre-decide the execution — let the cinematographer find the form.

The EDL

The EDL is a brief, not a build sheet. It captures roughly what the video should look like at each beat. Each row corresponds to a transcript span and says:

time — start/end (anchored to transcript)
text — the transcript phrase
what's on screen — describe the composable layers for this beat: A-roll (full-frame or cropped), any assets (as base layer, card, or strip), and any motion graphic elements (overlaid or composited on top). A beat can have all three simultaneously.
layout — the rough composition (one of the patterns above, or a new one with a short description)
motion — what's moving and roughly how
intent — one short line: why this beat is cut this way ("punch in on the stat", "let her land the joke", "graph slides in under the line about growth")

Format: JSON. See references/edl-schema.md for the schema and a worked example.

For human review, render the EDL as a markdown table — never dump raw JSON at the user. Columns: time, text, on-screen, layout, motion, intent. Iterate with the user until they approve.

Workflow

Initialize the edit folder.
```
bash scripts/setup_edit.sh <aroll.mp4> <transcript.json> <edit_folder_path>
```
Creates <edit_folder>/{assets,drafts,qa_frames}/ and copies A-roll and transcript as real files. The Remotion project itself (package.json, src/Root.tsx, etc.) is scaffolded later during the remotion-best-practices handoff — this script is intentionally lean.

1b. Probe the A-roll.

bash scripts/probe_aroll.sh <edit_folder>/aroll.mp4

Gets resolution, aspect, duration, fps. Situational awareness — no layout decisions follow directly, but the fps becomes the Remotion composition's fps.

1c. Assess the face zone. Use automated face detection to get exact coordinates instead of manual inspection.

Tool available: scripts/detect_face_zones.py (requires: pip install -r scripts/requirements.txt)

Returns JSON with face bounding boxes:

{
  "image": "frame.png",
  "resolution": [1080, 1920],
  "faces": [{"index": 0, "bbox": [x, y, width, height], "center": [cx, cy], "score": 0.95}]
}

How to use (you decide when/how):

# Extract frame at whatever timestamp makes sense
ffmpeg -y -ss 10 -i <edit_folder>/aroll.mp4 -frames:v 1 <edit_folder>/drafts/frame.png

# Detect faces - JSON to stdout
# NOTE: Use scripts/.venv for Python dependencies (root .venv may have broken opencv)
cd ~/.claude/skills/remotion-talking-head-editor/scripts
source .venv/bin/activate 2>/dev/null || python3 -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt
python3 detect_face_zones.py <edit_folder>/drafts/frame.png

# Interpret coordinates and determine safe zones for overlays/captions/CTAs
# If speaker moves, sample multiple timestamps to see the range

For face-aware crops: Use bbox or center coordinates to crop around the face (in Remotion, this becomes the style.transform or a clipped <OffthreadVideo> source rect). For overlays: Calculate safe zones from bbox to avoid covering the face. Optional visualization: Add --visualize -o annotated.png to see bounding boxes drawn on the image.

Confirm inputs and establish design direction. A-roll, assets source folder, transcript all in place. Either the user has color/font/vibe preferences, or ask the standard visual-identity questions (mood, light/dark, brand references) and propose. Lock the palette before any composition code is written — every overlay, motion graphic accent, and caption highlight depends on it.
Prepare assets.
```
bash scripts/prepare_assets.sh <selected_assets_dir> <edit_folder_path>
```
Prerequisite: curate which assets belong in <selected_assets_dir> first — do not pass a raw download folder. The script re-encodes long-GOP videos (stock footage from Pexels/Unsplash often ships with 6–15s keyframe intervals; Remotion's <OffthreadVideo> seeks to arbitrary timestamps and can freeze or show the wrong frame on long-GOP sources), copies images unchanged, and writes <edit_folder>/assets/inventory.json.
Inventory the assets. Open assets/inventory.json. Fill in what_it_shows for each entry. Note candidate beats in the transcript where each asset could land.
Segment the transcript into beats. Phrase-level. This is the EDL's row count.
Draft the EDL. Walk beat by beat, deciding layout / asset / motion / intent. Keep the variation principle in mind — don't repeat the same look for 3+ consecutive beats.
Show the markdown table to the user. Iterate until approved.

7b. Validate EDL timing.

python3 scripts/validate_edl.py edl.json

Run immediately after approval, before handing off. Uses Python Decimal (exact base-10 arithmetic) to compute FP-safe durations and writes edl_validated.json with a safe_duration field per beat. The cinematographer must use safe_duration for any <Sequence durationInFrames> math (Math.round(safe_duration * fps)) — never compute frame counts from end_s - start_s directly.

Hand off to remotion-best-practices. Load that skill (if not already in context) and pass it:
- edl_validated.json
- The design direction (palette hex values, type direction, caption vibe)
- Pointers to <edit_folder>/aroll.mp4, <edit_folder>/transcript.json, <edit_folder>/assets/
- The detected face-zone coordinates
- The target fps (from the A-roll probe) and output resolution (1080×1920 by default)
The cinematographer scaffolds the Remotion project inside <edit_folder> (npx create-video@latest or equivalent), authors the TSX composition, and owns rendering from here.
QA the build. Follow references/remotion-qa.md. The short version:
- Stage 1 — Remotion Studio (npm start) for interactive scrubbing during authoring. Frame-by-frame, free of any render cost.
- Stage 1.5 — single-frame still batches for verifying specific fixes without re-rendering the whole video. Write a tiny Node script using @remotion/bundler + @remotion/renderer that bundles once and rolls through a list of frames (~1s per frame after the first). Use this as the default feedback loop during iteration — it's an order of magnitude cheaper than a preview render.
- Stage 2 — low-res preview render as final gate before high-quality (npx remotion render --scale 0.5 --crf 30 ...). Catches pacing, sync, and codec issues stills cannot.
- Stage 3 — final high-quality render (npx remotion render --crf 18 ...) once the preview is clean.
Always run the overflow audit (see references/remotion-qa.md — "Overflow audit") before moving from Stage 1.5 to Stage 2. Off-screen overlays are the #1 silent bug in this style; the audit catches them in seconds and prevents wasted preview renders.
Captions. Pick the caption vibe at EDL approval (see "Captions" section above). The cinematographer implements it using whatever caption pattern fits — @remotion/captions, custom word-by-word components from remotion-best-practices/rules/subtitles.md, or hand-rolled. Confirm the caption zone in the EDL doesn't collide with any beat's on-screen elements before approving.

Anti-patterns

These hurt short-form social specifically. Avoid them.

Same layout held for 5+ seconds. Even if the speaker is good, the visual pulse dies.
Assets stacked into every line. Viewer has nowhere to rest. Let the face breathe on emotional beats.
Cutting away from the face during the punchline or emotional peak. Trust the speaker.
Motion graphics used as wallpaper. They should add information, not fill silence.
Beats aligned to round seconds instead of transcript phrasing. It will feel mechanical.
Static assets with no motion. Reads as dead frame; viewer scrolls.
Designing the cut without reading the transcript end-to-end first. You'll miss the arc.
Assets too small to read. A chart or screenshot the viewer can't decipher is noise — size it to be legible on a phone.
Overlapping competing elements. A motion graphic landing on top of the caption line, a label dropped where captions live, two on-screen text blocks fighting for the same eye. Decide which element owns the beat and let the other step away.
Skipping the Studio pass. Studio is free real-time feedback — using it during authoring catches most layout/zone issues before they ever reach a render.
Skipping the low-res preview render. Studio scrubbing doesn't surface pacing problems or audio/visual sync drift. The preview render does. Skipping it means finding these issues in the 10-minute final render.
position: absolute wrapper anchored with right: and no width. Its position: absolute children render off-screen because the wrapper has zero computed width. If anchoring to a frame edge, set explicit width/height on the wrapper or compute left = COMPOSITION_WIDTH - margin - childWidth. Audit every slide-in card for this — it's silent (no error) and only shows up in a still.
Re-rendering the whole preview to verify a single-beat fix. That's a 3-minute round-trip for a 10-second iteration. Use Stage 1.5 single-frame stills instead.

What this skill does NOT do

It doesn't write Remotion TSX, interpolation, or composition code. That's the cinematographer's job — let remotion-best-practices choose.
It doesn't pick exact colors, fonts, or interpolation curves. That's part of the visual-identity flow you run with the user before handoff; pass the locked palette and type direction as design direction, not as code.
It doesn't render. After EDL approval, hand off — the cinematographer scaffolds the project and renders.
It doesn't transcribe. If transcript is missing, ask the user to generate one first (Whisper, AssemblyAI, or remotion-best-practices may have a path for it).