ai-talking-heads - SKILL.md Agent Skill

name: ai-talking-heads description: Guide agents through producing realistic longform AI talking-head/UGC videos using syllable-counted script chunking, image/video generation prompts, and Remotion for post-production assembly.

AI Talking Heads

Guide the user through producing realistic longform AI talking-head and UGC-style videos. The workflow covers four phases: script preparation (syllable-counted chunking), image generation (photorealistic character), video generation (per-chunk clips), and post-production (assembly via remotion skill).

When to Use

Use this skill when the user wants to:

Create an AI talking-head video (a person speaking to camera)
Produce UGC-style (user-generated content) video with an AI character
Make a longform AI video (>30 seconds) with consistent character
Generate video prompts for Kling, Veo, or similar tools
Understand the full pipeline from script to finished video

Prerequisites

Before starting, run scripts/check-prereqs.sh (in this skill directory) using Bash.

Interpreting results:

Check	If `ok`	If `missing` / `outdated`
node	Proceed	Install: `nvm install 20` or https://nodejs.org/ (needed for Remotion post-production)
npm	Proceed	Comes with Node; if missing, reinstall Node
ffmpeg	Proceed	macOS: `brew install ffmpeg`. Linux: `apt install ffmpeg`. Soft-fail: can continue without it, but audio processing and Remotion rendering need it
yt-dlp	Proceed	Optional. Only needed if downloading reference footage. Install: `pip install yt-dlp`

Remotion skill availability is reported in context.remotion_skill_available:

true → Phase 4 can use the remotion skill directly
false → Phase 4 will provide manual assembly guidance and direct links to remotion.dev

When to skip checks:

User explicitly says they have the tools
User only needs help with script chunking (Phases 1-3 need no prerequisites)

Workflow Phase 1: Script Preparation

Goal: Split the user's script into chunks of 55-60 syllables each, where each chunk becomes one 10-second video clip.

Step 1: Get the Script

Ask the user for their full script text. It should be:

Written out in full sentences (not bullet points)
In the voice/tone they want for the final video
Numbers written as spoken words ("twenty twenty-four" not "2024")

If the user doesn't have a script yet, help them write one. Ask about: topic, target audience, key points, desired tone, and target duration.

Step 2: Run Syllable Counting

Pipe the script to scripts/count-syllables.sh:

echo "SCRIPT_TEXT_HERE" | scripts/count-syllables.sh

Or save to a file and pass as argument:

scripts/count-syllables.sh script.txt

Step 3: Interpret the Output

The script outputs JSON with chunks, syllable counts, and status flags.

Show the user:

Number of chunks (= number of video clips needed)
Estimated duration in seconds
Any chunks with status issues

Handle status flags:

Status	Meaning	Action
`ok`	45-65 syllables	No changes needed
`needs_filler`	< 45 syllables	Suggest a bridge sentence. See `references/script-chunking.md` for filler guidance
`over_limit`	> 65 syllables	Suggest splitting or trimming. Rushed pacing will result otherwise

Step 4: Iterate

If any chunks need adjustment:

Propose specific filler sentences or cuts
Get user approval on changes
Re-run count-syllables.sh on the updated script
Repeat until all chunks are ok

Reference: references/script-chunking.md for the full method, worked examples, and edge cases.

Workflow Phase 2: Image Generation

Goal: Generate a photorealistic base character image that will be used as the reference frame for all video clips.

Step 1: Gather Character Details

Ask the user about their character:

Appearance: Age, ethnicity, gender, hair style/color
Clothing: Casual/professional, specific items
Setting: Where are they? Home office, kitchen, outdoors?
Mood/tone: Matches the script -- confident, friendly, authoritative?

Step 2: Build the Image Prompt

Construct a prompt using the template in references/prompt-templates.md. Key principles:

The Imperfection Principle: Real UGC footage has visible skin pores, natural oils, imperfect lighting, and casual framing. Your prompt must explicitly request these.

Essential prompt elements:

iPhone/mobile camera specs (ISO, aperture)
"Visible skin pores, natural skin oils"
"No airbrushing, no studio lighting"
Specific environment (not generic blur)
Chest-up or waist-up framing (not tight headshot)

Step 3: Generate and Review

Present the prompt to the user. They will generate the image using their preferred tool.

Recommended tool: Nano Banana Pro (via fal.ai). Alternatives: Flux, DALL-E 3, Midjourney v6.

Step 4: Upscale

If the generated image is below 1080x1920 resolution:

Recommended: Enhancor AI (preserves skin texture)
Alternative: Real-ESRGAN with realesrgan-x4plus model

The upscaled image becomes the reference frame for all video generation.

Reference: references/image-generation.md for the imperfection principle, tool details, and character consistency guidance.

Workflow Phase 3: Video Generation

Goal: Generate one 10-second video clip per script chunk, using the base image as reference frame.

Step 1: Choose Video Generation Tool

Ask the user which tool they have access to:

Tool	Clip Length	Best For
Kling 2.6 (recommended)	10s	Consistent character, standard talking heads
Veo 3.1	8s	Superior body language, natural motion
InfiniteTalk	Continuous	Pre-recorded audio, no clip assembly needed

Step 2: Build Video Prompts

For each chunk, construct a video prompt using references/prompt-templates.md. Each prompt must include:

Reference image attachment (the base character from Phase 2)
Subject description matching the base image
Tone matching the script (do NOT include the actual script text -- it can trigger content filters)
Timestamped action clusters -- 3 time blocks describing body language:
- [0-3s] Opening pose and initial gesture
- [3-6s] Emphasis gesture at the key point
- [6-10s] Settling gesture, transition-ready pose

Avoid specific gestures: Do NOT prompt for "holds up three fingers" or "counting on fingers" -- AI video generators produce wrong finger counts. Use vague descriptions like "emphatic hand gesture" instead. See references/prompt-templates.md for safe gesture alternatives.

Clip boundary continuity: End each clip with a neutral/settled pose that the next clip can start from. See references/video-generation.md for continuity techniques.

Step 3: Generate Clips

Present all video prompts to the user (one per chunk). They generate clips in order. After generation:

Check each clip for acceptable lip movement and body language
Regenerate any clips where the character looks significantly different
Note problem clips for transition coverage in post-production

Step 4: Audio Preparation

The video clips have no usable audio. The user needs voice audio:

Record their own voice reading each chunk, or
Generate AI voice using ElevenLabs, Resemble AI, or similar, or
Use text-to-speech as a quick option

Minimum audio cleanup: Run through Adobe Podcast Enhance Speech (free tier). Then normalize with ffmpeg:

ffmpeg -i voice.wav -af loudnorm=I=-16:TP=-1.5:LRA=11 voice_normalized.wav

Reference: references/video-generation.md for tool comparison, action clusters, and the InfiniteTalk alternative. references/audio-cleanup.md for audio processing.

Step 5: Lip Sync (Optional but Recommended)

Important: Kling/Veo image-to-video does NOT sync lip movement to external audio. The generated video has generic "talking" mouth motion that won't match your audio track.

For professional results, run clips through a lip sync tool:

Wav2Lip (recommended) -- Open-source, can run locally or via Replicate API
SadTalker -- Alternative with better head motion
Skip this step if using captions that draw attention away from lips, or for rough cuts

See references/video-generation.md for detailed lip sync workflow.

Workflow Phase 4: Post-Production

Goal: Assemble all clips into a final video with transitions, captions, and audio.

If Remotion Skill Is Available

Reference the remotion skill for assembly. The key remotion references for this workflow:

Task	Remotion Reference
Clip sequencing	`references/sequencing.md` — Use `<Series>` or `<TransitionSeries>`
Transitions	`references/transitions.md` — Cross-fade (0.5s) recommended for talking heads
Captions	`references/display-captions.md` — TikTok-style word highlighting
Caption generation	`references/transcribe-captions.md` — Whisper transcription from audio
Audio mixing	`references/audio.md` — Voice + background music layering
Clip trimming	`references/trimming.md` — Remove generation artifacts at clip edges

Composition setup for talking-head vertical video:

Width: 1080, Height: 1920 (9:16)
FPS: 30
Duration: sum of all clip durations

Jump cut coverage: Where clips transition poorly, use:

Cross-fade transitions (0.5s)
Brief text overlay or zoom effect
B-roll footage if available

If Remotion Skill Is Not Available

Guide the user to assemble manually:

Remotion directly: https://remotion.dev/docs — set up a project with npx create-video@latest
CapCut (free): https://www.capcut.com — drag-and-drop assembly with auto-captions
DaVinci Resolve (free): Professional NLE for manual editing

ffmpeg concat (no transitions):

printf "file '%s'\n" clip_*.mp4 > filelist.txt
ffmpeg -f concat -safe 0 -i filelist.txt -c copy output.mp4

Reference: references/post-production.md for the full remotion task map and fallback options.

Reference Index

Reference	Content
references/script-chunking.md	55-60 syllable method, worked example, filler guidance
references/image-generation.md	Nano Banana Pro, imperfection principle, character consistency
references/video-generation.md	Kling/Veo comparison, timestamped action clusters, reference frames
references/audio-cleanup.md	Adobe Podcast vs Resemble AI, ffmpeg normalization
references/post-production.md	Bridge to remotion skill, fallback assembly options
references/prompt-templates.md	Full copy-pasteable image and video prompt templates

After Execution

After completing the skill output, score and log feedback using the hybrid pattern: structural evaluation + qualitative check.

Step 1: Structural Score

Run scripts/evaluate.sh — pipe the full skill output to it:

echo "$OUTPUT" | scripts/evaluate.sh

The script checks chunk structure, syllable ranges, image/video prompt markers, and post-production references. Use the JSON result for the structural score.

Step 2: Qualitative Score

On the first run and every 3rd run thereafter, ask the user:

"Did the generated clips match your character consistently? Would you use these prompts without major edits?"

Map the response: yes = 5, mostly = 4, some issues = 3, no = 2.

On other runs, self-assess against these criteria: chunks in expected range, prompts include realism markers and action clusters, post-production plan references remotion. Use the same 1-5 scale.

Step 3: Log Feedback

Append one JSON line to FEEDBACK.jsonl in this skill directory:

{"ts":"<UTC ISO 8601>","skill":"ai-talking-heads","version":"<from CONFIG.yaml>","prompt":"<user's original request>","structural":<1-5>,"qualitative":<1-5>,"outcome":<average, rounded>,"note":"<brief note if not 4>","source":"user","schema_version":1}

Set "source":"user" when the qualitative score came from the user, or "source":"llm" when self-assessed.

Alternative: Run log-feedback.sh from the project root to log feedback interactively.

Step 4: Increment Iteration Count

Increment iteration_count under compaction in CONFIG.yaml.