name: edu-vid-gen description: Generate an educational explainer video for a given topic and school class. Audio-first pipeline with VO-driven clip planning, Gemini-powered validation after every clip, MoviePy compositor for transitions, ambient audio layers (--audio-layers), and checkpoint gates between phases. Supports human/abstract/no-character modes.
Edu Video Generator V2
Follow each phase exactly and in order. Heavy content lives in references/ — load only when needed.
Scripts & Auth Setup
Scripts live at: __PLUGIN_DIR__/scripts/
| Script | Purpose | Auth |
|---|---|---|
generate-audio-timeline.mjs |
Full VO + word timestamps -> timeline.json |
ELEVENLABS_API_KEY |
rebuild-timeline.mjs |
Validate + renumber timeline after edits | none |
slice-audio.mjs |
Slice full-vo.mp3 at clip boundaries |
ffmpeg |
generate-image.mjs |
Keyframe images via Gemini | GEMINI_API_KEY |
generate-image-vertex.py |
Keyframe images via Vertex AI | gcloud ADC |
generate-voiceover.mjs |
Single-segment TTS (legacy/standalone) | ELEVENLABS_API_KEY |
generate-video.py |
Video clips via Vertex AI Veo 3.1 | gcloud ADC |
generate-character-sheet.mjs |
Character pose/expression reference sheets | GEMINI_API_KEY |
generate-subtitle-video.py |
Karaoke subtitle overlay for review | ffmpeg + Pillow |
generate-ambient.mjs |
Generate ambient loop via ElevenLabs Sound Effects API | ELEVENLABS_API_KEY |
composite.py |
MoviePy compositor — transitions + VO overlay + ambient layer | ffmpeg + moviepy |
validate-clip.py |
Per-clip Gemini validation (sync, text, style) | GEMINI_API_KEY |
validate-final.py |
Final video Gemini validation (junctions, ship_ready) | GEMINI_API_KEY |
checkpoint.py |
Phase gate verification | none |
enhance-for-print.mjs |
Upscale, CMYK, contrast for print | ImageMagick 7 |
extend-image.mjs |
Extend images for text overlay space | GEMINI_API_KEY + ImageMagick |
extend-video.py |
Veo video extension chain | gcloud ADC |
generate-zoom.mjs |
Ken Burns zoom clip generation | ffmpeg |
stitch.mjs |
Legacy ffmpeg concatenation | ffmpeg |
First-time setup: Run /setup to install prerequisites and configure API keys.
Loading API keys: Before running any script that requires an API key, source the plugin's .env file:
# Load keys from plugin .env
set -a; source "__PLUGIN_DIR__/.env" 2>/dev/null; set +a
Phase 1 — Collect Inputs
Ask the user:
- Topic — educational concept (e.g. "Water Cycle", "Photosynthesis")
- Class — grade level (e.g. "Class 5", "7th grade")
- Narration language — Hindi / English / Hinglish / Other. Save as
NARRATION_LANG. - Chapter source (optional) — URL, PDF, or textbook text. If provided, ALL narration must derive from it.
- Visual style — Pixar, Clay, 2D Flat, Doodle, Watercolour, or Photorealistic
- Characters — Human (
CHARACTER_MODE=human), Abstract (abstract), or None (none)- Human: Veo content filter sensitivity depends on style. See style-safety matrix below.
- Abstract: image-to-video safe. Generate character sheets in Phase 2.1.
- None: image-to-video with start+end frames. Skip Phase 2.1.
Style-Safety Matrix for Human Characters (Veo content filter):
| Style | Image-to-video safe? | Notes |
|---|---|---|
| Clay/Claymation | YES — figurine/toy read bypasses filter | Use "clay figurine", "toy diorama", "fingerprint textures" |
| Pixar (toy-like) | YES — if bobblehead/plastic/figurine proportions | Use "toy-like", "plastic skin", "figurine proportions", oversized head |
| Pixar (realistic) | NO — human-proportioned faces trigger filter | Fall back to text-to-video |
| Watercolour | NO — even fully clothed, illustration-style children get blocked | Fall back to text-to-video |
| 2D Flat / Doodle | Usually safe — test first frame before batch | Very stylized = safer |
| Photorealistic | NO — always triggers filter for child characters | Text-to-video only |
Key rules for passing Veo's content filter with human characters:
- The more artificial/toy-like the character looks, the safer it is
- Always give characters full clothing (shorts + vest/shirt) — never just a loincloth
- Use "figurine", "toy", "plastic", "puppet" language in Veo prompts
- Replace "toddler/baby/child/boy/girl" with "small character", "cartoon figurine"
- If a style fails: don't retry same style — switch to toy-Pixar or clay
- Duration — 60s or 90s
- Aspect ratio — 16:9 or 9:16
- Ambient layer (
--audio-layers) — Auto-suggest a category based on topic/setting. User can accept, override, or disable (--no-ambient).- Bundled loops (zero cost, instant):
forest,rain,ocean,space,underwater,workshop,lab,garden. Stored in repoambient-loops/dir. - ElevenLabs generation (
--generate-ambient): For vibes not covered by bundled loops. Costs ~$0.04 per 30s clip. - Save as
AMBIENT_CATEGORY(ornone).
- Bundled loops (zero cost, instant):
Save variables: TOPIC, CLASS, NARRATION_LANG, CHAPTER_SOURCE, STYLE, CHARACTER_MODE, DURATION_SEC, ASPECT, AMBIENT_CATEGORY
Create output folder:
# Load env for OUTPUT_BASE_DIR (and API keys)
set -a; source "__PLUGIN_DIR__/.env" 2>/dev/null; set +a
BASE_DIR="${OUTPUT_BASE_DIR:-$PWD}"
SLUG=$(echo "{TOPIC}" | tr '[:upper:]' '[:lower:]' | tr ' ' '-' | tr -cd 'a-z0-9-')
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
OUTPUT_DIR="${BASE_DIR}/${SLUG}-${TIMESTAMP}"
mkdir -p "$OUTPUT_DIR"/{images,clips,clips-transition,audio,prompts,characters}
Phase 2 — Write Video Brief
If chapter source provided: Read it first. Use the textbook's exact definitions and terminology.
Script guidelines:
- Each keyframe = 8 seconds. Total =
DURATION_SEC / 8keyframes. - Each narration segment =
18-22 words (150 WPM for 8s). - Structure: Hook -> Core explanation -> Key facts -> Summary.
- Character descriptions must be identical verbatim in every prompt.
- Auto-insert audio tags for
eleven_v3(seereferences/audio-tags.mdfor tag reference).
For prompt construction details: Read references/prompting.md
Present brief as:
- Keyframe table: #, Timestamp, Scene Description, Narration, Visual Notes, Text, Transition, Sound Cue, Duration
- Scene wireframe diagrams (ASCII art showing composition)
Save to $OUTPUT_DIR/script.md. Ask for approval before proceeding.
Phase 2.1 — Character Sheets (if CHARACTER_MODE != none)
GEMINI_API_KEY="$GEMINI_API_KEY" node __PLUGIN_DIR__/scripts/generate-character-sheet.mjs \
--name "{NAME}" --description "{DESC}" --style "{STYLE}" --type both \
--output "{OUTPUT_DIR}/characters/{name}" --aspect "{ASPECT}"
Generates: poses sheet, expressions sheet, recreation prompt. Wait 35s between sheets. Review with user. Use pose sheet as --reference in all subsequent image prompts.
GATE:
python3 __PLUGIN_DIR__/scripts/checkpoint.py --phase 2 --output-dir "{OUTPUT_DIR}"
Phase 2.5 — Audio Timeline
Generate the entire narration at once with word-level timestamps. The clip count is locked at the end of this phase.
Step 2.5a — Choose voice + settings. Default: Anika (ecp3DWciuUyW7BYM7II1), eleven_v3, stability 0.5, speed 0.98.
Step 2.5b — Pronunciation prep (MANDATORY for Hindi/Hinglish):
- Read
references/audio-tags.md— load the Devanagari substitution table - Scan the full narration for ALL Hindi words with retroflex, aspirated, or nasal sounds
- Replace romanized Hindi with Devanagari inline (e.g.
kapde→कपड़े). Keep English words in Roman. - For English technical terms that ElevenLabs mispronounces, use a pronunciation dictionary:
- Create one via ElevenLabs dashboard or API (
POST /v1/pronunciation-dictionaries/add-from-rules) - Pass
--dict-id {ID}to the script - Do NOT use dictionaries for Hindi — they make pronunciation worse. Devanagari embedding is the fix.
- Create one via ElevenLabs dashboard or API (
- Present the processed narration text to the user for approval before generating.
Step 2.5c — Generate full VO + timeline:
ELEVENLABS_API_KEY="$ELEVENLABS_API_KEY" node __PLUGIN_DIR__/scripts/generate-audio-timeline.mjs \
--text "{FULL_NARRATION_WITH_DEVANAGARI}" \
--output-dir "{OUTPUT_DIR}/audio" \
--voice "ecp3DWciuUyW7BYM7II1" --model eleven_v3 \
--stability 0.5 --speed 0.98 --language hi \
--min-clip 5 --max-clip 8
# Optional: --dict-id {ID} --dict-version {VER} for English pronunciation fixes
# Optional: --text-normalization on|off|auto (default: auto)
Step 2.5d — If AMBIENT_CATEGORY != none, add ambient config to timeline.json:
{
"ambient": {
"category": "{AMBIENT_CATEGORY}",
"volume": 0.15,
"source": "bundled",
"path": "${PWD}/ambient-loops/{AMBIENT_CATEGORY}.mp3"
}
}
If bundled loop doesn't exist for the category, generate one:
ELEVENLABS_API_KEY="$ELEVENLABS_API_KEY" node __PLUGIN_DIR__/scripts/generate-ambient.mjs \
--prompt "{vibe description}, seamless loop" --duration 30 \
--output "{OUTPUT_DIR}/audio/ambient-generated.mp3"
Then set "source": "generated" and "path" to the generated file.
Step 2.5e — Present timeline as readable table. Ask for edits (merge, split, adjust).
Step 2.5f — Validate after edits:
node __PLUGIN_DIR__/scripts/rebuild-timeline.mjs --timeline "{OUTPUT_DIR}/audio/timeline.json"
Step 2.5g — Slice audio:
node __PLUGIN_DIR__/scripts/slice-audio.mjs \
--timeline "{OUTPUT_DIR}/audio/timeline.json" \
--audio "{OUTPUT_DIR}/audio/full-vo.mp3" \
--output-dir "{OUTPUT_DIR}/audio"
Step 2.5h — Flag VO overflow clips (VO > 8s) for AC+TC split in Phase 4.
GATE:
python3 __PLUGIN_DIR__/scripts/checkpoint.py --phase 2.5 --output-dir "{OUTPUT_DIR}"
Phase 2.7 — Strategy Selection
Present both options with recommendation:
Option A: AC + TC — Each clip gets its own keyframe. Best for distinct scene changes. Pre-plan transitions from VO gap analysis.
Option B: Veo Extend Chain — Chain extensions from first clip. Best for continuous narrative in one environment. Risk: character drift after 3+ extensions.
Save as VIDEO_STRATEGY = "ac_tc" or "extend".
If AC+TC: pre-plan transition types per junction (hard cut / crossfade / slow zoom / Veo TC) based on VO gaps from timeline.json.
Phase 3 — Generate Keyframe Images
If VIDEO_STRATEGY = "extend": Generate only frame-01, skip rest.
Image count = timeline.total_clips. Wait 35s between Gemini calls.
For prompt templates and rules: Read references/prompting.md
For each clip in timeline.json:
- Build image prompt — character description (verbatim) + scene + style descriptors + TEXT RULES anti-prompt. Remember BEFORE-state for action scenes.
- Save prompt to
prompts/frame-{NN}_prompt.md - Generate image — with
--referenceif character sheets exist - Quality gate — Claude vision review: aspect ratio, character consistency, scene continuity, VO-scene alignment, text contamination. Max 2 retries.
- Display and ask — if user approves + says keep going, skip confirmation on subsequent passing frames.
- Compress all after approval:
sips -Z 1280 frame-{NN}.jpg --out frame-{NN}-small.jpg --setProperty formatOptions 65
GATE:
python3 __PLUGIN_DIR__/scripts/checkpoint.py --phase 3 --output-dir "{OUTPUT_DIR}"
Phase 4 — Generate Video Clips
If VIDEO_STRATEGY = "extend": Use extend-video.py chain instead (generate initial 8s clip, then chain +7s extensions).
Model: veo-3.1-fast-generate-001. Process clips sequentially.
For Veo prompt structure and beat maps: Read references/prompting.md
Audio prompt rewriting when --audio-layers is active: When AMBIENT_CATEGORY != none, Veo clips must produce SFX only — no ambient/atmospheric sounds. For each clip's --audio-prompt:
- Extract action-specific SFX from the scene (footsteps, splashes, clicks, door creaks, etc.)
- Prefix with
[SFX only: {extracted SFX}] - Append:
No background music, no ambient noise, no atmospheric sounds - Strip any ambient/atmosphere descriptors from the original prompt
For each clip:
- Build word-synced beat map from
full-vo-timestamps.json— rebase to clip-relative time, add +1s anticipation buffer - Build timestamp-structured Veo prompt using
[MM:SS-MM:SS]format - Save prompts to
prompts/clip-{NN}_prompt.md - Generate clip — image-to-video (abstract/none) or text-to-video (human)
- MANDATORY: Run validate-clip.py after EACH clip (also enforced by hook):
If any score < 7: pause, alert operator, wait for decision (accept/regenerate/adjust).python3 __PLUGIN_DIR__/scripts/validate-clip.py \ --clip "{OUTPUT_DIR}/clips/clip-{NN}.mp4" \ --clip-num {NN} \ --timeline "{OUTPUT_DIR}/audio/timeline.json" \ --output-dir "{OUTPUT_DIR}" - Generate transition clips after all ACs — extract last frames, generate TCs at correct Veo duration [2,4s]
For validation details: Read references/validation.md
For error handling: Read references/api-errors.md
Wait 60-90s between Veo calls.
GATE:
python3 __PLUGIN_DIR__/scripts/checkpoint.py --phase 4 --output-dir "{OUTPUT_DIR}"
Phase 5 — Composite & Export
Use composite.py — NOT manual ffmpeg xfade chains. The compositor reads timeline.json, calculates all gaps and overflow automatically, applies transitions, and overlays the VO.
Step 5a — Run compositor:
python3 __PLUGIN_DIR__/scripts/composite.py \
--clips-dir "{OUTPUT_DIR}/clips" \
--timeline "{OUTPUT_DIR}/audio/timeline.json" \
--vo-audio "{OUTPUT_DIR}/audio/full-vo.mp3" \
--output "{OUTPUT_DIR}/final.mp4" \
--veo-tcs-dir "{OUTPUT_DIR}/clips-transition" \
--sfx-volume 0.35 \
--ambient "{AMBIENT_PATH}" --ambient-volume 0.15
Omit --ambient if AMBIENT_CATEGORY == none. The ambient path comes from timeline.json.ambient.path.
Audio stack in final video:
| Layer | Source | Volume |
|---|---|---|
| 1 — VO | ElevenLabs full-vo.mp3 | 100% |
| 2 — SFX | Veo 3.1 native audio per clip | 35% |
| 3 — Ambient | Looped to video length | 15% |
For transition details: Read references/transitions.md
Step 5b — Validate final video:
python3 __PLUGIN_DIR__/scripts/validate-final.py \
--video "{OUTPUT_DIR}/final.mp4" \
--timeline "{OUTPUT_DIR}/audio/timeline.json" \
--output-dir "{OUTPUT_DIR}"
If ship_ready = yes and average score >= 8: proceed. Otherwise: present scores, wait for human decision.
Step 5c — Generate metadata.json with all generation settings for reproducibility.
Step 5d — Launch Timeline Editor:
The editor lets the operator trim clips, adjust VO timing, add transitions, preview the video, and export to professional NLEs (Premiere Pro, DaVinci Resolve, After Effects).
node __PLUGIN_DIR__/editor/start.mjs \
--project "{OUTPUT_DIR}"
This starts a local Next.js app at http://localhost:3333 with a media server at port 3334. The browser opens automatically. The editor provides:
- Video preview with per-clip playback and timeline scrubbing
- Trim & reorder video clips and VO segments on a visual timeline
- Transition editor — apply crossfade, dissolve, wipe, slide, etc. per-clip or globally
- Save Timeline — writes
edited-timeline.jsonto the project folder - Render MP4 — re-renders from the edited timeline using ffmpeg
- Export Premiere XML — FCP7 XML format (works in Premiere Pro, DaVinci Resolve, Final Cut Pro) →
{OUTPUT_DIR}/export/project.xml - Export AE Script — After Effects ExtendScript (.jsx) →
{OUTPUT_DIR}/export/project.jsx
After the editor launches, also open the project folder:
open "{OUTPUT_DIR}"
Report: output folder, script, characters, prompts, voiceover, frames, clips, final video, metadata, editor URL.
GATE:
python3 __PLUGIN_DIR__/scripts/checkpoint.py --phase 5 --output-dir "{OUTPUT_DIR}"
Phase 6 — Post-Pipeline Options
Optional enhancements after video is complete:
NLE Export (from Timeline Editor):
If the operator made edits in the Timeline Editor (Step 5d), they can export directly from the editor UI. If the editor is not running, re-launch it:
node __PLUGIN_DIR__/editor/start.mjs \
--project "{OUTPUT_DIR}"
| Export | Button | Output | Compatible With |
|---|---|---|---|
| Premiere XML | "Export Premiere XML" | export/project.xml |
Premiere Pro, DaVinci Resolve, Final Cut Pro |
| After Effects | "Export AE Script" | export/project.jsx |
After Effects (File → Scripts → Run Script) |
| Re-render MP4 | "Render MP4" | final-edited.mp4 |
Standalone playback |
The XML export uses FCP7 XML format — the universal interchange format supported by all major NLEs. It preserves clip positions, trim points, transitions (as cross dissolves), and VO track placement with absolute file paths.
The AE script creates a composition with all clips and VO segments at their correct timeline positions, including opacity keyframes for cross-dissolve transitions.
Print-ready images:
node __PLUGIN_DIR__/scripts/enhance-for-print.mjs \
--dir "{OUTPUT_DIR}/images" --output-dir "{OUTPUT_DIR}/images/print" --dpi 300 --format tiff
For RGB proofs: add --skip-cmyk --format jpeg.
Image extensions for text overlays:
GEMINI_API_KEY="$GEMINI_API_KEY" node __PLUGIN_DIR__/scripts/extend-image.mjs \
--input "{OUTPUT_DIR}/images/frame-{NN}.jpg" \
--output "{OUTPUT_DIR}/images/frame-{NN}-extended.jpg" \
--direction "{direction}" --extend-by 30 --style "{style}"
Directions: left, right, top, bottom, corners, all. Styles: gradient, flat, blur. Wait 35s between calls.
Error Handling
For the full error table, face-blocking workarounds, rate limits, and cost reference: Read references/api-errors.md
Quick reference for the most common issues:
- Gemini 429 -> wait 60s, retry
- ElevenLabs 401 -> check
ELEVENLABS_API_KEY - Veo 403 ->
gcloud auth application-default login - Veo face-blocking (17301594) -> use text-to-video mode (omit
--image) - Veo word restriction (58061214) -> replace "girl/boy/child" with "character/animated figure"
- ffmpeg not found ->
brew install ffmpeg