name: explainer
description: >-
Turn a topic (or source document) into a visually dynamic HTML explainer deck
and a narrated vertical video, end-to-end, using only local/free tools (Kokoro
TTS, torchaudio forced alignment, Playwright, ffmpeg) plus this Claude session.
Use when the user wants to "make an explainer video", "turn this into a Short/
Reel/TikTok", "create an explainer deck", or "/explainer ". Supports
topic-only OR source-driven (ingest a PDF/URL and frame a real figure/screenshot);
aspects 9:16, 16:9, 4:5; fixed theme. Generation only — it writes a labeled output
dir + manifest.json; it does NOT post to social platforms.
/explainer
Generate an explainer deck + narrated 9:16 video from a topic. You (Claude) do the generation stages (research, scripting, deck authoring); a pure-Python pipeline does the deterministic media stages (narrate → align → render → mux). No paid APIs.
Architecture rule (do not violate)
You only author structured JSON (script.json, deck.json, optional meta.json) and
wiki nodes. You never write raw deck HTML — the deck engine renders from deck.json,
which preserves the determinism contract (PRD §8.6). The media pipeline makes zero LLM
calls; once the JSON exists, it renders unattended.
Environment
- Package: the cloned
explainer-systemrepo (runexplainerfrom there). - Interpreter: a Python venv with Kokoro / torch / Playwright / ffmpeg installed.
- Console command (editable-installed into that venv):
explainer(on PATH once the venv is active). If unavailable, usePYTHONPATH=<repo>/src python -m explainer.cli. - Shell discipline (CLAUDE.md): the
mediacommand is synchronous — run it in the foreground and let it finish (~50s for a ~20s video). No polling, no backgrounding.
Steps
1. Intake
Parse the topic and any flags (default aspect 9:16, fps 30, voice af_heart). Unless the
user said --yes, briefly confirm: the angle/hook, target length, and aspect.
One cheap confirmation, then proceed.
2. Scaffold
explainer scaffold "<slug>" --title "<title>" [--aspect 9:16] [--theme midnight] [--brand ACME] [--voice-source operator]
Themes (a family of looks — vary them across a channel, PRD §8.5): midnight (default,
cool dark), paper (light), sunset (warm dark), forest (green dark), mono (yellow on
near-black). Each carries a default motion personality.
This creates outputs/<date>_<slug>/project.json and prints the project dir. Use that dir
for everything below.
2b. Ingest source material (source-driven runs only)
If the user gave a PDF or URL, ingest it into sources/ + citations.json:
explainer ingest <project_dir> --pdf <path> [--pages "1-3,5"]
explainer ingest <project_dir> --url <url> [--full-page]
This extracts text and renders framed screenshots/figures. To feature a specific
figure (not a whole page), it's fine to render a tight clip with PyMuPDF
(page.get_pixmap(matrix=fitz.Matrix(3,3), clip=fitz.Rect(...))) into sources/.
Read the extracted text to ground the script; reference an image in a figure slide (below).
3. Research (+ wiki)
- First reuse prior knowledge: read
wiki/INDEX.mdand any relevantwiki/source-fact/*nodes so you don't re-research what's already captured. - Then use WebSearch/WebFetch to gather and verify current facts. Prefer primary sources.
- Capture what you learn as wiki nodes (provenance compounds across videos):
explainer wiki source "<source title>" --root . --topic "<topic>" --ref "<url-or-path>" explainer wiki fact "<short fact name>" --root . --topic "<topic>" \ --body "<the atomized claim>" --source "<source slug>" --confidence high - Every claim that ends up on screen or in narration should trace to a fact you can cite.
4. Author script.json + deck.json
Pick a hook archetype for slide 1 (bold claim · question · surprising stat · "you've been doing X wrong" · visual reveal). The first slide must front-load the payoff — no title-card throat-clearing.
script.json — narration per slide (the slide field is the slide id, matched in deck.json):
{ "segments": [
{ "id": 0, "slide": "s1", "text": "<hook line — spoken>" },
{ "id": 1, "slide": "s2", "text": "<...>" }
] }
Write acronyms naturally ("MCP", "AI", "GPT-4") — the pronunciation lexicon speaks them
as letters/words while captions still show the acronym. Add a <project>/lexicon.json
({"token": "spoken form"}) for any term the default lexicon misses. Spell out numbers you
want read a certain way (e.g. "ninety seven million").
deck.json — one slide per id. Pick the device that proves each point — the full
catalog (schemas + when-to-use) is docs/visual-devices.md.
Devices: narrative (hook · statement · build · reframe · highlight · punch ·
define · list · payoff · quote), data (stat · ring · statgrid · progress ·
pictograph · trend · delta · diagram · ranked · compare · timeline ·
waterfall · matrix · steps), source (figure), brand (cta, auto).
{ "title": "<deck title>", "slides": [
{ "id": "s1", "type": "hook", "kicker": "<label>", "headline": "<text>", "accent": ["word"] },
{ "id": "s2", "type": "stat", "kicker": "<label>", "value": "90%", "label": "<context>" },
{ "id": "s3", "type": "statgrid", "stats": [ { "value": "$2M", "label": "<a>" }, { "value": "3x", "label": "<b>" } ] },
{ "id": "s4", "type": "progress", "value": "73%", "label": "<context>" },
{ "id": "s5", "type": "diagram", "kicker": "<label>",
"bars": [ { "label": "<a>", "value": 0.9, "kind": "good" }, { "label": "<b>", "value": 0.3, "kind": "bad" } ] },
{ "id": "s6", "type": "compare", "left": { "title": "<a>", "value": "<x>", "kind": "bad" },
"right": { "title": "<b>", "value": "<y>", "kind": "good" } },
{ "id": "s7", "type": "steps", "steps": [ { "title": "<step 1>" }, { "title": "<step 2>" } ] },
{ "id": "s8", "type": "quote", "quote": "<verbatim line>", "attribution": "<who>" },
{ "id": "s9", "type": "figure", "kicker": "<source>", "image": "sources/<file>.png", "caption": "<desc>" },
{ "id": "s10", "type": "payoff", "headline": "<text>", "accent": ["word"], "subkicker": "<a · b · c>" }
] }
More devices (same slides array — see docs/visual-devices.md
for when-to-use + every field):
{ "id": "n1", "type": "build", "headline": "<text>", "accent2": ["word"] },
{ "id": "n2", "type": "reframe", "before": "It's not about", "strike": "luck", "after": "timing" },
{ "id": "n3", "type": "highlight", "headline": "<text>", "mark": ["key", "phrase"] },
{ "id": "n4", "type": "punch", "word": "Ship.", "kind": "good" },
{ "id": "n5", "type": "define", "term": "<term>", "definition": "<gloss>" },
{ "id": "n6", "type": "list", "kicker": "3 truths", "items": ["<a>", "<b>", "<c>"] },
{ "id": "d1", "type": "ring", "value": "73%", "label": "<context>", "kind": "bad" },
{ "id": "d2", "type": "pictograph", "filled": 9, "total": 10, "label": "<context>", "kind": "bad" },
{ "id": "d3", "type": "trend", "kicker": "MRR", "points": [2,3,5,8,13], "end_label": "$13k", "kind": "good" },
{ "id": "d4", "type": "delta", "from": "$10k", "to": "$40k", "change": "+300%", "kind": "good" },
{ "id": "d5", "type": "ranked", "bars": [ { "label": "<a>", "value": 0.9, "display": "90%", "kind": "bad" } ] },
{ "id": "d6", "type": "timeline", "events": [ { "date": "Jan", "label": "<a>" }, { "date": "Sep", "label": "<b>" } ] },
{ "id": "d7", "type": "waterfall", "start": { "label": "Q1", "value": 40 },
"steps": [ { "label": "Churn", "value": -12, "kind": "bad" } ], "end": { "label": "Q2", "value": 58 } },
{ "id": "d8", "type": "matrix", "x_axis": ["Low effort","High effort"], "y_axis": ["Low impact","High impact"],
"points": [ { "label": "Ship", "x": 0.2, "y": 0.85, "kind": "good" } ] }
stat/statgrid/delta count their numbers up; progress/diagram/ranked bars grow;
ring/trend arcs + lines draw on; steps/statgrid/list/timeline reveal in sequence —
all driven deterministically by renderAt(t). figure frames an ingested screenshot (white
card); image is relative to the project root. waterfall/matrix need ≥ 5s dwell — the
engine auto-fits their columns/labels to the aspect, but they carry the most to parse, so favor
≤ 4 steps/points with short labels (see the doc).
Vary your devices (anti-monotony). Across a deck use 5+ distinct device types, and
never repeat the same data device unless the data demands it — a deck that reuses one bar
chart reads as templated. Use quote for verbatim talk-time one-liners; use stat when a
single number carries the point.
Each slide may set "transition" (rise · fade · pop · slide) to override the
theme's default intro motion. Vary it across slides — don't repeat the same transition
on every slide (the §8.4 anti-repetition rule); repetition reads as "templated".
Rules: every slide has motion by construction; accent/accent2 highlight words by the
theme colors; keep headlines tight (they auto-shrink past ~60 chars). Keep ids identical
across the two files. Aim for 4–6 slides for a ~20–40s Short.
4a. talk-time READ — write the script in the operator's real voice
If the brand carries a talk_time block ({tag, library?}), don't write generic AI prose
— ground the script in the operator's own documented takes, stories, and quotes (a
private "talk-time library" the operator maintains; it is not bundled with this tool). This
is independent of who speaks (Kokoro or operator VO): it shapes the words. If the
brand has no talk_time block, skip this step and write the script normally.
Surface the relevant material (read-only — never writes, never fabricates):
explainer talktime --brand <SLUG> [--topics "keyword,keyword"]
The library path comes from the brand's talk_time.library, a --library flag, or the
EXPLAINER_TALKTIME_LIBRARY env var. It parses that library's INDEX.md, filters entries
by the brand tag (+ optional topic keywords), and prints candidate quotes / positions /
anecdotes / topics with absolute paths. Then:
- Read the candidate files you'll draw on (Read tool, the printed abspaths).
- Quote VERBATIM from
quotes.md— use the exact one-liners as spoken lines. - ADAPT freely from
positions/andanecdotes/— paraphrase the reasoning/story into tight script prose. - NEVER fabricate a take, stat, or story not in the library. If the library is thin on the topic, say so and write only what the source material supports (or narrow the angle).
- Watch the per-entry brand tags + any
⚠notes in INDEX.md (e.g. "overused — rotate", NDA cautions) — honor them.
Use --topics to narrow to the deck's subject; omit it to see everything tagged for the
brand. This applies to both the daily (Kokoro) and weekly (operator VO) tiers.
5. (optional) meta.json for the manifest
Author meta.json with a summary + per-platform captions so the downstream poster has what
it needs (this tool still does NOT post):
{ "summary": "<1-2 sentences>",
"per_platform": [
{ "platform": "tiktok", "caption": "<hook-first caption>", "hashtags": ["#ai","#rag"],
"link_placement": "none", "primary_asset": "video", "aspect": "9:16" }
],
"sources": ["<url>", "<url>"] }
6. Render (pure-Python, synchronous)
explainer media outputs/<date>_<slug>
Runs narrate → align → deck → render → mux → manifest → qa and writes results.json. If
a stage fails, it prints failed_stage; re-run a single stage with explainer <stage> <dir>.
The qa stage (motion/pacing) reports warnings in work/qa.json: visual dead air during
speech (held frames while narrating — add motion or split the shot), over-long shots, and
uniform cut rhythm. Read the warnings; if dead air is high, tighten pacing or split slides
and re-render. Warnings are advisory, not fatal.
Decks include a subtle drifting ambient glow by default (keeps motion alive between word
highlights → near-zero dead air). It roughly doubles render time (compositing the glow
layer); set "ambient": false in project.json for ~2× faster renders when speed matters.
7. Report
Tell the user the output dir and the key artifacts:
deck/index.html(standalone, openable deck)video/explainer_9x16.mp4captions/captions.srt/.vttmanifest.json(ready_for_post, AI-disclosure, per-platform captions) Spot-check one rendered frame inwork/frames/to confirm layout/legibility before declaring done.
Aspects, platforms & length
--aspect 9:16|16:9|4:5|1:1, or render several at once with--aspects "9:16,1:1"(one project → one MP4 per aspect; layout is robust across aspects).--platform <tiktok|reels|shorts|threads|linkedin|youtube|square>sets the aspect + a safe-zone bottom inset (captions clear the platform's UI chrome) and, where relevant, a default min length (e.g. tiktok ⇒ 60s).--min-length <seconds>: if the rendered narration is shorter, the manifest gets alength_warningandready_for_post:false. Meet it by deepening the script with a sourced beat (a new example / fact), never by padding (PRD §7) — then re-render.
8. Validate + hand off (boundary stops here)
explainer validate <dir>— confirm the manifest is a complete, consistent handoff contract (videos exist, captions present, per-platform aspects rendered, disclosure set).explainer handoff <dir>— emithandoff.json: per-platform blotato-ready post specs (absolutemedia_file, composedtext,titlefor YouTube,ai_label). A poster (theblotato-crosspostskill) consumes it: uploadmedia_file→create_postper entry. This tool never posts. Theai_disclosureblock maps to the poster's AI toggle (e.g. TikTok'sisAiGenerated) — keep it set so publishes are compliant.
Optional: music bed
Set "music": "<path>" (and optionally "music_gain": 0.16) in project.json to mix a
low royalty-free bed under the narration (recommended for 9:16; off for 16:9/deck). No audio
ships with the tool — provide your own vetted, licensed track.
Branding & call-to-action (--brand <SLUG>)
Pass a brand slug to stamp the video with a brand and a CTA. Resolution is local-first
then global: ./brand/<SLUG>/ (the content project you run from) → $EXPLAINER_BRAND_DIR/
→ ~/.claude/explainer-brands/<SLUG>/. A brand folder holds brand.json + assets:
{ "name": "ACME Co",
"logo": "logo.png", // transparent PNG — small corner watermark on EVERY slide + larger on the CTA
"product": "product.png", // optional — e.g. a book cover, shown on the CTA slide
"watermark_corner": "bl", // bl | br
"accent": "#5b8cff", // optional — tints the theme accent to brand color
"lexicon": { "acme.example": "ACME dot example" }, // optional — brand-specific pronunciations
"talk_time": { "tag": "<brand-tag>", "library": "/abs/path/to/library" }, // optional — operator's private take-library (step 4a)
"cta": { "headline": "Get the thing.", "subkicker": "Out now",
"url": "acme.example",
"spoken": "Check out ACME — link in bio." } }
When --brand is set: assets are copied into the output dir (self-contained), the logo
watermarks every slide in the safe-zone corner, and a CTA end slide is auto-appended
(product + larger logo + headline/subkicker/url) with the cta.spoken line auto-narrated
and synced. You don't author the CTA slide/segment — the pipeline adds them from the brand
(author your own {"id":"cta","type":"cta"} slide / slide:"cta" segment only to override).
The url is on-screen text only — the tool still never links out or posts.
Rotating CTAs (--cta <variant>). A brand folder may hold a hand-editable
cta_library.json with named variants (e.g. book, newsletter), each with its own
headline/subkicker/url/spoken + optional product image. explainer scaffold … --brand FFW --cta newsletter picks one; with no --cta the library default (then brand.json cta)
is used. Lets the operator maintain rotating CTAs by hand without touching code — and a
routine just passes the variant it chose.
Voiceover mode — the operator's real voice (--voice-source operator)
For higher-production pieces, the narration can be the operator reading the script, instead of Kokoro. The pipeline is audio-first, so a real recording aligns to the slides exactly like Kokoro does. Be a coach through this loop — keep it seamless, no external apps:
- Scaffold with
--voice-source operator(everything else is the same). - Author script.json + deck.json as usual. Offer a quick review checkpoint — let the operator tweak the script before recording (re-recording is the costly step).
- Record (integrated). Run
explainer record <project_dir>— but launch it in the background so the conversation keeps moving. It opens a browser teleprompter that records the mic per segment (record / re-record / playback), saving straight intovoiceover/. Tell the operator: "Teleprompter's open — read each segment, re-record any you flub, then click Finish." The command returns when they click Finish; its result listsrecorded/missingsegment ids. - If any segment is
missing, tell them which and re-launchrecord(already-recorded segments stay ✓). Don't proceed until all are recorded. - Render. Run
explainer media <project_dir>— the operator-narrate path assembles the clips, runs them through the local audio-cleanup skill (podcast-grade, −14 LUFS), aligns, and renders. - Report the finished video and ask "what do you think?"
The tool can also run fully end-to-end with Kokoro (--voice-source kokoro, default) — give it a
topic and it returns a finished video. Voiceover mode is the interactive, co-built tier.
Out of scope (current phase)
Music beat-sync, operator --interview voice capture, C2PA embedding (needs c2patool +
a signing cert — disclosure is currently carried in the manifest + poster AI toggle), automatic
min-length deepening (you do it), layout variants within a template — later phases (see PRD).
Don't fake them.