name: manim-video-with-tts-ark-plan description: "【Ark Agent Plan 专用版本】Manim 数学/算法讲解视频完整流水线,使用火山引擎 TTS 中文旁白(与 Seedream/Seedance 共享认证)。Plan → TTS → Code → Render → Stitch → Deliver. 适用于:Manim 动画 + 中文配音、音画同步讲解视频、3Blue1Brown 风格教学视频。" version: 1.0.0 metadata: hermes: tags: [manim, animation, video, volcengine, tts, seedream, seedance, math, ark-plan] related_skills: [epub2podcast, feishu-seedance-video-pipeline, vocabulary-video-pipeline]
Manim + Volcengine TTS Showcase Pipeline
Trigger Conditions
Use this skill when the user requests:
- Manim animation with Chinese voiceover / narration
- Math/tech explainer video where audio timing must match visuals
- "用 Manim 做解释视频,并加上中文旁白"
- "确保语音和画面节奏匹配"
- 3Blue1Brown-style video with TTS integration
Pipeline Overview
PLAN → TTS → CODE → DRAFT → PRODUCTION → STITCH → MUX → DELIVER
| Step | Tool | Output |
|---|---|---|
| 1. Plan | Markdown | plan.md — narrative arc, scene breakdown, voiceover script, color palette |
| 2. TTS | Volcengine TTS | audio/s01.mp3 ~ sNN.mp3 + duration manifest |
| 3. Code | Manim CE | script.py — one class per scene, run_times synced to audio |
| 4. Draft | manim -ql |
480p15 preview clips for rhythm verification |
| 5. Production | manim -qh |
1080p60 final scene clips |
| 6. Stitch | ffmpeg concat | video_pro.mp4 — seamless scene assembly |
| 7. Mux | ffmpeg | final.mp4 — video + narration mixed |
| 8. Deliver | lark-cli + web | Feishu Drive link + HTTPS direct link |
Prerequisites
- Python 3.10+, Manim Community v0.20+
- ffmpeg
- ✅ Ark Agent Plan 统一认证:火山引擎 TTS 凭证(与 Seedream/Seedance 共享
VOLCENGINE_ACCESS_TOKEN,无需额外配置) - lark-cli for Feishu upload
- Web server directory
/var/www/hermes.aigc.green/media/
Step 1: Plan
Write plan.md before any code. Include:
- Narrative Arc: Core question → Aha moment → Emotional curve
- Scene Breakdown: One table row per scene with timing, visuals, animation type
- Voiceover Script: Chinese text per scene, with estimated character count
- Color Palette: Background, target curve, approximation, text, accent, grid
- Typography: Monospace font only (FreeMono/DejaVu Sans Mono), minimum size 18
Rule: Scene count = audio segment count. Each scene gets exactly one TTS clip.
Step 2: Generate TTS & Measure Durations
Generate one MP3 per scene using Volcengine TTS:
volc-tts --text '第N段旁白' --voice female --output ./audio/s0N.mp3
Then measure exact durations:
for f in audio/*.mp3; do
dur=$(ffprobe -v error -show_entries format=duration -of csv=p=0 "$f")
echo "$f: ${dur}s"
done
Record these durations — they are the master clock for all Manim timing.
Step 3: Write Manim Code
Sync Strategy: Audio-Driven Timing
Since Manim does not natively support Volcengine TTS, use the manual sync approach:
# In each Scene.construct(), the sum of:
# all run_time values + all self.wait() values
# must equal the corresponding audio clip duration.
Cross-Scene Continuity
To avoid rebuilding axes/curves every scene (wasting precious audio time):
class S3_NextScene(Scene):
def construct(self):
self.camera.background_color = BG
axes = Axes(...)
sin_curve = axes.plot(...)
prev_curve = axes.plot(...)
# Carry over from previous scene — instant, costs 0s
self.add(axes, sin_curve, prev_curve)
# Only animate NEW elements
self.play(ReplacementTransform(prev_curve, next_curve), run_time=2.0)
self.wait(...)
# ...
Critical: The last frame of Scene N must match the first frame of Scene N+1 for seamless ffmpeg concatenation. If Scene N fades out an element, Scene N+1 must not self.add() it.
Duration Budgeting Template
For a scene with audio duration D:
D = Σ(run_times) + Σ(waits)
Recommended allocation:
- Title/text appearance: 0.8–1.2s
- Curve drawing/creation: 1.5–2.5s
- Transform/morph: 1.5–2.0s
- Formula writing: 0.8–1.0s
- Indicator/effect: 0.5–0.8s
- FadeOut cleanup: 0.3–0.5s
- Breathing waits: distribute remainder
Color Constants (High-Contrast Showcase)
BG = "#0A0A0F" # deep void black
SIN_COLOR = "#FF4D4D" # coral red — target curve
APPROX_COLOR= "#00F5FF" # electric cyan — approximation
TEXT_COLOR = "#FFFFFF" # pure white
ACCENT = "#FFD93D" # golden yellow
GRID_COLOR = "#2A2A3A" # dim purple-grey
MONO = "FreeMono" # or "DejaVu Sans Mono"
Step 4: Draft Render
manim -ql script.py Scene1 Scene2 Scene3 ...
Verify each clip duration matches audio:
ffprobe -v error -show_entries format=duration -of csv=p=0 media/videos/script/480p15/Scene1.mp4
If mismatch > 0.1s, adjust run_time or self.wait() in code and re-render.
Step 5: Production Render
manim -qh script.py Scene1 Scene2 Scene3 ...
Step 6: Stitch & Mux
# Video concat
cat > concat_video.txt << 'EOF'
file 'media/videos/script/1080p60/Scene1.mp4'
file 'media/videos/script/1080p60/Scene2.mp4'
...
EOF
ffmpeg -y -f concat -safe 0 -i concat_video.txt -c copy video_pro.mp4
# Audio concat
cat > concat_audio.txt << 'EOF'
file 'audio/s01.mp3'
file 'audio/s02.mp3'
...
EOF
ffmpeg -y -f concat -safe 0 -i concat_audio.txt -c copy audio_full.mp3
# Mux
ffmpeg -y -i video_pro.mp4 -i audio_full.mp3 -c:v copy -c:a aac -b:a 192k final.mp4
Step 7: Deliver
Web Server (Primary Link)
cp final.mp4 /var/www/hermes.aigc.green/media/<filename>.mp4
# URL: https://hermes.aigc.green/media/<filename>.mp4
Feishu Drive (Backup)
cd <project-dir>
lark-cli drive +upload --file ./final.mp4 --name "<Title>.mp4"
lark-cli drive metas batch_query --data '{"request_docs":[{"doc_token":"<TOKEN>","doc_type":"file"}],"with_url":true}'
Extract url from response for Feishu direct link.
Performance Targets
| Stage | Resolution | FPS | Typical Speed |
|---|---|---|---|
Draft (-ql) |
854×480 | 15 | 5–15s/scene |
Production (-qh) |
1920×1080 | 60 | 30–120s/scene |
Audience Adaptation: Two Dimensions
When adapting a technical explainer, distinguish two independent dimensions:
- Audience age: Adult → Kids (cognitive complexity)
- Locale: Global English → China-localized (language & cultural context)
These require separate rewrites. Do not try to make one version serve both purposes.
Dimension 1: Age Adaptation (Kids vs Adult)
When adapting for younger audiences (elementary/middle school), do not tweak—rewrite entirely. The kids version requires new plan, new script, new TTS, and new code.
| Aspect | Technical (Adult) | Kids/Student |
|---|---|---|
| Narrative hook | "泰勒展开" / "Taylor Expansion" | "计算器是怎么算出来的?" |
| Central metaphor | Mathematical approximation | Building blocks (积木) |
| Character naming | "目标曲线" / "逼近曲线" | 小红 & 小蓝 (give curves personalities) |
| Pace | Tight, efficient | +20-30% duration, longer pauses between concepts |
| Language | Technical terms (derivative, order, convergence) | Shape, slope, bend, "pretend", "looks like" |
| Voice | Neutral | Female voice (perceived as more approachable) |
| Colors | High-contrast electric | Softer, friendlier palette |
Kids-Optimized Color Palette:
BG = "#0D1117" # deep void (same)
SIN_COLOR = "#FF6B6B" # coral red — warmer, less aggressive
APPROX_COLOR= "#4ECDC4" # mint cyan — softer than electric
TEXT_COLOR = "#F0F6FC" # off-white, easier on eyes
ACCENT = "#FFE66D" # golden yellow — magic/block highlights
GRID_COLOR = "#30363D" # dim grey
MONO = "FreeMono"
Kids Script Guidelines:
- "展开" / "expansion" → "搭积木" (building blocks)
- "阶数" / "order" → "第几块积木" (which block)
- "导数" / "derivative" → "斜率" (slope) or "弯的程度" (how much it bends)
- "收敛" / "convergence" → "越来越像" (more and more alike)
- "假装" (pretend) — the straight line is "pretending" to be the curve
- "骗过眼睛" (fool the eye) — technical accuracy achieved through deception
- "双胞胎" (twins) — when two curves become indistinguishable
- "弯腰" (bend at the waist) — cubic term makes the line "bend"
Timing Adjustments for Kids:
# Adult version
self.wait(0.3) # between phrases
# Kids version
self.wait(0.5) # longer breathing room
Add suspense pauses before reveals: self.wait(1.0) before "Here's the secret..."
Dimension 2: China Localization (中英双语注释)
When the audience is Chinese students, apply the Bilingual Annotation Principle:
Every English mathematical term must appear with its Chinese equivalent on first use.
This applies to both narration and on-screen text.
Bilingual Annotation Rules
| Rule | Example | Bad | Good |
|---|---|---|---|
| Narration: Chinese first, then English in parentheses | sine | "sine" | "正弦(sine)" |
| Narration: Read formulas in Chinese, not symbols | y=x | "y equals x" | "y 等于 x(y=x)" |
| Narration: Describe operations in Chinese | x³/6 | "x cubed over six" | "x 的立方除以六" |
| On-screen labels: 100% Chinese | title | "Math Magic" | "数学的魔法" |
| On-screen labels: Chinese with English sub-label | formula label | "good fit!" | "很像!" |
| Math formulas: Keep standard notation on screen | y=x | Remove formula | Keep y=x on screen, but narration says "y 等于 x" |
Critical: The mathematical notation (MathTex) on screen should never be removed or translated. The formula y = x, x - x³/6, etc. must remain in standard mathematical notation. Only the narration and text labels are localized.
China-Localized On-Screen Text Mapping
| English Label | 中文标签 | Notes |
|---|---|---|
| Math Magic | 数学的魔法 | Title |
| How does a calculator know sine? | 计算器是怎么算出正弦的? | Hook question |
| 小红 | 小红 | Character name (keep simple) |
| Let's pretend with a straight line! | 用直线来假装! | Subtitle |
| Add a magic block! | 加一块魔法积木! | Subtitle |
| good fit! | 很像! | Region label |
| much better! | 更像了! | Region label |
| Another block! | 再加一块! | Subtitle |
| Here's the secret... | 秘密在这里... | Subtitle |
| shape, slope, bend | 形状、斜率、弯曲度 | Concept text |
| Infinite simple blocks | 无限个简单积木 | Closing text |
| building something complex & beautiful | 搭出复杂又美丽的东西 | Closing text |
Three-Version Iteration Pattern
In practice, producing a China-localized kids version requires three iterations:
| Version | Language | Labels | Audience |
|---|---|---|---|
| V1: Adult Technical | English terms, English labels | English | Global adult |
| V2: Generic Kids | English terms, mixed labels | English/Chinese | International kids |
| V3: China-localized Kids | Chinese + bilingual annotations | 100% Chinese | Chinese students |
Workflow for V3:
# 1. Create entirely new plan (do not reuse V1/V2 plan)
cat > plan-kids-cn.md << 'EOF'
## Audience: Chinese Elementary/Early Middle School
## Principle: Bilingual annotation (Chinese first, English in parens)
## Labels: 100% Chinese
## Voice: Female Chinese TTS
EOF
# 2. Write Chinese narration with bilingual annotations
# Example: "你有没有想过,计算器是怎么算出正弦(sine)的?"
# 3. Generate TTS with female voice
volc-tts --text '...' --voice female --output ./audio-kids-cn/s01.mp3
# 4. Create new script file with Chinese labels
# script-kids-cn.py — all Text() objects use Chinese strings
# 5. Render, stitch, deliver separately
Example: China-Localized Script Evolution
| Scene | V1 Adult (EN) | V2 Kids (Mixed) | V3 China Kids (CN+EN) |
|---|---|---|---|
| S1 | "sin x Maclaurin expansion" | "How does a calculator know sine?" | "你有没有想过,计算器是怎么算出**正弦(sine)**的?" |
| S3 | "First order gives tangent y=x" | "y=x fools the eye!" | "y 等于 x(y=x),居然在中心骗过了眼睛!" |
| S4 | "Introduce cubic correction -x³/6" | "Add a block: x³/6" | "数学家加了一块积木:减去 x 的立方除以六" |
| S6 | "Match derivatives at zero" | "Shape and slope match" | "形状、斜率、弯曲程度,都和小红一模一样!" |
Pitfall: Symbol Reading in Chinese TTS
Chinese TTS cannot naturally read mathematical symbols. Never write:
- ❌
volc-tts --text 'x³/6'— TTS will say random characters or fail - ❌
volc-tts --text 'y=x'— TTS may read as "y dengyu x" (awkward)
Always write the spoken Chinese description:
- ✅
volc-tts --text '减去 x 的立方除以六'— natural Chinese - ✅
volc-tts --text 'y 等于 x'— natural Chinese
The screen still shows x - x³/6 and y=x via MathTex. The narration describes them in Chinese. This dual-channel approach (visual: standard notation, audio: Chinese description) is the core of bilingual annotation.
Workflow for Kids Version
# 1. Save adult plan as baseline
cp plan.md plan-adult.md
# 2. Create entirely new plan
cat > plan-kids.md << 'EOF'
# Math Magic: How Calculators Know Sine
## Audience: Elementary/Early Middle School
## Metaphor: Building Blocks
## Characters: 小红 (target), 小蓝 (building)
EOF
# 3. Generate new TTS with female voice
volc-tts --text '...' --voice female --output ./audio-kids/s01.mp3
# 4. Create new script file (do not reuse adult script)
# script-kids.py with adjusted timings and softer colors
# 5. Render, stitch, deliver separately from adult version
Critical Pitfalls
- Font not found: Always use
FreeMonoorDejaVu Sans Mono.Menlois macOS-only. - SurroundingRectangle on point:
SurroundingRectanglerequires a Mobject, not a coordinate. UseRectangle(...).move_to(point)instead. - Frame quantization at 15fps: Draft (
-ql) can have up to ~0.07s timing drift per animation. Production (-qh) at 60fps reduces this to ~0.017s. - Audio shorter than video: If sum(video) > sum(audio), the final clip will have silent tail. Acceptable if < 0.5s. If larger, trim video with ffmpeg or adjust waits.
- Scene transition flash: If Scene N fades out elements that Scene N+1
self.add()s, ensure the last frame of N matches the first frame of N+1. Otherwise a "flash" occurs at the stitch point. - Polynomial explosion: Taylor polynomials diverge outside the convergence radius. Set
y_rangewide enough to show the "shrinking" effect visually, or clipx_rangeto safe bounds. - Kids version feels rushed: If adapting for kids, increase all waits by 50%. Children's processing speed is slower; they need time to connect the metaphor to the math.
Sync Quality Checklist
- Each scene's
run_time + waitsum matches audio duration within 0.1s - Production render durations verified with ffprobe
- No visual flash at scene stitch points
- Audio concat duration matches video concat duration within 0.5s
- Final mux outputs without
-shortesttruncation - Both Feishu + Web links tested and accessible