learn-eng-by-douyin - SKILL.md Agent Skill

name: learn-eng-by-douyin description: "Translate Chinese short videos into English subtitles with English voiceover for English learning. TRIGGERS ON: any video URL from Douyin/TikTok/Bilibili/抖音 combined with translation-related words (翻译, translate, 英文, English, 字幕, subtitles, 配音, dubbing, 学英语, learn English). Key patterns: '翻译一下' + video link, '翻译这个视频', 'translate this video', video link + '翻译/translate'. Also triggers on: 抖音视频翻译, 配英文字幕, 英文配音, Chinese to English video translation, 把视频翻译成英文, 帮我翻译, v.douyin.com + 翻译. When user shares a Chinese short video platform link (v.douyin.com, douyin.com, tiktok.com, bilibili.com, b23.tv) and mentions ANY form of translation, this skill applies. Outputs: original video + translated video with burned-in English subtitles and English TTS audio track."

Learn English by Douyin

Overview

Transform Chinese short videos (Douyin, TikTok, Bilibili, etc.) into English learning materials by:

Transcribing the Chinese audio to text
Translating Chinese text to natural English (done by Agent/LLM)
Generating English TTS voiceover
Smart time expansion — automatically slows down video segments where TTS needs more time
Burning English subtitles and replacing the audio track

Trigger Conditions

When the user:

Shares a Douyin/TikTok/Bilibili link and says any translation-related word (翻译、translate、英文、English、字幕、subtitles)
Simple patterns like "翻译一下" / "翻译" / "translate" + a Chinese video link
"翻译这个视频" / "把这个视频翻译成英文" / "帮我翻译"
Asks to "learn English from a Chinese video"
Wants "English subtitles" or "English dubbing" for a Chinese video
Says "translate this Douyin/TikTok video to English"
Provides a Chinese video and asks for English learning content
Uses keywords: 学英语、配英文字幕、英文配音、抖音视频翻译

URL patterns that indicate Chinese short videos:

v.douyin.com/*, douyin.com/*
tiktok.com/*
bilibili.com/*, b23.tv/*

Key rule: If the user shares a link from any of the above platforms and mentions translation in ANY form, use this skill — not video-summarizer.

Output Structure

All output is saved to output/<video-title>/ in the workspace:

output/<video-title>/
├── original.mp4           # Original video (copied/downloaded)
└── translated.mp4         # English subtitles + English voiceover

Intermediate files (audio, SRT files, TTS clips) are created in a system temp directory (mktemp -d). Do NOT manually delete them — macOS automatically cleans /tmp periodically (~3 days). Keeping temp files allows quick iteration (re-translate, adjust TTS, etc.) without re-running expensive steps like transcription.

Execution Workflow

Step 1: Install Dependencies

bash "$SKILL_DIR/scripts/install_deps.sh"

This checks and installs: yt-dlp, ffmpeg, uv (Python package manager).

Step 2: Download or Locate Video

If the user provides a URL:

# Create output directory
mkdir -p "output/<video-title>"

# Download with yt-dlp
yt-dlp -f "bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best" \
  --merge-output-format mp4 \
  -o "output/<video-title>/original.mp4" \
  "<video-url>"

If the user provides a local file:

cp "<local-video-path>" "output/<video-title>/original.mp4"

Step 3: Transcribe Chinese Audio

First, extract audio from the video:

TMPDIR=$(mktemp -d)
ffmpeg -y -i "output/<video-title>/original.mp4" -vn -acodec pcm_s16le -ar 16000 -ac 1 "$TMPDIR/audio.wav"

Then run the transcription script:

uv run "$SKILL_DIR/scripts/parallel_transcribe.py" \
  --input "$TMPDIR/audio.wav" \
  --output "$TMPDIR/chinese.srt" \
  --model small \
  --language zh

This outputs a Chinese SRT file with timestamps.

Step 4: Translate to English (Agent Task)

This step is performed by the Agent (you), NOT by a script.

Read the Chinese SRT file and translate each subtitle entry to English. Follow the guidelines in $SKILL_DIR/references/translation-guidelines.md.

Key rules:

Keep the SRT index numbers and timestamps exactly the same
Translate each Chinese text line to natural, spoken English
Keep translations concise to fit the time window
Use colloquial English suitable for language learners
Preserve the speaker's tone and intent
For internet slang/memes, find equivalent English expressions or explain briefly
Keep Chinese names in pinyin (e.g., 哪吒 → Nezha)

Example transformation:

Input (chinese.srt):
1
00:00:01,200 --> 00:00:03,800
大家好，今天我们来聊聊哪吒闹海的故事

Output (english.srt):
1
00:00:01,200 --> 00:00:03,800
Hey everyone, today let's talk about the story of Nezha Conquers the Sea

Save the translated SRT to $TMPDIR/english.srt.

Step 4b: Speaker Identification & Voice Map (Multi-Speaker Videos)

When the video has multiple speakers (e.g., adult + child, interviewer + guest), identify who speaks each line and create a voice map JSON file.

This step is performed by the Agent (you), NOT by a script.

Analyze the dialogue content and context to determine which speaker says each line
Assign appropriate edge-tts voices based on speaker characteristics
Save as $TMPDIR/voice_map.json

Voice map format:

{
  "default": "en-US-AndrewMultilingualNeural",
  "2": "en-US-AnaNeural",
  "3": "en-US-AnaNeural"
}

Keys are subtitle index numbers (as strings). The "default" key sets the fallback voice. Only list indices that differ from default.

Recommended edge-tts voices by character type:

Character	Voice	Description
Adult male (neutral)	`en-US-AndrewMultilingualNeural`	Natural, conversational
Adult male (casual)	`en-US-GuyNeural`	Casual, warm
Adult female	`en-US-JennyNeural`	Friendly, clear
Adult female (warm)	`en-US-AvaMultilingualNeural`	Natural, warm
Child/young girl	`en-US-AnaNeural`	Young, cheerful
Narrator	`en-US-BrianMultilingualNeural`	Smooth, authoritative

Skip this step if the video has only one speaker — Step 5 will use a single default voice.

Step 5: Generate English TTS Clips

Generate TTS audio for each subtitle entry:

# Single speaker (no voice map):
uv run "$SKILL_DIR/scripts/generate_tts.py" \
  --srt "$TMPDIR/english.srt" \
  --output-dir "$TMPDIR/tts_clips"

# Multi-speaker (with voice map from Step 4b):
uv run "$SKILL_DIR/scripts/generate_tts.py" \
  --srt "$TMPDIR/english.srt" \
  --output-dir "$TMPDIR/tts_clips" \
  --voice-map "$TMPDIR/voice_map.json"

Step 6: Smart Time Window Expansion

This is the key step for learner-friendly pacing. It compares each TTS clip's actual duration against the subtitle time window. If the TTS needs more time than the window allows, it expands that video segment by slowing it down slightly.

6a. Analyze and generate adjusted timeline:

# Get original video duration
DURATION=$(ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 "output/<video-title>/original.mp4")

python3 "$SKILL_DIR/scripts/smart_expand.py" \
  --srt "$TMPDIR/english.srt" \
  --tts-dir "$TMPDIR/tts_clips" \
  --output-srt "$TMPDIR/adjusted_english.srt" \
  --output-segments "$TMPDIR/segments.json" \
  --video-duration "$DURATION" \
  --buffer 0.3

This outputs:

adjusted_english.srt — SRT with shifted timestamps accounting for expansions
segments.json — per-segment expansion plan (which segments to slow down, by how much)

6b. Build expanded video:

python3 "$SKILL_DIR/scripts/expand_video.py" \
  --video "output/<video-title>/original.mp4" \
  --segments "$TMPDIR/segments.json" \
  --output "$TMPDIR/expanded_video.mp4" \
  --tmpdir "$TMPDIR"

This splits the video into segments, applies setpts slowdown (typically 1.05x–1.25x) on segments that need more time, and concatenates everything back together.

Step 7: Assemble Audio and Burn Subtitles

Assemble TTS clips using the adjusted timeline:

# Use expanded video duration, not original
EXPANDED_DURATION=$(ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 "$TMPDIR/expanded_video.mp4")

uv run "$SKILL_DIR/scripts/assemble_audio.py" \
  --srt "$TMPDIR/adjusted_english.srt" \
  --audio-dir "$TMPDIR/tts_clips" \
  --output "$TMPDIR/english_audio.wav" \
  --duration "$EXPANDED_DURATION"

Burn subtitles and merge audio — note we use the expanded video as input:

bash "$SKILL_DIR/scripts/burn_and_merge.sh" \
  "$TMPDIR/expanded_video.mp4" \
  "$TMPDIR/adjusted_english.srt" \
  "$TMPDIR/english_audio.wav" \
  "output/<video-title>/translated.mp4"

Step 8: Deliver

Send the translated video to the user. Do NOT delete $TMPDIR — macOS handles /tmp cleanup automatically, and keeping intermediate files enables quick iteration.

Script Reference

Script	Purpose	Key Args
`install_deps.sh`	Install yt-dlp, ffmpeg, uv	None
`parallel_transcribe.py`	Chinese speech → SRT	`--input`, `--output`, `--model`, `--language`
`generate_tts.py`	English SRT → TTS audio clips	`--srt`, `--output-dir`, `--voice-map` (optional)
`smart_expand.py`	Analyze TTS vs windows, generate adjusted SRT + segment plan	`--srt`, `--tts-dir`, `--output-srt`, `--output-segments`, `--video-duration`, `--buffer`
`expand_video.py`	Split video, slow down expanded segments, concatenate	`--video`, `--segments`, `--output`, `--tmpdir`
`assemble_audio.py`	TTS clips → full audio track	`--srt`, `--audio-dir`, `--output`, `--duration`
`burn_and_merge.sh`	Burn subtitles + replace audio	video, srt, audio, output
`burn_subtitles_cv.py`	Subtitle burning fallback (OpenCV+Pillow)	`--video`, `--srt`, `--output`

Notes

⚠️ Telegram 视频编码注意：Telegram 客户端对 H.264 (libx264) 编码的竖屏（9:16）视频有内联预览 bug——比例会变形，画面被压扁。必须使用 HEVC (libx265) 编码，加 -tag:v hvc1 确保兼容性。所有脚本已默认使用 HEVC。如果目标平台不是 Telegram，H.264 也可以，但为了通用性建议保持 HEVC。
The translation step (Step 4) is the most critical for quality. Always follow references/translation-guidelines.md.
Smart expansion typically adds 1–3 seconds to a 30-second video. Each expanded segment gets a 0.3s buffer for breathing room. The slowdown factor is usually 1.05x–1.25x — subtle enough that viewers barely notice, but enough for clear English narration.
Do NOT manually clean up $TMPDIR. macOS clears /tmp automatically. Keeping intermediate files (chinese.srt, english.srt, TTS clips) enables quick iteration without re-running expensive transcription or TTS steps.
For very short videos (<10s), the transcription may produce minimal output. That's expected.
Multi-speaker videos: When a video has multiple speakers (adult + child, interviewer + guest, etc.), use Step 4b to create a voice map. Different edge-tts voices make the translation feel natural and distinguishable. Note that multi-voice TTS clips may have different durations than single-voice, so smart expansion may add more time (e.g., child voices like en-US-AnaNeural tend to speak slower than adult voices).
TTS uses sag (ElevenLabs) when available for high quality, falls back to system say command on macOS.
Subtitle burning: burn_and_merge.sh uses ffmpeg libass first; if unavailable, automatically falls back to burn_subtitles_cv.py (OpenCV+Pillow frame-by-frame rendering). Both produce identical visual output.
All Python scripts use uv run with inline script metadata — no virtual environment setup needed. smart_expand.py and expand_video.py use only stdlib, so plain python3 is fine.
The $SKILL_DIR variable points to this skill's root directory.