name: learn-eng-by-douyin description: "Translate Chinese short videos into English subtitles with English voiceover for English learning. TRIGGERS ON: any video URL from Douyin/TikTok/Bilibili/抖音 combined with translation-related words (翻译, translate, 英文, English, 字幕, subtitles, 配音, dubbing, 学英语, learn English). Key patterns: '翻译一下' + video link, '翻译这个视频', 'translate this video', video link + '翻译/translate'. Also triggers on: 抖音视频翻译, 配英文字幕, 英文配音, Chinese to English video translation, 把视频翻译成英文, 帮我翻译, v.douyin.com + 翻译. When user shares a Chinese short video platform link (v.douyin.com, douyin.com, tiktok.com, bilibili.com, b23.tv) and mentions ANY form of translation, this skill applies. Outputs: original video + translated video with burned-in English subtitles and English TTS audio track."
Learn English by Douyin
Overview
Transform Chinese short videos (Douyin, TikTok, Bilibili, etc.) into English learning materials by:
- Transcribing the Chinese audio to text
- Translating Chinese text to natural English (done by Agent/LLM)
- Generating English TTS voiceover
- Smart time expansion — automatically slows down video segments where TTS needs more time
- Burning English subtitles and replacing the audio track
Trigger Conditions
When the user:
- Shares a Douyin/TikTok/Bilibili link and says any translation-related word (翻译、translate、英文、English、字幕、subtitles)
- Simple patterns like "翻译一下" / "翻译" / "translate" + a Chinese video link
- "翻译这个视频" / "把这个视频翻译成英文" / "帮我翻译"
- Asks to "learn English from a Chinese video"
- Wants "English subtitles" or "English dubbing" for a Chinese video
- Says "translate this Douyin/TikTok video to English"
- Provides a Chinese video and asks for English learning content
- Uses keywords: 学英语、配英文字幕、英文配音、抖音视频翻译
URL patterns that indicate Chinese short videos:
v.douyin.com/*,douyin.com/*tiktok.com/*bilibili.com/*,b23.tv/*
Key rule: If the user shares a link from any of the above platforms and mentions translation in ANY form, use this skill — not video-summarizer.
Output Structure
All output is saved to output/<video-title>/ in the workspace:
output/<video-title>/
├── original.mp4 # Original video (copied/downloaded)
└── translated.mp4 # English subtitles + English voiceover
Intermediate files (audio, SRT files, TTS clips) are created in a system temp directory (mktemp -d). Do NOT manually delete them — macOS automatically cleans /tmp periodically (~3 days). Keeping temp files allows quick iteration (re-translate, adjust TTS, etc.) without re-running expensive steps like transcription.
Execution Workflow
Step 1: Install Dependencies
bash "$SKILL_DIR/scripts/install_deps.sh"
This checks and installs: yt-dlp, ffmpeg, uv (Python package manager).
Step 2: Download or Locate Video
If the user provides a URL:
# Create output directory
mkdir -p "output/<video-title>"
# Download with yt-dlp
yt-dlp -f "bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best" \
--merge-output-format mp4 \
-o "output/<video-title>/original.mp4" \
"<video-url>"
If the user provides a local file:
cp "<local-video-path>" "output/<video-title>/original.mp4"
Step 3: Transcribe Chinese Audio
First, extract audio from the video:
TMPDIR=$(mktemp -d)
ffmpeg -y -i "output/<video-title>/original.mp4" -vn -acodec pcm_s16le -ar 16000 -ac 1 "$TMPDIR/audio.wav"
Then run the transcription script:
uv run "$SKILL_DIR/scripts/parallel_transcribe.py" \
--input "$TMPDIR/audio.wav" \
--output "$TMPDIR/chinese.srt" \
--model small \
--language zh
This outputs a Chinese SRT file with timestamps.
Step 4: Translate to English (Agent Task)
This step is performed by the Agent (you), NOT by a script.
Read the Chinese SRT file and translate each subtitle entry to English. Follow the guidelines in $SKILL_DIR/references/translation-guidelines.md.
Key rules:
- Keep the SRT index numbers and timestamps exactly the same
- Translate each Chinese text line to natural, spoken English
- Keep translations concise to fit the time window
- Use colloquial English suitable for language learners
- Preserve the speaker's tone and intent
- For internet slang/memes, find equivalent English expressions or explain briefly
- Keep Chinese names in pinyin (e.g., 哪吒 → Nezha)
Example transformation:
Input (chinese.srt):
1
00:00:01,200 --> 00:00:03,800
大家好,今天我们来聊聊哪吒闹海的故事
Output (english.srt):
1
00:00:01,200 --> 00:00:03,800
Hey everyone, today let's talk about the story of Nezha Conquers the Sea
Save the translated SRT to $TMPDIR/english.srt.
Step 4b: Speaker Identification & Voice Map (Multi-Speaker Videos)
When the video has multiple speakers (e.g., adult + child, interviewer + guest), identify who speaks each line and create a voice map JSON file.
This step is performed by the Agent (you), NOT by a script.
- Analyze the dialogue content and context to determine which speaker says each line
- Assign appropriate edge-tts voices based on speaker characteristics
- Save as
$TMPDIR/voice_map.json
Voice map format:
{
"default": "en-US-AndrewMultilingualNeural",
"2": "en-US-AnaNeural",
"3": "en-US-AnaNeural"
}
Keys are subtitle index numbers (as strings). The "default" key sets the fallback voice. Only list indices that differ from default.
Recommended edge-tts voices by character type:
| Character | Voice | Description |
|---|---|---|
| Adult male (neutral) | en-US-AndrewMultilingualNeural |
Natural, conversational |
| Adult male (casual) | en-US-GuyNeural |
Casual, warm |
| Adult female | en-US-JennyNeural |
Friendly, clear |
| Adult female (warm) | en-US-AvaMultilingualNeural |
Natural, warm |
| Child/young girl | en-US-AnaNeural |
Young, cheerful |
| Narrator | en-US-BrianMultilingualNeural |
Smooth, authoritative |
Skip this step if the video has only one speaker — Step 5 will use a single default voice.
Step 5: Generate English TTS Clips
Generate TTS audio for each subtitle entry:
# Single speaker (no voice map):
uv run "$SKILL_DIR/scripts/generate_tts.py" \
--srt "$TMPDIR/english.srt" \
--output-dir "$TMPDIR/tts_clips"
# Multi-speaker (with voice map from Step 4b):
uv run "$SKILL_DIR/scripts/generate_tts.py" \
--srt "$TMPDIR/english.srt" \
--output-dir "$TMPDIR/tts_clips" \
--voice-map "$TMPDIR/voice_map.json"
Step 6: Smart Time Window Expansion
This is the key step for learner-friendly pacing. It compares each TTS clip's actual duration against the subtitle time window. If the TTS needs more time than the window allows, it expands that video segment by slowing it down slightly.
6a. Analyze and generate adjusted timeline:
# Get original video duration
DURATION=$(ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 "output/<video-title>/original.mp4")
python3 "$SKILL_DIR/scripts/smart_expand.py" \
--srt "$TMPDIR/english.srt" \
--tts-dir "$TMPDIR/tts_clips" \
--output-srt "$TMPDIR/adjusted_english.srt" \
--output-segments "$TMPDIR/segments.json" \
--video-duration "$DURATION" \
--buffer 0.3
This outputs:
adjusted_english.srt— SRT with shifted timestamps accounting for expansionssegments.json— per-segment expansion plan (which segments to slow down, by how much)
6b. Build expanded video:
python3 "$SKILL_DIR/scripts/expand_video.py" \
--video "output/<video-title>/original.mp4" \
--segments "$TMPDIR/segments.json" \
--output "$TMPDIR/expanded_video.mp4" \
--tmpdir "$TMPDIR"
This splits the video into segments, applies setpts slowdown (typically 1.05x–1.25x) on segments that need more time, and concatenates everything back together.
Step 7: Assemble Audio and Burn Subtitles
Assemble TTS clips using the adjusted timeline:
# Use expanded video duration, not original
EXPANDED_DURATION=$(ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 "$TMPDIR/expanded_video.mp4")
uv run "$SKILL_DIR/scripts/assemble_audio.py" \
--srt "$TMPDIR/adjusted_english.srt" \
--audio-dir "$TMPDIR/tts_clips" \
--output "$TMPDIR/english_audio.wav" \
--duration "$EXPANDED_DURATION"
Burn subtitles and merge audio — note we use the expanded video as input:
bash "$SKILL_DIR/scripts/burn_and_merge.sh" \
"$TMPDIR/expanded_video.mp4" \
"$TMPDIR/adjusted_english.srt" \
"$TMPDIR/english_audio.wav" \
"output/<video-title>/translated.mp4"
Step 8: Deliver
Send the translated video to the user. Do NOT delete $TMPDIR — macOS handles /tmp cleanup automatically, and keeping intermediate files enables quick iteration.
Script Reference
| Script | Purpose | Key Args |
|---|---|---|
install_deps.sh |
Install yt-dlp, ffmpeg, uv | None |
parallel_transcribe.py |
Chinese speech → SRT | --input, --output, --model, --language |
generate_tts.py |
English SRT → TTS audio clips | --srt, --output-dir, --voice-map (optional) |
smart_expand.py |
Analyze TTS vs windows, generate adjusted SRT + segment plan | --srt, --tts-dir, --output-srt, --output-segments, --video-duration, --buffer |
expand_video.py |
Split video, slow down expanded segments, concatenate | --video, --segments, --output, --tmpdir |
assemble_audio.py |
TTS clips → full audio track | --srt, --audio-dir, --output, --duration |
burn_and_merge.sh |
Burn subtitles + replace audio | video, srt, audio, output |
burn_subtitles_cv.py |
Subtitle burning fallback (OpenCV+Pillow) | --video, --srt, --output |
Notes
- ⚠️ Telegram 视频编码注意:Telegram 客户端对 H.264 (libx264) 编码的竖屏(9:16)视频有内联预览 bug——比例会变形,画面被压扁。必须使用 HEVC (libx265) 编码,加
-tag:v hvc1确保兼容性。所有脚本已默认使用 HEVC。如果目标平台不是 Telegram,H.264 也可以,但为了通用性建议保持 HEVC。 - The translation step (Step 4) is the most critical for quality. Always follow
references/translation-guidelines.md. - Smart expansion typically adds 1–3 seconds to a 30-second video. Each expanded segment gets a 0.3s buffer for breathing room. The slowdown factor is usually 1.05x–1.25x — subtle enough that viewers barely notice, but enough for clear English narration.
- Do NOT manually clean up
$TMPDIR. macOS clears/tmpautomatically. Keeping intermediate files (chinese.srt, english.srt, TTS clips) enables quick iteration without re-running expensive transcription or TTS steps. - For very short videos (<10s), the transcription may produce minimal output. That's expected.
- Multi-speaker videos: When a video has multiple speakers (adult + child, interviewer + guest, etc.), use Step 4b to create a voice map. Different edge-tts voices make the translation feel natural and distinguishable. Note that multi-voice TTS clips may have different durations than single-voice, so smart expansion may add more time (e.g., child voices like
en-US-AnaNeuraltend to speak slower than adult voices). - TTS uses
sag(ElevenLabs) when available for high quality, falls back to systemsaycommand on macOS. - Subtitle burning:
burn_and_merge.shuses ffmpeg libass first; if unavailable, automatically falls back toburn_subtitles_cv.py(OpenCV+Pillow frame-by-frame rendering). Both produce identical visual output. - All Python scripts use
uv runwith inline script metadata — no virtual environment setup needed.smart_expand.pyandexpand_video.pyuse only stdlib, so plainpython3is fine. - The
$SKILL_DIRvariable points to this skill's root directory.