video-assemble - SKILL.md Agent Skill

name: video-assemble user-invocable: false description: > Assemble a final recap video: mux narration audio over the source video, duck the original audio under the narration, render subtitles (SRT/ASS, optionally burned in), and loudness- normalize. Use as the last stage of the video-recap bundle. Consumes the source video + tts_meta.json (+ narration placement); produces recap_.mp4 + subtitles.srt/.ass. 触发词: 视频合成, 混音, 字幕, 压字幕, assemble video, mux, ducking, subtitles, 成片.

What this does

Mixes the narration audio segments onto the source video at their placed times.
Ducks the original audio under narration (fixed / sidechain / zone modes).
Renders subtitles from the narration placement → subtitles.srt (+ subtitles.ass when burning, which is on by default; --no-burn-subtitles to disable).
Optional final loudness normalization to a target LUFS.

Input contract

<video> — the source video (the original, or edited_source.mp4 in cut mode).
work_dir/tts_meta.json — {segments: [...]} from video-voiceover (each segment carries audio_path, timing, pause_after_ms, and overlaps_speech/placement used for ducking + subtitles).

Run

python3 scripts/assemble.py <video> --work-dir <work_dir> \
  [--recap-stem <name>] [--output-dir <dir>] [--no-burn-subtitles]
  [--source-video <orig.mp4>] [--export-jianying [--jianying-out <dir>]]

Output contract

recap_<stem>.mp4 — the final recap video (written to --output-dir or work_dir's parent). It is the stable output alias, overwritten in place on every run so iterating on the narration refreshes the same file.
work_dir/output.mp4 — the in-place render.
subtitles.srt — narration subtitles; subtitles.ass when burning subtitles (on by default).
timeline.json — backend-neutral multi-track model (video / original-audio / narration / BGM / subtitle tracks with ducking automation). Always written.
assembly_manifest.json — a slim render record: the input/source paths, the cut-mode source fingerprint (proving a stale ambient SOURCE_VIDEO did not leak into a full-mode export), the render settings, and the final output path.
剪映 draft folder (recap_<stem>/draft_content.json + draft_info.json + draft_meta_info.json) — only with --export-jianying.

Notes

Audio is mixed as tracks (like a cut-software timeline): the original audio, an optional BGM bed, and the narration.
Optional 剪映/JianYing export: --export-jianying (or EXPORT_JIANYING=1) turns timeline.json into an editable 剪映 draft — original clips, separate audio tracks, and volume keyframes for the ducking. Fully decoupled and lazy-imported: the ffmpeg render never depends on it, and 剪映 need not be installed. In cut mode pass --source-video <orig> so the draft references the real clips. Point --jianying-out at 剪映's drafts root to open it in-app. If a draft folder with the same name already has files, export writes a numbered sibling instead of overwriting it. Media is bundled into the draft folder by default (--jianying-no-bundle-media to reference in place) — this is required on macOS, where 剪映 is sandboxed and cannot read external paths. Note: the draft references the un-burned original, so the source's hardcoded subtitles are visible there (mask them in 剪映 if needed).
Subtitle look: SUBTITLE_FONT_SIZE, SUBTITLE_MARGIN_V, SUBTITLE_MAX_CHARS, etc.
Ducking / loudness: the original swells to IDLE_ORIG_VOLUME in the gaps and ducks to SPEECH_DUCKING_VOLUME under narration (DUCK_FADE_SECONDS smooths the transition); also DUCKING_MODE, ZONE_DUCKING_VOLUME, FINAL_LOUDNORM, TARGET_LUFS.
BGM (optional): set BGM_PATH to any audio file; it loops to length and ducks under narration (BGM_VOLUME / BGM_DUCKING_VOLUME).
Burning subtitles requires an ffmpeg with subtitles/libass support; assemble (and the recap orchestrator) preflight this and fail fast with a clear message if it is missing.
During original-audio blocks (the narration gaps), the original dialogue is also burned as subtitles so the band is never blank while the original speaks — wrapped in 「」 to set it apart from narration (SUBTITLE_ORIGINAL_IN_GAPS, default on). Preferred source is the agent-calibrated original_subtitles.json (OUTPUT-time [{start,end,text}]); without it, a conservative auto-ASR mapping is used (cut mode remaps ASR source→output via the clip plan, assigns each line to the one gap it lands in, and skips lines too dense to read).

What this skill does NOT do

Does NOT generate narration or synthesize TTS.
Does NOT re-transcribe or alter timing decisions — it consumes placement from tts_meta.json.
Burning subtitles is on by default (--no-burn-subtitles to turn it off); when on, it re-encodes the video to draw the subtitle band.