name: lyrics-video description: "Agent-authored workflow task: choose a render path (ffmpeg vs JSX), build project.json overlay entries from lyrics_sync output, and set up audio. Load this when you hit montaj/lyrics-video in a workflow." step: true
Lyrics Video Skill
Use this skill whenever you are working on a lyrics_video workflow project in montaj.
Steps 0a + 0b — Always run first
0a: Stem separation (Demucs)
Isolate clean vocals before running Whisper. Running Whisper directly on the full mix gives poor transcription because drums/bass/production mask the vocals.
montaj stem-separation --input song.mp3 --stems vocals --out-dir /tmp/stems
Output JSON: { "vocals": "/tmp/stems/htdemucs/song/vocals.wav", ... }
0b: Lyrics sync on clean vocals
Pass the vocals WAV to lyrics_sync, not the full mix. Window auto-detection is built in — only pass --start/--end if auto-detection fails.
montaj lyrics-sync --input /tmp/stems/htdemucs/song/vocals.wav --lyrics lyrics.txt --model medium.en --out captions.json
Output: captions.json with shape { segments: [{ text, start, end, words: [{word, start, end}] }], audioInPoint: <seconds> }.
The audioInPoint field tells the render engine where in the source file project t=0 maps to. Set audio.tracks[0].inPoint in project.json to this value.
Step 1 — Detect inputs
Check the prompt context for:
Lyrics file: /path/...→ path to lyrics.txtfileBackground video: /path/...→ background video provided (optional)
Step 2 — Choose render path
There are two paths. If the user's intent isn't obvious from their prompt, ask:
"Do you want simple captions — text burned straight into the video, ready to export immediately — or custom animated captions that you can preview and adjust in the UI before rendering?"
| Signal in prompt | Path |
|---|---|
| No background video | JSX always — hard rule, never use ffmpeg |
| "fast", "quick", "export now", "just burn it in" | ffmpeg (needs background video) |
| "preview", "animated", "I want to see it first", "custom style" | JSX |
| Background video provided, no other signal | ffmpeg (default — simpler) |
| Ambiguous with background video | Ask |
What each path means
Both paths always build lyric-phrase.jsx overlays in tracks[1] for UI preview. The difference is what happens at render time.
ffmpeg — burned-in captions (requires background video)
At render time, render.js sees renderMode: "ffmpeg-drawtext" and bypasses Puppeteer entirely — calls lyrics_render.py directly. Fast, no Puppeteer, output is a finished MP4. Style limited to font/size/color/position. The overlays in tracks[1] are ignored during render but remain visible in the UI preview.
ffmpeg preview fidelity rule: Style the tracks[1] overlays to match the ffmpeg output as closely as possible so the UI preview is a faithful representation of what will be rendered. Use these props:
transparent: true— text over video, no backgroundvariant: "static"— shows all words immediately for the full segment duration (matches ffmpeg static display)wordsPerLine— matchcaptions.wordsPerLine(e.g.3); omit ifcaptions.wordsPerLineis not settextColor— matchcaptions.color(e.g."white"); use"white"whencaptions.coloris"auto"fontSize— matchcaptions.fontsize(e.g.72)position— matchcaptions.positionstrokeColor: "#000000",strokeWidth: 4— matches ffmpegborderw=4:bordercolor=blackfontWeight: 400,fontFamily: "system-ui, sans-serif"— closest to ffmpeg default fonttextTransform: "none",bgStyle: "solid"(ignored since transparent)- No entrance animation, no karaoke effects
JSX — custom animated overlays
At render time, Puppeteer composites lyric-phrase.jsx overlays over tracks[0]. Supports animated variants (pop, fade, typewriter), color-flash backgrounds, per-phrase styling. Good when the user wants control or is iterating on the look.
ffmpeg Path
Requires a background video in tracks[0]. Both paths always build lyric-phrase.jsx overlays in tracks[1] for UI preview first.
project.json setup
{
"renderMode": "ffmpeg-drawtext",
"tracks": [
[
{
"id": "bg",
"type": "video",
"src": "/abs/path/to/background.mov",
"start": 0,
"end": <song_duration>,
"inPoint": 0,
"outPoint": <clip_duration>
}
],
[ ...lyric-phrase overlays styled to match ffmpeg output — see ffmpeg preview fidelity rule above... ]
],
"captions": {
"style": "word-by-word",
"segments": [ ...lyrics_sync segments... ],
"position": "center",
"color": "white",
"fontsize": 72
},
"audio": {
"tracks": [
{ "id": "music", "src": "/abs/path/to/song.mp3", "start": 0, "end": 0, "inPoint": 0, "volume": 1.0, "label": "song" }
]
},
"status": "final"
}
The overlays in tracks[1] serve as the UI preview — word-sync is visible in the browser before render. At render time, render.js sees renderMode and bypasses Puppeteer, calling lyrics_render.py directly with the captions params above.
captions ffmpeg params
| Field | Default | Notes |
|---|---|---|
position |
center |
center / top-left / bottom-left |
color |
white |
Any ffmpeg color string |
fontsize |
72 |
Integer pixels |
Mini-preview before full render
After setting up project.json and verifying overlay timing in the UI, run a short CLI preview before the full render:
montaj lyrics-render \
--captions captions.json \
--audio song.mp3 \
--input background.mov \
--position center \
--color white \
--fontsize 72 \
--preview-duration 30 \
--out preview.mp4
Verify the burned-in text looks correct, then trigger the full render with montaj render project.json.
JSX Path
project.json structure
{
"version": "0.2",
"workflow": "lyrics_video",
"settings": { "resolution": [720, 1280], "fps": 30 },
"tracks": [
[],
[]
],
"audio": {
"tracks": [
{ "id": "music", "src": "/abs/path/to/song.mp3", "start": 0, "end": 0, "inPoint": 0, "volume": 1.0, "label": "song" }
]
},
"status": "draft"
}
tracks[0] = background video loop OR empty array
tracks[1] = lyric phrase overlays (one per segment from lyrics_sync)
Preview (mandatory)
After setting up project.json, open the project in the montaj UI and preview before rendering. The PreviewPlayer scrubs through the overlay timing in real time — verify word sync looks correct and adjust any start/end values in the overlay entries if needed. Only trigger a full render once the preview looks right.
tracks[0] — With background video
[
{
"id": "bg",
"type": "video",
"src": "/abs/path/to/background.mov",
"start": 0,
"end": <song_duration_seconds>,
"inPoint": 0,
"outPoint": <clip_duration_seconds>,
"loop": true
}
]
tracks[0] — Without background video
[]
lyric-phrase.jsx renders its own full-screen colored background automatically.
tracks[1] — Overlay entries
One entry per segment from the lyrics_sync output. The src must be an absolute path to the template:
<montaj_root>/render/templates/overlays/lyric-phrase.jsx
{
"id": "phrase-0",
"type": "overlay",
"src": "/abs/path/to/montaj/render/templates/overlays/lyric-phrase.jsx",
"start": 0.5,
"end": 3.2,
"props": {
"words": [
{"word": "drink", "start": 0.5, "end": 0.9},
{"word": "le", "start": 0.9, "end": 1.2},
{"word": "croix", "start": 1.2, "end": 1.8}
],
"variant": "pop",
"transparent": false,
"fontSize": 80,
"fontFamily": "system-ui, sans-serif",
"fontStyle": "normal",
"textColor": "#111111",
"strokeColor": null,
"strokeWidth": 0,
"wordEntrance": "scale",
"entranceDuration": 0.1,
"bg1": "#ffffff",
"bg2": "#efefef",
"bgStyle": "solid",
"bgAngle": 135,
"position": "center"
}
}
For Google Fonts, add "googleFonts": ["Font Name"] on the overlay item alongside "props", and set "fontFamily" to the font name in props.
Background styles
bgStyle |
Effect |
|---|---|
"solid" |
Flat color, alternates between bg1/bg2 per word (default) |
"gradient" |
Linear gradient bg1→bg2, angle drifts 20°/s — gives subtle motion per frame |
"radial" |
Radial gradient — bg1 at center, bg2 at edges (spotlight effect) |
"vignette" |
Solid bg1 with a dark radial overlay darkening the edges |
"bgAngle" sets the starting angle for "gradient" (default 135). The angle slowly drifts over time so adjacent phrases look slightly different even with the same colors.
Example — dark gradient with slow spin:
"bg1": "#0d0d0d",
"bg2": "#1a0030",
"bgStyle": "gradient",
"bgAngle": 135
Example — spotlight radial:
"bg1": "#1a1a2e",
"bg2": "#000000",
"bgStyle": "radial"
Set "transparent": true when a background video is in tracks[0].
Variant guide
| Variant | Best for | Notes |
|---|---|---|
"pop" |
Most lyrics, punchy phrases | One word at a time, bg flashes per word |
"accumulate" |
Building phrases, emphasis | Words stack at full font size, wrapping across lines to use vertical space |
"fade" |
Slow/emotional sections | Full phrase fades in/out |
"typewriter" |
Dramatic reveals | Characters reveal progressively |
"static" |
Match ffmpeg output exactly | All words shown immediately, wordsPerLine splits into N-word lines, auto font-sizing |
Creative tip: vary variant per phrase. Chorus → "pop" + "scale" entrance, bridge → "fade".
Word entrance animations
Applied per-word as it appears. Only meaningful for pop and accumulate.
| Value | Effect |
|---|---|
"none" |
Hard cut — word appears instantly (default) |
"scale" |
Word scales down from 1.3× to 1.0 while fading in |
"slide-up" |
Word slides up 20px while fading in |
"blur" |
Word unblurs from 8px to sharp while fading in |
"entranceDuration" controls speed in seconds (default 0.1 ≈ 3 frames at 30fps). Increase to 0.15–0.2 for more visible effect.
Text stroke
"strokeColor": "#000000",
"strokeWidth": 4
Stroke is painted under the fill (paint-order: stroke fill), so it doesn't eat into the letterforms.
Font
"fontFamily": "Montserrat, sans-serif",
"fontStyle": "italic",
"fontWeight": 700
For Google Fonts, also add "googleFonts": ["Montserrat"] on the overlay item.
textStyle escape hatch
Raw CSS spread onto the <p> — for anything not covered by props:
"textStyle": { "textTransform": "uppercase", "letterSpacing": "0.1em" }
Transparent mode (with background video)
When "transparent": true, no background color renders — text floats over tracks[0]. A text shadow is applied automatically for readability.
Dark footage: "textColor": "white", "transparent": true
Bright footage: "textColor": "#111111", "transparent": true
Resolution guide
| Format | Resolution | Use when |
|---|---|---|
| 9:16 portrait | [720, 1280] or [1080, 1920] |
TikTok, Reels, Shorts |
| 1:1 square | [720, 720] or [1080, 1080] |
Feed posts |
| 16:9 landscape | [1280, 720] |
YouTube |
Always match the background video's native resolution if one is provided. Use montaj probe --input video.mov to check.
Converting lyrics_sync output to overlay entries
The lyrics_sync output has shape:
{ "segments": [{ "text": "...", "start": 1.0, "end": 3.5, "words": [...] }], "audioInPoint": 26.6 }
For each segment, create one overlay entry with:
"id": "phrase-<index>""start": segment.start"end": segment.end"props.words": segment.words— subtractsegment.startfrom every word'sstartandendso timestamps are relative to the overlay's own start (the render engine injectsframestarting at 0 for each overlay, not the project timeline)- All other props set globally (same variant/colors for all phrases, unless varying by intent)
Set audio.tracks[0].inPoint to the audioInPoint value from the lyrics_sync output.