name: short-drama-generation description: Use when the user asks to create a short drama from a prompt — e.g. "生成短剧", "制作微短剧", "make a short drama", "turn this script into a short drama", "ショートドラマを生成", "숏드라마를 만들어". Orchestrates script, storyboard, optional character images, per-shot video clips, and ffmpeg composition through the existing media tools. allowed-tools: - Bash - Read - Write - Canvas - CanvasState user-invocable: true disable-model-invocation: false requires-action: true
You orchestrate a short drama by driving the existing media tools yourself. There is no single short-drama-generation tool. allowed-tools is guidance; media generation is enforced by the internal tool permission path.
Trigger only on a request to CREATE/generate a short drama. Requests to analyze, rewrite, summarize, or brainstorm a script do NOT trigger this skill unless the user asks to produce the drama. Progress-only or promise-only replies are not completion: after starting, either drive the pipeline or report the concrete blocker plainly.
Gating mechanism (confirm each authored stage before spending credits)
Each stage that authors content (script, storyboard, character images, per-shot
results, final order) is gated through the existing Canvas/CanvasState tools so
the user reviews and edits a draft before you act on it. The mechanism uses no new
infrastructure — it is render → end-turn → re-entry → read-back:
- In a desktop inline environment, render the draft with
Canvas, usingcanvasId=canvas-drama-<slug>-stage<N>(N = the stage number;<slug>is just a cosmetic label — the backend assigns the real canvasId). Then end the current turn — do NOT busy-wait or poll inside the skill. - The user edits the Canvas and submits; submission automatically starts a new turn.
That new turn first calls
CanvasStatewith the samecanvasIdto read back the confirmedvalues, then writes artifacts and advances to the next stage. - Non-desktop environment (no inline submit / no turn re-trigger): degrade to text-based per-stage confirmation, stating plainly that there is no visual gating in this environment.
- Never call
imagegen/videogenand never advance on draft values before the confirmation for that stage has been read back.
Pipeline (run in order; skip any stage whose inputs the prompt already supplies)
Form the drama <id> as <slug>-<session8>, where <slug> is a short kebab slug of
the drama title and <session8> is the first 8 characters of the session id (the
sessionId field returned by every CanvasState read-back — first available after the
Stage 0 read-back). The <session8> suffix is what makes each run land in a fresh
directory; without it, re-running a similar prompt collides with an earlier run's files
and the manifest write fails. Project files go under .puffer/media/drama/<id>/.
Mint <id> exactly once — when you first create .puffer/media/drama/<id>/ (the
Stage 0 manifest write, which happens after the Stage 0 read-back) — then reuse that
exact <id> verbatim for every later stage, recovering it from your own earlier
writes in the conversation. Never re-derive a bare title slug, and never write into a
pre-existing drama directory. (Same session asked for a second drama → append -2,
-3, ….) Generated image/video artifacts are written by the tools to
.puffer/media/images|videos/ — you only reference them, never relocate them.
Models (gate before any credit-consuming stage). Confirm the image and video provider/model up front. Both are mandatory.
Render
CanvaswithcanvasId = canvas-drama-<slug>-stage0whose body is a singlemediaModelSelectnode:{ "type": "Canvas", "canvasId": "canvas-drama-<slug>-stage0", "spec": { "title": "Models", "body": [ { "type": "mediaModelSelect" } ] } }The node is self-populating: it fetches the connected image/video capabilities itself, seeds each dropdown from the currently-saved global media defaults, renders an in-place "connect a provider in Settings" prompt for any kind with no connected provider, and on confirmation persists the choice back to the global media settings. Do not fetch capabilities, build options, or branch on empty lists yourself. Then end the turn.
In the next turn read back with
CanvasState(same canvasId):valuescarries{imgProvider,imgModel,vidProvider,vidModel}. Validate: ifimgModelorvidModelis empty, stop and report that both image and video models must be selected (direct the user to Settings if a kind has no provider). Record these values inmanifest.jsonfor Stages 3/4 — this manifest write first creates.puffer/media/drama/<id>/, so this is where you mint<id> = <slug>-<session8>using thesessionIdfrom this read-back.Script. If the prompt already contains a script (or names a script file), use it directly (no gate needed). Otherwise draft one, then gate it: render
CanvaswithcanvasId = canvas-drama-<slug>-stage1and spec{title:"Script draft",body:[{type:"textarea",id:"script",rows:14,value:"<draft>"}]}The spec is exactly this — the canvas title is the only heading. Do not add asummary, do not wrap the textarea in acard, and do not setregenerable; the script draft textarea shows directly with only a Submit action. Then end the turn. In the next turn read it back withCanvasState(same canvasId) and savevalues.scriptto.puffer/media/drama/<id>/script.md.Storyboard. If the prompt already contains a shot breakdown, use it directly. Otherwise break the script into ordered shots (aim for a handful; one beat per shot). Give each shot a stable lowercase id (
shot-001,shot-002, …) and record: subject, action, scene, lighting, camera, style, target duration (seconds), which characters appear, and any stability constraints. These fields become the video prompt — richer shots yield better clips.Gate the draft: render
CanvaswithcanvasId = canvas-drama-<slug>-stage2and spec{title:"Storyboard",body:[{type:"editableTable",id:"storyboard",layout:"cards",columns:["shotId","subject","action","duration","characters"],rows:<draft shots>}]}(layout:"cards"renders one card per shot with column 0 = shotId as the card title and the rest as labeled wrapping fields — the editableTable sits directly inbody). Do not wrap it in acardand do not setregenerable. Then end the turn. In the next turn read it back withCanvasState:valuesfor the editableTable id"storyboard"is the confirmed 2D array. In one shot, write.puffer/media/drama/<id>/storyboard.md(a markdown table of the confirmed rows) and seedmanifest.json'sshots[]— column 0 is theshotId, the remaining columns become the shot's prompt fields.Character images (reference for video). Scan the prompt for image references that are
https://orasset://URLs.If present, use those URLs directly as
--image-referencein stage 4. Do NOT generate images. Note:asset://references only resolve on upload-capable video providers (e.g. WorldRouter, which uploads the reference itself). A direct-URL provider (e.g. BytePlus) sends the reference verbatim to the model and cannot fetch anasset://handle — if the chosen video provider is direct-URL and the prompt supplied anasset://reference, ask the user for a publichttps://URL instead of passing it through.If absent and the user wants character-consistent shots, generate one image per character in parallel — never fold the cast into a grouped call. Identity is bound by which call you issued — each
imagegencall is 1:1 with one character, so the returned artifact IS that character's reference; never bind by returned image order (providers do not guarantee it) and never identify a character by inspecting pixels. The call-to-character mapping is known at issue time and is the only binding used.Square is carried by the prompt — never pass
--aspect. The reference image must read as square, but different image models support different ratio knobs and some reject any explicit ratio. Do not pass--aspect; squareness is carried entirely by the mandatory square clause in the per-character prompt below.Style anchor (compose once, reuse verbatim). Before generating, write a single shared style phrase describing the drama's overall look (medium, rendering, palette, line/lighting), derived from the storyboard's
stylefield — e.g."flat 2D anime illustration, soft cel shading, muted warm palette, clean outlines". It is the anchor that keeps the whole cast in one consistent style; every per-character call reuses it verbatim and never varies it per character. This shared anchor — not a grouped call — is what makes the separately-generated cast cohere.Collect the distinct character names from the confirmed storyboard's
characterscolumn and emit the per-characterimagegencalls together in a single turn so the backend runs them concurrently (one approval unblocks the whole batch). Cap each turn at 5imagegencalls; more than 5 characters → send successive turns of ≤5 (e.g. 7 characters → 5 then 2). Oneimagegenper character, never folding two into one call — N characters → N calls → N images. For each character build:<style anchor>, square 1:1 composition with equal width and height, full-body head-to-toe front view of <character + appearance>, standing, centered, plain pure-white background, even studio lighting, no text, no letters, no watermark, no logo, no captions— then runimagegen --prompt "<that prompt>" --count 1 --provider <imgProvider> --model <imgModel>. One call → one character → one image; the N (≤5 per turn) calls go out together as one parallel batch, and you read every result back after the batch returns. Never add--aspect.Make each character stylized / non-photorealistic (cartoon, 3D render, illustration): image-to-video providers (e.g. BytePlus) reject photoreal real-person images on moderation. Never combine multiple characters into one image. For each returned image read the tool result's
remoteSourceUrl(same key the video tool uses):- If
remoteSourceUrlis present, record it under that character inmanifest.jsoncharacterRefs({ "<character>": "<url>" }) and use it as that character's--image-referencein stage 4. - If a character's image failed or its
remoteSourceUrlis absent while other characters got one, that single character has no usable reference: record nocharacterRefsentry for it and let it fall back to text-to-video for the shots it appears in (Stage 4). The rest proceed normally — one missing image never aborts the cast. - If no character in the whole cast produced a
remoteSourceUrl, that is the configured image provider not producing referenceable URLs at all — stop and report that image-to-video is unavailable. Do NOT silently degrade the entire cast to text-to-video; the user chose an image model on purpose.
- If
If absent and consistency is not required, run text-to-video in stage 4.
When you have generated the per-character images for all chunks (not after each chunk), gate the choice once: render
CanvaswithcanvasId = canvas-drama-<slug>-stage3andtitle:"Character image", whosebodyis a singlemediaPickerwith no wrapping card:{type:"mediaPicker", id:"pick", multi:true, value:[<every item id>], items:[{id,url,label,description}, …]}— one item per character. Seturlto that character'sremoteSourceUrl(or its asset url on desktop),labelto the character name only, anddescriptionto that character's sheet description.valuelists every item id, so all characters are checked by default. Then end the turn. In the next turn read it back withCanvasState:pickis the array of checked item ids; map each back to its character viacharacterRefs. For each checked character, itscharacterRefsurl (the remoteremoteSourceUrl) is the stage 4--image-referencevalue — the picker'surlis the thumbnail only (it may be a desktopasset://url) and is never used as the reference. Any unchecked character falls back to text-to-video for the shots it appears in. There is no Regenerate toggle — to redo a character, generate it again and re-render this canvas.Per-shot video. Generate the shots in parallel: emit a chunk's
videogencalls together in a single turn so the backend runs them concurrently (one approval unblocks the batch). Cap each turn at 5videogencalls; more than 5 shots → successive turns of ≤5 (e.g. 12 shots → 5, 5, 2). Generation order does not matter here — the final play order is confirmed in Stage 5. Build onevideogencommand per shot:videogen --prompt "<@Image bindings><shot visual + action>" --provider <vidProvider> --model <vidModel>- Add
--image-reference <url>for each character in that shot'scharacterscolumn that has a checked entry, taking the url fromcharacterRefs[<character>](the remoteremoteSourceUrlrecorded in Stage 3 — not the picker's display url), in stable order. Bind every reference in the prompt — the provider does not map image→character by upload order, so a multi-character shot mis-assigns faces without explicit tags. Prefix the prompt with one binding line per reference, numbered to match the--image-referenceflags exactly (@Image1= the first--image-reference,@Image2= the second, …):@Image1 = <character + one-line appearance>, keep this character's face, hair, and outfit consistent; @Image2 = <next character + appearance>, …— THEN the shot's visual + action. A shot with one reference still gets its single@Image1 = …line. A shot whose characters are all unchecked or unavailable runs text-to-video (no@Imagebindings). - Each
videogencall polls its clip to completion in its own parallel worker, so a chunk finishes in roughly the slowest single clip's time, not the sum. Set an explicit long Bash timeout within the current Bash cap on each call, sized for the slowest single clip — never for the whole drama. One call → one finished clip. - Read
pathandartifactIdfrom the tool result and record both into the manifest asvideoPathandvideoArtifactId(see below). - After all shot chunks have finished (not after each chunk), gate the keep/drop
selection once (mirroring stage 3):
render
CanvaswithcanvasId = canvas-drama-<slug>-stage4andtitle:"Per-shot video", whosebodyis a singlemediaPickerwith no wrapping card:{type:"mediaPicker", id:"shots", multi:true, value:[<every succeeded shotId>], items:[{id,kind:"video",artifactId,label,description}, …]}— one item per SUCCEEDED shot. Setidandlabelto the shotId,kindto"video",artifactIdto that clip'svideoArtifactIdfrom the manifest, anddescriptionto the shot's prompt summary. The picker renders each tile from its artifact's first-frame poster, so the item needs nopath.valuelists every succeeded shotId, so all clips are checked by default. Then end the turn. In the next turn read it back withCanvasState:shotsis the array of checked shotIds — these are the clips kept for composition; unchecked shots are dropped. There is no retry — to redo a shot, re-run itsvideogenand re-render this canvas (same as stage 3's redo note). A shot whosevideogenfailed is not added as a tile; report failed shots plainly in turn text (see Failure contracts).
Compose. Before composing, gate the final order and mux mode: render
CanvaswithcanvasId = canvas-drama-<slug>-stage5, acardcontaining aneditableTable(id:"order",columns: ["shotId"],rows= the stage-4-kept clips in current order — the user confirms/reorders) and asingleSelect(id:"mux", optionscopy/re-encode), then end the turn. In the next turn read it back withCanvasState: compose in the confirmedvalues.order, preferring stream-copy unlessvalues.muxisre-encode.Stitch the successful shot clips in the confirmed order with ffmpeg. First probe ffmpeg:
command -v ffmpeg. If missing, stop and report — do not fake a file. Include only the stage-4-kept clips (the confirmedvalues.order); if none were kept, skip composition and report. Build the concat list with single-quote escaping (each clip line isfile '<path>', with any'in the path written as'\''). Prefer stream-copy (clips from the same provider share codec/params); only if concat-copy fails with a codec/params mismatch, retry with a re-encode:: > .puffer/media/drama/<id>/concat.txt # append one line per SUCCEEDED clip, in order (escape single quotes): printf "file '%s'\n" "<clip path, ' -> '\\''>" >> .puffer/media/drama/<id>/concat.txt # primary: fast, no re-encode ffmpeg -f concat -safe 0 -i .puffer/media/drama/<id>/concat.txt \ -c copy .puffer/media/drama/<id>/final.mp4 # fallback only if the copy fails on mismatched streams: ffmpeg -f concat -safe 0 -i .puffer/media/drama/<id>/concat.txt \ -c:v libx264 -pix_fmt yuv420p .puffer/media/drama/<id>/final.mp4If some shots failed but others composed, report it as a partial drama and list the missing shot ids.
Manifest (your working ledger — keep it simple)
Maintain .puffer/media/drama/<id>/manifest.json as you go. It is a plain ordered list,
not a schema'd artifact:
{
"id": "<id>",
"shots": [
{ "shotId": "shot-001", "status": "succeeded", "prompt": "...", "imageReferences": ["https://..."], "videoArtifactId": "...", "videoPath": ".puffer/media/videos/<aid>/..." }
],
"final": ".puffer/media/drama/<id>/final.mp4"
}
Failure contracts (never paper over)
- If a kind has no connected provider, the Stage 0 node shows an in-place "connect a provider in Settings" prompt and that model stays empty; on read-back, stop and tell the user to connect a provider for that kind in Settings — never fall back to text-to-video or to config defaults.
- If
imgModelorvidModelis empty on read-back, stop and report that both image and video models must be selected before continuing. - Never pass
--aspecttoimagegen; squareness comes only from the prompt's square clause. This sidesteps "axis ratio value not allowed" / "ratio not mapped" rejections on models that only support their own default ratio. If an image still comes back slightly non-square, that is acceptable for a reference frame — do not retry with a ratio flag. - A parallel media batch (Stage 3 images or Stage 4 videos) raises the media approval once for the whole turn. Choose Always allow when first prompted so later chunks run with no further prompt; "Approve once" only covers the current turn, so each chunk re-prompts. A trailing chunk of a single call uses the normal single-command path and may prompt on its own unless Always allow was chosen — this is expected, not an error.
- Do not advance any gated stage on draft values — wait for the stage's confirmation to be read back first.
- If
CanvasStatereturns no value for a gated stage (the user did not submit), report "no confirmation received" and stop; do not fall back to the draft. - In a non-desktop environment (no Canvas / no inline submit), degrade to text-based per-stage confirmation — including provider/model — and say so plainly; do not skip the confirmation.
- If a chosen video provider is Relaydance (prompt-only) and the user wants image references, report that the configured provider does not support image references.
- If ffmpeg is unavailable or composition fails, report it plainly and keep the per-shot clips; do not claim a composed drama was produced.
- Report final-video success only when
final.mp4actually exists; a missing final video can still leave useful per-shot clips — say so rather than implying success. - Do not hand-author placeholder media (SVG, stills, stub mp4) and present it as generated output.