name: local-comfy-wan-multigpu description: Operate local-only ComfyUI Wan image-to-video workflows with MultiGPU placement, quality-first source staging, long-clip validation, and Explorer-based export handoff. Use when Codex needs to turn private local stills into local video artifacts without browser-only UX, while avoiding destructive crops, stray console windows, and GPU contention with chat models.
Local Comfy Wan MultiGPU
Use this skill when the goal is a real local media-generation pass, not just an API or health smoke.
All local paths and endpoints in this skill are intentionally parameterized.
Set environment variables (for example LOCAL_MEDIA_WORKBENCH_TOOLS_ROOT, LOCAL_COMFY_*, and LOCAL_MUSETALK_REPO) instead of hardcoding workstation-specific paths.
Workflow
- Initialize a deterministic work item folder before heavy generation. Use
scripts/init_media_job.py. - Keep the source library read-only and stage copies into a workspace or Comfy input folder before rendering.
- Build a thumbnail/contact sheet from candidate stills before choosing a source; do not pick an arbitrary image. Use
scripts/build_contact_sheet.py. - Reject destructive center-crop staging. Preserve composition with fit or pad unless the user explicitly wants a crop.
- Use
scripts/stage_reference_frame.pyfor deterministic fit or pad staging.
- Use
- Distinguish
smoke_testfromquality_render:smoke_test: shortest cheap proof only.quality_render: choose a properly framed source and push the workflow to the longest supported clip length.- Inspect the live Wan node schema first with
scripts/inspect_wan_limits.py. - Treat
16 fpsas the native Wan baseline unless a specific workflow proves otherwise.
- For heavy Wan runs, unload competing NVIDIA-resident chat models first and route chat to another local host/GPU if available.
- When using MultiGPU nodes, do not hide devices with
--cuda-device; keep both CUDA devices visible and place loaders deliberately. - Start Comfy through a monitored headless launcher that writes logs and fails closed if the server never becomes ready. Use
scripts/launch_comfy_headless.py. - Persist prompt/result artifacts next to the output clip so the exact render path is reproducible.
- Use
scripts/run_wan_flf2v.pyfor first/last-frame quality renders and living-portrait baselines. - Use
scripts/run_keyframe_range_pipeline.pywhen the user supplies a range of approved keyframes that should become a single complete video. - Use
scripts/run_complete_media_loop.pywhen the job should drive staging, raw motion, Jessica TTS, MuseTalk, QC, and export packaging in one pass.
- Use
- Extract review frames and build a review sheet from the finished clip with
scripts/review_video_result.py. - Visually inspect the generated review sheet and, when available, capture a local screenshot of the result viewer before deciding the run is acceptable.
- When the MX3 runtime is available, use it as an assist plane for source scoring and QC rather than as a pretend generator:
face-yunet,face-landmarks,face-headpose,face-eye-statefor source preflightpose-simcc,emotion-ferplus,face-eye-statefor raw-render body/eye/expression QCsegmentation-bisenet,depth-hr, andDaSiamRPNtracker pair for masks, ROI tracking, and mouth-region inspectionaudio-speech-enhancementonly for recorded/noisy audio, not clean Jessica TTS
- For loop-targeted shots, prefer loop-aware Wan nodes such as
WanFirstLastFrameToVideoover a plain one-way image-to-video run when they are available locally.
- Reusing the same approved still as both first and last frame is acceptable for a living-portrait loop baseline.
- A keyframe range should be treated as ordered endpoints, not as arbitrary cuts to concatenate blindly.
- When the clip should loop, keep the approved forward render as the source of truth and only build the reverse leg after approval:
scripts/prepare_capcut_loop_handoff.pyfor the preferred forward + reverse editor handoffscripts/run_loop_vfi_master.pyfor a Comfy-based VFI masterscripts/make_loop_master.pyonly as the fallback ffmpeg-only lane
- If the user asks for
30or60fps, render Wan at native cadence first and master upward with dedicated VFI instead of pretending Wan itself should be the final30/60fps source. - For longer spoken or directed clips, split the job into phrase-sized raw segments first, then stitch after approval:
- keep raw Wan slices near a real measured wall-clock budget instead of one giant render
- current quality-first policy targets about
18minutes per raw slice with a30minute hard cap - prefer natural phrase boundaries over arbitrary word chops, and do not split a segment mid-thought just to satisfy a budget
- for Penny spoken passes, aim for
1sentence,6-14words, and<= 7sper accepted spoken slice - stitch the approved segment masters after render; do not force a single monolithic raw pass when the workstation can parallelize slices better
- on current hardware, smaller slices are a feature, not a failure: they improve queue balance, review cadence, retry cost, and whole-system throughput
- If the clip needs speech, keep TTS/lip-sync as a separate pass and run local ASR QC after it.
- For Penny spoken passes, use the approved local Kokoro
af_jessicalane unless the user explicitly chooses another approved voice. - Use
scripts/synthesize_kokoro_tts.pywith the local Kokoro model files when the workbench needs a deterministic offline TTS source. - Use
scripts/run_musetalk_pass.pyfor the current animated-source spoken pass lane. - Use
scripts/analyze_spoken_clip.pyto compare the spoken output against the expected line. - Treat
LatentSync 1.6as the primary quality-upgrade target for mouth clarity; treatMuseTalkas the current fallback/baseline until the local LatentSync lane is fully installed.
- Apply lip-sync only after the body or portrait motion pass is approved. A rigid still-image talking head is a benchmark lane, not the final-quality spoken workflow, unless the user explicitly asks for it.
- Normalize talking-head lip-sync sources to
25 fpsvideo and16 kHzmono audio before MuseTalk / LatentSync-class processing. - If the user wants a loop that feels polished rather than algorithmic, finish the forward + reverse assembly in CapCut instead of forcing the whole loop into one automatic render step.
- Stop at one reviewable finished clip before creating
30/60 fpsvariants unless the user explicitly asks for VFI masters.
- Default spoken/mastered handoff is the reviewed
25 fpsclip. - Treat
30/60 fpsas opt-in post-review variants, not the automatic default.
- Before and after a quality batch, run the heterogeneous workbench evaluation so the pipeline reflects the real machine state:
<WORKBENCH_TOOLS>/evaluate_heterogeneous_media_pipeline.py- treat required local stack services as a prerequisite for media creation; fail closed if those health surfaces are down
- treat AMUSE on
<AMUSE_BASE_URL>as a GUI/inventory bridge unless a future native execution API is proven - treat
DeviceSelectorMultiGPUoptions as the source of truth when/system_statsunderreports CUDA visibility - validate the selected Wan bundle and template currency before heavy rendering with
scripts/validate_wan_bundle.py - fail closed if the selected bundle is incomplete or the template pack is stale for the chosen lane
- keep the spoken benchmark ladder explicit:
wan22_i2v_a14b_flf2v -> wan22_ti2v_5b -> wan22_s2v_14b -> wan22_animate_14b
- Use the workstation heterogeneously instead of forcing every step through one GPU:
- NVIDIA
5060 Ti: preferred raw Wan lane and heavier segment owner - NVIDIA
4060 Ti: secondary raw Wan lane with smaller slices matched to its measured throughput - Radeon
7600 XT: peer generation lane when a Radeon Comfy worker is healthy; otherwise use it continuously for AMUSE DirectML generation/repair/upscale, Qwen review, and sidecar critique - do not call the Radeon a Comfy peer lane unless a HIP-enabled Comfy runtime is actually live on the configured peer endpoint; OpenCL visibility or AMUSE inventory alone is not enough
- do not idle the Radeon just because the current raw Wan path is NVIDIA-first; it must always have review, repair, upscale, benchmark, or peer-generation work
- When MX3 is healthy, switch the shim feeder into media-assist mode before long QC loops:
- capture feeder status first, then enable media assist, and restore the prior feeder state when the media pipeline exits cleanly
<WORKBENCH_TOOLS>/set_mx3_media_assist_mode.py --mode status<WORKBENCH_TOOLS>/set_mx3_media_assist_mode.py --mode enable<WORKBENCH_TOOLS>/set_mx3_media_assist_mode.py --mode restore --from <saved-status-json>- run
scripts/mx3_frame_gate.pyon each raw slice before lip-sync; the current implementation is a post-render frame-QC scaffold until a thinner dedicated MX3 face/headpose route is wired
- Build a structured multi-lane review artifact for every serious batch:
<WORKBENCH_TOOLS>/build_multilane_review_report.py- use Radeon
Qwen3.5-VLfor image review, RadeonQwen3.5text for distillation when useful, and include the current runtime evaluation - if an optional local vision supervisor is running, treat its
vision/analyze-framesurface as an extra review lane - when a segmented batch is running, dispatch review immediately on
slice_completed; do not wait for the whole batch to finish before the Radeon and MX3 start critiquing - for already-running batches that started before the latest orchestration patch, attach
<WORKBENCH_TOOLS>/monitor_segmented_batch.py --review-completed-slices --watch-until-idle
- Open the export folder in Explorer when finished only when the user actually wants that handoff. Do not treat browser links as the primary handoff.
- Clean up stray Comfy/Python console windows after the render completes unless the user explicitly wants the console left open.
Quality Guardrails
- Never reuse a previous cropped poster as the main creative source unless the user asked for that crop.
- Do not present a proof artifact as a quality result.
- If the chosen input is too tight, off-angle, or badly framed, stop using it and stage a better source.
- Prefer face-plus-shoulders, torso, or full-character references over mouth-only or chin-only crops for character animation tests.
- If the render loop loses the Comfy server or the output video record never appears, fail closed and report that explicitly instead of waiting indefinitely.
- Distinguish
raw_renderfrommastered_output. A higher-fps looping master may be derived from the raw render when that gives a better result than a more expensive rerun. - Do not call a
6 fpsor similarly under-cranked output a useful dry run unless the user explicitly asked for a bare smoke test. - Do not let one oversized segment monopolize the slower lane. Re-split before rerendering if the measured slice budget is wrong.
- Do not cut spoken text into nonsense fragments just to satisfy a budget. Prefer phrase-sized splits and only split again when a chunk truly cannot fit the available render lanes.
- Do not accept any slice where Penny turns away from camera or drifts into an off-camera/profile-view pose.
- Do not call a spoken clip acceptable until ASR QC says the spoken line is close enough to the intended line.
- Do not let the lip-sync lane quietly change the intended pacing or phoneme content without recording it in the QC output.
- Do not use Windows SAPI / Microsoft Sam / generic fallback voices for a Penny quality pass when the local Kokoro
af_jessicalane is available. - Do not treat a rigid-body talking-head lip-sync pass as the final spoken output when the user wants a full animated result.
- Do not treat MuseTalk output as clean just because ASR passes; inspect the mouth region for flashing, blur, or mask artifacts before accepting it.
- Do not let MX3-side scores override obvious visual failures; MX3 is an assist/QC plane, not final creative judgment.
- Do not create
30/60 fpsvariants until the25 fpsreviewed master is accepted.
MultiGPU Placement
For Wan image-to-video lanes:
- Use MultiGPU loader variants when available.
- Keep CLIP on CPU when that reduces VRAM pressure without hurting throughput materially.
- Prefer the
5060 Tilane for high-noise UNet work and the4060 Tilane for the low-noise / helper lane when both are visible to the runtime. - Split high-noise and low-noise UNet placement across both NVIDIA GPUs when the workflow supports it.
- Record the actual device placement in the saved prompt artifact.
- For segmented spoken batches with separate single-GPU Comfy lanes, treat throughput benchmarks as planning inputs rather than assuming both cards are equal.
Long-Clip Rule
When the user asks for a useful render, inspect the live node/object schema or workflow template and choose the longest supported clip length that is practical on the current machine. Do not default to a 4-second or near-minimal clip unless the user asked for a quick smoke.
Native Cadence Rule
For Wan image-to-video work, prefer the official/native cadence and master upward later:
- raw Wan baseline:
16 fps - dry run baseline:
81frames at a480p-class area or portrait-equivalent area - high-fps master:
RIFEorFILMafter raw approval
Scope Boundary
Use this skill for direct local ComfyUI/Wan media work.
Do not use this skill for:
- OpenAI Sora/API video generation only.
- Repo-scoped adapter validation only.
- Generic video editing guidance without local render execution.
References
references/complete-work-loop.mdreferences/media-mode-checklist.mdreferences/mx3-media-assist.mdreferences/penny-continuity-contract.mdreferences/review-loop.mdreferences/source-selection-rules.mdreferences/wan-native-vfi-best-practices.md
Loopback
If the lane is blocked or the render is poor:
- Capture the exact prompt/result artifact and output path.
- State whether the failure was source-choice, workflow-choice, model/runtime limits, or shell/UI policy.
- Re-stage a better source and rerun the longest practical clip instead of defending a bad artifact.