name: integrate-new-backbone description: "Step-by-step recipe for plugging a new video DiT backbone into minWM, grounded in the HunyuanVideo (HY15) and Wan 2.1 reference integrations. Use when a user wants to add a new backbone to the framework." license: MIT
integrate-new-backbone
Reference implementations: HY15 (MMDiT, 8B) and Wan21 (Cross-attention DiT, 1.3B). Whichever path you choose, it usually helps to follow the reference whose architecture is closer to your new backbone.
Framework Layout
minWM/
├── HY15/ # HunyuanVideo backbone
│ ├── hyvideo/ # model implementation (models/, pipelines/, prope/, schedulers/)
│ └── trainer/ # trainer (configs/, pipelines/, training/)
├── Wan21/ # Wan 2.1 backbone
│ ├── model/ # model implementation (base.py + per-stage model files)
│ ├── pipeline/ # inference / training pipelines
│ └── wan_utils/ # wrapper, loss, dataset, scheduler, etc.
└── shared/ # shared across backbones (SP comms, config bases, algorithms)
Each backbone is a self-contained top-level directory, mounted via PYTHONPATH; the two backbones do not import from each other. A new backbone follows the same pattern: create a new top-level directory (e.g. NewBB/).
Step 1: Wrapper Layer — wrap the new model behind a unified interface
Wan21 does this in wan_utils/wan_wrapper.py, exposing three wrappers:
WanDiffusionWrapper: wraps the DiT, handles forward and the causal/bidirectional switchWanTextEncoder: text encoderWanVAEWrapper: VAE encode/decode
Suggested approach: write the equivalent three wrappers for the new backbone, ideally keeping the interface aligned with:
DiffusionWrapper.forward(latents, timestep, encoder_hidden_states, **kwargs) -> noise_predVAEWrapper.encode_to_latent(pixel) -> latentVAEWrapper.decode_to_pixel(latent) -> pixel
The HY15 counterparts live in hyvideo/models/ and trainer/models/ — useful as a second reference.
Step 2: BaseModel — hook into the training abstraction
Wan21 training models inherit from Wan21/model/base.py::BaseModel. The core is _initialize_models(), which instantiates:
self.generator # trainable DiT (causal)
self.real_score # frozen teacher (bidirectional)
self.fake_score # trainable fake score
self.text_encoder
self.vae
self.scheduler
Suggested approach: create NewBB/model/base.py with a similar structure, replacing WanDiffusionWrapper etc. with your own wrappers.
The per-stage model files (camera_dmd.py, dmd.py, ode_regression.py, ...) inherit BaseModel and implement their own forward(). The new backbone follows the same pattern.
Step 3: Pipeline — inference and training flow
Each backbone has its own pipeline directory:
Wan21/pipeline/:SelfForcingTrainingPipeline,BidirectionalTrainingPipelineHY15/trainer/pipelines/: split per stage (ar_hunyuan_training_pipeline.py,ar_causal_cd_pipeline.py, ...)
Pipelines are responsible for: given noise + condition, run the denoising steps and return a trajectory.
Suggested approach:
- Start by getting the bidirectional (Phase 1) pipeline working — the simplest implementation in
Wan21/pipeline/is a friendly place to begin. - Then progressively integrate AR diffusion TF → causal ODE → causal CD → DMD stages.
Key interface: inference_with_trajectory(noise, clean_image_or_video, **conditional_dict), returning (pred, timestep_from, timestep_to).
Step 4: ProPE — condition injection
Camera control is injected via ProPE, implemented in:
Wan21/prope/(Wan implementation)HY15/hyvideo/prope/(HY implementation)
ProPE turns a pose string (e.g. "a*4,w*8,s*7") into positional encodings injected into the DiT's RoPE.
Suggested approach: locate the new backbone's RoPE implementation and follow the HY15 or Wan21 ProPE wiring pattern, passing the pose embedding through DiffusionWrapper.forward().
Step 5: SP (Sequence Parallel) adaptation
Shared SP communication primitives live in shared/sp/. SP tends to be the trickiest part of integrating a new backbone, so it's worth walking through every item below carefully (a fuller checklist lives in the debug-world-model skill):
- FSDP
process_groupuses DP group, not world group - seed offset uses
dp_rank, notglobal_rank - no random number calls inside any
if rank == 0:block - KV cache lives in the head-parallel domain; head count =
H // sp_size - strip SP padding before attention, restore after attention
Step 6: Training entry point and configs
- Wan21 entry:
wan_train.py+wan_utils/configs/ - HY15 entry:
trainer/training/+trainer/configs/
For a new backbone, the Wan21 entry style is often the easier starting point thanks to its simplicity. You might create NewBB/new_bb_train.py and use ModelConfig from shared/configs/base.py as the config base class.
Recommended Integration Order
- Get the wrapper layer running on a single GPU for bidirectional inference
- Hook in
BaseModel, get Phase 1 SFT training running - Integrate pipelines stage by stage (TF → ODE / CD → DMD)
- Add ProPE condition injection
- Turn on SP, validate every item in the checklist
Each phase has corresponding HY15 / Wan21 reference files. When you run into a concrete issue, a useful habit is to ask "how does HY do this? how does Wan do this?" and let the two references guide your decision.