name: mindspeed-mm-generative description: Universal MindSpeed-MM generative model training guide for Huawei Ascend NPU. Covers all backend patterns (Megatron, Megatron+FSDP2, FSDP2-native, Accelerate+DeepSpeed), feature extraction, weight conversion, and training for ALL supported generative models. Supports Wan2.1/2.2, HunyuanVideo/1.5, CogVideoX, OpenSoraPlan, VACE, LTX2, FLUX, SD3, SDXL, Sana, HiDream, StepVideo, Lumina and more. Use when training multimodal generative models on Ascend NPU. keywords: - mindspeed-mm - generative - video generation - image generation - wan - t2v - t2i - i2v - feature extraction - sora - diffusion - flux - hunyuanvideo - cogvideox - accelerate - fsdp2
MindSpeed-MM Generative Model Training
This Skill guides users through the end-to-end training pipeline for multimodal generative models (text-to-video, text-to-image) on Huawei Ascend NPU.
Prerequisites
Critical: Do NOT use
bash scripts/install.shfor generative models. Official MindSpeed-MM docs stateinstall.shonly fully supports Qwen3/Qwen3.5. For Wan2.1 and other generative models, follow the manual install flow below (matchesexamples/wan2.1/README.md).
Step P1: Clone repositories
git clone https://gitcode.com/Ascend/MindSpeed-MM.git /root/workspace/MindSpeed-MM
git clone https://github.com/NVIDIA/Megatron-LM.git /root/workspace/Megatron-LM
cd /root/workspace/Megatron-LM && git checkout core_v0.12.1
cp -r megatron /root/workspace/MindSpeed-MM/
cd /root/workspace/MindSpeed-MM
Step P2: Install PyTorch + torch_npu
Always use pre-built wheels from Ascend PyTorch Releases. Do not use pip install torch==2.7.1 — aarch64 wheels on PyPI are unreliable.
# Download matching wheels from https://gitcode.com/Ascend/pytorch/releases
# Replace cp310 with cp311 if using Python 3.11
pip install torch-2.7.1-cp310-cp310-manylinux_2_28_aarch64.whl
pip install torch_npu-2.7.1rc1-cp310-cp310-manylinux_2_28_aarch64.whl
# For x86_64: same naming pattern with _x86_64 suffix
# Required by torch_npu and other components:
pip install numpy pyyaml scipy attrs decorator psutil
Step P3: Install MindSpeed
git clone https://gitcode.com/Ascend/MindSpeed.git /root/workspace/MindSpeed
cd /root/workspace/MindSpeed
git checkout 93c45456c7044bacddebc5072316c01006c938f9
pip install -r requirements.txt
pip install -e .
cd /root/workspace/MindSpeed-MM
Step P4: Install MindSpeed-MM base dependencies
# Installs base stack from pyproject.toml: transformers==4.57.0, diffusers==0.30.3, peft==0.7.1, etc.
pip install -e .
Step P5: Activate CANN environment
source /usr/local/Ascend/cann/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
Step P6: Smoke test
python3 -c "
import torch, torch_npu
assert torch.npu.is_available(), 'NPU not available'
print(f'NPU count: {torch_npu.npu.device_count()}')
import transformers, diffusers, mindspeed
print(f'transformers={transformers.__version__}, diffusers={diffusers.__version__}')
"
Then proceed to Step 0 below for model-specific dependencies (diffusers upgrade, decord, etc.) — Prerequisites alone is NOT enough for any generative model.
For detailed troubleshooting and alternative install paths, see mindspeed-mm-env-setup.
Supported Models
| Model | Sub-tasks | Backend | Status |
|---|---|---|---|
| Wan2.1 | t2v, i2v, v2v, flf2v | Megatron | Released |
| Wan2.2 | t2v, i2v | Megatron+FSDP2 | Released |
| HunyuanVideo | t2v, i2v | Megatron | Prototype |
| HunyuanVideo 1.5 | t2v, i2v | Megatron+FSDP2 | Prototype |
| CogVideoX | t2v | Megatron | Released |
| OpenSoraPlan 1.3/1.5 | t2v | Megatron | Released |
| OpenSora 2.0 | t2v | Megatron | Released |
| StepVideo | t2v | Megatron | Prototype |
| VACE | t2v | Megatron+FSDP2 | Prototype |
| LTX2 | t2v | FSDP2-native | Prototype |
| FLUX | t2i | Accelerate+DeepSpeed | Prototype |
| SD3 | t2i | Accelerate+DeepSpeed | Prototype |
| SDXL | t2i | Accelerate+DeepSpeed | Prototype |
| Sana | t2i | Accelerate+DeepSpeed | Prototype |
| HiDream | t2i | Accelerate+DeepSpeed | Prototype |
| Lumina | t2v | Megatron | Prototype |
Sub-task descriptions:
- t2v (text-to-video): Generate video from text, the most fundamental generation task
- i2v (image-to-video): Generate video from image + text, requires a reference image and different model weights
- v2v (video-to-video): Video style transfer
- flf2v (first-last-frame-to-video): Generate intermediate video from the first and last frames
How to Train Any Generative Model
For ANY generative model, use the Model Registry to look up the exact backend, entry script, converter, and feature extraction script. Then:
- Check
examples/<model_name>/README.mdfor model-specific instructions - Read the shell script to identify: entry script, config files, backend pattern
- Check if the model needs feature extraction (not all do — see registry)
- Check if the model needs weight conversion (diffusers models use HF weights directly)
- Modify config files with your data/weight paths
- Launch the shell script
Backend Patterns
MindSpeed-MM uses four different training backends depending on the model:
| Backend | Entry Script | Models | Notes |
|---|---|---|---|
| Megatron | pretrain_sora.py |
wan2.1, hunyuanvideo, cogvideox, opensoraplan1.3/1.5, opensora2.0, stepvideo | Standard Megatron distributed training |
| Megatron+FSDP2 | pretrain_sora.py + --use-torch-fsdp2 |
wan2.2, hunyuanvideo_1.5, vace | Requires CUDA_DEVICE_MAX_CONNECTIONS=2 |
| FSDP2-native | mindspeed_mm/fsdp/train/trainer.py |
ltx2 | Standalone FSDP2 trainer, no Megatron |
| Accelerate+DeepSpeed | accelerate launch |
flux, sd3, sdxl, sana, hidream (all diffusers models) | Uses HF accelerate with DeepSpeed, completely different from Megatron |
Feature Extraction Matrix
Not all models require feature extraction. Models that train on raw data skip this step entirely.
| Script | Models | Notes |
|---|---|---|
get_wan_feature.py |
Wan2.1 | VAE + text encoder pre-encoding |
get_hunyuan_feature.py |
HunyuanVideo | HunyuanVideo-specific VAE |
get_sora_feature.py |
CogVideoX, StepVideo, OpenSoraPlan 1.3 | Shared extraction script |
get_lumina_feature.py |
Lumina | Lumina-specific |
get_vace_feature.py |
VACE | VACE-specific |
| Not needed | wan2.2, OpenSoraPlan 1.5, ltx2, all diffusers models (flux, sd3, sdxl, sana, hidream) | Train directly on raw data |
Weight Converter Reference
| Converter | Models |
|---|---|
WanConverter |
wan2.1, wan2.2 |
HunyuanVideoConverter |
hunyuanvideo, hunyuanvideo_1.5 |
CogVideoConverter |
cogvideox |
OpenSoraPlanConverter |
opensoraplan1.3/1.5 |
StepVideoConverter |
stepvideo |
LuminaConverter |
lumina |
VACEConverter |
vace |
| No converter needed | diffusers models (flux, sd3, sdxl, sana, hidream) -- use HF weights directly |
Key Difference from VLM
The generative model training pipeline includes an additional feature extraction step (for models that need it):
Data Preparation → Feature Extraction (VAE + TextEncoder) → Training → Inference
VLMs (Vision-Language Models) train directly from raw image/video inputs, whereas generative models first use a VAE to encode video into latent space features and a text encoder to encode text into vectors, then train a diffusion model in the feature space. Some newer models (wan2.2, ltx2, diffusers models) skip feature extraction and train on raw data directly.
Quick Start: Wan2.1-1.3B T2V (End-to-End, Flagship Example)
Using Wan2.1-1.3B text-to-video as the flagship example (Megatron backend with feature extraction). This section walks through the complete training pipeline. To adapt for other models, refer to the "How to Train Any Generative Model" section above and the "Adaptation Notes" section below.
Step 0: Install Model-Specific Dependencies
MANDATORY: Wan2.1 requires
diffusers==0.33.1. The base environment shipsdiffusers==0.30.3, which does NOT containAutoencoderKLWanand will fail at feature extraction. You must upgrade before proceeding.
# Wan2.1 — upgrade diffusers (MANDATORY)
pip install diffusers==0.33.1
# IMPORTANT: After upgrading diffusers, verify core dependencies haven't been broken
pip install numpy==1.26.0 pandas==2.0.3 # Pin back if upgraded
python3 -c "import numpy, pandas, sklearn; print('Dependencies OK')"
# Video decoder (DecordVideo is the official default for Wan2.1)
# x86 architecture:
pip install decord==0.6.0
# ARM (aarch64) — build decord from source (apt-get may need retries due to network):
apt-get update || (sleep 5 && apt-get update) # retry if first attempt times out
apt-get install -y cmake build-essential \
libavcodec-dev libavformat-dev libavutil-dev libswscale-dev libavfilter-dev
git clone --recursive https://github.com/dmlc/decord.git /tmp/decord
cd /tmp/decord && mkdir build && cd build
cmake .. -DUSE_CUDA=0 -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
cd /tmp/decord/python && pip install .
# Note: libavfilter-dev is required — without it cmake fails with "FFMPEG_LIBAVFILTER-NOTFOUND"
# Verify: python3 -c "import decord; print(decord.__version__)"
Different models have significantly different dependency versions. See references/per-model-deps.md for details. It is strongly recommended to create a separate container for each model to avoid dependency conflicts.
Step 1: Weight Download and Conversion
Download the Wan-AI/Wan2.1-T2V-1.3B-Diffusers weights from HuggingFace.
Weight conversion (HuggingFace format -> MindSpeed-MM format):
mm-convert WanConverter hf_to_mm \
--cfg.source_path ./weights/Wan-AI/Wan2.1-T2V-1.3B-Diffusers/transformer/ \
--cfg.target_path ./weights/Wan-AI/Wan2.1-T2V-1.3B-Diffusers/transformer/
When using Pipeline Parallelism (PP), you need to additionally specify the layer splitting scheme:
mm-convert WanConverter hf_to_mm \
--cfg.source_path ./weights/Wan-AI/Wan2.1-T2V-1.3B-Diffusers/transformer/ \
--cfg.target_path ./weights/Wan-AI/Wan2.1-T2V-1.3B-Diffusers/transformer/ \
--cfg.target_parallel_config.pp_layers '[[7,8,8,7]]'
Why
LOAD_PATHusestransformer/(parent dir):mm-convert hf_to_mmwriteslatest_checkpointed_iteration.txtintransformer/and stores weights intransformer/release/. Megatron checkpoint loading readslatest_checkpointed_iteration.txtto resolve the actual subdirectory automatically. SetLOAD_PATHto.../transformer/, not.../transformer/release/. For detailed information on weight conversion, refer to themindspeed-mm-weight-prepSkill.
Step 2: Dataset Preparation
Dataset directory structure:
<dataset>/
├── data.json # Video-text pair metadata
└── videos/
├── video0001.mp4
└── video0002.mp4
data.json format:
[
{
"path": "videos/video0001.mp4",
"cap": "A cat playing with a ball in a garden.",
"num_frames": 81,
"fps": 24,
"resolution": {"height": 480, "width": 832}
},
{
"path": "videos/video0002.mp4",
"cap": "Ocean waves hitting the shore at sunset.",
"num_frames": 81,
"fps": 24,
"resolution": {"height": 480, "width": 832}
}
]
Field descriptions:
path: Relative path to the video file (relative to the dataset root directory)cap: Video description textnum_frames: Number of frames (Wan2.1 typically uses 81 frames)fps: Frame rateresolution: Resolution (height x width)
Then edit examples/wan2.1/feature_extract/data.txt, with each line in the format:
<dataset_root_directory>,<data.json_path>
Example:
/home/dataset/wan_training,/home/dataset/wan_training/data.json
Step 3: Feature Extraction
Docker users: If your container was not created with
--ipc=hostor--shm-size=16g, DataLoader workers will crash withBus error. Set--num-workers 0in the script as a workaround.
Feature extraction is a step unique to generative models. This step runs the VAE and text encoder on NPU to pre-encode raw video and text into feature vectors, so that training can directly use features without repeated encoding/decoding.
Configuration files (3 files need to be modified):
examples/wan2.1/feature_extract/model_t2v.json: Set thefrom_pretrainedpaths for the VAE and text encoderexamples/wan2.1/feature_extract/data.json: Setnum_frames,max_height,max_width, and the tokenizer'sfrom_pretrainedpathmindspeed_mm/tools/tools.json: Setsorafeature.save_path, the output directory for extracted features
Launch feature extraction:
bash examples/wan2.1/feature_extract/feature_extraction.sh
Underlying call: torchrun ... mindspeed_mm/tools/feature_extraction/get_wan_feature.py
After extraction completes, the output structure is:
./sora_features/ ← sorafeature.save_path (feature dataset root)
├── data.jsonl ← file paths are RELATIVE: "features/test_0000.pt"
└── features/
├── test_0000.pt
└── test_0001.pt
Path rule (important):
data_folderinfeature_data.jsonmust point to thesave_pathroot (e.g.,./sora_features/), NOT./sora_features/features/. Thedata.jsonlalready includesfeatures/in its relative paths. Settingdata_folderto thefeatures/subdirectory causes double-nesting (features/features/...) and data loading failures.
For detailed feature extraction configuration, see references/feature-extraction.md.
Step 4: Training Configuration
After feature extraction is complete, configure the training script.
Update data.txt: Edit examples/wan2.1/1.3b/t2v/data.txt to point to the feature dataset root (e.g., ./sora_features).
Key configuration files:
feature_data.json: Dataset configuration. Setdata_foldertosorafeature.save_pathroot (e.g.,./sora_features/),dataset_type: "feature", and tokenizer pathpretrain_model.json: Model architecture configuration, defines the diffusion scheduler and predictor (wandit) architecture parameters
Training script examples/wan2.1/1.3b/t2v/pretrain.sh key parameters:
# Path configuration
LOAD_PATH="./weights/Wan-AI/Wan2.1-T2V-1.3B-Diffusers/transformer/"
SAVE_PATH="./output/wan2.1-1.3b-t2v/"
# Parallelism strategy
TP=1 # Tensor Parallelism
PP=1 # Pipeline Parallelism
VP=1 # Virtual Pipeline
CP=1 # Context Parallelism
MBS=1 # Micro Batch Size
# Memory optimization (key parameters)
--recompute-granularity full
--recompute-method block
--recompute-num-layers 20
# Distributed optimizer
--use-distributed-optimizer
--overlap-grad-reduce
--overlap-param-gather
Step 5: Launch Training
Docker users: If your container was not created with
--ipc=hostor--shm-size=16g, DataLoader workers will crash withBus error. Set--num-workers 0in the script as a workaround.
bash examples/wan2.1/1.3b/t2v/pretrain.sh
Entry script: pretrain_sora.py
Training checkpoints are saved in the directory configured by SAVE_PATH.
Step 6: Inference Generation (Optional)
After training completes, you can use the inference script to generate video from text prompts:
bash examples/wan2.1/1.3b/t2v/inference.sh
Entry script: inference_sora.py
Adaptation Notes for Other Models
Megatron Backend Models
These models follow the same pipeline as Wan2.1 (feature extraction -> training via pretrain_sora.py):
- CogVideoX:
examples/cogvideox/, usesCogVideoConverter, feature extraction viaget_sora_feature.py - OpenSoraPlan 1.3:
examples/opensoraplan1.3/, usesOpenSoraPlanConverter(hf_to_mm), feature extraction viaget_sora_feature.py - OpenSoraPlan 1.5:
examples/opensoraplan1.5/, usesOpenSoraPlanConverter(source_to_mm), no feature extraction (trains on raw data) - OpenSora 2.0:
examples/opensora2.0/ - HunyuanVideo:
examples/hunyuanvideo/, usesHunyuanVideoConverter, feature extraction viaget_hunyuan_feature.py - StepVideo:
examples/stepvideo/, usesStepVideoConverter, feature extraction viaget_sora_feature.py - Lumina: uses
LuminaConverter, feature extraction viaget_lumina_feature.py
Megatron+FSDP2 Backend Models
These models use pretrain_sora.py with --use-torch-fsdp2 and require CUDA_DEVICE_MAX_CONNECTIONS=2:
- Wan2.2:
examples/wan2.2/, usesWanConverter, NO feature extraction needed (trains on raw data). Additional deps:pip install -r examples/wan2.2/requirements.txt(diffusers==0.35.1, peft==0.17.1) - HunyuanVideo 1.5:
examples/hunyuanvideo_1.5/, usesHunyuanVideoConverter. Additional deps:pip install -r examples/hunyuanvideo_1.5/requirements.txt(transformers==4.57.1, diffusers==0.35.0) - VACE:
examples/vace/, usesVACEConverter, feature extraction viaget_vace_feature.py
FSDP2-Native Backend Models
- LTX2: Uses
mindspeed_mm/fsdp/train/trainer.pydirectly, NO feature extraction needed, no Megatron dependency
Accelerate+DeepSpeed Backend Models (Diffusers)
These models use accelerate launch with DeepSpeed, NOT pretrain_sora.py. They use HuggingFace weights directly (no converter needed) and do NOT need feature extraction:
- FLUX:
examples/diffusers/flux/, text-to-image - SD3:
examples/diffusers/sd3/, text-to-image - SDXL:
examples/diffusers/sdxl/, text-to-image - Sana:
examples/diffusers/sana/, text-to-image - HiDream:
examples/diffusers/hidream/, text-to-image
Diffusers models have a completely different training flow from Megatron models. Read the shell scripts in
examples/diffusers/<model>/to understand the specific launch command and config structure.
Post-Training
MindSpeed-MM supports the following post-training optimization methods:
| Method | Entry Script | Description |
|---|---|---|
| DPO | posttrain_sora_dpo.py |
Direct Preference Optimization, used for aligning with human preferences |
| GRPO | posttrain_flux_dancegrpo.py |
Group Relative Policy Optimization |
Directory Structure Reference
MindSpeed-MM/
├── pretrain_sora.py # Megatron/FSDP2 training entry point
├── inference_sora.py # Inference entry point
├── posttrain_sora_dpo.py # DPO post-training entry point
├── posttrain_flux_dancegrpo.py # GRPO post-training entry point
├── mindspeed_mm/
│ ├── fsdp/train/trainer.py # FSDP2-native trainer (used by ltx2)
│ └── tools/
│ ├── tools.json # Global config (sorafeature.save_path for feature output)
│ └── feature_extraction/
│ ├── get_wan_feature.py # Wan2.1 feature extraction
│ ├── get_hunyuan_feature.py # HunyuanVideo feature extraction
│ ├── get_sora_feature.py # CogVideoX, StepVideo, OpenSoraPlan
│ ├── get_lumina_feature.py # Lumina feature extraction
│ └── get_vace_feature.py # VACE feature extraction
└── examples/
├── wan2.1/ # Megatron backend
│ ├── feature_extract/
│ │ ├── feature_extraction.sh
│ │ ├── model_t2v.json
│ │ ├── data.json
│ │ └── data.txt
│ ├── 1.3b/t2v/
│ │ ├── pretrain.sh
│ │ ├── inference.sh
│ │ ├── data.txt
│ │ ├── feature_data.json
│ │ └── pretrain_model.json
│ └── 14b/t2v/
├── wan2.2/ # Megatron+FSDP2 backend
├── hunyuanvideo/ # Megatron backend
├── hunyuanvideo_1.5/ # Megatron+FSDP2 backend
├── cogvideox/ # Megatron backend
├── opensoraplan1.3/ # Megatron backend
├── opensoraplan1.5/ # Megatron backend
├── opensora2.0/ # Megatron backend
├── stepvideo/ # Megatron backend
├── vace/ # Megatron+FSDP2 backend
├── ltx2/ # FSDP2-native backend
└── diffusers/ # Accelerate+DeepSpeed backend
├── flux/
├── sd3/
├── sdxl/
├── sana/
└── hidream/
FAQ
Q: Feature extraction reports OOM -- Reduce max_height/max_width or num_frames in data.json, or reduce --nproc_per_node in torchrun.
Q: Training loss does not converge -- Check that LOAD_PATH points to transformer/ (not model root), weight conversion completed successfully, and dataset_type in feature_data.json is "feature".
Q: Inference generates all-black or noisy video -- Confirm the inference script loads the correct checkpoint path and sufficient training steps have been completed.
Q: mm-convert command not found -- Run pip install -e . to install MindSpeed-MM. mm-convert is a console entry point registered during installation.
Q: Feature extraction cannot read videos -- Install decord: pip install decord==0.6.0 on x86, or compile from source on ARM.
Q: Communication_Error_Bind_IP_Port -- Stale process holding port. Kill with lsof -i :29500 + kill -9 <PID>, or change MASTER_PORT.
References
- Model Delta Cards - Execution checklists for non-flagship models (Wan2.2, HunyuanVideo, CogVideoX, Diffusers/FLUX)
- Per-Model Dependencies - Dependency versions and installation instructions for each generative model
- Feature Extraction Guide - Feature extraction configuration files, workflow, and troubleshooting
- Model Registry - Complete lookup table for all models
- MindSpeed-MM Repository