launch-experiment

star 3.7k

Generate and execute a training launch command for FastVideo models

hao-ai-lab By hao-ai-lab schedule Updated 5/18/2026

name: launch-experiment description: Generate and execute a training launch command for FastVideo models

Launch Experiment

Purpose

Construct a fully-specified torchrun training command for a FastVideo model given a target pipeline, dataset, and hyperparameter overrides. This skill automates the boilerplate of setting environment variables, picking the right entrypoint, and applying defaults from the closest example script.

Prerequisites

  • The repo is cloned and fastvideo is installed (uv pip install -e ".[dev]").
  • Dataset is preprocessed (see docs/training/data_preprocess.md).
  • WANDB_API_KEY is set in the environment (or WANDB_MODE=offline for local).
  • GPU resources are available (multi-GPU requires NCCL).

Inputs

Parameter Required Description
pipeline Yes Training pipeline type: finetune, distill-dmd, self-forcing, lora, consistency
model Yes Model family: wan-t2v-1.3B, wan-i2v-14B, ltx2, matrixgame
data_path Yes Path to preprocessed dataset (parquet)
num_gpus Yes Number of GPUs
overrides No Dict of hyperparameter overrides (any CLI arg)
output_dir No Output directory (default: outputs/<model>_<pipeline>)
run_name No W&B run name (default: auto-generated)

Steps

1. Identify the training entrypoint

Pipeline Entrypoint
finetune (Wan T2V) fastvideo/training/wan_training_pipeline.py
finetune (Wan I2V) fastvideo/training/wan_i2v_training_pipeline.py
finetune (LTX-2) fastvideo/training/ltx2_training_pipeline.py
finetune (Matrix-Game 2.0) fastvideo/training/matrixgame2_training_pipeline.py
distill-dmd fastvideo/training/wan_distillation_pipeline.py
self-forcing fastvideo/training/wan_self_forcing_distillation_pipeline.py

2. Resolve default hyperparameters

Find the closest example script in examples/training/ for the model:

Model Example Script Directory
wan-t2v-1.3B examples/training/finetune/wan_t2v_1.3B/crush_smol/
wan-i2v-14B examples/training/finetune/wan_i2v_14B_480p/crush_smol/
ltx2 examples/training/finetune/ltx2/
matrixgame examples/training/finetune/MatrixGame2.0/
distill-dmd scripts/distill/v1_distill_dmd_wan.sh

Read the script to extract default values for:

  • --learning_rate, --train_batch_size, --sp_size, --tp_size
  • --num_latent_t, --num_height, --num_width, --num_frames
  • --gradient_accumulation_steps, --max_train_steps
  • --mixed_precision, --weight_decay, --max_grad_norm
  • --validation_steps, --validation_sampling_steps

3. Set environment variables

export WANDB_API_KEY="${WANDB_API_KEY}"
export WANDB_BASE_URL="https://api.wandb.ai"
export FASTVIDEO_ATTENTION_BACKEND=FLASH_ATTN
export TOKENIZERS_PARALLELISM=false
export TRITON_CACHE_DIR=/tmp/triton_cache

4. Construct the torchrun command

torchrun --nnodes 1 --nproc_per_node <num_gpus> \
    <entrypoint> \
    --pretrained_model_name_or_path <model_hf_id> \
    --data_path "<data_path>" \
    --output_dir "<output_dir>" \
    --wandb_run_name "<run_name>" \
    --tracker_project_name "<project_name>" \
    --log_validation \
    <...all hyperparameters...>

5. Log to experiment journal

After launching, append an entry to .agents/memory/experiment-journal/README.md:

## [YYYY-MM-DD] Experiment: <run_name>
- **Hypothesis**: <user-provided or auto-generated>
- **Config**: model=<model>, lr=<lr>, sp_size=<sp>, gpus=<n>, script=<entrypoint>
- **W&B run**: <pending — will be updated by monitor skill>
- **Status**: running

Outputs

  • A ready-to-execute shell command.
  • An experiment journal entry.

Example Usage

Launch a Wan T2V 1.3B finetune on 4 GPUs with lr=5e-5 and max_train_steps=1000:

  pipeline: finetune
  model: wan-t2v-1.3B
  data_path: data/crush_smol_preprocessed/
  num_gpus: 4
  overrides:
    learning_rate: 5e-5
    max_train_steps: 1000

References

  • examples/training/finetune/wan_t2v_1.3B/crush_smol/finetune_t2v.sh
  • scripts/distill/v1_distill_dmd_wan.sh
  • docs/training/finetune.md (training arguments table)
  • fastvideo/training/trackers.py (tracker initialization)

Changelog

Date Change
2026-03-02 Initial version
Install via CLI
npx skills add https://github.com/hao-ai-lab/FastVideo --skill launch-experiment
Repository Details
star Stars 3,719
call_split Forks 362
navigation Branch main
article Path SKILL.md
More from Creator