name: add_hf_model
description: >
Add support for a HuggingFace model in XTuner so it can be trained with the
full set of parallelism strategies and optimization switches. The user can
provide a HuggingFace Hub repo id, a local model directory, or a model family
already supported by transformers. This skill walks through the full path:
locating the reference implementation, classifying the model, splitting new
code across the model / module / ops layers, implementing the XTuner model
class and config, registering the entry point, validating bitwise numerical
parity with HuggingFace, writing regression tests, and enabling the training
optimizations (EP/SP, micro-batch, torch.compile, fp8, activation offload).
Add a HuggingFace model to XTuner
A complete handbook for porting any HuggingFace causal LM into XTuner with
correct training behavior. Covers four buckets: Dense LLM, MoE LLM,
VLM / compose model, and trust_remote_code model. Every step below is
grounded in concrete file:line references in this repository — read the cited
code before writing yours.
Anti-spaghetti reminder — XTuner separates what the model is (
*Config), what it computes (Dense/MoE/BaseComposeModelsubclass), and how its weights map to HF safetensors (to_hf_key_list, optionalsafetensors_to_params/param_to_safetensor). Keep new code inside those three seams. Never addif model_name == "..."branches to the base classes.
0. Locate the source code
The only goal of this step is to find the HuggingFace reference implementation
you will port from, and to note the model_type you will register under. Do
not pick attention/router/layer here — that happens naturally in §1 as you read
the code.
- If the user gives a remote-code model, the modeling code ships in the repo
as
modeling_*.py— read it directly. - If the user gives weights only, confirm the target Python environment with
the user, then find the matching implementation inside that env's
transformersinstall. If no code exists for thatmodel_type, stop and ask the user. - Confirm the weights are training-suitable — at least bf16 — so training precision can be aligned later. Flag fp8/int4-only checkpoints to the user.
- Note two strings from the HF config — you will need both downstream:
from transformers import AutoConfig cfg = AutoConfig.from_pretrained(path, trust_remote_code=True) cfg.model_type # registration key (§4) cfg.architectures[0] # the modeling class you are mirroring
1. How to do the migration
XTuner splits a model into three layers. You port by reusing what already exists at each layer and only adding what is genuinely new — so the work is "read the HF modeling code, then match each of its pieces to the XTuner component below". The locations are listed plainly; you do not need extra guidance to map your model's attention / router / layer onto the matching one.
- model layer (
xtuner/v1/model) — the model body.- Dense body:
xtuner/v1/model/dense/dense.py - MoE body:
xtuner/v1/model/moe/moe.py - Multi-modal (compose) bodies:
xtuner/v1/model/compose/— reuse the language model and add only the ViT / projector pieces.
- Dense body:
- module layer (
xtuner/v1/module):- Decoder layers:
DenseDecoderLayer,MoEDecoderLayer—xtuner/v1/module/decoder_layer. Small changes add a switch on the existing layer; a genuinely novel architecture may add{model_name}_decoder_layer. - Attention configs:
MHAConfig(xtuner/v1/module/attention/mha.py),MLAConfig(.../mla.py, DeepSeek-style latent),GatedDeltaNetConfig(.../gated_deltanet.py, linear). - Router configs (MoE):
GreedyRouterConfig(xtuner/v1/module/router/greedy.py, softmax top-k — Qwen3 MoE, GptOss),NoAuxRouterConfig(xtuner/v1/module/router/noaux_router.py, grouped sigmoid aux-loss-free — DeepSeek V3). - Also: rope, rms_norm, etc.
- Decoder layers:
- ops layer (
xtuner/v1/ops) — kernels such as attention and rms_norm.
Existing models to copy from
Pick the one whose attention + (router) match yours; the closer it is, the smaller your diff.
| Your model is… | Copy from |
|---|---|
| Dense LLM | xtuner/v1/model/dense/qwen3.py |
| MoE LLM | xtuner/v1/model/moe/qwen3.py |
| MoE with MLA + aux-loss-free router | xtuner/v1/model/moe/deepseek_v3.py |
| MoE needing weight-layout transform | xtuner/v1/model/moe/gpt_oss.py |
| Compose / VLM | xtuner/v1/model/compose/qwen3_vl/ |
After you have a plan, write an integration design doc into docs/design/model
and confirm with the user that the decomposition is reasonable before
implementing.
File layout for the new model
| Bucket | Model file | Public re-export | Test file |
|---|---|---|---|
| Dense LLM | xtuner/v1/model/dense/<name>.py |
xtuner/v1/model/__init__.py |
tests/model/test_<name>_dense.py |
| MoE LLM | xtuner/v1/model/moe/<name>.py |
xtuner/v1/model/__init__.py |
tests/model/test_<name>_moe.py |
| Compose / VLM | xtuner/v1/model/compose/<name>/... |
xtuner/v1/model/__init__.py |
tests/model/test_<name>.py |
2. Implement the model class
Subclass Dense, MoE, or BaseComposeModel. You typically only need to
override one method.
2.1 to_hf_key_list(self, key: str) -> list[str] — mandatory, abstract on BaseModel
Declared at xtuner/v1/model/base.py. Translates an XTuner-side parameter
name into one or more HF-side safetensors keys. Return-list cardinality:
- 1 → 1 for plain weights (most params).
- 1 → N for fused MoE experts (one fused param explodes into N per-expert
HF keys; see
xtuner/v1/model/moe/qwen3.py).
Patterns you almost always need (study xtuner/v1/model/dense/qwen3.py
and xtuner/v1/model/moe/qwen3.py):
# 1. Tied embeddings: redirect lm_head → embed_tokens before adding the model. prefix.
if self.config.tie_word_embeddings and "lm_head" in key:
key = key.replace("lm_head", "embed_tokens")
# 2. Add the "model." prefix that HF wraps everything in.
if "layers" in key or "embed_tokens" in key:
key = "model." + key
# 3. HF MoE nests experts/gate under .mlp.; XTuner does not.
if "layers" in key:
key = re.sub(r"layers\.(\d+)\.(experts|gate)", r"layers.\1.mlp.\2", key)
# 4. Top-level norm.
if key.startswith("norm."):
return [key.replace("norm.", "model.norm.")]
# 5. MoE expert fusion: one fused param → N HF keys.
if "fused_w1w3.weight" in key:
out = []
for i in range(self.config.n_routed_experts):
out.append(key.replace("fused_w1w3.weight", f"{i}.gate_proj.weight"))
out.append(key.replace("fused_w1w3.weight", f"{i}.up_proj.weight"))
return out
# 6. Model-specific buffers (e.g. FoPE: rotary_emb.sin_coef / cos_coef are persistent buffers).
if key.startswith("rotary_emb."):
return [key.replace("rotary_emb.", "model.rotary_emb.")]
A prefix-region remap that differs by deployment (e.g. a text tower nested under a
VLM as model.language_model.) does not belong here — put it in the config's
hf_key_mapping. See §5.1.
2.2 safetensors_to_params / param_to_safetensor — optional, only for layout mismatches
If the HF storage layout cannot be expressed as “same tensor, different name”,
you must transform the bytes. Canonical example: GptOss stores expert weights as
(num_experts, hidden_size, expert_dim * 2) but XTuner fuses them as
(num_experts * 2 * expert_dim, hidden_size) — see overrides at
xtuner/v1/model/moe/gpt_oss.py.
Use this hook only when transposing/reshaping is unavoidable. Renames go in
to_hf_key_list; numerics go here.
2.3 Do not override
build_layers,build_embeddings,_init_weights— the base handles them via the config. Override only if the new model adds a layer type the base cannot express (rare).forward— same reasoning. If you find yourself touching it, re-evaluate the bucket choice; the problem is probably in the config (attention,router,first_k_dense_replace).
3. Implement the config
Two layers: a base config that reads HF and a size-specific subclass per released checkpoint.
3.1 Base config — three required members
Mirror Qwen3MoEConfig (xtuner/v1/model/moe/qwen3.py) or
Qwen3DenseConfig (xtuner/v1/model/dense/qwen3.py).
build(self) -> <ModelClass>— return<ModelClass>(self).@classmethod from_hf(cls, hf_path) -> Self— read the HF config and map its fields one-to-one onto your config. The field set is whatever the live config class declares (TransformerConfiginxtuner/v1/model/base.py,MoEConfiginxtuner/v1/model/moe/moe.py); read it, don't work from a hard-coded list. A few non-obvious traps:- Use
RopeParametersConfig.from_hf_config(hf_config)fromxtuner/v1/module/rope/rope.py. It already handles both therope_parametersdict (HF ≥ 5.2.0) and the legacyrope_scalingdict (HF 4.57.x), plus YARN / FoPE / mrope special fields. Do not parse rope yourself. - For optional HF fields, use
getattr(hf_config, "<field>", <default>)rather thanhf_config.<field>— older checkpoints drop fields.
- Use
@property hf_config(self) -> <HFConfigClass> | None— the inverse offrom_hf: every fieldfrom_hfreads must be re-emitted sosave_hfround-trips.
Built-in vs. trust_remote_code — the only axis that changes from_hf /
hf_config: a trust_remote_code model is structurally still Dense / MoE /
VLM; the only difference is whether transformers ships a built-in
<Name>Config class. Decide which case you are in and follow the matching column
— this is the single place that contract is defined.
| Built-in config (e.g. Qwen3) | trust_remote_code (no built-in config) |
|
|---|---|---|
from_hf reads via |
<HFConfigClass>.from_pretrained(hf_path) |
AutoConfig.from_pretrained(hf_path, trust_remote_code=True), then assert hf_config.model_type == "<expected>" |
hf_config returns |
a populated <HFConfigClass> re-emitting every field |
None |
save_hf weight-map |
written from hf_config |
same |
save_hf side files |
derived from hf_config |
falls back to copying config.json / tokenizer / *.py from self._hf_path so the dir stays loadable with trust_remote_code=True; module-cache test in the matrix below |
Compare Qwen3MoEConfig.hf_config (built-in) and Qwen3MoEFoPEConfig.hf_config
(returns None), both in xtuner/v1/model/moe/qwen3.py. Do not invent a
built-in HF config class for a remote-code model — return None.
Invariants:
from_hf↔hf_configmust round-trip. Thetest_save_hftest (§7) is the enforcement mechanism — every key in the original index must appear in the saved index with byte-equal tensors.- The config must be pickleable and side-effect free. No env vars, no file
I/O, no
torch.cuda.*at construction time.
3.2 Size-specific subclasses
Add one subclass per released checkpoint that hard-codes the published
dimensions. Mirror Qwen3MoE30BA3Config / Qwen3MoE235BA22Config in
xtuner/v1/model/moe/qwen3.py. Even if only one size ships, define at least one
— it is the entry point for get_model_config and for tests.
4. Register the entry point
Open xtuner/v1/model/__init__.py — all four edits are short:
- Import — bring in the new config classes.
- Alias — add
"<alias>": <NewSizeConfig>()tomodel_mapping. Lookup normalizes case and-/_, so"my-model-7b"and"my_model_7b"are the same key. - Dispatch — add
elif cfg.model_type == "<model_type>":toget_model_config_from_hfreturning<NewConfigClass>.from_hf(model_path). The catch-allraise ValueError(...)must stay last. __all__— re-export the new public config names.
Never key dispatch on parameter names, file paths, or architecture strings.
Only model_type.
5. Compose / VLM specifics
Compose models live under xtuner/v1/model/compose/<name>/ and follow a
three-tower pattern. See xtuner/v1/model/compose/base.py and
xtuner/v1/model/compose/qwen3_vl/qwen3_vl_config.py.
5.1 Config shape
BaseComposeConfig (compose/base.py) carries three sub-configs:
vision_config: XTunerBaseModelConfig # e.g. Qwen3VLVisionConfig
projector_config: XTunerBaseModelConfig # e.g. Qwen3VLProjectorConfig
text_config: XTunerBaseModelConfig # any Dense/MoEConfig subclass
freeze_vision: bool = False
freeze_projector: bool = False
freeze_language: bool = False
Reuse existing text-tower configs where possible. Qwen3VLDense* delegates to
the Dense path via xtuner/v1/model/dense/qwen3vl_text.py. Do not copy-paste
the entire dense/MoE config — compose over them.
Reused text-tower key prefixes — use hf_key_mapping, not to_hf_key_list. A
text tower reused inside a VLM keeps its standalone to_hf_key_list (it emits
model.<...> as if it were a plain LLM). The compose checkpoint nests it one level
deeper (model.language_model.<...>). Do not teach the text tower's
to_hf_key_list about that prefix — that leaks the compose context into a tower that
must also work standalone. Set the remap on the text config instead:
hf_key_mapping = {r"^model\.": "model.language_model."} (applied in
_init_load_spec). Keep structural renames in to_hf_key_list; the
deployment-dependent prefix region belongs in hf_key_mapping.
5.2 Model shape
BaseComposeModel (compose/base.py) builds three sub-models:
vision_tower, multi_modal_projector, language_model.
from_hf(hf_path)— delegates to each sub-model withstrict=False(vision keys are missing from the language-model state dict and vice versa) and unions the missing-keys sets.save_hf(hf_dir)— saves each tower with a distinct prefix ("model-language","model-vision","model-projector") and merges the threeweight_maps into onemodel.safetensors.index.json.
5.3 hf_config for VLMs
Many VLMs ship without a stable HF top-level config class; hf_config returns
None (compose/qwen3_vl/qwen3_vl_config.py). In that case, save_hf
falls back to copying source files from self._hf_path (see §3.1).
5.4 Vision-side quirks to watch for
attn_implselectable per checkpoint:"flash_attention" | "flex_attention" | "eager_attention".deepstack_visual_indexes(e.g.[8, 16, 24]) — auxiliary supervision depths on the vision tower; must round-trip.- Specialized rope:
rope_type="qwen3_vl",mrope_section=[24, 20, 20]— delegated toRopeParametersConfig.from_hf_config. - The projector’s
torch.compilepath is enabled on torch ≥ 2.9.1 (see commitf6d74efb). Tests must validate both compiled and eager outputs match.
6. Baseline parity — no complex parallelism yet
First make the model correct on the simplest execution path: single rank, no
EP/SP, no torch.compile, no fp8, no activation offload. Align precision from
the inside out — ops → decoder layer → full model — reusing XTuner's own modules
at every level. Only once this baseline holds do you add the parallel and
optimization features (§8).
Reuse, don't reimplement. Use the existing XTuner modules — MHA, MLA, GatedDeltaNet, GroupLinear, etc. Do not re-implement them with naive torch.
Route op differences through
XTUNER_HF_IMPL. This env var already exists (xtuner/v1/utils/misc.py) and selects HF-exact ops via per-op selectors:xtuner/v1/ops/attn_imp.py::get_attn_impl_fn(eager attention),xtuner/v1/ops/rms_norm/__init__.py::get_rms_norm_fn(native torch rms_norm), andxtuner/v1/ops/gated_deltanet/__init__.py::{get_chunk_gated_delta_rule_fn, get_causal_conv1d_fn}(return the canonical fla / causal_conv1d wrappers HF calls instead of XTuner's compile-friendlytorch.library.custom_opwraps — same forward output, but the backward op graph now matches HF). Selectors read the env var live so a test can toggle it per model instance. If your model introduces a new op whose fast path is not bitwise against HF (typical for fla-backed linear-attention variants), add its HF-exact branch in the corresponding op selector the same way. You may only patch at the ops level — never take the shortcut of patching the entireXTunerDecoderLayer.forward.Pair
XTUNER_HF_IMPLwithXTUNER_DETERMINISTIC=true. The latter gates an autotune-pin block inxtuner/v1/__init__.pythat monkey-patchestriton.autotuneto lock every kernel to its first config and disable the result cache. Without the pin, fla's@triton.autotunepicks tiling / reduction order per kernel and the choice drifts between runs, producing 1 ULP per linear-attention layer that the chain rule amplifies to ~1.0 at the model boundary (§12).tests/conftest.pysets the env var and runsimport xtuner.v1so the patch installs before fla is imported. The patch lives on thextuner.v1.*import chain (not in the conftest) becausemultiprocessing.spawnchildren ofMultiProcessTestCasedon't go through pytest's conftest — they re-import the test class top-level (which pullsxtuner.v1.model.*) and the patch must run in those processes too. Production training doesn't set the env var, so the block is inert there.Decoder-layer bitwise parity — forward AND backward, both required. Build one XTuner
DecoderLayer(one per distinct layer type — e.g. a linear and a full layer for a hybrid model) and the matching HF layer, setXTUNER_HF_IMPLandXTUNER_DETERMINISTIC, feed identical inputs, and align bitwise — not a tolerance. Backward is not optional: asingle layer → final norm → lm_head → CE loss → loss.backward()test asserts the layer output, the loss, and the input gradientdL/dxall bitwise-equal to HF — test one full and one linear layer separately (reference:test_decoder_layer_bitwise_parity). A bitwise forward does not imply a bitwise backward — the forward values can round identically while the backward graph differs (see the fp32-softmax note in §12). This is the primary path, required at all scales, and the only feasible one for large models (it never loads the full model — see "load only what you use" below). When you hand-build a single layer's inputs, the two sides'causal mask/seq_ctxmust match exactly — see §12. Load only what you use so the test fits any model size: build the XTuner tower onmetaand materialize (to_empty+ selective load) only the tested layer +norm+lm_headviaHFCheckpointLoader.load(key)(it reads just the needed safetensors shard); build a standalone HFDecoderLayer+ norm + lm_head and load the same keys. Neverfrom_pretrainedthe whole model for a single-layer test. (build_rotary_embeddingbuilds the rotary on CPU with real buffers even under meta, somodel.rotary_embis usable without materializing the layers.) For < 40B you may also read per-layer hidden states from one full-model forward (HFoutput_hidden_states+ XTunerreturn_hidden_states); mind the off-by-one in §12.Model config matches HF hyperparameters — see the §3.1 built-in vs. remote-code contract for how
from_hf/hf_configdiffer between the two.Full-model forward parity for models < 40B, bitwise against HF, on the single-rank path (FSDP on or off; no EP / compile / fp8 yet).
save_hf/from_hfcorrectness (round-trip; see §7).Loss-convergence trace + engine test. Once model forward and decoder-layer forward/backward are aligned, record a loss-convergence trajectory and add an engine test case.
Commit the baseline — do not stop to ask. Committing is part of this workflow: invoking this skill authorizes the in-workflow commits. Once parity holds on the simple path and lint/tests pass, commit immediately so this correct baseline is preserved before any parallel/optimization work begins (the §8 sub-agents branch off this commit). Commit each logical step as you finish it (§10) rather than batching everything to the end or waiting for a separate go-ahead.
Report back to the user: whether bitwise parity was achieved, and the residual error (decoder-layer level and model level) once XTuner's internal components are used.
7. Baseline tests — add these first
Before any parallel feature, add the most basic regression tests so the baseline
(§6) is locked and the §8 work has a fixed reference to measure against. Add
tests/model/test_<name>_<bucket>.py, mirroring
tests/model/test_qwen3_moe.py (MoE) or tests/model/test_qwen3_dense.py
(Dense). All tests extend xtuner._testing.DeterministicDDPTestCase for
deterministic distributed setup.
Reference test cases:
- decoder-layer bitwise parity + save_hf round-trip:
tests/model/test_qwen3_5_dense.py(hybrid linear/full dense VLM; togglesXTUNER_HF_IMPLper case) - model forward & save & load parity:
tests/model/test_qwen3_moe.py - engine training test:
tests/engine/test_moe_train_engine.py
7.1 Baseline test matrix
All single execution path — no EP / dispatcher / compile / fp8 here; those are §8 tests, written against this baseline.
| Case | Dense | MoE | Compose | Notes |
|---|---|---|---|---|
Decoder-layer parity — output + loss + dL/dx |
✅ | ✅ | ✅ | Required at all scales, bitwise. Test one full and one linear layer separately. Standalone & memory-light: build the XTuner tower on meta and materialize_xtuner_submodule only {tested layer, norm, lm_head}; build a standalone HF DecoderLayer; load via HFCheckpointLoader. Under XTUNER_HF_IMPL assert the layer output, loss, and dL/dx all bitwise (forward AND backward at the layer level — a bitwise forward ≠ bitwise backward, see §12). Reference: test_decoder_layer_bitwise_parity. |
| Whole-model forward + backward parity (by scale) | ✅(4B) | scale | ✅(4B) | Run the top-level model end-to-end vs HF, eager both sides (XTUNER_HF_IMPL + XTUNER_DETERMINISTIC + HF eager). For a VLM that's the compose model on an image prompt (vision + projector + text — subsumes "VLM forward parity"); for a plain LLM, the text/LM model. With both env switches on plus the autotune pin loaded (§6.2), forward AND backward are bitwise: assert logits + loss + dL/d(pixel_values) / dL/d(inputs_embeds) all == 0.0. Scale: small = fwd+bwd; medium = forward only; large (can't e2e-forward) = skip → integration. Reference: test_vl_forward_parity. |
| Vision-tower bitwise parity | — | — | ✅ | Compose/VLM only. Compare HF's visual output (pooler_output, which is post-merger) bitwise (0.0), both eager. XTuner splits that boundary into vision_tower (patches → merged hidden) + projector (the merger MLP → text dim), so load both standalone to match it — don't build the whole VLM just for the vision path. Build them on the real device, not meta: the vision rotary is computed into a buffer at forward time, so to_empty would leave garbage (unlike the text rotary, which build_rotary_embedding builds real even under meta — §6.3). Vision attention is non-causal → eager_attention(causal=False), which XTUNER_HF_IMPL selects. Reference: test_vision_tower_bitwise_parity. |
| FSDP forward parity | ✅ | ✅ | ✅ | Looser tolerance (≈ 3e-2 is the established budget). |
save_hf round-trip (byte-equal tensors) |
✅ | ✅ | ✅ | See tests/model/test_qwen3_moe.py. |
7.2 Parity bar by model scale
Two independent axes:
Per-layer (op) parity — required at all scales, bitwise. Test one full and
one linear decoder layer separately; for each assert layer output + loss + dL/dx
bitwise vs HF under XTUNER_HF_IMPL. Keep it memory-light (XTuner meta-build +
materialize only the tested layer + norm + lm_head; standalone HF layer; load
via HFCheckpointLoader) so it runs even for models too large to forward
end-to-end. This is the bitwise guarantee for both forward and backward ops.
Whole-model (integration) parity — on the top-level model, graded by what fits one GPU. Run the real model end-to-end: for a VLM that's the compose model on an image prompt (one test covers vision + projector + text — don't add a separate text-tower model test); for a plain LLM, the text/LM model.
- Small (e.g. dense 4B): both forward (logits/loss) AND backward
(
dL/d(pixel_values)for a VLM,dL/d(inputs_embeds)for text) bitwise. Prerequisites:XTUNER_HF_IMPL=trueso every op selector returns the HF-canonical callable (§6.2) ANDXTUNER_DETERMINISTIC=trueso Triton autotune is pinned (§6.2). With both on, even hybrid linear/full models match HF byte-for-byte through the whole compose chain. Don't check weight gradients — they accumulate across positions + tiedlm_headand aren't the right granularity for parity. - Medium: forward only (logits/loss bitwise).
- Large (cannot forward end-to-end on the test GPUs): skip the whole-model test; leave it to integration tests. The per-layer test still gives op-level forward+backward bitwise.
7.3 Required test idioms
tests/model/test_qwen3_5_dense.py is the reference template; extend
DeterministicDDPTestCase (xtuner._testing), which provides the shared
parity scaffolding so the test file only holds model-specific logic:
with self.hf_impl():— sets/restoresXTUNER_HF_IMPL=true; build the XTuner model inside the block so its attention picks the eager op at construction.self.materialize_xtuner_submodule(model, submodule, prefix, loader)— for a model built onmeta, materialize only that submodule on CUDA and load its weights (so a single-layer test fits any model size).self.load_params_from_hf(module, key_for, loader)/self.xtuner_ckpt_key(model, name)— selective per-param load by checkpoint key (HFCheckpointLoaderreads only the needed shard);xtuner_ckpt_keyappliesto_hf_key_list+ the config'shf_key_mapping.
# Construct on meta, materialize on device — single-rank baseline, eager, no EP.
with torch.device("meta"):
cfg = get_model_config_from_hf(hf_model_path)
cfg.compile_cfg = False # baseline tests run eager
model = cfg.build()._to_device_dtype(dtype=torch.bfloat16, skip_buffers_dtype=True)
# Optionally shard with FSDP (data-parallel only), then load.
model.fully_shard(fsdp_config=FSDPConfig(cpu_offload=False))
model.from_hf(hf_model_path)
Older patch_hf_rms_norm / patch_hf_rope helpers also live in
xtuner._testing. Check they actually fit your model before using them —
they were written for earlier models and match HF modules
by class-name substring and attribute name (e.g. variance_epsilon). On a newer
model they can silently mis-patch (wrong eps attribute, or replacing a
zero_centered RMSNorm with a default one). Often the native XTuner ops already
match HF bitwise (see §12), so no patching is needed; prefer XTUNER_HF_IMPL
over these helpers for parity. The §8 parallel-feature tests reuse this same
idiom with cfg.dispatcher / cfg.ep_size (and FSDPConfig(ep_size=...)) set.
7.4 Checkpoint paths
Read from env vars; never hard-code. Example:
QWEN3_MOE_PATH = os.environ["QWEN3_MOE_PATH"] at
tests/model/test_qwen3_moe.py. Document new env vars in the PR body.
7.5 Drop-in training config
Ship a runnable training config at ci/config/<name>.py so a developer (and CI) can verify
the port end-to-end with one command: load the HF checkpoint, run a few real training steps,
watch the loss curve descend as expected. This is the smoke test that catches anything the
unit tests miss — the full from_hf → model.forward → loss → backward → optimizer step →
FSDP shard/reduce chain — and the file doubles as the example users copy when wiring the
model into their own training pipeline. Mirror ci/config/qwen3_moe_30BA3.py (MoE) or
ci/config/qwen3_dense.py (dense), keeping its structure: one <NewSizeConfig>() (from §3.2)
fed into a TrainerConfig alongside optim_cfg / lr_cfg / fsdp_cfg / dataset_cfg /
dataloader_cfg / loss_cfg. load_from and tokenizer_path read from an env var (§7.4 —
typically the same one as the parity test). Verify by running ~50 steps and confirming the
loss drops monotonically into a plausible range for that model size; record the trajectory
in the PR body alongside the §6 convergence trace.
8. Training optimizations
Only after the §6 baseline is committed and the §7 tests are green: verify each switch runs correctly and keeps precision within budget against that baseline.
- EP and SP can be enabled normally.
- With
ep_size > 1,intra_layer_micro_batchenables correctly and stays precise. torch.compileruns correctly.- fp8 can be enabled.
XTUNER_ACTIVATION_OFFLOADcan be enabled.
Each switch needs a test that compares its output to the §6 baseline — e.g. the
dispatcher × ep_size parity matrix (at minimum {(None,1), ("all2all",4), ("all2all",8)}, add "deepep" if supported), compiled-vs-eager, or fp8-vs-bf16
within tolerance.
These items are well-suited to the multi-agent workflow in §11: the main agent analyzes and creates tasks (with test cases), sub-agents develop in parallel, and the main agent merges. Report the precision error introduced by each switch.
9. Reference cheat-sheet
| Concept | Concrete location |
|---|---|
| Dense base config | xtuner/v1/model/base.py (TransformerConfig) |
| MoE base config | xtuner/v1/model/moe/moe.py (MoEConfig) |
| Compose base config | xtuner/v1/model/compose/base.py (BaseComposeConfig) |
from_hf (model) |
xtuner/v1/model/base.py (from_hf) |
save_hf (model) |
xtuner/v1/model/base.py (save_hf) |
to_hf_key_list proto |
xtuner/v1/model/base.py |
| MoE layer split | xtuner/v1/model/moe/moe.py (first_k_dense_replace) |
| RoPE auto-parse | xtuner/v1/module/rope/rope.py (RopeParametersConfig.from_hf_config) |
| Dense reference | xtuner/v1/model/dense/qwen3.py |
| MoE reference | xtuner/v1/model/moe/qwen3.py |
| MoE w/ MLA + NoAux | xtuner/v1/model/moe/deepseek_v3.py |
| MoE w/ layout transform | xtuner/v1/model/moe/gpt_oss.py |
| Remote-code reference | xtuner/v1/model/moe/qwen3.py (Qwen3MoEFoPEConfig) |
| Compose reference (VLM) | xtuner/v1/model/compose/qwen3_vl/ |
| Dense test reference | tests/model/test_qwen3_dense.py |
| MoE test reference | tests/model/test_qwen3_moe.py |
| Engine test reference | tests/engine/test_moe_train_engine.py |
10. Commit discipline
- Commit as you go — this is authorized, do not pause for a separate
go-ahead. The user invoking this skill authorizes the commits this workflow
produces. Finish a logical step → lint/test → commit, then move on. Do not
leave the whole port uncommitted and end with "should I commit?"; that strands
the work. (Hard-to-reverse or outward actions like
git push/ opening a PR still need explicit confirmation — committing locally does not.) - Follow the stacked-PR convention. Plan upfront how many commits the model
port needs — one per logical step (e.g. shared-path fixes; model class +
config + registration; the
XTUNER_HF_IMPLswitch; baseline tests; then one per parallel/optimization feature). Order them by dependency (a shared-path fix the baseline relies on goes first). Land every later change asgit commit --fixup=<sha>+git rebase --autosquashinto the commit it belongs to. Never grow the history with endless patch commits just to keep fixing things up. - Every commit must pass lint:
pre-commit run --files $(find xtuner/v1)
11. Multi-agent workflow
This applies only to the §8 parallel/optimization features — and only once the §6 baseline and §7 baseline tests are done and committed (§10). The baseline itself is built sequentially, not multi-agent; the §8 features are independent of each other, which is what makes them safe to develop in parallel.
- The main agent splits the §8 features into tasks and, per §8, writes each feature's comparison test against the committed baseline and commits the tests.
- Sub-agents branch off the baseline/test commit and develop one feature each (e.g. one switch / one tower) in parallel, making its test pass.
- The two sides adjust each other — a shared-interface change is made once by the main agent (it updates the test/baseline; sub-agents rebase), and a failing test pushes the sub-agent to fix its feature, not to weaken the test.
- The main agent merges, resolves conflicts, and reports the precision error each feature introduces relative to the baseline.
12. Parity debugging pitfalls
Hard-won notes from aligning hybrid (GatedDeltaNet + gated-MHA) models bitwise. The meta-rule: measure on real GPU and bisect to localize the divergence before concluding anything — most "bugs" here are measurement artifacts, not model errors. A clean way to bisect is to feed one side's intermediate tensors into both modules and compare step by step (norm → q/k/v → qk_norm → rope → attention core → o_proj).
Single-layer parity: the two sides' inputs must match exactly — especially the causal mask. When you hand-build one layer's inputs, passing
attention_mask=Noneto an HF attention/layer makes HF apply no causal mask while XTuner stays causal, producing a large false diff. Build a causal mask equivalent to XTuner'sseq_ctx, or drive the layer through the model so the mask is constructed internally.flash≠eager, and HF eager upcasts softmax to fp32. XTuner defaults to flash attention; HF reference eager does softmax in fp32. They are numerically close but not bitwise equal, so bitwise parity requires forcing eager on the XTuner side (that is exactly whatXTUNER_HF_IMPLdoes). XTuner'seager_attentionmatches HF'seager_attention_forwardbitwise.A bitwise forward does not imply a bitwise backward. The eager softmax must run in fp32 (
softmax(..., dtype=torch.float32).to(x.dtype), like HF), not bf16. With a bf16 softmax the forward logits can still round identically to HF (bitwise), but the backward differs (~1e-5 ondL/dx) because the gradient accumulates at bf16 precision. So always check backward bitwise separately (theloss.backward()test), not just forward — and align op dtypes (softmax, norms) in the backward-sensitive direction, fp32 where HF uses fp32.Whole-model backward is bitwise — but only with the full env stack. Per-layer bitwise backward does not on its own imply whole-model bitwise backward; the missing pieces are
XTUNER_HF_IMPL=trueANDXTUNER_DETERMINISTIC=trueAND the conftest mustimport xtuner.v1so the autotune-pin block inxtuner/v1/__init__.pyruns before any test reaches fla. Without those, two distinct ULP-level sources compound through the LM chain rule and look indistinguishable from "fundamental bf16 noise":- Triton autotune drift in fla. fla's
@triton.autotunepicks tiling / num_warps / reduction-order per kernel. Across runs the choice can change, producing a ~1-ULP backward diff per linear-attention layer that the chain rule amplifies — a 32-layer LM stack turned a per-layer ~2e-3 abs diff at grad magnitude ~30 into a ~1.0 diff on the massive-activation channels ofdL/d(pixel_values). The pin (configs held to the first one, cache disabled) is inxtuner/v1/__init__.pyunder anXTUNER_DETERMINISTICgate. It must run before any module imports fla — fla's@triton.autotunedecorators evaluate at import time, so patching after the fact has no effect on kernels already constructed. That's why the patch is on thextuner.v1.*import chain rather than in the conftest:multiprocessing.spawnchildren ofMultiProcessTestCasere-import the test class top-level (which pullsxtuner.v1.model.*) without running conftest, so the conftest alone can't install the patch in those processes. torch.library.custom_opwraps for the linear-attention path (GatedDeltaNet,causal_conv1d) produce a different backward autograd graph from the canonical fla / causal_conv1d call HF uses, even though the underlying kernel and forward output are identical.XTUNER_HF_IMPLroutes throughget_chunk_gated_delta_rule_fn/get_causal_conv1d_fn(§6.2) so the parity path matches HF byte-for-byte.
Production train doesn't set either env var; the custom_op wraps + autotune still operate. These switches are parity-test only. If the whole-model backward is not bitwise but the per-layer backward is, suspect one of the two before suspecting depth-accumulated bf16 noise.
- Triton autotune drift in fla. fla's
flash(XTuner varlen) vsflash_attention_2(HF) is bitwise only for short, single-tile sequences. They are the same kernel, so at a short seqlen (e.g.- the logits are bitwise (0.0); at longer sequences the tiling / accumulation order differs and they drift ~1e-2. So a short-seq text test can assert flash-vs-FA2 bitwise, but anything realistic-length (or a VLM whose image tokens make the sequence long) needs the eager path for seqlen-independent bitwise. Don't conclude "flash matches FA2" from one short input.
Shared kernels are already bitwise; focus on what isn't. GatedDeltaNet / linear attention go through the same
fla/causal_conv1dkernels as HF, and native rms_norm / rope match HF bitwise — these come out at 0.0 with no work. Spend your effort on the path that actually differs (usually full attention).output_hidden_statesoff-by-one (full-forward route, < 40B only). HF's tuple is[emb, L0_out, …, L{n-2}_out, post_norm]: indexi+1is layeri's pre-norm output only fori ≤ n-2; the last entry is post-norm, not the last layer's pre-norm output. Compare the last layer viamodel.norm(...)or you will see the final layer "explode".A big max-diff is not automatically a bug. Late layers have massive activations (a few channels in the tens/hundreds); a tiny relative error there shows up as a large absolute max-diff. Look at the mean and the trend, not just the max. Conversely, feeding
randninputs to a single attention layer is ill-conditioned and misleads — use real (or model-produced) hidden states.A single-model failure is often a general-path bug. E.g.
Dense.build_layersnot forwardingrms_norm_typetoDenseDecoderLayerbroke anyzero_centereddense model, not just this one. Fix the shared invariant, not the symptom.Loading a standalone sub-tower uses
strict=False. A text tower loaded directly from a full VLM checkpoint will see vision (and deferredmtp.*) keys as unexpected; load withstrict=Falseso they are ignored.Test the whole VLM, not just the text tower. Text-tower parity leaves the vision tower and projector completely untested. A compose port needs a full-model forward-parity test with real image inputs (§7.1). The text path of the compose forward still runs a dummy vision forward to keep the tower in the autograd graph, so vision bugs surface even on text prompts.
Vision attention is non-causal — give
eager_attentionacausal=Falsepath. XTuner'seager_attentiondefaulted to a causal block-diagonal mask; that is wrong for a vision tower (bidirectional within each image), which is why forcing vision to the causal eager op produced large diffs. The op now takescausal=True|False(False → non-causal block-diagonal mask), the vision module already calls it withcausal=False, and the vision tower routes throughget_attn_impl_fnsoXTUNER_HF_IMPLselects eager. With that, the whole VLM is bitwise in one run, everything eager:XTUNER_HF_IMPLon the XTuner side (text causal-eager + vision non-causal-eager) and HF loaded with a singleattn_implementation="eager". Prefer eager for parity — it is seqlen-independent, unlike flash/FA2 (see the flash seqlen bullet above).Vision pos-embed dtype when reusing the Qwen3-VL tower. HF's
fast_pos_embed_interpolatereturns fp32 (bilinear weights are fp32) and HF's vision forward casts it (pos_embeds.to(hidden_states.dtype)); XTuner's vision forward adds it without a cast. If you patch in the HF method for parity, cast the result back to the param dtype, or you get an fp32/bf16LayerNormmismatch (expected scalar type Float but found BFloat16).Hybrid-model
torch.compile: choosefullgraphper layer — verify, then fall back; don't gate whether to compile. When a decoder layer's forward is wrapped into a higher-order op (HOP) — e.g. by an activation-checkpoint wrapper, whichtorch.compilelowers to a HOP whose body must be functional so it can be recomputed in backward — and that layer is a linear-attention layer (GatedDeltaNet), the layer mutates an outer-scope variable inside the HOP (GatedDeltaNet writes back intoseq_ctx). Underfullgraph=Truethat raisesHigherOrderOperator ... Mutating a variable not in the current scope (SideEffects). This is not a blanket "linear attention ⇒fullgraph=False" rule — whether it triggers depends on whether the layer is actually HOP-wrapped and whether that linear impl really writes outer state. So for such a HOP + linear combination, first test whetherfullgraph=Trueruns; keep it if it does (it compiles more of the layer), and only fall back tofullgraph=Falseif it doesn't —fullgraph=Falsesurvives by graph-breaking at the mutation and running that side-effecting write in eager. Make this a per-layer choice inDense.fully_shard(decide each layer'sfullgraph), not a decision about whether a layer is compiled at all.