add-model

name: add-model description: >- Adapt and port new LLM model architectures to this xinfer project. Use when the user asks to add, port, support, or adapt a new model (e.g. Llama, Gemma, Qwen, GPT-OSS, DeepSeek, or any HuggingFace architecture) including safetensors and GGUF formats, Dense and MoE architectures, and quantization formats (MXFP4, NVFP4, FP8, ISQ).

Add Model — Adapt New LLM Architectures to xinfer

Phase 0: Gather Required Information

Before starting, the agent must collect the inputs below. If any are missing, ask the user explicitly.

Resolution order (try top to bottom, stop at the first that succeeds):

Local model path (preferred) — If the user provides a local directory containing safetensors, read config.json and inspect weight tensors directly from disk. If the user provides a .gguf file, extract metadata and tensor info from the GGUF header (GGUF is self-contained; there is no separate config.json). No further input is needed in either case.
HuggingFace model ID — Look for the model in the local HuggingFace Hub cache (~/.cache/huggingface/hub/). If not cached, fetch config.json from the HuggingFace model repo. If that also fails, fall back to step 3.
Manual input — Ask the user to provide:
- config.json contents (for safetensors models) or GGUF metadata (for GGUF models).
- Weight tensor info (names, shapes, dtypes). The user can obtain this by clicking a weight file in the HuggingFace model repo.

Additionally, ask the user if they can provide the Python reference implementation (modeling_<arch>.py from HuggingFace Transformers). This is not strictly required, but significantly improves accuracy — it clarifies the exact forward pass, attention variants, MoE routing, activation functions, and normalization order that config fields alone cannot fully describe.

Input	How to obtain
HuggingFace model ID (e.g. `google/gemma-4-26B-A4B-it`)	User provides, or infer from context
Model config (`config.json`)	Fetch from HF: `https://huggingface.co/<id>/blob/main/config.json`. Not needed if local model path is provided.
HF tensor info (weight names + shapes)	User provides, or read from local safetensors with `scripts/inspect_weights.py` (create the script if it doesn't exist)
GGUF metadata + tensor info (if GGUF support needed)	User provides, or extract from local `.gguf` with `scripts/inspect_gguf.py` (create the script if it doesn't exist)
Python reference implementation (optional but recommended)	Fetch `modeling_<arch>.py` from the HuggingFace Transformers GitHub repo
Local model path (optional)	User provides path containing `config.json` + `.safetensors` or `.gguf`

Extracting info from local weights

For safetensors (Python required):

# scripts/inspect_weights.py
import json, sys, struct
path = sys.argv[1]
with open(path, "rb") as f:
    n = struct.unpack("<Q", f.read(8))[0]
    header = json.loads(f.read(n))
for k, v in sorted(header.items()):
    if k != "__metadata__":
        print(f"{k}\t{v.get('shape')}\t{v.get('dtype')}")

For GGUF, use the gguf_helper CLI or Python gguf library to list tensors and metadata keys.

Phase 1: Analyze the Architecture

Read and compare:

config.json — identify: architectures, hidden_size, num_hidden_layers, num_attention_heads, num_key_value_heads, head_dim, intermediate_size, hidden_act, vocab_size, rope_theta, partial_rotary_factor, sliding_window, tie_word_embeddings.
MoE fields (if any): num_experts, num_experts_per_tok / top_k_experts, moe_intermediate_size, norm_topk_prob, routed_scaling_factor.
Special features: attention_k_eq_v, layer_types array, layer_scalar, final_logit_softcapping, attn_logit_softcapping, dual head_dim, dual RoPE, per-expert scaling, shared experts, etc.
Python reference — understand the forward pass, especially:
- Attention: GQA/MQA, head dim, RoPE variant, sliding window, QK norm, V norm
- MLP: standard SiLU-gate, GeluTanh, custom SwiGLU, etc.
- MoE: router structure, expert computation, parallel dense+MoE, renormalization
- Layer structure: pre-norm vs post-norm, residual connections, layer scalar

Identify which existing model in src/models/ is the closest match. Use it as a starting template.

Architecture classification

Type	Characteristics	Example models
Dense	Standard transformer, no MoE	Llama, Gemma3, Phi4
MoE	Router + experts per layer	Qwen3-MoE, Gemma4-MoE
Hybrid MoE	Dense MLP + MoE in parallel per layer	Gemma4 (dense+MoE parallel)
Multimodal	Vision encoder + language model	Gemma3-VL, Qwen3-VL

Phase 2: Check if New Kernels Are Needed

If the model requires operators not in attention.rs or src/models/layers/:

Clone attention.rs to a local sibling directory (if not already present):
```
cd .. && git clone https://github.com/guoqingbao/attention.rs.git
```
Then switch to the commit that this project relying on.

Point xinfer/Cargo.toml to local attention.rs:

# In [dependencies], change the git URL to:
attention-rs = { path = "../attention.rs", ... }

Add both CUDA and Metal kernel implementations:
- CUDA kernel: attention.rs/src/kernels/src/<name>.cu
- Metal kernel: attention.rs/src/metal-kernels/src/<name>.metal
- FFI bindings: attention.rs/src/kernels/src/ffi.rs
- Rust wrapper: attention.rs/src/<name>.rs
- Update attention.rs/src/lib.rs: add pub mod <name>;
- Update attention.rs/src/kernels/build.rs: add rerun-if-changed for .cu
- Update attention.rs/src/metal-kernels/build.rs: add to METAL_SOURCES
For CUDA BF16 kernels (Ampere+ only), guard with #ifndef NO_BF16_KERNEL and provide F16 fallback or dummy stubs for older GPUs.
Verify compilation: cargo check --features metal (macOS) or cargo check --features cuda (Linux).

Phase 3: Implement the Model

3a. Create model file

Create src/models/<arch>.rs. Follow the established pattern from the closest existing model:

Key struct pattern:

pub struct <Arch>DecoderLayer {
    self_attn: Attention,
    mlp: MLP,
    moe: Option<<Arch>MoE>,      // if MoE
    input_layernorm: NormX,
    post_attention_layernorm: NormX,
    // ... additional norms for hybrid MoE
}

pub struct <Arch>ForCausalLM {
    embed_tokens: candle_nn::Embedding,
    layers: Vec<<Arch>DecoderLayer>,
    norm: NormX,
    lm_head: ReplicatedLinear,
    // ...
}

Required public methods on <Arch>ForCausalLM:

pub fn new(vb: &VarBuilderX, comm: Rc<Comm>, config: &Config, dtype: DType,
           is_rope_i: bool, device: &Device,
           progress_reporter: Arc<RwLock<Box<dyn ProgressLike>>>) -> Result<Self>

pub fn forward(&self, input_ids: &Tensor, positions: &Tensor,
               kv_caches: Option<&Vec<(Tensor, Tensor)>>,
               input_metadata: &InputMetadata,
               _embeded_inputs: bool) -> Result<Tensor>

pub fn forward_embedding(&self, ..same args..) -> Result<Tensor>

pub fn get_vocab_size(&self) -> usize

The 5th _embeded_inputs: bool argument is required by the model_call! macro.

3b. Weight loading paths

The model must support two VarBuilder types via vb.is_qvar_builder():

Path	VarBuilder	Weight prefix pattern
HF safetensors	`Either::Left`	`language_model.model.layers.{i}` (multimodal) or `model.layers.{i}`
GGUF	`Either::Right`	`model.layers.{i}` (maps from `blk.{i}`)

For norm and lm_head:

Component	HF prefix	GGUF prefix
Final norm	`language_model.model.norm`	`model.norm`
LM head (untied)	`lm_head`	`model.output`
LM head (tied)	`language_model.model.embed_tokens`	`model.embed_tokens`

3c. MoE construction

For models with MoE, the Gemma4MoE / FusedMoe selection pattern:

let moe = if is_qvar_builder {
    FusedMoeGGUF::new(config, vb.clone(), comm.clone(), dtype)?
} else if quant_config == "mxfp4" {
    FusedMoeMxfp4::new(config, vb.pp("mlp"), comm.clone(), dtype)?
} else if config.quant.is_some() {
    FusedMoeISQ::new(config, vb.pp("mlp"), comm.clone(), dtype)?
} else {
    FusedMoe::new(config, vb.pp("mlp"), comm.clone(), dtype)?
};

Important: If the model's router gate is NOT at the standard mlp.gate path (e.g. Gemma4 uses router.proj), use FusedMoe::new_with_gate(config, gate_vb, experts_vb, ...) and FusedMoeISQ::new_with_gate(...).

For GGUF models with packed ffn_gate_up_exps (instead of separate ffn_gate_exps + ffn_up_exps), FusedMoeGGUF::new() auto-detects and handles this.

3d. Packed expert layout

If the model uses packed gate_up_proj with a non-standard layout, add the architecture name to the layout resolvers in src/models/layers/moe.rs:

resolve_packed_gate_up_layout() — InterPacked if shape is [experts, 2*intermediate, hidden]
resolve_packed_down_layout() — HiddenInter if shape is [experts, hidden, intermediate]

Phase 4: Register the Model

4a. `src/models/mod.rs`

pub mod <arch>;

4b. `src/utils/config.rs`

Add variant to ModelType enum:

pub enum ModelType {
    // ...
    <Arch>,
}

4c. `src/utils/mod.rs` — `get_arch_rope` function

Add mappings in order:

GGUF canonical_arch: "<gguf_arch>" => "<HFArchitectureName>".to_string()
Architecture to ModelType + chat template: Add match arm for HF architecture names
Rope type selection: ("<arch_lower>", false) in the rope map
HF config parsing (multimodal wrapper): If the model wraps text_config, add handler to extract nested config, MoE config, and rope_parameters
GGUF extra_config_json: If the model has special GGUF metadata (sliding_window_pattern, layer_types, etc.), add extraction block
GGUF MoE config: Add mod_cfg construction from GGUF expert metadata
hidden_act override: If the model uses a non-Silu activation, override after config construction
require_model_penalty(): Add architecture names
Multimodal GGUF warning: Add if applicable

4d. `src/core/runner.rs`

Add use crate::models::<arch>::<Arch>ForCausalLM;
Add <Arch>(Arc<<Arch>ForCausalLM>) to Model enum
Add <Arch> => <Arch>ForCausalLM to build_model! macro
Add <Arch> => EmbedInputs to graph_wrapper! macro (or ImageData for multimodal)
Add <Arch> => false to both model_call! invocations (or true / image handling for multimodal)
Add ModelType::<Arch> to disable_flash_attn if needed
Add Model::<Arch>(model) => model.get_vocab_size() to get_vocab_size

4e. `src/server/parser.rs`

Add ModelType::<Arch> to ToolConfig::for_model_type()
Add ModelType::<Arch> to parser_name_for_model()
Add ModelType::<Arch> to structured output format handling

Phase 5: Build and Verify

Build commands

Platform	Command
macOS (Metal)	`cargo build --release --features metal`
CUDA (basic)	`cargo build --release --features "cuda,flashinfer"`
CUDA (full)	`cargo build --release --features "cuda,flashinfer,nccl"`
CUDA (sm90+)	Add `cutlass` feature: `--features "cuda,flashinfer,nccl,cutlass"`

If permission errors occur on target/, use: CARGO_TARGET_DIR=/tmp/xinfer-check cargo check --features metal

Verify compilation

# Check xinfer compiles
cargo check --features metal   # or cuda

# Check attention.rs compiles (if modified)
cd ../attention.rs && cargo check --features metal   # or cuda

Fix all errors and warnings before proceeding.

Phase 6: Check Model Compatibility

Before testing, invoke the check-model skill (.cursor/skills/check-model/SKILL.md) to validate the new model's tensor format and multi-rank compatibility.

Provide:

The model's config.json (local path or HuggingFace URL)
Weight tensor info (names, shapes, dtypes) — from a local safetensors file or HuggingFace model page

The check-model skill will verify:

Tensor names match the loader's expected format (e.g. weight_packed vs weight vs blocks for FP4)
Quantized vs unquantized layer detection matches the ignore list in quantization_config
All TP-sharded dimensions are divisible by common GPU counts (1, 2, 4, 8)
FP8 block alignment and FP4 scale group alignment are correct for multi-rank

Fix any [ERROR] findings before proceeding to the test phase. [WARN] items should be reviewed but may not block loading.

Phase 7: Test the Model

Start the server

Single-GPU:

# Metal (macOS)
cargo run --release --features metal -- --m <model_id_or_path> --port 8080 # or use --w to specify local model path

# CUDA
cargo run --release --features "cuda,flashinfer,cutlass" -- --m <model_id_or_path> --port 8080

Multi-GPU (CUDA):

# --d used to specify device ids
xinfer --m <model_id_or_path> --port 8080 --d 0,1

Before retrying model loading

Always kill all previous instances and verify GPU memory is freed:

# Kill all xinfer and runner processes
pkill -f xinfer;
sleep 2

# Check GPU memory (CUDA)
nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader

# Check GPU memory (Metal)
sudo powermetrics --samplers gpu_power -i 1000 -n 1 | grep 'GPU'

Ensure the target GPU(s) have sufficient free memory for the model before loading.

Send test requests

# Basic completion, depend on the api server port started, default 8000
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<model_id>",
    "messages": [{"role": "user", "content": "Hello, who are you?"}],
    "max_tokens": 64,
    "temperature": 0.7
  }' | python3 -m json.tool

# Streaming
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<model_id>",
    "messages": [{"role": "user", "content": "Write a haiku about Rust."}],
    "max_tokens": 64,
    "stream": true
  }'

Debugging common issues

Symptom	Likely cause	Fix
Panic at weight loading	Wrong VarBuilder prefix or tensor shape mismatch	Check HF vs GGUF prefix mapping; verify tensor shapes match config
NaN/Inf in output	Missing activation, wrong dtype, or norm misconfiguration	Verify `hidden_act`, check if model needs GeluPytorchTanh vs Silu; check norm dtype (F32 for GGUF/ISQ)
Gibberish output	Wrong RoPE params, wrong head_dim, or tied embeddings misconfigured	Verify `rope_theta`, `partial_rotary_factor`, `head_dim`, `tie_word_embeddings`
Server crash on decode	KV cache shape mismatch or sliding window misconfigured	Verify `head_dim` per attention layer matches KV cache allocation
CUDA OOM	Model too large for GPU	Use ISQ quantization (`--quant q4k`), or use multi-GPU with `--tensor-parallel`
Slow performance	Flash attention disabled or wrong features	Ensure `flashinfer` feature is enabled; check `disable_flash_attn` isn't matching your model

Precision validation

Compare outputs against the Python reference:

Run the same prompt through both the Python HF model and xinfer
Compare logits for the first few tokens (top-5 should match)
For MoE models, verify expert routing produces the same top-k experts

Quick Reference: Key Files

File	Purpose
`src/models/<arch>.rs`	Model implementation
`src/models/mod.rs`	Module registration
`src/models/layers/moe.rs`	MoE layer (FusedMoe, FusedMoeGGUF, FusedMoeISQ, FusedMoeMxfp4)
`src/models/layers/attention.rs`	Attention layer (handles GQA, RoPE, sliding window)
`src/models/layers/mlp.rs`	Dense MLP layer
`src/models/layers/rotary_emb.rs`	RoPE implementations (standard, scaling, YaRN)
`src/models/layers/mod.rs`	VarBuilderX definition
`src/utils/config.rs`	Config struct, ModelType enum, MoEConfig
`src/utils/mod.rs`	Config parsing (HF + GGUF), architecture registration, chat templates
`src/core/runner.rs`	Model enum, build_model!, model_call!, graph_wrapper! macros
`src/server/parser.rs`	Tool parsing, structured output per model type
`Cargo.toml`	Feature flags: metal, cuda, flashinfer, nccl, cutlass
`build.sh`	Build script