track-models

name: track-models description: Track popular/new open-source LLMs and update docs/model_coverage.mdx with their kernel support status. Use when discovering new models to add to the coverage tracker, checking if a specific model is covered, or refreshing model coverage documentation.

Track Models

Discover popular or newly released open-source LLMs, determine their kernel coverage in FlashInfer-Bench, and update docs/model_coverage.mdx with summary and detailed per-kernel status tables.

Usage

# Auto-discover new popular models not yet tracked
/track-models --discover

# Add a specific model by name
/track-models --model-name mistral-7b
/track-models --model-name gemma-3-27b --hf-repo-id google/gemma-3-27b-it

# Refresh coverage status for all already-tracked models
/track-models --refresh-status

# Do everything: discover new models + refresh existing ones
/track-models --discover --refresh-status

Parameters

--discover (optional): Auto-discover new popular models not yet in model_coverage.mdx
--model-name (optional): Specific model to add or refresh (e.g., "mistral-7b", "gemma-3-27b")
--hf-repo-id (optional): Override the HuggingFace repo ID (e.g., "google/gemma-3-27b-it")
--refresh-status (optional): Re-check definition files for all already-tracked models and update ✅/❌ status

Prerequisites

docs/model_coverage.mdx must exist (it does — managed by this skill)
tmp/flashinfer-trace/definitions/ must exist with current definition JSON files (this is the HuggingFace dataset clone — populated by /clone-repos; the in-repo flashinfer_trace/ directory was removed in the trace-dataset refactor)
Run /clone-repos first if you need SGLang or sgl-cookbook configs for a model that isn't in the existing patterns, or to ensure the trace dataset clone is current

What This Skill Does

Phase 1: Model Discovery

When --discover is set, find popular open-source LLMs not yet tracked in model_coverage.mdx.

Discovery Sources (in order)

SGLang day-0 additions (highest priority — newly supported models):
```
# Models added to SGLang in the last 30 days (new .py files in models dir)
git -C tmp/sglang log --since="30 days ago" --name-status --diff-filter=A \
    -- "python/sglang/srt/models/*.py" | grep "^A" | awk '{print $2}'
```
These are models with brand-new SGLang support — they are the most likely to have novel kernels not yet tracked. Parse each new filename to derive the model slug.
SGLang supported models list (full catalog):
```
ls tmp/sglang/python/sglang/srt/models/
```
SGLang models are a curated list of production-quality LLMs. Every model in SGLang is a candidate.
sgl-cookbook (models with recommended serving configs):
```
ls tmp/sgl-cookbook/data/models/generated/v0.5.6/
```
Models with a YAML config are actively deployed — highest priority.

Inference API provider model lists (strong popularity signal): A model served by 2+ commercial inference APIs is definitively production-critical. Fetch or browse the following provider pages and extract the open-weight model list:

Provider	URL	Notes
Together AI	https://www.together.ai/pricing	Full model catalog with pricing tiers
Fireworks AI	https://fireworks.ai/pricing	Open-source models section
Groq	https://groq.com/pricing	Smaller curated list, very high-traffic models
OpenRouter	https://openrouter.ai/models	Largest aggregator; filter by `free` or sort by throughput
DeepInfra	https://deepinfra.com/models	Broad open-source catalog
Hyperbolic	https://app.hyperbolic.xyz/models	GPU-native provider
Cerebras	https://inference.cerebras.ai	Wafer-scale inference, limited but popular subset

Heuristic: Any model listed by ≥ 2 providers should be added to coverage tracking. Models on ≥ 3 providers are the highest priority (most widely deployed in production).

Known high-priority models from provider crawls (update this list over time):

Llama 4 Scout (17Bx16E) — Together, Groq, sgl-cookbook
Llama 4 Maverick (17Bx128E MoE) — Together, Groq
Llama 3.1 405B — Together, Fireworks
gpt-oss 120B / gpt-oss 20B — Together, Fireworks, Groq (OpenAI open-source)
GLM-4.5 / GLM-4.6 / GLM-5 — Together, Fireworks; sgl-cookbook has glm46.yaml
MiniMax M2 / M2.5 — Together, Fireworks
Kimi K2 / K2.5 — Together, Fireworks, Groq
DeepSeek-V3.1 / R1-0528 — Together, Fireworks (updated DeepSeek variants)
Qwen3 235B A22B — Together, Fireworks, Groq
Mistral Small 3 (24B) — Together; sgl-cookbook has mistral.yaml

HuggingFace website (primary web source — fetch directly): Use WebFetch to retrieve the HuggingFace trending models page and extract newly released open-weight LLMs. This gives the most up-to-date signal for models gaining traction before they appear on inference provider lists.
```
# Fetch trending models (text/LLM category)
WebFetch: https://huggingface.co/models?pipeline_tag=text-generation&sort=trending
```
Also check the HuggingFace "new" filter for models released in the last 7 days:
```
WebFetch: https://huggingface.co/models?pipeline_tag=text-generation&sort=created
```
Focus on models that are:
- Open-weight (not API-only or gated without public access)
- From established labs (Meta, Google, Mistral, Alibaba/Qwen, DeepSeek, Moonshot, etc.)
- Have ≥ 1K downloads/week or trending in the top 50
- Transformer-based decoder (not vision-only or encoder-only)
For each candidate, fetch its model page to confirm the architecture:
```
WebFetch: https://huggingface.co/{org}/{model}/blob/main/config.json
```
HuggingFace trending (supplemental web search): Search for "huggingface trending LLM open source {current_year}" to find newly popular models. Focus on models that are:
- Open-weight (not API-only)
- LLM/VLM with transformer-based decoder
- Have ≥ 1K downloads/week or are from a major lab (Meta, Google, Mistral, Alibaba, etc.)

Filtering Already-Tracked Models

Read the current docs/model_coverage.mdx and extract all model names from the ## Summary table. Skip any model whose name is already listed.

Architecture Priority

Prefer to discover models that use kernel types already supported by FlashInfer-Bench:

GQA attention (most LLMs)
MLA attention (DeepSeek family)
GDN linear attention (Qwen3-Next family)
MoE FFN (Mixtral, Qwen-MoE, DeepSeek, etc.)
Mamba/SSM (NemotronH, GraniteMoe-Hybrid)
Dense FFN (Llama, Gemma, Mistral)

Phase 2: Architecture Extraction

For each new model, extract architecture details. Use the same approach as the extract-kernel-definitions skill.

Step 1: Get config.json from HuggingFace

from huggingface_hub import hf_hub_download
import json

config_path = hf_hub_download(repo_id="meta-llama/Llama-3.1-8B", filename="config.json")
with open(config_path) as f:
    config = json.load(f)

Key fields to extract:

Field	Used For
`hidden_size`	RMSNorm definition names, GEMM K dim
`num_hidden_layers`	Layer count
`num_attention_heads`	GQA/MLA q_heads
`num_key_value_heads`	GQA kv_heads
`head_dim` or `hidden_size / num_attention_heads`	Attention head dim
`intermediate_size`	GEMM N dim for MLP
`vocab_size`	Sampling definition names
`architectures`	Model class name → infer op types
`num_experts` / `num_local_experts`	MoE expert count
`num_experts_per_tok`	MoE topk

Step 2: Determine attention type from architecture

Use the architectures field in config.json or the SGLang model class:

Architecture String	Attention Type	Notes
`LlamaForCausalLM`, `MistralForCausalLM`, `GemmaForCausalLM`, `Qwen2ForCausalLM`	GQA	Standard
`DeepseekV2ForCausalLM`, `DeepseekV3ForCausalLM`	MLA	Has `q_lora_rank`, `kv_lora_rank`, `qk_rope_head_dim`
`Qwen3NextForCausalLM`	GDN + GQA hybrid	Has `gdn_*` config keys
`NemotronHForCausalLM`	GQA + Mamba2 hybrid	Has `ssm_*` config keys
`MixtralForCausalLM`	GQA + MoE	Has `num_local_experts`

Step 3: Find sgl-cookbook TP/EP config

find tmp/sgl-cookbook/data/models/generated/v0.5.6/ -name "*.yaml" | xargs grep -l "{model_keyword}"

Parse the YAML to extract all unique tp and ep values. Use these to compute per-TP head counts.

If no sgl-cookbook config exists, use TP=1 (single GPU baseline).

Phase 3: Map Kernels to Definitions

For each model, compute the full list of expected definition names, then check which ones exist in tmp/flashinfer-trace/definitions/ (the HF dataset clone — the only location where definition JSONs live).

3a: Compute expected definitions

Follow these rules (same as CLAUDE.md):

RMSNorm (not TP-dependent):

rmsnorm_h{hidden_size}
fused_add_rmsnorm_h{hidden_size}
If MLA: also rmsnorm_h{q_lora_rank} and rmsnorm_h{kv_lora_rank}

GEMM (not TP-dependent):

gemm_n{intermediate_size}_k{hidden_size} (gate/up proj)
gemm_n{hidden_size}_k{intermediate_size} (down proj)
gemm_n{hidden_size}_k{hidden_size} (o_proj, if square)
For MLA: gemm_n{q_lora_rank * num_heads}_k{hidden_size} etc. (check SGLang impl)

GQA (TP-dependent, per TP value):

gqa_paged_prefill_causal_h{q//TP}_kv{kv//TP}_d{head_dim}_ps1
gqa_paged_prefill_causal_h{q//TP}_kv{kv//TP}_d{head_dim}_ps64
gqa_paged_decode_h{q//TP}_kv{kv//TP}_d{head_dim}_ps1
gqa_paged_decode_h{q//TP}_kv{kv//TP}_d{head_dim}_ps64
gqa_ragged_prefill_causal_h{q//TP}_kv{kv//TP}_d{head_dim}

MLA (TP-dependent, per TP value):

mla_paged_prefill_causal_h{num_heads//TP}_ckv{ckv_dim}_kpe{kpe_dim}_ps1
mla_paged_prefill_causal_h{num_heads//TP}_ckv{ckv_dim}_kpe{kpe_dim}_ps64
mla_paged_decode_h{num_heads//TP}_ckv{ckv_dim}_kpe{kpe_dim}_ps1
mla_paged_decode_h{num_heads//TP}_ckv{ckv_dim}_kpe{kpe_dim}_ps64

Where ckv_dim = kv_lora_rank + qk_rope_head_dim and kpe_dim = qk_rope_head_dim.

DSA (for DeepSeek V3.2-style, TP-dependent):

dsa_topk_indexer_fp8_h{num_heads//TP}_d{head_dim}_topk{topk}_ps64
dsa_sparse_attention_h{num_heads//TP}_ckv{ckv_dim}_kpe{kpe_dim}_topk{topk}_ps1
dsa_sparse_attention_h{num_heads//TP}_ckv{ckv_dim}_kpe{kpe_dim}_topk{topk}_ps64

GDN (for Qwen3-Next-style, TP-dependent, per TP value):

gdn_prefill_qk{q_heads//TP}_v{v_heads//TP}_d{head_dim}_k_last
gdn_decode_qk{q_heads//TP}_v{v_heads//TP}_d{head_dim}_k_last
gdn_mtp_qk{q_heads//TP}_v{v_heads//TP}_d{head_dim}_k_last

Mamba2 SSU (for NemotronH-style, TP-dependent):

mamba_ssu_decode_h{nheads//TP}_d{head_dim}_s{dstate}_ng{ngroups//TP}
Constraints: head_dim ∈ [64,128,256], dstate ∈ [64,128,256], nheads/ngroups ∈ [1,8,16]

MoE (EP-dependent):

moe_fp8_block_scale_ds_routing_topk{topk}_ng{n_group}_kg{topk_group}_e{num_experts//EP}_h{hidden_size}_i{intermediate_size}

RoPE (not TP-dependent):

rope_with_cos_sin_cache_neox_style_d{head_dim}_rd{rotary_dim}
rope_with_cos_sin_cache_gptj_style_d{head_dim}_rd{rotary_dim}

Sampling (not TP-dependent):

top_k_sampling_from_probs_v{vocab_size}
top_k_top_p_sampling_from_probs_v{vocab_size}
top_p_sampling_from_probs_v{vocab_size}

3b: Check definition existence

For each expected definition name, check:

find tmp/flashinfer-trace/definitions/ -name "{definition_name}.json"

Assign status:

✅ — JSON file exists in tmp/flashinfer-trace/definitions/
❌ — Name computed from model config, but no JSON file found (missing, needs to be created)
— — Module exists in the model architecture but definition is not computed/mapped (unmapped)

3c: Compute coverage summary

coverage = (count of ✅) / (count of ✅ + count of ❌)

✅ Fully covered — all expected definitions exist (0 ❌)
🟡 Partial — some definitions exist, some missing
❌ Not covered — no definitions exist (all ❌ or —)

Phase 4: Update `docs/model_coverage.mdx`

4a: Add row to Summary table

Find the ## Summary section and add (or update) the model row:

| {Model Display Name} | {Architecture description} | {Coverage emoji + label} |

Architecture description examples:

GQA + Dense (standard transformer)
GQA + MoE (standard attention + mixture of experts)
MLA + Dense/MoE (multi-head latent attention)
GDN + GQA + MoE (hybrid linear + standard attention)
GQA + Mamba2 (hybrid attention + SSM)

4b: Add detailed section

Append (or update) a new ## {Model Display Name} section before the closing ---:

## {Model Display Name}

**Architecture**: {num_layers} decoder layers, {attention_type} attention, {ffn_type} FFN{extra details}

{If multiple TP configs, add note}: Standard serving configuration: **TP={N}** (or TP={N} / TP={M}).

| Definition | Op Type | Status |
|-----------|---------|:------:|
| `rmsnorm_h{hidden_size}` | rmsnorm | ✅ |
| `fused_add_rmsnorm_h{hidden_size}` | rmsnorm | ✅ |
| ... | ... | ... |
| MoE gate / topk / experts | moe | — |

**Coverage**: {N} / {M} definitions present. {Optional: note about missing definitions.}

---

4c: Preserve existing sections

Do NOT overwrite existing model sections unless --refresh-status is set. When refreshing:

Re-check all ✅/❌ in the existing table
Update the coverage count line
Update the summary table row

Documentation Standards

Model Display Names

Use consistent naming matching the summary table:

DeepSeek V3 / R1 (joint entry for same-architecture variants)
Llama 3.1 8B (include size)
Qwen3 30B A3B (include size + variant)
Mistral 7B v0.3
Gemma 3 27B

Architecture Descriptions

Keep brief (5–8 words), following existing patterns:

{layer_count} decoder layers, {attention} attention, {ffn} FFN
Add hybrid prefix for mixed architectures (e.g., 48 layers, hybrid GDN+GQA attention, MoE FFN)

TP/EP Notes

When multiple TP configs exist, show all in the table using separate rows with TP label in Op Type column:

| `gdn_prefill_qk16_v32_d128_k_last` | gdn TP=1 | ❌ |
| `gdn_prefill_qk8_v16_d128_k_last` | gdn TP=2 | ✅ |
| `gdn_prefill_qk4_v8_d128_k_last` | gdn TP=4 | ✅ |

For MoE with EP:

| `moe_..._e256_...` | moe EP=1 | ❌ |
| `moe_..._e32_...` | moe EP=8 | ✅ |

Unmapped Modules

For modules that exist in the architecture but don't have definitions mapped, use —:

| MoE gate / topk / experts | moe | — |

Output

After running, report:

Models added: List of new models added to model_coverage.mdx
Models refreshed: List of models whose coverage status was updated
Missing definitions: List of ❌ definitions that need to be created (suggest running /extract-kernel-definitions)
Summary delta: What changed in the summary table (newly fully covered, newly partial, etc.)

Common Model Architectures Reference

GQA Standard (Llama / Mistral / Gemma / Qwen2.5)

Expected definitions:

2× RMSNorm
4× GEMM (qkv_proj, o_proj, gate_up, down)
5× GQA (paged decode ps1, paged decode ps64, paged prefill ps1, paged prefill ps64, ragged prefill)
3× Sampling (top_k, top_k_top_p, top_p)

GQA + MoE (Mixtral / Qwen2-MoE / Qwen3-MoE)

Expected definitions: same as GQA Standard, plus:

1× MoE (per EP config)

MLA + Dense/MoE (DeepSeek V2/V3/R1)

Expected definitions:

4× RMSNorm (hidden, q_lora, kv_lora, plus sometimes kv_lora variants)
N× GEMM (MLA has more projections: qkv_a, q_b, kv_b, o_proj, gate/up/down)
4× MLA (paged decode ps1, paged decode ps64, paged prefill ps1, paged prefill ps64)
1× MLA ragged (prefill)
1× MoE (per EP config)
3× Sampling

GDN + GQA + MoE (Qwen3-Next)

Expected definitions:

2× RMSNorm
3× GDN per TP config (prefill, decode, mtp)
5× GQA per TP config
1× MoE
(Sampling: if tracked)

GQA + Mamba2 (NemotronH / GraniteMoe-Hybrid)

Expected definitions:

2× RMSNorm
N× GEMM
5× GQA
1× Mamba SSU per TP config
3× Sampling

Note: Check FlashInfer Mamba SSU constraints before adding: nheads/ngroups ∈ [1, 8, 16]. Models that violate this (e.g., FalconH1 ngroups=1 nheads=128) cannot use FlashInfer SSU and should be marked ❌ for Mamba kernel with a note.

Integration with Other Skills

# Discover new models and see what's missing
/track-models --discover

# Create missing kernel definitions for a new model
/extract-kernel-definitions --model-name mistral_7b

# Add reference tests for new definitions
/add-reference-tests --op-type gqa_paged

# Full workflow for a brand new model
/clone-repos
/track-models --model-name gemma-3-27b --hf-repo-id google/gemma-3-27b-it
/extract-kernel-definitions --model-name gemma3
/add-reference-tests --op-type gqa_paged

# Or use the end-to-end pipeline (discovery + definition + workload + PR submission)
/onboard-model --discover
/onboard-model --model-name gemma-3-27b --hf-repo-id google/gemma-3-27b-it

This skill is also called by onboard-model (Phase 1 and Phase 4b) to update docs/model_coverage.mdx as part of the automated pipeline.

Error Handling

HuggingFace config not accessible

Cause: Private model or network error
Handling: Note the model as ❌ Not covered with a "config unavailable" note in the detailed section. Proceed with other models.

Architecture not recognized

Cause: Novel architecture not in the known patterns
Handling: Add model to summary as ❌ Not covered, add a detailed section with — (architecture not yet mapped) for all kernel rows and a note explaining the unknown pattern.

Definition file naming ambiguity

Cause: Computed name doesn't match any definition (off-by-one in dims, different naming convention)
Handling: Mark as ❌ and list the computed name in the detailed section. Cross-check with actual files in tmp/flashinfer-trace/definitions/ before marking as missing.

Model not in SGLang

Cause: Model isn't implemented in SGLang yet
Handling: Still add to coverage doc using HuggingFace config.json. Note "SGLang implementation pending" in the architecture description.

Maintaining This Document

Update this file when:

New op_types are added to FlashInfer-Bench (add to Phase 3 mapping and Common Architectures)
Definition naming conventions change
New major model families emerge (add to Common Architectures)

name: track-models description: Track popular/new open-source LLMs and update docs/model_coverage.mdx with their kernel support status. Use when discovering new models to add to the coverage tracker, checking if a specific model is covered, or refreshing model coverage documentation.

Track Models

Usage

Parameters

Prerequisites

What This Skill Does

Phase 1: Model Discovery

Discovery Sources (in order)

Filtering Already-Tracked Models

Architecture Priority

Phase 2: Architecture Extraction

Step 1: Get config.json from HuggingFace

Step 2: Determine attention type from architecture

Step 3: Find sgl-cookbook TP/EP config

Phase 3: Map Kernels to Definitions

3a: Compute expected definitions

3b: Check definition existence

3c: Compute coverage summary

Phase 4: Update docs/model_coverage.mdx

4a: Add row to Summary table

4b: Add detailed section

4c: Preserve existing sections

Documentation Standards

Model Display Names

Architecture Descriptions

TP/EP Notes

Unmapped Modules

Output

Common Model Architectures Reference

GQA Standard (Llama / Mistral / Gemma / Qwen2.5)

GQA + MoE (Mixtral / Qwen2-MoE / Qwen3-MoE)

MLA + Dense/MoE (DeepSeek V2/V3/R1)

GDN + GQA + MoE (Qwen3-Next)

GQA + Mamba2 (NemotronH / GraniteMoe-Hybrid)

Integration with Other Skills

Error Handling

HuggingFace config not accessible

Architecture not recognized

Definition file naming ambiguity

Model not in SGLang

Maintaining This Document

See Also

Phase 4: Update `docs/model_coverage.mdx`