adding-a-model - SKILL.md Agent Skill

name: adding-a-model description: Use when adding a new model (or model family) to the llm-calc database (calc/src/data/models.ts). Walks through sourcing fields from HuggingFace config.json, reading papers for novel attention variants, writing tests first, and registering the entry. Invoke this whenever a model is being added or updated — even for a single SKU, since the failure modes (wrong KV bytes, mis-sized attention, silently broken roofline) all look fine until someone runs the calc against a real workload.

Adding a Model

Source priority

HuggingFace config.json — primary truth for architecture. Fetch from https://huggingface.co/<org>/<repo>/raw/main/config.json. This is the file the model actually loads from; mismatches break inference, so vendors keep it accurate.
HF model card README — for active parameter count on MoE, family/release notes, anything not in config.
Paper / technical report — required for any novel attention variant or sparsity scheme (MLA, DSA, CSA, linear attention, delta-net). config.json names the fields but doesn't tell you what the geometry means.
Vendor blog / launch post — supplementary; verify any numeric claim against config.json before trusting.

Aggregators (TechPowerUp, marketing comparison sites, LLM-summarized model cards) are not acceptable primary sources. Use them only to cross-check after vendor sources agree. See docs/data-philosophy.md for the reasoning.

ModelArch field mapping

Schema in calc/src/engine/types.ts (ModelArch).

config.json key → ModelArch field:

HF key	Field	Notes
`num_hidden_layers`	`layers`
`hidden_size`	`hiddenDim`
`intermediate_size`	`intermediateDim`	For MoE: per-expert FFN inner dim (small, e.g. 768), not dense-FFN width
`num_attention_heads`	`numHeads`
`num_key_value_heads`	`numKvHeads`	Defaults to `numHeads` if absent (no GQA)
`head_dim`	`headDim`	If absent, compute `hidden_size / num_attention_heads`
`vocab_size`	`vocabSize`
`max_position_embeddings`	`maxContext`	Trained ceiling; calc extrapolates past it with a soft warning
`num_nextn_predict_layers`	`numNextnLayers`	Multi-Token Prediction depth; 0 for non-MTP models

paramCount — take from the model card, not derived. Vendors quote total parameters including embeddings and norms; recompute is unreliable.

publisher and releaseDate are catalog metadata, not from config.json:

publisher — the releasing organization (Alibaba, Meta, Google, DeepSeek, Moonshot AI, Zhipu AI, Microsoft, Mistral AI, …). This is the primary grouping key in the model picker, so reuse the exact string an existing same-publisher entry uses — don't introduce "Alibaba" and "Alibaba (Qwen)" as two groups.
releaseDate — public release/announcement month as ISO YYYY-MM. Day granularity is noisy (announce vs. weights vs. paper); month is the contract. Source from the official launch post or paper date. For a forward-looking/speculative entry, use a projected month consistent with its version ordering and say so in the commit.
nativeDtype — the dtype the released weights ship in (the model-aware quant default). bf16 for most; fp8 for natively-fp8 releases (DeepSeek V3/R1/V3.2/V4). Source from the model card / release notes; default bf16 when unstated. Not a config.json field.

Architecture (dense vs MoE)

Detect MoE by presence of num_experts / num_local_experts in config:

architecture: {
  type: 'moe',
  numExperts: <num_experts>,
  numExpertsActive: <num_experts_per_tok>,
  numSharedExperts: <num_shared_experts ?? 0>,
  activeParamCount: <from model card>  // routed-active + shared aggregate
}

Otherwise: architecture: { type: 'dense' }.

Attention variant — when to use which

Walk through these in order. First match wins.

full — default; MHA or GQA. config.json has no special attention keys beyond the standard ones.
sliding — every layer uses a sliding window (Mistral 7B, Mixtral). Config has sliding_window and applies it globally.
hybrid — interleaved sliding + global (Gemma 3, Phi-4 with sliding pattern). Config has sliding_window_pattern or the paper specifies a per-layer pattern.
mla — DeepSeek V2/V3 base. Config has kv_lora_rank, q_lora_rank, qk_rope_head_dim, qk_nope_head_dim, v_head_dim. Read the V2 paper if you haven't.
mla-dsa — DeepSeek V3.2. MLA fields + sparse-attention indexer. Adds topK.
linear-mla-hybrid — Kimi-Linear (linear attention + MLA). Per-layer counts must sum to model.layers.
csa-hca-hybrid — DeepSeek V4-Flash/Pro. Sliding + CSA (Compressed Sparse Attention) + HCA (Heavily Compressed Attention).
delta-hybrid — Qwen3.5 (Gated DeltaNet + Gated Attention).
mamba2-hybrid — Nemotron-H / Nemotron 3 (Mamba2 + attention + FFN-only blocks). Config is model_type: nemotron_h with hybrid_override_pattern (* = attention, M = Mamba2, E/- = FFN block) or layers_block_type. Note these are SEPARATE blocks in num_hidden_layers, not attention+FFN pairs; SSM state is cached fp32 (mamba_ssm_cache_dtype), independent of KV quant.
partial — NAS-pruned models (DeciLM / Llama-Nemotron Super). Config has block_configs with some attention.no_op: true; numFullLayers = blocks that kept attention. Requires uniform geometry on the surviving attention blocks (check n_heads_in_group is constant); variable FFN widths are absorbed by paramCount.

If the model fits none of these, stop. Don't shoehorn into the closest variant — add a new variant to AttentionConfig in types.ts with engine integration in memory.ts, prefill.ts, decode.ts. Brainstorm with the user first; new attention is a meaningful design change, not a data update.

Process

Pull config.json. Pull model card. Note paramCount and activeParamCount (if MoE).
Identify attention variant. Read the paper if it's not vanilla.
Find the most similar existing entry in calc/src/data/models.ts (same family, same attention variant). Copy it as a template — saves more time than typing from scratch and inherits proven shape.
Write a test first in calc/test/engine/calc.test.ts. Pick one or two assertions that catch field mis-entry:
- paramCount within a few % of expected (compute layer params manually — embeddings + per-layer attention+FFN — and check)
- kvCachePerRequest at a known prompt length (verifies attention variant fields)
- For MoE: activeParamCount aligns with model card
Run the test → red.
Add the entry to the MODELS array. Run tests → green.
npm run check (svelte-check / TS).
npm run dev, click the model in the UI, verify the perf panel renders without NaN and the memory bar looks sane against its siblings.
Commit.

Reading novel attention papers

WebFetch struggles with PDFs. Two options:

Ask the user to paste the relevant section inline.
Dispatch a research subagent with the paper URL and a focused question: "extract per-layer attention dims, KV state size, compression ratio M, indexer head count" — list exactly the fields you need.

Don't guess. A misread MLA paper produces silently-wrong KV cache numbers for the rest of the model's life in the database.

Anti-patterns

Trusting aggregator sites or LLM-generated summaries for architecture numbers.
Deriving paramCount from a formula instead of taking it from the model card.
Setting numNextnLayers: 0 reflexively; DeepSeek V3 onwards has it.
Adding the entry before the test — the test is what catches typos in numKvHeads or intermediateDim.
Picking the "closest" attention variant when the model is actually novel. Either it matches exactly or it gets a new variant.
Skipping the UI check; a NaN in the perf panel is the fastest signal something is wrong.