hf-architecture-tikz - SKILL.md Agent Skill

name: hf-architecture-tikz description: "Draw Sebastian-Raschka-gallery-style TikZ architecture diagrams for any HuggingFace decoder-only LLM, with per-block parameter formulas and concrete numbers. Supports MHA, GQA, MLA, DeepSeek-V4-Flash (Hyper-Connections + Sparse Attention with learned indexer), dense and MoE FFNs (incl. hash routing), and MTP heads. Use when the user asks to visualize / diagram / illustrate a transformer or LLM architecture (DeepSeek, Qwen, Llama, Mistral, gpt-oss, etc.), wants a Raschka-style figure, or wants a TikZ/LaTeX rendering of an HF model."

HF Architecture → TikZ

Generate a publication-quality vertical architecture diagram (in the style of Sebastian Raschka's LLM Architecture Gallery) for any HuggingFace decoder-only LLM. The diagram annotates every sub-block with its parameter-count formula and the concrete number for the loaded config.

When to use

"Draw the architecture of <HF repo>."
"Visualize how <model> is structured" / "make a diagram of <model> like Raschka's gallery."
"I want a TikZ figure of <model> for a paper / blog post."
The user mentions DeepSeek-V4-Flash, mHC / Hyper-Connections, MLA, MoE, sparse attention, MTP, and asks for a figure.

If the user just wants memory / parallelism numbers, prefer megatron-memory-estimator instead.

Quick start

cd hf-architecture-tikz/

# 1. Pull config from HF + emit normalized arch.json
uv run python scripts/extract_arch.py deepseek-ai/DeepSeek-V4-Flash \
    --output examples/deepseek-v4-flash/arch.json

# 2. Render TikZ from arch.json
uv run python scripts/render_tikz.py \
    examples/deepseek-v4-flash/arch.json \
    --output examples/deepseek-v4-flash/deepseek-v4-flash.tex

# 3. Compile to PNG
bash scripts/compile.sh examples/deepseek-v4-flash/deepseek-v4-flash.tex

For a model with custom code (e.g. brand-new architectures), pass --trust-remote-code. For a local config:

uv run python scripts/extract_arch.py /path/to/config.json --output arch.json

Workflow

Acquire config. extract_arch.py tries transformers.AutoConfig first; if the installed transformers doesn't recognize the model_type (e.g. deepseek_v4 introduces hc_mult, compress_ratios), it falls back to raw JSON via huggingface_hub.hf_hub_download. Local file paths bypass network.
Detect architecture family. Pure config-field rules — see references/architecture_families.md. The script labels the model with a family tag (mha, gqa, mla, dsv4) plus orthogonal flags (MoE, hash routing, shared experts, MTP, tied LM head, first_k_dense_replace).
Compute parameter counts. Closed-form formulas keyed by family — see references/param_formulas.md. The script (not Claude) does the arithmetic and emits arch.json with one entry per architectural unit, each carrying name, family, shape_in, shape_out, formula_symbolic, formula_concrete, param_count.
Assemble TikZ. render_tikz.py reads arch.json plus templates/anthropic.tex.j2 (Jinja2 template — all block macros are inlined for shared coordinate-space layout). The repeated transformer block is drawn once with a × N layers annotation; per-layer-varying behavior (V4-Flash compress_ratios, hash vs score routing) appears as a small pattern strip beneath the block.
Compile. bash scripts/compile.sh out.tex runs xelatex ×2 (TikZ fit/positioning needs a second pass) then pdftocairo -png -r 300 -singlefile. Falls back to pdflatex if XeTeX is unavailable.

Architecture family detection

Detection rules live in references/architecture_families.md. Summary:

Family	Detector	Examples
`dsv4`	`model_type == "deepseek_v4"` or presence of `hc_mult`+`compress_ratios`+`index_n_heads`	DeepSeek-V4-Flash
`mla`	`q_lora_rank` + `kv_lora_rank` + `qk_nope_head_dim` + `qk_rope_head_dim` + `v_head_dim`	DeepSeek-V2/V3
`gqa`	`num_key_value_heads < num_attention_heads`	Llama-3, Qwen3, Mistral
`mha`	otherwise	GPT-2, OPT

Orthogonal flags: MoE (n_routed_experts/num_local_experts), hash routing (num_hash_layers > 0), shared experts (n_shared_experts > 0), MTP head (num_nextn_predict_layers > 0), tied LM head (tie_word_embeddings), dense-prefix layers (first_k_dense_replace > 0).

Parameter formulas

Full table in references/param_formulas.md. One-line summary per family attention: MHA 4·d²; GQA 2·d² + 2·d·Hkv·dh; MLA six projections; DSv4 wq_a + q_norm + wq_b + wkv + kv_norm + wo_a + wo_b + attn_sink (+ Compressor + Indexer). SwiGLU 3·d·f. Standard MoE = E routed experts (each 3·d·f) + router d·E + Es shared. Hash MoE replaces router with a vocab×topk token→expert table.

Worked example: DeepSeek-V4-Flash

The example under examples/deepseek-v4-flash/ covers the most architecturally novel components in the supported set:

Hyper-Connections (mHC): four parallel hidden-state copies, with Sinkhorn-balanced reduction (hc_sinkhorn_iters=20) before each sublayer and weighted expansion + cross-copy mixing after. Drawn as a fan-in / fan-out inside each block.
Sparse Attention: Q-LoRA (d → q_lora_rank → H·dh), KV projection (d → dh, Hkv=1), per-layer Compressor (overlap pooling for compress_ratio=4, block pooling for compress_ratio=128), learned Indexer for compress_ratio=4 layers (top-index_topk=512 selection over compressed KV), sliding window of 128, grouped O-LoRA (o_groups=8, o_lora_rank=1024).
MoE with hash routing: first 3 layers use a learned tid2eid table (vocab × topk); remaining 40 layers use sqrtsoftplus scoring + top-6 routing.
MTP head: one MTPBlock (= e_proj + h_proj + their RMSNorms + a full Block) for next-token prediction.
Compress-ratios pattern strip: drawn beneath the block to make the per-layer alternation [0, 0, 4, 128, 4, 128, …, 4, 0] visible.

Customization

Palette. Reuses the warm-pastel palette from tikz-flowchart/themes/anthropic.md (lavender = attention, mint = norm, teal = projection, cream = router/MoE infra, amber = experts, peach = embedding/output).
Detail level. The default is full expansion (every sub-block separately). To collapse sub-blocks, edit the dsv4 branch of templates/anthropic.tex.j2 and replace the inner attention expansion with a single rounded card.
Other models. The non-dsv4 branch of templates/anthropic.tex.j2 covers mha / gqa / mla (with optional MoE FFN) as a simpler vertical stack. The renderer dispatches based on the family flag emitted by extract_arch.py.

Troubleshooting

AutoConfig raises on unknown fields. Expected for very new model types. The loader catches and falls back to raw JSON automatically. If both fail, pass a local config.json path.
mbridge is unavailable / unsupported model. Not required — we use transformers + raw JSON. mbridge is referenced only for cross-checking V3/Qwen counts.
trust_remote_code warnings. extract_arch.py does not enable this flag silently. Pass --trust-remote-code only if the user explicitly requests it.
Tied embeddings double-counting. When tie_word_embeddings=True, the embedding-table contribution is folded into the LM head and not counted twice.
Tall PNG. Full expansion + side annotations + MTP branch typically renders to 4–6k pixels tall. Use --no-mtp (renderer flag) to suppress the MTP branch if you need a shorter figure.
xelatex not installed. The compile script falls back to pdflatex automatically. Font macros are guarded with \IfFontExistsTF.

Dependencies

Python: transformers, huggingface_hub, jinja2. Run via uv run. System: xelatex (preferred) or pdflatex; pdftocairo (from poppler).