name: hf-architecture-tikz description: "Draw Sebastian-Raschka-gallery-style TikZ architecture diagrams for any HuggingFace decoder-only LLM, with per-block parameter formulas and concrete numbers. Supports MHA, GQA, MLA, DeepSeek-V4-Flash (Hyper-Connections + Sparse Attention with learned indexer), dense and MoE FFNs (incl. hash routing), and MTP heads. Use when the user asks to visualize / diagram / illustrate a transformer or LLM architecture (DeepSeek, Qwen, Llama, Mistral, gpt-oss, etc.), wants a Raschka-style figure, or wants a TikZ/LaTeX rendering of an HF model."
HF Architecture → TikZ
Generate a publication-quality vertical architecture diagram (in the style of Sebastian Raschka's LLM Architecture Gallery) for any HuggingFace decoder-only LLM. The diagram annotates every sub-block with its parameter-count formula and the concrete number for the loaded config.
When to use
- "Draw the architecture of
<HF repo>." - "Visualize how
<model>is structured" / "make a diagram of<model>like Raschka's gallery." - "I want a TikZ figure of
<model>for a paper / blog post." - The user mentions DeepSeek-V4-Flash, mHC / Hyper-Connections, MLA, MoE, sparse attention, MTP, and asks for a figure.
If the user just wants memory / parallelism numbers, prefer megatron-memory-estimator instead.
Quick start
cd hf-architecture-tikz/
# 1. Pull config from HF + emit normalized arch.json
uv run python scripts/extract_arch.py deepseek-ai/DeepSeek-V4-Flash \
--output examples/deepseek-v4-flash/arch.json
# 2. Render TikZ from arch.json
uv run python scripts/render_tikz.py \
examples/deepseek-v4-flash/arch.json \
--output examples/deepseek-v4-flash/deepseek-v4-flash.tex
# 3. Compile to PNG
bash scripts/compile.sh examples/deepseek-v4-flash/deepseek-v4-flash.tex
For a model with custom code (e.g. brand-new architectures), pass --trust-remote-code. For a local config:
uv run python scripts/extract_arch.py /path/to/config.json --output arch.json
Workflow
- Acquire config.
extract_arch.pytriestransformers.AutoConfigfirst; if the installedtransformersdoesn't recognize themodel_type(e.g.deepseek_v4introduceshc_mult,compress_ratios), it falls back to raw JSON viahuggingface_hub.hf_hub_download. Local file paths bypass network. - Detect architecture family. Pure config-field rules — see
references/architecture_families.md. The script labels the model with a family tag (mha,gqa,mla,dsv4) plus orthogonal flags (MoE, hash routing, shared experts, MTP, tied LM head, first_k_dense_replace). - Compute parameter counts. Closed-form formulas keyed by family — see
references/param_formulas.md. The script (not Claude) does the arithmetic and emitsarch.jsonwith one entry per architectural unit, each carryingname,family,shape_in,shape_out,formula_symbolic,formula_concrete,param_count. - Assemble TikZ.
render_tikz.pyreadsarch.jsonplustemplates/anthropic.tex.j2(Jinja2 template — all block macros are inlined for shared coordinate-space layout). The repeated transformer block is drawn once with a× N layersannotation; per-layer-varying behavior (V4-Flash compress_ratios, hash vs score routing) appears as a small pattern strip beneath the block. - Compile.
bash scripts/compile.sh out.texrunsxelatex×2 (TikZfit/positioningneeds a second pass) thenpdftocairo -png -r 300 -singlefile. Falls back topdflatexif XeTeX is unavailable.
Architecture family detection
Detection rules live in references/architecture_families.md. Summary:
| Family | Detector | Examples |
|---|---|---|
dsv4 |
model_type == "deepseek_v4" or presence of hc_mult+compress_ratios+index_n_heads |
DeepSeek-V4-Flash |
mla |
q_lora_rank + kv_lora_rank + qk_nope_head_dim + qk_rope_head_dim + v_head_dim |
DeepSeek-V2/V3 |
gqa |
num_key_value_heads < num_attention_heads |
Llama-3, Qwen3, Mistral |
mha |
otherwise | GPT-2, OPT |
Orthogonal flags: MoE (n_routed_experts/num_local_experts), hash routing (num_hash_layers > 0), shared experts (n_shared_experts > 0), MTP head (num_nextn_predict_layers > 0), tied LM head (tie_word_embeddings), dense-prefix layers (first_k_dense_replace > 0).
Parameter formulas
Full table in references/param_formulas.md. One-line summary per family attention: MHA 4·d²; GQA 2·d² + 2·d·Hkv·dh; MLA six projections; DSv4 wq_a + q_norm + wq_b + wkv + kv_norm + wo_a + wo_b + attn_sink (+ Compressor + Indexer). SwiGLU 3·d·f. Standard MoE = E routed experts (each 3·d·f) + router d·E + Es shared. Hash MoE replaces router with a vocab×topk token→expert table.
Worked example: DeepSeek-V4-Flash
The example under examples/deepseek-v4-flash/ covers the most architecturally novel components in the supported set:
- Hyper-Connections (mHC): four parallel hidden-state copies, with Sinkhorn-balanced reduction (
hc_sinkhorn_iters=20) before each sublayer and weighted expansion + cross-copy mixing after. Drawn as a fan-in / fan-out inside each block. - Sparse Attention: Q-LoRA (
d → q_lora_rank → H·dh), KV projection (d → dh,Hkv=1), per-layer Compressor (overlap pooling forcompress_ratio=4, block pooling forcompress_ratio=128), learned Indexer forcompress_ratio=4layers (top-index_topk=512selection over compressed KV), sliding window of 128, grouped O-LoRA (o_groups=8,o_lora_rank=1024). - MoE with hash routing: first 3 layers use a learned
tid2eidtable (vocab × topk); remaining 40 layers usesqrtsoftplusscoring + top-6 routing. - MTP head: one
MTPBlock(=e_proj+h_proj+ their RMSNorms + a full Block) for next-token prediction. - Compress-ratios pattern strip: drawn beneath the block to make the per-layer alternation
[0, 0, 4, 128, 4, 128, …, 4, 0]visible.
Customization
- Palette. Reuses the warm-pastel palette from
tikz-flowchart/themes/anthropic.md(lavender = attention, mint = norm, teal = projection, cream = router/MoE infra, amber = experts, peach = embedding/output). - Detail level. The default is full expansion (every sub-block separately). To collapse sub-blocks, edit the
dsv4branch oftemplates/anthropic.tex.j2and replace the inner attention expansion with a single rounded card. - Other models. The non-
dsv4branch oftemplates/anthropic.tex.j2coversmha/gqa/mla(with optional MoE FFN) as a simpler vertical stack. The renderer dispatches based on the family flag emitted byextract_arch.py.
Troubleshooting
AutoConfigraises on unknown fields. Expected for very new model types. The loader catches and falls back to raw JSON automatically. If both fail, pass a localconfig.jsonpath.mbridgeis unavailable / unsupported model. Not required — we usetransformers+ raw JSON.mbridgeis referenced only for cross-checking V3/Qwen counts.trust_remote_codewarnings.extract_arch.pydoes not enable this flag silently. Pass--trust-remote-codeonly if the user explicitly requests it.- Tied embeddings double-counting. When
tie_word_embeddings=True, the embedding-table contribution is folded into the LM head and not counted twice. - Tall PNG. Full expansion + side annotations + MTP branch typically renders to 4–6k pixels tall. Use
--no-mtp(renderer flag) to suppress the MTP branch if you need a shorter figure. xelatexnot installed. The compile script falls back topdflatexautomatically. Font macros are guarded with\IfFontExistsTF.
Dependencies
Python: transformers, huggingface_hub, jinja2. Run via uv run.
System: xelatex (preferred) or pdflatex; pdftocairo (from poppler).