hf-architecture-tikz

star 134

Draw Sebastian-Raschka-gallery-style TikZ architecture diagrams for any HuggingFace decoder-only LLM, with per-block parameter formulas and concrete numbers. Supports MHA, GQA, MLA, DeepSeek-V4-Flash (Hyper-Connections + Sparse Attention with learned indexer), dense and MoE FFNs (incl. hash routing), and MTP heads. Use when the user asks to visualize / diagram / illustrate a transformer or LLM architecture (DeepSeek, Qwen, Llama, Mistral, gpt-oss, etc.), wants a Raschka-style figure, or wants a TikZ/LaTeX rendering of an HF model.

yzlnew By yzlnew schedule Updated 5/22/2026

name: hf-architecture-tikz description: "Draw Sebastian-Raschka-gallery-style TikZ architecture diagrams for any HuggingFace decoder-only LLM, with per-block parameter formulas and concrete numbers. Supports MHA, GQA, MLA, DeepSeek-V4-Flash (Hyper-Connections + Sparse Attention with learned indexer), dense and MoE FFNs (incl. hash routing), and MTP heads. Use when the user asks to visualize / diagram / illustrate a transformer or LLM architecture (DeepSeek, Qwen, Llama, Mistral, gpt-oss, etc.), wants a Raschka-style figure, or wants a TikZ/LaTeX rendering of an HF model."

HF Architecture → TikZ

Generate a publication-quality vertical architecture diagram (in the style of Sebastian Raschka's LLM Architecture Gallery) for any HuggingFace decoder-only LLM. The diagram annotates every sub-block with its parameter-count formula and the concrete number for the loaded config.

When to use

  • "Draw the architecture of <HF repo>."
  • "Visualize how <model> is structured" / "make a diagram of <model> like Raschka's gallery."
  • "I want a TikZ figure of <model> for a paper / blog post."
  • The user mentions DeepSeek-V4-Flash, mHC / Hyper-Connections, MLA, MoE, sparse attention, MTP, and asks for a figure.

If the user just wants memory / parallelism numbers, prefer megatron-memory-estimator instead.

Quick start

cd hf-architecture-tikz/

# 1. Pull config from HF + emit normalized arch.json
uv run python scripts/extract_arch.py deepseek-ai/DeepSeek-V4-Flash \
    --output examples/deepseek-v4-flash/arch.json

# 2. Render TikZ from arch.json
uv run python scripts/render_tikz.py \
    examples/deepseek-v4-flash/arch.json \
    --output examples/deepseek-v4-flash/deepseek-v4-flash.tex

# 3. Compile to PNG
bash scripts/compile.sh examples/deepseek-v4-flash/deepseek-v4-flash.tex

For a model with custom code (e.g. brand-new architectures), pass --trust-remote-code. For a local config:

uv run python scripts/extract_arch.py /path/to/config.json --output arch.json

Workflow

  1. Acquire config. extract_arch.py tries transformers.AutoConfig first; if the installed transformers doesn't recognize the model_type (e.g. deepseek_v4 introduces hc_mult, compress_ratios), it falls back to raw JSON via huggingface_hub.hf_hub_download. Local file paths bypass network.
  2. Detect architecture family. Pure config-field rules — see references/architecture_families.md. The script labels the model with a family tag (mha, gqa, mla, dsv4) plus orthogonal flags (MoE, hash routing, shared experts, MTP, tied LM head, first_k_dense_replace).
  3. Compute parameter counts. Closed-form formulas keyed by family — see references/param_formulas.md. The script (not Claude) does the arithmetic and emits arch.json with one entry per architectural unit, each carrying name, family, shape_in, shape_out, formula_symbolic, formula_concrete, param_count.
  4. Assemble TikZ. render_tikz.py reads arch.json plus templates/anthropic.tex.j2 (Jinja2 template — all block macros are inlined for shared coordinate-space layout). The repeated transformer block is drawn once with a × N layers annotation; per-layer-varying behavior (V4-Flash compress_ratios, hash vs score routing) appears as a small pattern strip beneath the block.
  5. Compile. bash scripts/compile.sh out.tex runs xelatex ×2 (TikZ fit/positioning needs a second pass) then pdftocairo -png -r 300 -singlefile. Falls back to pdflatex if XeTeX is unavailable.

Architecture family detection

Detection rules live in references/architecture_families.md. Summary:

Family Detector Examples
dsv4 model_type == "deepseek_v4" or presence of hc_mult+compress_ratios+index_n_heads DeepSeek-V4-Flash
mla q_lora_rank + kv_lora_rank + qk_nope_head_dim + qk_rope_head_dim + v_head_dim DeepSeek-V2/V3
gqa num_key_value_heads < num_attention_heads Llama-3, Qwen3, Mistral
mha otherwise GPT-2, OPT

Orthogonal flags: MoE (n_routed_experts/num_local_experts), hash routing (num_hash_layers > 0), shared experts (n_shared_experts > 0), MTP head (num_nextn_predict_layers > 0), tied LM head (tie_word_embeddings), dense-prefix layers (first_k_dense_replace > 0).

Parameter formulas

Full table in references/param_formulas.md. One-line summary per family attention: MHA 4·d²; GQA 2·d² + 2·d·Hkv·dh; MLA six projections; DSv4 wq_a + q_norm + wq_b + wkv + kv_norm + wo_a + wo_b + attn_sink (+ Compressor + Indexer). SwiGLU 3·d·f. Standard MoE = E routed experts (each 3·d·f) + router d·E + Es shared. Hash MoE replaces router with a vocab×topk token→expert table.

Worked example: DeepSeek-V4-Flash

The example under examples/deepseek-v4-flash/ covers the most architecturally novel components in the supported set:

  • Hyper-Connections (mHC): four parallel hidden-state copies, with Sinkhorn-balanced reduction (hc_sinkhorn_iters=20) before each sublayer and weighted expansion + cross-copy mixing after. Drawn as a fan-in / fan-out inside each block.
  • Sparse Attention: Q-LoRA (d → q_lora_rank → H·dh), KV projection (d → dh, Hkv=1), per-layer Compressor (overlap pooling for compress_ratio=4, block pooling for compress_ratio=128), learned Indexer for compress_ratio=4 layers (top-index_topk=512 selection over compressed KV), sliding window of 128, grouped O-LoRA (o_groups=8, o_lora_rank=1024).
  • MoE with hash routing: first 3 layers use a learned tid2eid table (vocab × topk); remaining 40 layers use sqrtsoftplus scoring + top-6 routing.
  • MTP head: one MTPBlock (= e_proj + h_proj + their RMSNorms + a full Block) for next-token prediction.
  • Compress-ratios pattern strip: drawn beneath the block to make the per-layer alternation [0, 0, 4, 128, 4, 128, …, 4, 0] visible.

Customization

  • Palette. Reuses the warm-pastel palette from tikz-flowchart/themes/anthropic.md (lavender = attention, mint = norm, teal = projection, cream = router/MoE infra, amber = experts, peach = embedding/output).
  • Detail level. The default is full expansion (every sub-block separately). To collapse sub-blocks, edit the dsv4 branch of templates/anthropic.tex.j2 and replace the inner attention expansion with a single rounded card.
  • Other models. The non-dsv4 branch of templates/anthropic.tex.j2 covers mha / gqa / mla (with optional MoE FFN) as a simpler vertical stack. The renderer dispatches based on the family flag emitted by extract_arch.py.

Troubleshooting

  • AutoConfig raises on unknown fields. Expected for very new model types. The loader catches and falls back to raw JSON automatically. If both fail, pass a local config.json path.
  • mbridge is unavailable / unsupported model. Not required — we use transformers + raw JSON. mbridge is referenced only for cross-checking V3/Qwen counts.
  • trust_remote_code warnings. extract_arch.py does not enable this flag silently. Pass --trust-remote-code only if the user explicitly requests it.
  • Tied embeddings double-counting. When tie_word_embeddings=True, the embedding-table contribution is folded into the LM head and not counted twice.
  • Tall PNG. Full expansion + side annotations + MTP branch typically renders to 4–6k pixels tall. Use --no-mtp (renderer flag) to suppress the MTP branch if you need a shorter figure.
  • xelatex not installed. The compile script falls back to pdflatex automatically. Font macros are guarded with \IfFontExistsTF.

Dependencies

Python: transformers, huggingface_hub, jinja2. Run via uv run. System: xelatex (preferred) or pdflatex; pdftocairo (from poppler).

Install via CLI
npx skills add https://github.com/yzlnew/infra-skills --skill hf-architecture-tikz
Repository Details
star Stars 134
call_split Forks 9
navigation Branch main
article Path SKILL.md
More from Creator