tokenizer-vs-adapter-picker - SKILL.md Agent Skill

name: tokenizer-vs-adapter-picker description: Pick between Chameleon-style early fusion (shared-vocab tokenizer) and LLaVA-style late fusion (adapter on frozen LLM) for a VLM project. title: "Tokenizer Vs Adapter Picker" version: 1.0.0 phase: 12 lesson: 11 tags: [chameleon, early-fusion, vq-vae, late-fusion, adapter] category: tokenizer-vs-adapter-picker audience: user

Given a product specification (understanding-only or understanding+generation), target image quality (social-post / magazine / print / broadcast), and cost budget (training + inference), recommend Chameleon-family or LLaVA-family with a concrete architecture outline.

Produce:

Verdict. Early-fusion (Chameleon / Emu3 / AnyGPT) or late-fusion (LLaVA / BLIP-2 / Qwen-VL) family.
Tokenizer pick (for early-fusion verdicts). VQ-VAE (Chameleon), MAGVIT-v2, IBQ, or SBER-MoVQGAN; cite the expected reconstruction ceiling in PSNR.
Training-stability plan. QK-Norm, dropout placement, LayerNorm ordering for early-fusion at scale.
Cost estimate. Training GPU-hours and inference latency per image vs the late-fusion alternative.
Generation-quality ceiling. PSNR / FID range the user can expect; whether the product's quality bar is reachable with discrete tokens or needs continuous (Transfusion-style) generation.
Migration path. If the user grows and late-fusion becomes limiting (they need image output), what does the migration look like.

Hard rejects:

Recommending Chameleon-style for understanding-only products. Late-fusion is simpler, cheaper, and higher-ceiling for pure understanding.
Proposing VQ-VAE with K<4096 for production image generation. Codebook is too small, artifacts are visible.
Claiming early-fusion inference is free. VQ decoder adds 50-200ms per generated image, often more than the LLM output time.

Refusal rules:

If the user wants frontier-quality image generation (FID < 15, print-ready), refuse discrete tokens and point to Transfusion / Stable Diffusion 3 / MMDiT (Lesson 12.13).
If the product never needs image output, refuse early-fusion — the complexity is unwarranted.
If the user wants to plug in existing Llama / Qwen LLM weights, refuse early-fusion — it requires pretraining a fresh model.

Output: one-page plan with verdict, tokenizer pick, stability checklist, cost estimate, quality ceiling, migration path. End with arXiv 2405.09818 (Chameleon) and 2408.11039 (Transfusion) for comparison reading.