name: trtllm-model-onboard-multimodal
description: >
Onboard a HuggingFace multimodal model (vision/audio/video + text) to the
TensorRT-LLM PyTorch backend. Use when writing a new
tensorrt_llm/_torch/models/modeling_<vlm>.py plus its input processor and
weight mapper, or extending an existing VLM. Not for AutoDeploy — use
ad-model-onboard for that path.
license: Apache-2.0
metadata:
author: NVIDIA Corporation
TensorRT-LLM Multimodal Model Onboarding (PyTorch backend)
Scope. PyTorch backend only (
tensorrt_llm/_torch/) — the default forLLM(..., backend="pytorch"),trtllm-serve,trtllm-bench. Not for AutoDeploy (tensorrt_llm/_torch/auto_deploy/); usead-model-onboardfor that.
Output:
tensorrt_llm/_torch/models/modeling_{name}.py— wrapper class (multimodal encoder + LLM) decorated with@register_auto_model,@register_vision_encoder,@register_input_processor(and@support_multimodal_disaggregatedif EPD is supported), plus aBaseMultimodalInputProcessor(+BaseMultimodalDummyInputsBuilder) subclass._torch/models/checkpoints/hf/{name}_weight_mapper.pyif HF prefixes need surgery.- Per-model unit test under
tests/unittest/_torch/modeling/test_modeling_<name>.py(subclass ofTestModelingMultimodal); supplemental utility tests undertests/unittest/_torch/multimodal/if needed; an accuracy test undertests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py; support-matrix entry; verifiedtrtllm-serveflow.
System map
Aggregated path (default)
[1] API event loop (server-side, async)
The chat handler wraps each image/video/audio URL part as an async_load_*
coroutine (not yet awaited). apply_chat_template builds the text prompt.
asyncio.gather then decodes all media for one request in parallel.
[2] Input pipeline (asyncio.to_thread, off the event loop)
BaseMultimodalInputProcessor.__call__ dispatches by input shape: a text
prompt goes to the per-model call_with_text_prompt; prompt_token_ids +
mm_data goes to the base-class call_with_token_ids fast path (or is
detokenized back to call_with_text_prompt when the model opts out). The
per-model HF processing lives in call_with_text_prompt:
HF AutoProcessor → pixel_values + token_ids
mm-token layout (positions / lengths / special_token_offsets)
(mRoPE) mrope_position_ids + deltas computed on CPU
_postprocess: HF mm token ids → tllm_multimodal_token_id (OOV sentinel)
The framework wrapper around your processor computes blake3 content hashes
for KV-cache reuse.
MultimodalParams.to_handle("multimodal_data") at the end → each tensor in
multimodal_data is replaced by a small dict pointing at its CUDA-IPC / shm
handle, so the broadcast in [3] carries pointers, not megabytes of pixels.
[3] Worker fan-out (TP / PP / CP)
Each worker rebuilds local tensor views via to_tensor("multimodal_data").
multimodal_input (hashes / positions / lengths) is forwarded to the C++
executor to drive KV-cache hash matching.
[4] Per-iteration staging (model engine)
Context: build MultimodalRuntimeData (positions / lengths / chunk bounds)
→ push pixel_values to CUDA pinned + non_blocking, obeying the
model's multimodal_data_device_paths declaration; pad
mrope_position_ids into a preallocated CUDA buffer.
Generation (mRoPE only): strip everything except mrope_position_deltas.
Post-prefill: drop mm_data so it doesn't ride along in decode.
[5] Model.forward(attn_metadata, input_ids, position_ids, multimodal_params=…)
get_multimodal_embeddings: runs encoder.forward only on params whose
multimodal_data["multimodal_embedding"] is empty (chunked-prefill iter
2+ hits the per-request cache; results written back automatically).
find_input_mm_embeds: slices the cached embedding to the current chunk
under chunked prefill / KV-cache reuse.
prepare_mrope_config (mRoPE models): one-shot mrope_rotary_cos_sin per
request from the staged mrope_position_ids buffer.
fuse_input_embeds: text + mm merged via precomputed indices
(with optional extra_embeds for multi-feature encoders).
self.llm.forward(inputs_embeds=..., mrope_config=...) → logits.
Key invariants:
- [1] and [2] both run off the API event loop. [1] fans media decode out across one request's items with
asyncio.gather; [2] is single-threaded per request because the HF processor is request-scoped. - The producer hands off as handles at the end of [2], so the broadcast in [3] stays small (Contract 3).
- [4] is the only per-iteration GPU staging; H2D is
non_blocking=Truefrom pinned host memory. - [5] runs on the compute stream and must be sync-free (Contract 1).
EPD-disaggregated path
When @support_multimodal_disaggregated is set and the deployment uses TLLM_MULTIMODAL_DISAGGREGATED=1:
- Encoder worker: runs as a standalone
MultimodalEncoder(mm_encoder_only=True). It executes only the multimodal encoder and shipsmm_embeddings(+ mRoPE position ids/deltas) to prefill+decode workers as shared-tensor handles. - Prefill+decode worker: the model's
__init__skips constructingself.mm_encoderwhen_is_mm_disagg()is true; the input processor'sattach_multimodal_embeddings()override binds the encoder handles into the request. For context-only requests, the engine re-clones mrope tensors so IPC handles outlive the encoder worker's freed memory — replicate that pattern for any new GPU-resident mm tensors.
Templates to study
modeling_qwen3vl.py, modeling_llava_next.py, and modeling_gemma3vl.py are the canonical references — fully-ported encoder, single-class wrapper, text_config-based LLM resolution. Other examples by modality: modeling_pixtral.py, modeling_phi4mm.py (audio), modeling_mllama.py, modeling_hyperclovax.py, modeling_mistral_large3.py. Pick the closest one (modality + LLM family + RoPE variant). modeling_qwen2vl.py retains an HF-passthrough vision tower for the outdated Qwen2-VL family — read it for context but don't copy that pattern.
Reuse before you write (the most important rule)
Every common block has a TRT-LLM implementation; compose, don't reimplement. Hand-rolled nn.Linear / nn.LayerNorm / nn.MultiheadAttention silently work in fp16/bf16 single-GPU eager and silently break under quantization, TP, attention-backend selection, KV cache, and CUDA graphs. Browse tensorrt_llm/_torch/modules/ before writing a layer; the reference VLMs (modeling_qwen3vl.py, modeling_llava_next.py, modeling_gemma3vl.py) show canonical wiring. Reuse is also where every future perf improvement lands automatically.
Compute modules
Mappings most often missed by adapters:
| Concern | Module | Non-obvious wiring |
|---|---|---|
| Linear | _torch.modules.linear.Linear |
Pass mapping=model_config.mapping, tensor_parallel_mode=TensorParallelMode.{COLUMN,ROW,NONE}, allreduce_strategy=model_config.allreduce_strategy. Every quant scheme (FP8 / NVFP4 / W4A8 / AWQ / weight-only) is automatic — never substitute nn.Linear. |
| Attention (text and vision) | _torch.modules.attention.Attention (variants: qk_norm_attention.QKNormRoPEAttention for QK-norm + YARN; attention.MLA for DeepSeek-style) |
Same module runs the LLM and the multimodal encoder. For the encoder side, build an ad-hoc attn_metadata per forward and pass predefined_attention_mask=PredefinedAttentionMask.FULL (or windowed). Reference: Qwen2_5_VLVisionAttention.prepare_attn_metadata. |
| MLP / Gated MLP | _torch.modules.mlp.MLP, _torch.modules.gated_mlp.GatedMLP, _torch.modules.swiglu.swiglu |
GatedMLP covers the SwiGLU pattern (gate + up + silu + down) with fused gate/up weights — don't roll it from two Linears and F.silu. Plain MLP for non-gated cases. Both inherit the same TP / quant story as Linear. Reference: Qwen2_5_VLMLP. |
| RoPE | _torch.modules.rotary_embedding.{RotaryEmbedding, MRotaryEmbedding} |
MRotaryEmbedding (mRoPE) is for the LLM side of mRoPE-using VLMs (Qwen-VL family), with mrope_section-aware cos/sin slicing and 3D position_ids. The encoder's internal 2D RoPE uses plain RotaryEmbedding. |
LLM backbone — reuse via AutoModel
The inner LLM is loaded via TRT-LLM's own AutoModelForCausalLM (tensorrt_llm._torch.models.modeling_auto), not transformers.AutoModelForCausalLM. It dispatches on pretrained_config.architectures[0] to whichever class is registered via @register_auto_model. The canonical wiring (using text_config to surface the inner LLM) lives in the Phase 2 template; if the inner LLM doesn't yet have a TRT-LLM modeling file, finish that text-only onboarding first.
Multimodal encoder — port to TRT-LLM modules
This is required, not a preference. Re-implement encoder blocks from _torch.modules.*; the encoder builds its own attn_metadata via prepare_attn_metadata. Reference: Qwen2_5_VisionModel, Qwen2_5_VLVisionAttention, Qwen2_5_VLPatchMerger. Two reasons:
- Performance. HF-eager runs on PyTorch's stock kernels — vanilla SDPA,
nn.LayerNorm, plainnn.Linear— losing TRT-LLM-attention / FlashInfer, fused RMSNorm, FP8/NVFP4/AWQ Linear, TP, and CUDA-graph capture for static-shape paths. For a 0.5–7 B encoder running every prefill, the regression compounds each iteration. - Version coupling. Every
from transformers.models.<family> import <X>ties the modeling file to a specifictransformersrelease. Upstream HF refactors (renamed classes, signature changes, internal helper migrations) silently break TRT-LLM imports months later, often surfacing only when users upgrade their environment. Porting cuts the dependency. The same applies to importing HF computations / helper functions, not just modules — keep both out of new modeling files.
The lone existing exception is Qwen2VLModel, which keeps Qwen2VisionTransformerPretrainedModel from transformers because Qwen2-VL is an outdated family on life support — not because passthrough is acceptable for new onboarding. Don't copy that pattern. If you genuinely cannot port (e.g. patching an existing legacy model), the HF import must carry a code comment explicitly justifying it; otherwise PR review will bounce.
Weight loading — reuse mappers
If HF prefixes don't match (model.vision_tower.* → mm_encoder.*, fused/un-fused QKV, etc.), inherit from a related mapper rather than ad-hoc translation. Reference: _torch/models/checkpoints/hf/qwen2vl_weight_mapper.py, qwen3vl_weight_mapper.py.
Host memory during init / weight loading. Large VLMs can blow past host RAM if every rank materializes the full state_dict before sharding. Two patterns from modeling_nemotron_nano.py:NemotronH_Nano_VL_V2 (PR #13283):
- Defer multimodal-encoder construction out of
__init__and intoload_weights()when the encoder contains HF submodules whose deterministic init ops (ones_,zeros_,fill_,.detach(),.to(dtype=...)) clash with the LLM'sMetaInitModefast path. Snapshot the multimodalModelConfigin__init__(sincepost_config()overwritesself.model_config.pretrained_configto the LLM-only config), construct the encoder +.to("cuda")insideload_weights(). OtherwiseMetaInitModeraises and the entire model falls back to slow CPU init. - Call
weights.mark_consumed(<prefix>)after each sub-module'sload_weights(...)so the mmap-backed shards behind those weights can be released. Without it, peak host memory holds the entire checkpoint; with it, peak holds only the shard you're currently loading. Tag every prefix you've finished — encoder, sound, projector, LLM.
Don't touch
PyExecutor + the C++ core own AttentionMetadata, KV cache, scheduler, sampler, decoder. Your model receives attn_metadata and multimodal_params as inputs and returns logits — never builds request-level metadata. The only attn_metadata you build yourself is the multimodal encoder's own, on the synthetic per-item batch (concatenated patches with per-image seqlens, mel frames, etc.).
Performance contracts
Three rules. Multimodal prefill is long (image/audio tokens balloon sequence length) and media tensors are big (MBs–GBs); the overlap scheduler hides host work behind GPU work only if all three hold.
Contract 1 — Zero CPU-GPU syncs inside forward
A single sync inside forward collapses overlap, and per-iteration GPU work is long for VLMs, so stalls compound.
Banned in forward and anything it calls:
.item(),.tolist(),int(t),bool(t),float(t)on GPU tensorst.cpu(),t.to("cpu"), any device-crossing read- Python
if/whileon tensor values (shape is fine; values are not) torch.nonzero, single-argtorch.where(condition)(index form; documented sync hazard infilter_mm_token_from_input_idswhen run on GPUinput_ids),torch.unique,masked_selecttorch.tensor([...], device="cuda")from a Python list (hidden H2D)- HF runtime branches (
if pixel_values is None: ...) that change tensor shapes
Three-arg torch.where(cond, x, y) is fine when cond is built only on-device (no scalar readback). fuse_input_embeds: kwargs text_token_indices + mm_token_indices together ⇒ skip internal filter_*. trtllm-serve usually supplies both via model_engine.py (CPU-side index build → inputs → fuse_input_embeds(..., **kwargs)). Pure-text batches have no MM inputs; bare unit tests / direct calls may omit indices ⇒ in-model filter_* runs.
Patterns:
- Static graph for mixed batches. Don't add
if has_mm:branches.find_input_mm_embedsreturns input unchanged when runtime is None;fuse_input_embedsreturns(input_ids, None)whenmm_embeds == []— preserve that contract. - mRoPE: compute once per request, never per layer. The pipeline (input processor → engine →
prepare_mrope_config) is laid out in the system map; the constraint here is that per-layer attention must read pre-sliced(cos, sin)— never recompute mrope inside the decoder loop.
Audit: grep for the banned constructs; run one prefill iteration with torch.cuda.set_sync_debug_mode("warn") and confirm zero warnings from your model.
Contract 2 — Preprocessing on CPU, async, server-side
CPU-bound work (decode / resize / normalize / mel-spectrogram / frame extraction) must not compete with GPU work, block the request loop, or serialize across requests.
- HF AutoProcessor + image_processor + tokenizer run inside the input processor's
call_with_text_prompt(dispatched from__call__) — not in the model worker. - URL/bytes media goes through
async_load_image/async_load_video/async_load_audio(all wrap blocking decode inasyncio.to_thread). Never callPIL.Image.open(...).load()/cv2.VideoCapture/soundfile.readsynchronously on the request hot path. - Pin host tensors before H2D with
prefer_pinned()(False under Confidential Compute (CC), True otherwise). The engine pinsmultimodal_dataautomatically viato_device(..., pin_memory=prefer_pinned()). - Declare
multimodal_data_device_pathson the model — list of dotted paths (e.g.["image.pixel_values", "image.image_grid_thw", "video.pixel_values_videos", "video.video_grid_thw", "multimodal_embedding"]) telling the engine which fields go to CUDA. Anything not listed stays on CPU. - Optional tokenized+MM fast path: set
supports_token_id_mm_expansion = True(aClassVar, defaultFalse) and implementget_text_with_mm_placeholders+expand_prompt_token_ids_for_mm. The base-class__call__then routesprompt_token_ids + multi_modal_data(noprompt) requests throughcall_with_token_ids, skipping redundant detokenization. When the flag isFalse(most VLMs), the base class detokenizesprompt_token_ids → promptand re-runscall_with_text_prompt, so token-ID inputs still work — just less efficiently. Only LlavaNext + NanoV2VL opt in today. - Forward
mm_processor_kwargsfrominputs.get("mm_processor_kwargs", {})to the HF processor (callers tune things like video sample rate via this).
Contract 3 — Large media via shared tensors, never raw pickle
A 1024×1024 fp32 patch tensor is ~12 MB; a video clip can be hundreds of MB. Naive pickle through MPI broadcast turns the leader into the IPC bottleneck.
- Always use
MultimodalParams.to_handle/to_tensor.to_handleswaps each tensor insidemultimodal_datafor a small dict —{method_key, tensor_size, storage_handle, ...}— that points at the same memory: a CUDA-IPC handle for GPU tensors (REBUILD_CUDA) or a POSIX-shm handle for CPU tensors (REBUILD_CPU). The dict is a few hundred bytes regardless of the original tensor size. Consumers callto_tensorto rebuild local tensor views from the handle. See_torch/shared_tensor/. - Where it crosses ranks: the executor broadcasts
py_multimodal_dataviadist.broadcast/tp_cp_broadcast/ PP send-recv. Payload size = the literal byte size of whatever's inpy_multimodal_data— confirm every tensor inside has been swapped for its handle dict (i.e.to_handleran) before this point. - Strip after prefill.
_strip_py_multimodal_data_post_prefillclears everything exceptmrope_config.mrope_position_deltas. If your model needs to retain something across decode, updatestrip_mm_data_for_generationexplicitly. - EPD disagg. Embeddings still cross workers as shared tensors, not bytes — see the EPD-disaggregated path section above for the encoder/prefill-worker split.
- Hashes are small; broadcast eagerly.
MultimodalInput.multimodal_hashes(blake3) drives KV-cache reuse — never substitute raw pixels for them.
Audit: payload size in NVTX broadcast_requests / tp_broadcast_requests ranges should be < 1 MB per rank per request. More means a broadcast leaked raw tensors.
Contract 4 — Batch the multimodal encoder across requests
get_multimodal_embeddings hands the encoder a list of MultimodalParams covering every uncached request in the current batch. The encoder must consume that list as a single batched forward pass — concatenate every request's pixel_values / image_grid_thw / mel frames into one tensor, build one ad-hoc attn_metadata whose seq_lens carries per-image boundaries, and run the encoder blocks once. Looping for p in mm_params: encoder.forward(p) loses kernel-launch coalescing and serializes N requests' worth of encoder work.
Pattern (Qwen2.5-VL). Qwen2_5_VisionModel concatenates every request's pixel_values into one [total_patches, ...] tensor and builds attn_metadata with batch_size=1 and seq_lens=[img1_patches, img2_patches, ...]. The TRT-LLM Attention module respects seq_lens so cross-image attention doesn't bleed. The patch merger / projector at the end then splits the result back per-request via torch.split over the same lengths (this is what _cache_multimodal_embeddings expects too).
Audit. Under load with several multimodal requests in one batch, the encoder kernels in nsys should appear as one wide block per iteration, not N narrow blocks. A fan of N narrow blocks means the encoder is being looped per request instead of batched — one of the easiest VLM perf regressions to introduce while refactoring.
Phases
Phase 0 — Gather resources
huggingface-cli download {org}/{model} --exclude "*.safetensors" "*.bin" "*.pt" "*.gguf"
Confirm preprocessor_config.json and chat_template.json are pulled. Verify AutoProcessor.from_pretrained(model_path) loads. Estimate LLM + multimodal encoder params for VRAM sanity (multimodal encoders are often 0.5–7 B on top of the LLM).
Phase 1 — Survey existing coverage
Read config.json's architectures and model_type. If a _torch/models/modeling_*.py already claims that architecture via @register_auto_model, extend rather than create new. Identify the closest existing multimodal model and note which TRT-LLM modules it reuses.
Phase 2 — Model wrapper
Create tensorrt_llm/_torch/models/modeling_{name}.py. The default pattern below mirrors modeling_llava_next.py and modeling_gemma3vl.py — a single wrapper class that composes a multimodal encoder + an LLM resolved through AutoModelForCausalLM.from_config(text_config). The *ModelBase + *Model Base/non-Base split in modeling_qwen2vl.py and modeling_qwen3vl.py is an implementation detail for sharing one wrapper between two variants of the same family (Qwen2-VL ↔ Qwen2.5-VL; Qwen3-VL ↔ Qwen3-VL-MoE) — keep the wrapper a single class unless you have the same multi-variant need.
class {Name}VisionModel(nn.Module):
"""Multimodal encoder. Composes _torch.modules.{Attention,Linear,RMSNorm,GatedMLP,RotaryEmbedding}."""
def forward(self, multimodal_params: List[MultimodalParams]) -> torch.Tensor:
# Concat pixel_values across all requests, build per-image attn_metadata
# via prepare_attn_metadata, then run encoder blocks once (Contract 4).
...
class {Name}Model(PreTrainedModel):
config_class = {Name}Config
def __init__(self, model_config: ModelConfig[PretrainedConfig], *args, **kwargs):
config = model_config.pretrained_config
super().__init__(config)
if hasattr(self, "llm"):
return # idempotency guard — re-entry from `post_config` etc.
if not _is_mm_disagg():
self.mm_encoder = {Name}VisionModel(model_config)
else:
self.mm_encoder = None
# Inner LLM is resolved from text_config; no architectures rewrite needed.
llm_model_config = copy.deepcopy(model_config)
llm_model_config.pretrained_config = model_config.pretrained_config.text_config
# TRT-LLM's AutoModel (tensorrt_llm._torch.models.modeling_auto), not transformers'.
self.llm = AutoModelForCausalLM.from_config(llm_model_config)
self.model_config = model_config
self.post_config()
def post_config(self):
# After llm is constructed, downstream code expects self.config to be the LLM-shaped config.
self.config = self.llm.config
self.model_config.pretrained_config = self.llm.config
@property
def vocab_size_padded(self) -> int:
return self.llm.vocab_size_padded
def infer_max_seq_len(self) -> int:
return self.llm.infer_max_seq_len()
@torch.inference_mode()
def forward(
self,
attn_metadata: AttentionMetadata,
input_ids: Optional[torch.IntTensor] = None,
position_ids: Optional[torch.IntTensor] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
return_context_logits: bool = False,
**kwargs,
) -> torch.Tensor:
num_context_requests = attn_metadata.num_contexts
multimodal_params = kwargs.get("multimodal_params", [])
mm_embeds = []
if len(multimodal_params) > 0 and not _is_mm_disagg():
mm_embeds = get_multimodal_embeddings(
encoder_forward_fn=self.mm_encoder.forward,
multimodal_params=multimodal_params[:num_context_requests],
)
mm_embeds = find_input_mm_embeds(
mm_embeds, multimodal_params[:num_context_requests])
input_ids, inputs_embeds = fuse_input_embeds(
self.llm.model.embed_tokens, input_ids, mm_embeds, **kwargs)
return self.llm.forward(
attn_metadata=attn_metadata, input_ids=input_ids,
position_ids=position_ids, inputs_embeds=inputs_embeds,
return_context_logits=return_context_logits)
@property
def multimodal_data_device_paths(self) -> List[str]:
return ["image.pixel_values", "image.image_grid_thw", "multimodal_embedding"]
Required (every multimodal model):
forwardtakesmultimodal_paramsvia**kwargs. Never addpixel_values/image_grid_thw/attention_maskas direct args — they live inmultimodal_params.multimodal_data.- Encoder output length must match the input processor's MM placeholder count.
mm_encoder.forwardmust return a single tensor whose first dimension equals the total number of MM tokens (excluding special tokens) the input processor placed inprompt_token_ids. If lengths don't agree — or if the encoder returns a list with more than one element —get_multimodal_embeddingssilently skips caching the embedding back intomultimodal_data, and chunked prefill re-runs the encoder from scratch on every chunk.
Family-specific extras (apply only when relevant):
- mRoPE (Qwen-VL family): add
init_mrope_embedding(model_config)in__init__to preallocateself.mrope_position_ids_padding_cuda, plusprepare_mrope_config(multimodal_params, num_context_requests)returningmrope_rotary_cos_sin. Pass through toself.llm.forward(..., mrope_config=...). Reference:Qwen3VLModelBase.prepare_mrope_config. - Deepstack features (Qwen3-VL): split encoder output into
mm_embed+deepstack_embeds, callfuse_input_embeds(..., extra_embeds=deepstack_embeds), forwarddeepstack_embeds=into the LLM. - HF wrapper without a clean
text_config: Qwen2-VL'sQwen2VLModelBaserewritesarchitecturesto surface the inner LLM. Fall back to that pattern only when the multimodal HF config does not expose atext_configsub-config. - Inner LLM that doesn't match HF's
text_configschema (Qwen3.5-MoE-VL → Qwen3Next). When the VLM's HFtext_configschema differs from the TRT-LLM runtime model you want to reuse, write a config normalizer (e.g._normalize_qwen35_moe_vl_config) that maps HF aliases to the runtime's expected names (mRoPE keys,intermediate_sizealiases, quantization-exclude module paths). Wire it via lazy import frompyexecutor.config_utils.load_pretrained_config— theMistralandQwen3_5branches are templates. Two gotchas: transformers 5.x'srope_scalingis a property aliasingrope_parameters— setting either silently overwrites the other, so the normalizer should mutaterope_parametersdirectly if the HF code still reads from it. And for VLMs, the normalizer must run on the composite config (withtext_config/vision_config), not flattened away. - Thin wrapper for runtime reuse. Even when the LM class body is identical to the runtime's existing class, still create a
@register_auto_model("YourArch")-decorated thin subclass — that's how weight-mapper dispatch picks the family-specific mapper. You can't stack two@register_auto_modeldecorators on a single shared class.
Phase 3 — Input processor + dummy builder
Subclass both BaseMultimodalInputProcessor (drives every real request) and BaseMultimodalDummyInputsBuilder (drives engine warmup / profiling — the base shrinks dummy image resolution until the synthetic prompt fits input_seq_len). Colocate in the modeling file. Reference: Qwen3VLInputProcessorBase.
Implement call_with_text_prompt(inputs, sampling_params) — the per-model text-prompt path. Don't override __call__: the base class's concrete __call__ dispatches here for text prompts, and also detokenizes prompt_token_ids → prompt and falls through to here for non-fast-path VLMs. call_with_text_prompt does:
- Pull
text_prompt,mm_data,mm_processor_kwargsfrominputs. _preprocess(...)— HF processor producespixel_values/pixel_values_videos/*_grid_thw/input_ids.- Build
multimodal_datakeyed by modality:{"image": {"pixel_values": ..., "image_grid_thw": ...}, "video": {...}}. - Compute
mrope_configon CPU (.to("cpu").clone()) intomultimodal_data["mrope_config"]. Required even on text-only Qwen-VL prompts — no branch. _postprocess(input_ids)rewrites HF'simage_token_id/video_token_idtotllm_multimodal_token_id = vocab_size + 1(the OOV sentinel). Skip whenmm_datais empty.- Return
(prompt_token_ids_list, {"multimodal_data": multimodal_data}).
Optional tokenized+MM fast path (skip unless needed): set supports_token_id_mm_expansion = True (ClassVar) and implement get_text_with_mm_placeholders(mm_counts) + expand_prompt_token_ids_for_mm(prompt_token_ids, num_mm_tokens, ...). The base-class call_with_token_ids then builds dummy placeholder text, runs call_with_text_prompt on it, expands the real token IDs, and merges any returned mm_data_updates (e.g. video evs_ids) into multimodal_data. Leave the flag False and the base class just detokenizes token-ID inputs and re-runs call_with_text_prompt. Only LlavaNext + NanoV2VL opt in today.
EPD override (if @support_multimodal_disaggregated): override _attach_multimodal_embeddings_impl(inputs, multimodal_embedding, sampling_params) — not the attach_multimodal_embeddings wrapper — to consume encoder outputs in the prefill+decode worker. The base wrapper detokenizes tokenized inputs for non-fast-path VLMs before delegating to your impl.
Decorator stack — bottom-up application; register_vision_encoder requires register_auto_model to have run first:
@support_multimodal_disaggregated # outermost (after validation)
@register_vision_encoder({Name}VisionModel, vlm_base_model=HFVisionTransformerClass)
@register_auto_model("{ArchName}ForConditionalGeneration")
@register_input_processor(
{Name}InputProcessor, model_type="{model_type}",
placeholder_metadata=MultimodalPlaceholderMetadata(
placeholder_map={"image": "<|vision_start|><|image_pad|><|vision_end|>", ...},
placeholder_placement=MultimodalPlaceholderPlacement.BEFORE_TEXT,
placeholders_separator="",
content_format=ContentFormat.STRING,
),
)
class {Name}Model(PreTrainedModel): ...
Phase 4 — Weight loading
def load_weights(self, weights, weight_mapper):
if not _is_mm_disagg():
self.mm_encoder.load_weights(weights)
# Release mmap pages backing the encoder weights as soon as we're done.
if hasattr(weights, "mark_consumed"):
weights.mark_consumed("vision_model") # adjust prefix per your checkpoint
weight_mapper = {Name}HfWeightMapper()
weight_mapper.init_model_and_config(self.llm, self.model_config)
filtered = {k: v for k, v in weights.items() if not k.startswith("model.visual.")}
self.llm.load_weights(filtered, weight_mapper)
if hasattr(weights, "mark_consumed"):
weights.mark_consumed("language_model")
Inherit from a related mapper for prefix surgery — don't write a one-off translator.
Phase 5 — Tests
Per-model unit test (the main one) at tests/unittest/_torch/modeling/test_modeling_<name>.py. Subclass TestModelingMultimodal from tests/unittest/_torch/modeling/test_modeling_multimodal.py (an abstract unittest.TestCase) and implement six abstract methods: get_model_config, get_trtllm_model_class, get_hf_model_class, get_weight_mapper_class, get_model_type, get_model_config_class. The base class drives a MultimodalScenario-parameterized run (modality ∈ image / multiple_image / video / text / mixture_text_image / audio, with optional use_cuda_graph / chunked_prefill / kv_cache_reuse) — comparing TRT-LLM logits to HF reference, exercising the KV cache manager, attn metadata, mrope, fusion path, and CUDA graph capture in one harness. Override get_scenarios to declare which combinations apply to your model. Reference: test_modeling_qwen3vl.py, test_modeling_qwen2_5vl.py, test_modeling_nemotron_nano_v2_vl.py. Test data lives under ${LLM_MODELS_ROOT}/multimodals/test_data/.
Hybrid linear-attention models. Override _dummy_request_kwargs to return {"use_mrope": True} if the model uses mRoPE (allocates the 3-D position-id buffer at dummy-request time). The base class's init_kv_cache_manager already dispatches on is_qwen3_hybrid / is_nemotron_hybrid to build CppMambaHybridCacheManager — don't override unless you need a different concrete manager. Use PyKvCacheConfig from llmapi.llm_args (Pydantic), not the C++ bindings KvCacheConfig — CppMambaHybridCacheManager.__init__ reads mamba_state_cache_interval which only exists on the Pydantic side. CUDA-graph capture in the harness doesn't currently address the Mamba SSM state buffer — keep use_cuda_graph=False in get_scenarios for hybrid models until that's wired through; production CUDA-graph support is independent and unaffected.
Synthetic-config shape couplings. head_dim × partial_rotary_factor / 2 == sum(mrope_section) — head_dim can't be shrunk independently. If the test loads the real tokenizer via _name_or_path, vocab_size must equal the real tokenizer's vocab — otherwise chat-template specials at ids >= your synthetic vocab_size get misclassified as mm tokens by fuse_input_embeds's OOV filter (manifests as "found N image tokens but received M image embeddings", off by exactly the number of chat-template specials). Vision deepstack indices [i, j, k] require depth > k — the HF processor reserves placeholder tokens for deepstack outputs regardless of whether the encoder is configured to emit them.
Two-config Approach B for tests. If you've added a config normalizer (Phase 2), keep self.hf_config raw and route a deepcopy + normalize only through create_trtllm_model. Reusing one normalized config for both HF and TRT-LLM construction trips the transformers 5.x property aliasing and silently corrupts HF-side schema (rope_scaling ↔ rope_parameters).
Tolerance band. Default get_tolerance returns 0.4 / 0.4, calibrated to pass for the existing VLM tests but wide enough to mask argmax-changing bugs. After your test passes, dial it tighter — keep atol = 0.4 to absorb single-logit tail outliers seen on multiple_image / video scenarios; tighten rtol toward 0.1 to gate bulk-of-logits relative agreement. Don't drop rtol below 0.05 without cross-SKU validation.
Supplemental utility tests (only if your model exercises new logic in shared utilities) under tests/unittest/_torch/multimodal/: test_fuse_input_embeds.py, test_multimodal_runtime.py, test_find_num_image_tokens.py, test_external_embedding.py, test_share_multiparams.py, test_mm_encoder_standalone.py. Extend the existing tests rather than creating new files when the coverage is generic.
Accuracy test at tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py — subclass LlmapiAccuracyTestHarness, set MODEL_NAME / MODEL_PATH / MAX_NUM_TOKENS=16384, run MMMU (or ChartQA/ScienceQA). Reference: TestQwen2_5_VL_7B. Wire into tests/integration/test_lists/test-db/l0_<gpu>.yml.
- First-run reference capture. Set
TRTLLM_ACCURACY_NO_REFERENCE=1for the first local run; the harness synthesizes a baseline reference (0for higher-is-better metrics like MMMU), runs end-to-end, and prints the achieved accuracy. Paste the printed value verbatim intotests/integration/defs/accuracy/references/mmmu.yaml— that's the measured reference; the threshold derives from it viasigma/alpha/beta. quant_algoassertion intest_fp8_prequantizedmust match what the checkpoint actually advertises. Flat per-tensor FP8 isQuantAlgo.FP8; block-scaled FP8 (DeepSeek-V3 / Qwen3.5 style) isQuantAlgo.FP8_BLOCK_SCALES. Same applies to NVFP4 variants. Easy to copy from a peer model and assert the wrong one.
Be parsimonious. The cartesian product modality × use_cuda_graph × chunked_prefill × kv_cache_reuse explodes fast. In get_scenarios(), pick the smallest set covering this model's distinctive paths — e.g. one image, one mixture_text_image, plus one chunked-prefill / one cuda-graph entry only if the model claims those features. One accuracy benchmark per model (MMMU for image VLMs); add another only for capabilities the first doesn't exercise (audio, video, very long context).
Phase 6 — Docs + serve verification
docs/source/models/supported-models.md:
- Supported Models table: row alphabetical by architecture class.
- Multimodal Feature Support Matrix (PyTorch Backend): row with columns Overlap Scheduler / CUDA Graph / Chunked Prefill / Torch Sampler / TLLM C++ Sampler / KV Cache Reuse / Logits Post Processor / EPD Disaggregated Serving / Modality (L+I+V+A). Mark
Yesonly what you've verified.
First line of defense — quickstart smoke test. Before bringing up a server, run the bundled quickstart against your model:
python examples/llm-api/quickstart_multimodal.py \
--model_dir <hf_model_id> --modality image \
--media <url-or-path>
It exercises setup_llm + default_multimodal_input_loader + the chat template + LLM.generate end-to-end with a couple of bundled prompts. Cheaper than spinning up trtllm-serve and fails fast on input-processor / encoder / fusion bugs. Run for every modality your model supports (--modality image|video|audio|image_audio|...).
Then aggregated serving:
trtllm-serve <hf_model_id> --backend pytorch --max_num_tokens 16384 --port 8000
Send a chat completion with a real image; confirm coherent output. (TODO: 2ez4bz to provide ready-to-use curl examples.)
Chunked-prefill cache verification (mandatory). Re-run with a deliberately small --max_num_tokens to force the prefill of one image to span multiple chunks, then grep the server log for these two lines:
Multimodal hashing failed:→ the input processor's hash path fell back; KV-cache reuse across requests with the same image is broken (Contract 3 hash invariant).Multimodal runtime data missing or incomplete, will not cache embeddings.→ the encoder-output cache is being skipped, so the encoder is being re-run on every chunked-prefill iteration of the same request (Phase 2 Required: encoder output length must match MM placeholder count).
A clean serving log shows neither line. If either appears, fix it before declaring the model done — these are silent perf cliffs, not crashes.
For EPD: run MultimodalEncoder and LLM as separate process groups; verify embeddings cross via disaggregated_params.multimodal_embedding_handles.
Phase 7 — Pull request
Follow CONTRIBUTING.md. Title [JIRA/NVBUG/None][type] description, git commit -s. Body: one full multimodal prompt → output verbatim, reproduction commands, pytest output verbatim. Trigger CI via /bot run.
Pre-PR checklist
Architecture & registration
- Decorator stack in correct order:
@support_multimodal_disaggregated(outermost, optional) →@register_vision_encoder→@register_auto_model→@register_input_processor(innermost). -
forwardtakesmultimodal_paramsvia**kwargs; nopixel_values/image_grid_thw/attention_maskdirect args. -
multimodal_data_device_pathslists every GPU-resident mm field. - If runtime-reusing (e.g. Qwen3.5 → Qwen3Next): thin
@register_auto_modelwrapper class present; config normalizer lazy-imported frompyexecutor.config_utils.load_pretrained_config.
Module reuse
- No raw
nn.Linear/nn.LayerNorm/nn.MultiheadAttention/ hand-rolled attention — use_torch/modules/*. - LLM backbone surfaced via
text_config+AutoModelForCausalLM.from_config(TRT-LLM's, not transformers');architecturesrewrite only as the documented Family-specific fallback. - Multimodal encoder is ported to TRT-LLM modules. No imports of HF modules or computations from
transformers.models.<family>— they couple the file to a specifictransformersrelease and break silently on upstream refactors. The only existing exception is the outdated Qwen2-VL family (Qwen2VLModel); new models must not follow that pattern. - HF→TRT-LLM weight surgery via a
BaseWeightMappersubclass under_torch/models/checkpoints/hf/. -
weights.mark_consumed(<prefix>)called after each sub-module load (mm encoder / projector / LLM) so mmap shards release incrementally; multimodal-encoder construction deferred toload_weights()if its HF submodules clash withMetaInitMode.
Input processor
- Subclasses both
BaseMultimodalInputProcessorandBaseMultimodalDummyInputsBuilder. -
call_with_text_prompt(not__call__— that's the base-class dispatcher) runs HF AutoProcessor + tokenizer, buildsmultimodal_databy modality, computesmrope_configon CPU,_postprocess-rewrites mm token ids to the OOV sentinel. -
mm_processor_kwargsflow-through preserved. (Tokenized fast path is optional: setsupports_token_id_mm_expansion = True+ implementget_text_with_mm_placeholders/expand_prompt_token_ids_for_mm; otherwise the base class detokenizes token-ID inputs automatically.) -
_attach_multimodal_embeddings_implimplemented (not theattach_multimodal_embeddingswrapper) if@support_multimodal_disaggregated.
Performance contracts
- Grep clean for Contract 1 bans (
.item()/.cpu()/.tolist()/torch.nonzero/ single-argtorch.where/ value-dependentif, etc.) in modelingforwardpaths — elementwisetorch.where(cond, x, y)with GPU-onlycondis fine. -
set_sync_debug_mode("warn")audit on prefill: zero warnings from your model. - Async loaders used for URL/bytes inputs.
- Broadcast payload < 1 MB per rank per request (NVTX
broadcast_requests/tp_broadcast_requests); media crosses ranks only viato_handle/to_tensor. - Decode-iteration
mm_datais empty (post-prefill strip exercised in e2e test). - Encoder output is a single tensor whose first dim equals the input processor's MM placeholder count; verified by running with chunked prefill on (small
--max_num_tokens) and confirming the encoder runs once per request, not once per chunk. - Encoder is batched across requests: a multi-request batch produces a single wide encoder block in nsys, not N narrow blocks (Contract 4).
Tests & docs
- Per-model unit test at
tests/unittest/_torch/modeling/test_modeling_<name>.pysubclassingTestModelingMultimodal; six abstract methods implemented;get_scenarios()declares the minimum modality × cuda_graph × chunked_prefill × kv_cache_reuse combinations that cover this model's distinctive paths (no full cartesian product) and they all pass. - Mixed-batch scenario (
mixture_text_image) included and passes against HF reference logits. - If hybrid linear-attention:
_dummy_request_kwargsoverridden to setuse_mrope=True; CUDA-graph scenarios skipped (harness limitation, not a model limitation). - Accuracy test under
test_llm_api_pytorch_multimodal.py; entry intest-db/l0_<gpu>.yml. - Reference accuracy captured via
TRTLLM_ACCURACY_NO_REFERENCE=1and entered inreferences/mmmu.yaml; threshold derived value sanity-checked. - Prequantized assertions use the correct
QuantAlgovariant for the checkpoint (flat FP8 vsFP8_BLOCK_SCALES; FP4 vsNVFP4). - Rows added to Supported Models + Multimodal Feature Support Matrix.
-
examples/llm-api/quickstart_multimodal.pyround-trip passes for every modality the model supports (--modality image|video|audio|...); thentrtllm-serveround-trip verified with a real image prompt; EPD round-trip verified if applicable. - Chunked-prefill verification: ran
trtllm-servewith low--max_num_tokensand confirmed the log contains neitherMultimodal hashing failed:norMultimodal runtime data missing or incomplete, will not cache embeddings. -
/bot runtriggered and multimodal stages pass.