name: vllm-omni
description: |-
vLLM-Omni output-side multimodal generation — image (FLUX.1/2, Qwen-Image, GLM-Image, BAGEL, SD3.5, HunyuanImage-3.0), video (Wan2.1/2.2, LTX-2, HunyuanVideo-1.5), TTS (Qwen3-TTS, CosyVoice3, Voxtral-TTS), any-to-any omni (Qwen3-Omni, Qwen2.5-Omni, MiMo-Audio) via vllm serve --omni. Stage-based disaggregation (OmniConnector + Mooncake + RDMA), /v1/images/generations, async+sync /v1/videos, /v1/audio/speech with voice-upload, PCM16 WebSocket /v1/realtime, Ulysses/Ring SP + CFG-parallel, DiT FP8/INT8/GGUF, CUDA/ROCm/NPU/XPU/MUSA matrix, release pitfalls (v0.19.0rc1 FLUX regression, GLM-Image transformers>=5.0, Qwen3-TTS enforce-eager).
when_to_use: |-
Trigger on any vLLM deployment producing non-text output (image/video/audio) or any-to-any omni model, or model names ending -Image/-TTS/-Omni/-Video. Keywords — vllm serve --omni, vllm-omni, /v1/images/generations, /v1/videos, /v1/audio/speech, /v1/audio/voices, /v1/realtime, async_chunk, stage_configs_path, OmniConnector, MooncakeStore, OmniDiffusionSamplingParams, FlowUniPC, TeaCache, Cache-DiT, Sage/Ring/Ulysses, --ulysses-degree, --ring-degree, Thinker/Talker/Code2Wav, BAGEL, Wan2.2, FLUX.2-klein, ComfyUI bridge, verl RL. Narrow phrasings — "serve Qwen-Image", "Qwen3-Omni streaming audio", "async video job". Also implicit — "deploy image gen", "TTS endpoint", "video gen pipeline", "audit omni", "deploy-memo for {model}-Image/-TTS/-Video". NOT for embeddings/reranking/STT/OCR (→ vllm-input-modalities).
vLLM-Omni — output-side multimodal serving
Target: operators who serve image / video / audio / any-to-any generation models with the vLLM-Omni fork of vLLM. vllm-omni extends upstream vLLM (same CUDA/ROCm/NPU/XPU runtime, same OpenAI-compat API server) to add non-autoregressive DiT models, multi-stage pipeline execution, diffusion schedulers, CFG plumbing, and real-time streaming audio I/O — things upstream vLLM does not ship.
This skill is a reference, not a tutorial. SKILL.md holds the mental model, quick-answer router, top pitfalls, and operator cheat sheet. The references/ files hold endpoint catalogs, supported-model tables, stage-config grammar, and the diffusion/DiT details. Read only the reference file that matches the question.
The one thing to know before anything else
vllm-omni is not a fork — it layers on top of upstream vLLM, registers OmniModelConfig, and adds one CLI flag: --omni. Adding --omni to vllm serve routes the server through vllm_omni.entrypoints. As of v0.20.0 the old vLLM entrypoint-hijack / patch.py early-import mechanism was removed — the v0.20.0 release notes state "removal of the old vLLM entrypoint hijack, and runtime changes needed for the 0.20.0 integration path (#3232, #3082, #3352, #3393, #2306)". The omni runtime is now rebased onto upstream vLLM v0.20.0 (rebase PR #3232) rather than monkey-patching it. The architectural claim is to decompose any-to-any models into a graph of disaggregated stages (Thinker / Talker / Code2Wav for Qwen3-Omni; AR-encoder / DiT for Qwen-Image) connected via OmniConnector, so each stage scales independently. The paper (arXiv:2602.02204) claims up to 91.4% JCT reduction vs an unspecified baseline — treat as an architectural argument, not a deployment benchmark.
Version alignment is strict: vllm-omni major.minor must match upstream vLLM major.minor. v0.20.0 (2026-05-07) is the current stable, rebased on upstream vLLM v0.20.0 (CUDA 13.0 / PyTorch 2.11). First stable was v0.14.0 (2026-01-31). Latest pre-release is v0.21.0rc1 (2026-05-25). The v0.19.0rc1 FLUX.1-dev regression (#2730) is fixed in v0.20.0 stable (PR #2760) — no version pin needed anymore.
Quick-answer router
Serving a specific endpoint → references/endpoints.md
/v1/images/generations,/v1/images/edits(DALL·E-shape)/v1/videos(async job) +/v1/videos/sync(raw MP4, 1200s timeout)/v1/audio/speech,/v1/audio/voices(list + upload),/v1/audio/speech/batch,/v1/audio/speech/stream(WebSocket)/v1/realtime(WebSocket PCM16 in/out for Qwen3-Omni)/v1/chat/completionswith diffusion viaextra_body
Picking a model → references/models.md
- Full supported-architecture → HuggingFace-ID table
- Per-model platform matrix (CUDA / ROCm / NPU / XPU / MUSA)
- Known-issue flags per family
Writing / debugging stage configs → references/stage-config.md
- OmniModelConfig + StageConfig YAML grammar
- OmniConnector types (Shared-memory / Mooncake-Store / Mooncake-Transfer-Engine / RDMA / Yuanrong)
- Pipeline edge validation, entry-point requirement
stage_id,model_stage,worker_type,engine_output_type,async_chunk
DiT-specific questions → references/diffusion.md
- Schedulers (FlowUniPC + model-specific)
- CFG plumbing (dual CFG for Wan2.2, true_cfg_scale for Qwen-Image, cfg_branch_past_key_values)
- Caches: TeaCache / Cache-DiT / latent cache / noise_pred cache
- Quantization: FP8 (Flux #1640), INT8 (Z-Image/Qwen-Image #1470), GGUF (#1755) — all per-component via
ComponentQuantizationConfig - Ulysses / Ring sequence parallel, CFG-parallel merged-batch TP
Qwen3-Omni realtime + Qwen3-TTS → references/realtime-tts.md
- PCM16 mono @ 16 kHz in / 24 kHz out, OpenAI realtime event shape
async_chunk: falserequirement- Qwen3-TTS CustomVoice / VoiceDesign / Base modes, 12 Hz / 25 Hz tokenizers
- Voice-upload surface (10 MB cap, consent/ref_text/speaker_description required)
The top operator mistakes this skill exists to prevent
/v1/realtimewithasync_chunk: true. The realtime WebSocket rejects at connection ifasync_chunkis enabled (api_server.py:1208). Use the default stage-config (vllm_omni/model_executor/stage_configs/qwen3_omni_moe.yaml) — not the...moe_async_chunk.yamlvariant — for realtime sessions. The async-chunk config is for higher-throughput non-realtime Qwen3-Omni serving.Qwen3-TTS with CUDA graphs on (v0.18 only). Issue #2866: on v0.18 the code2wav stage crashed when
enforce_eager: false, so--enforce-eagerwas mandatory. #2866 is CLOSED (2026-04-29) and v0.20.0 ships TTS CUDA-graph capture + shared memory pools (release notes cite #2690/#2758/#2803), lifting the requirement. On v0.20.0+ keep--trust-remote-codebut--enforce-eageris no longer forced — drop it to regain CUDA-graph throughput, and re-test latency.Running the v0.19.0rc1 FLUX artifacts. Issue #2730: FLUX.1-dev generated incorrect images in v0.19.0rc1 (T5 text-encoder bug). Fixed in v0.20.0 stable (PR #2760, merged 2026-04-24). The v0.19.0rc1 tag artifacts are still broken, so do not deploy that specific tag — use v0.20.0+ for any FLUX deployment.
GLM-Image on v0.18 without
transformers>=5.0. On v0.18 GLM-Image required a manualpip install 'transformers>=5.0'before serving (the default wheel pinned transformers below 5.0 and GLM-Image silently failed to load). v0.20.0 ships Transformers 5.x compatibility fixes from the upstream rebase — verify whether the manual upgrade is still needed on v0.20.0+ before adding it.PCM format on
/v1/realtime. Qwen3-Omni realtime hard-expects 16-bit PCM mono @ 16 kHz input, outputs PCM at 24 kHz. Stereo, 8 kHz, 24-bit, or WAV-with-header inputs produce garbage or silent failures. Use the reference client inexamples/online_serving/qwen3_omni/openai_realtime_client.pyas a template.Default
guidance_scale=0.0sentinel. OmniDiffusionSamplingParams treatsguidance_scale=0.0as "not provided" — passing0.5intending partial CFG gets coerced. To disable CFG, leave the field unset; to enable, pass> 1.0.Prefix caching on a stage that emits latents. Any stage with
engine_output_type: latent(thinker stages producing hidden states) must setenable_prefix_caching: falsein itsengine_args. Prefix cache reuses token-level blocks, which makes no sense for latent outputs — leaving it on surfaces as intermittent stale responses./v1/videos/syncfor long jobs. The sync endpoint has a hardcodedVIDEO_SYNC_TIMEOUT_S(default ~1200s) and returns 504 past that. Long Wan2.2 / HunyuanVideo-1.5 jobs should usePOST /v1/videos(async), then pollGET /v1/videos/{id}and fetch/content.Orphan processes after a Wan2.2 crash. Issue #2768: killing one Wan2.2 worker leaves sibling stage processes alive. Wrap launches in a process group +
pkill -9sweep on failure, or usesystemd'sKillMode=control-group.Assuming vllm-omni serves text-only models. If the model has no multimodal output, use stock vLLM — vllm-omni adds overhead for features a text-only model won't exercise, and the community skill explicitly recommends against it. The decision rule: output modality is non-text OR the model name ends
-Omni/-Image/-TTS/-Video→ vllm-omni; otherwise stock vLLM.
Operator cheat sheet
Install
uv venv --python 3.12 --seed
source .venv/bin/activate
# CUDA — pin upstream vLLM to the matching minor:
uv pip install vllm==0.20.0 --torch-backend=auto
# ROCm:
uv pip install vllm==0.20.0+rocm700 \
--extra-index-url https://wheels.vllm.ai/rocm/0.20.0/rocm700
# Then the omni package (prebuilt wheel OR editable clone):
uv pip install vllm-omni==0.20.0
# OR: git clone https://github.com/vllm-project/vllm-omni && cd vllm-omni && uv pip install -e .
Python 3.12 is required (3.11 is not supported). Docker image: vllm/vllm-omni:0.20.0.
Serving canonical forms
# Text-to-image (default Z-Image-Turbo quickstart):
vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8091
# Qwen-Image with tensor parallelism:
vllm serve Qwen/Qwen-Image --omni --tensor-parallel-size 2 --port 8091
# Qwen3-Omni realtime (default stage config, async_chunk OFF):
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--tensor-parallel-size 2 --gpu-memory-utilization 0.9 --port 8091
# Qwen3-Omni high-throughput non-realtime (async_chunk ON):
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_omni_moe_async_chunk.yaml
# Qwen3-TTS (trust-remote-code; --enforce-eager only required on v0.18, lifted by TTS CUDA-graph capture in v0.20.0+):
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --omni \
--trust-remote-code --task-type CustomVoice
# Wan2.2 T2V with Ulysses sequence parallel:
vllm serve Wan-AI/Wan2.2-T2V-A14B-Diffusers --omni \
--ulysses-degree 4 --ulysses-mode strict --port 8091
Common extra flags
| Flag | Purpose |
|---|---|
--omni |
Enable vllm-omni entrypoint (load-bearing) |
--stage-configs-path |
Override default stage-config YAML |
--task-type |
Qwen3-TTS: CustomVoice | VoiceDesign | Base |
--ulysses-degree / --usp |
Ulysses sequence parallelism for DiT |
--ulysses-mode |
strict (divisibility) | advanced_uaa (uneven shapes) |
--ring-degree |
Ring-based parallelism |
--num-gpus |
GPUs allocated to diffusion pipeline |
--omni-master-address / -oma |
Orchestrator hostname (multi-node) |
--omni-master-port / -omp |
Orchestrator port |
--stage-id |
Single-stage mode (requires master address) |
--worker-backend |
multi_process | ray |
--model-class-name |
Override diffusion pipeline class |
Key numbers to memorize
| Metric | Value |
|---|---|
| Current stable | v0.20.0 (2026-05-07, rebased on vLLM v0.20.0, CUDA 13.0 / PyTorch 2.11) |
| Latest pre-release | v0.21.0rc1 (2026-05-25) |
| First stable | v0.14.0 (2026-01-31) |
| Minimum Python | 3.12 |
/v1/realtime input |
PCM16 mono @ 16 kHz |
| Qwen3-Omni audio output rate | 24 kHz |
| Qwen3-TTS tokenizer rate | 12 Hz or 25 Hz |
/v1/videos/sync timeout |
~1200s (hard) |
| Voice upload size cap | 10 MB |
| Paper claim | up to 91.4% JCT reduction vs "baseline" (unspecified) |
| Qwen3-TTS published RTF (v0.16) | 0.22–0.45 |
| MiMo-Audio published RTF (v0.16) | ~0.2 (11× baseline) |
Paired skills
vllm-input-modalities— the complement: text embeddings, reranking, STT (Whisper/Voxtral-STT/Qwen3-ASR), OCR (DeepSeek-OCR). Trigger together when the deployment does both input and output non-text modalities.vllm-nvidia-hardware— for sizing GB300/NVL72/Rubin capacity for diffusion + CFG-parallel + Ulysses footprints.vllm-caching— OmniConnector borrows Mooncake from upstream vLLM; the caching skill has the connector-config surface.vllm-observability— vllm-omni inherits upstream/metrics; profiler hooks (OmniTorchProfilerWrapper) add stage_id + rank awareness to trace files.
Source policy
All claims are cited with file:line, release-note PR refs, or issue IDs. Full anchor list + community channels + third-party plugin catalog in references/sources.md. Compiled 2026-04-18 against v0.18.0; last freshened 2026-05-28 (rebased to v0.20.0 stable; refresh again when the next upstream-rebase release ships).