vllm-omni-audio-tts - SKILL.md Agent Skill

name: vllm-omni-audio-tts description: Generate audio and speech with vLLM-Omni using Qwen3-TTS, Fish Speech S2 Pro, CosyVoice3, MiMo-Audio, and Stable-Audio models. Use when synthesizing speech from text, generating audio effects or music, configuring TTS parameters, cloning voices, adding new TTS models, or working with text-to-speech models.

vLLM-Omni Audio & TTS

Overview

vLLM-Omni supports text-to-speech (TTS), text-to-audio (sound effects, music), and audio understanding through multiple model families. TTS models use a two-stage autoregressive pipeline (Code Predictor + Code2Wav decoder), while audio generation uses diffusion.

Supported Audio Models

Model	HF ID	Type	Min VRAM
Qwen3-TTS 1.7B CustomVoice	`Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice`	TTS + voice cloning	8 GB
Qwen3-TTS 1.7B VoiceDesign	`Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign`	TTS + voice design	8 GB
Qwen3-TTS 1.7B Base	`Qwen/Qwen3-TTS-12Hz-1.7B-Base`	Basic TTS	8 GB
Qwen3-TTS 0.6B CustomVoice	`Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice`	TTS + voice cloning	4 GB
Qwen3-TTS 0.6B Base	`Qwen/Qwen3-TTS-12Hz-0.6B-Base`	Basic TTS	4 GB
Fish Speech S2 Pro	`fishaudio/s2-pro`	TTS + voice cloning (dual-AR + DAC)	16 GB
CosyVoice3 0.5B	`FunAudioLLM/Fun-CosyVoice3-0.5B-2512`	TTS (AR + flow matching)	4 GB
MiMo-Audio-7B	`XiaomiMiMo/MiMo-Audio-7B-Instruct`	Audio understanding + TTS	24 GB
MiMo-V2.5-ASR	`XiaomiMiMo/MiMo-V2.5-ASR`	ASR (speech-to-text)	24 GB
OmniVoice	`nvidia/OmniVoice`	TTS + voice cloning (HiggsAudioV2)	8 GB
VoxCPM2	`openbmb/VoxCPM2`	TTS (native AR, 30+ languages)	8 GB
Stable-Audio-Open	`stabilityai/stable-audio-open-1.0`	Text-to-audio (music/effects)	8 GB

OmniVoice supports voice cloning via ref_audio + ref_text (requires transformers>=5.3). VoxCPM2 is a 2B tokenizer-free native AR TTS model producing 48kHz audio in 30+ languages (requires pip install voxcpm).

Model Architectures

Both Qwen3-TTS and CosyVoice3 use a two-stage autoregressive pipeline. See the reference docs for architecture details, key files, and model variants:

Quick Start: Text-to-Speech

Offline

from vllm_omni.entrypoints.omni import Omni

omni = Omni(model="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice")
outputs = omni.generate("Hello, welcome to vLLM-Omni!")
audio = outputs[0].request_output[0].audio
audio.save("greeting.wav")

Online API

vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --omni --port 8091

curl -s http://localhost:8091/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    "input": "Hello, welcome to vLLM-Omni!",
    "voice": "default"
  }' --output greeting.wav

Voice Cloning (CustomVoice variants)

Clone a voice from a reference audio sample:

omni = Omni(model="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice")
outputs = omni.generate(
    prompt="This is a test of voice cloning with vLLM-Omni.",
    audio_references=["reference_voice.wav"],
)
outputs[0].request_output[0].audio.save("cloned_speech.wav")

Voice Design (VoiceDesign variant)

Design a voice by describing its characteristics:

omni = Omni(model="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign")
outputs = omni.generate(
    prompt="Welcome to our product launch event!",
    voice_description="A warm, professional female voice with a calm tone",
)
outputs[0].request_output[0].audio.save("designed_voice.wav")

Text-to-Audio (Music & Effects)

Generate music or sound effects with Stable-Audio-Open:

vllm serve stabilityai/stable-audio-open-1.0 --omni --port 8091

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8091/v1", api_key="unused")

response = client.chat.completions.create(
    model="stabilityai/stable-audio-open-1.0",
    messages=[{"role": "user", "content": "Relaxing piano music with rain sounds"}],
)

Audio Understanding (MiMo-Audio)

MiMo-Audio can both understand audio input and generate speech:

omni = Omni(model="XiaomiMiMo/MiMo-Audio-7B-Instruct")

# Transcribe/understand audio
outputs = omni.generate(
    prompt="What is being said in this audio?",
    audio_inputs=["recording.wav"],
)
print(outputs[0].request_output[0].text)

Stage Configuration (Qwen3-TTS)

async_scheduling is enabled by default for Qwen3-TTS models, improving first-packet latency and throughput.

Default stage config uses async_chunk streaming (qwen3_tts.yaml). Key knobs:

Config	Description	Default
`async_chunk`	Enable inter-stage streaming	`true`
`runtime.max_batch_size`	Max requests batched per stage	`1`
`enforce_eager`	Disable CUDA Graph (Stage 0: false, Stage 1: true)	varies
`codec_chunk_frames`	AR frames per async chunk (inter-stage streaming only)	`25`
`codec_left_context_frames`	Sliding context window for smooth boundaries	`25`
`initial_codec_chunk_frames`	Frames for first emitted codec chunk only (lowers TTFA)	`0`
`decode_chunk_frames`	Code2Wav internal decode chunk size (independent of codec streaming)	`300`
`decode_left_context_frames`	Code2Wav internal left context for decode	`25`

Connector streaming chunking (codec_chunk_frames / codec_left_context_frames) is decoupled from Code2Wav internal decode chunking (decode_chunk_frames / decode_left_context_frames). The connector controls inter-stage streaming windows only, while Code2Wav keeps its own independent decode parameters. Use initial_codec_chunk_frames to emit a small first chunk for low TTFA, then subsequent chunks return to the normal codec_chunk_frames window.

The uniproc Code2Wav stage default max_num_seqs is now 10 (was 1). Avoid reducing below 10 for latency-sensitive deployments.

CUDA Graph warmup for Qwen3-TTS now accounts for custom decode_chunk_frames / decode_left_context_frames overrides.

High-Concurrency Profile

For high-concurrency TTS serving (voice cloning, c=64+), use qwen3_tts_high_concurrency.yaml:

vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --omni \
  --stage-configs-path vllm_omni/deploy/qwen3_tts_high_concurrency.yaml

This profile enables batched CUDA graph decoder, prefix CUDA graphs for the code predictor, bounded reference-code context, and first-chunk fast emit (initial_codec_chunk_frames: 1). Tuned for 2-GPU serving with Seed-TTS voice-clone workload. Median TTFP is higher than default profile; use for throughput/E2E rather than first-packet-latency optimization.

Additional high-concurrency knobs available in the deploy config:

decode_cudagraph_batch_sizes: Multi-batch-size CUDA graph capture for Code2Wav
decode_batch_bucket_frames / decode_batch_max_size: Variable-length chunk batching
ref_code_context_frames: Limits reference-audio code frames per chunk for stable stage-1 shapes
decode_enable_tf32: true: Opt-in TF32 for Code2Wav
code_predictor_prefix_graphs: true: Prefix CUDA graph warmup for Stage0 code predictor

For batch mode (no streaming), use qwen3_tts_batch.yaml.

Fish Speech uses fish_speech_s2_pro.yaml with similar knobs. Its DAC codec outputs at 44.1 kHz (vs Qwen3-TTS's 24 kHz).

Note: CosyVoice3 does not support async_chunk streaming yet - use cosyvoice3.yaml (batch mode only).

Streaming Audio

For real-time TTS streaming:

response = client.chat.completions.create(
    model="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    messages=[{"role": "user", "content": "A long paragraph of text to stream..."}],
    stream=True,
)

Adding a New TTS Model

For a step-by-step guide on integrating a new TTS model into vLLM-Omni, see the TTS model developer guide. Offline examples are consolidated under examples/offline_inference/text_to_speech/<model>/end2end.py, and online serving examples under examples/online_serving/text_to_speech/<model>/.

Troubleshooting

Audio quality issues: Ensure reference audio for voice cloning is clean (no background noise), 10-20 seconds, single speaker.

Qwen3-TTS code predictor crash: Fixed in #1619. If you encounter a crash in the code predictor stage, update to the latest vllm-omni.

Qwen3-TTS NaN on fp16-only GPUs: The code predictor auto-upcasts to float32 for numerical stability on GPUs without bf16 support (Turing, Volta). No manual override needed. Fixed in #3253.

Qwen3-TTS speaker_embedding dimension error: Speaker embedding dimensions must match the model's talker hidden_size (2048 for 1.7B, 1024 for 0.6B). Mismatched dimensions return HTTP 400. Fixed in #3191.

Qwen3-TTS load_format: dummy: speaker_encoder is always constructed at init time. Voice cloning works under load_format: dummy without extra configuration. Fixed in #3117.

Slow generation: TTS models are autoregressive - generation time scales with output duration. Enable async_chunk for lower first-packet latency. For throughput, increase max_batch_size.

Fish Speech voice cloning latency: Uploaded voices via /v1/audio/voice/upload now auto-cache DAC-encoded reference audio. First request encodes the reference; subsequent requests reuse the cached codes for faster TTFP. Fixed in #2609.

Event loop blocking under concurrent TTS: Blocking tokenizer operations (_build_voxtral_prompt, _build_fish_speech_prompt) now run in a shared ThreadPoolExecutor(max_workers=1). This prevents /health latency spikes under concurrent load. Fixed in #2511.

References

For Qwen3-TTS details and voice options, see references/qwen-tts.md
For Fish Speech S2 Pro details, see references/fish-speech.md
For CosyVoice3 details, see references/cosyvoice3.md
For MiMo-Audio capabilities, see references/mimo-audio.md