add-tts-model - SKILL.md Agent Skill

name: add-tts-model description: "Integrate a new text-to-speech model into vLLM-Omni from HuggingFace reference implementation through production-ready serving with streaming and CUDA graph acceleration. Use when adding a new TTS model, wiring stage separation for speech synthesis, enabling online voice generation serving, debugging TTS integration behavior, or building audio output pipelines."

TTS Model Integration Workflow

Overview

HF Reference -> Stage Separation -> Online Serving -> Async Chunk -> CUDA Graph -> Pre-commit/DCO
   (Phase 1)      (Phase 2)          (Phase 3)        (Phase 4)     (Phase 5)      (Phase 6)

Three architecture patterns are supported:

Two-stage pipeline (e.g. Qwen3-TTS, Fish Speech, CosyVoice3): AR code-predictor → audio decoder, connected via async_chunk for low-latency streaming. Use this for maximum performance.
Single-stage AR via generator (e.g. MOSS-TTS-Nano): entire model runs inside one AR worker, streaming audio chunks through a per-request inference_stream() generator. Use this when the upstream model bundles AR
- codec inseparably. See references/single-stage-ar.md.
Single-stage, vLLM-native base LM + side computation (e.g. VoxCPM2): the base language model runs under vLLM's PagedAttention as a normal AR model; diffusion / VAE / side computations run outside vLLM and are attached via the runner post-processing hook. This is a distinct pattern from the generator approach above — do not confuse the two.

The single-stage variants skip Phase 4 (async_chunk) but Phase 5 (CUDA graph) is still encouraged for the inner AR loop.

Cross-Cutting Invariants

These rules apply to every TTS model regardless of architecture (AR vs AR+diffusion, single-stage vs two-stage, codec-based vs VAE-based). They surface repeatedly across PRs — check them at the end of every phase.

I1. Streaming output contract

Pick exactly one per-step semantics for forward() and document it in the docstring:

Delta: yield only new audio samples produced this step. Preferred — linear cost, low memory.
Cumulative: re-decode from step 0 every call. O(N²); only acceptable if the codec has no streaming decode path.

If you choose delta, verify the full emit→consolidate→consume chain:

forward() returns {"model_outputs": <new_chunk_only>, ...}
_consolidate_multimodal_tensors() in vllm_omni/engine/output_processor.py concatenates the audio key into one tensor at finish. If it skips the key (continue), offline consumers receive only the final chunk. See output_processor.py for the concrete list of handled modality keys.
Streaming consumers (SSE, Gradio) receive per-step deltas; offline consumers (engine.generate()) receive a single concatenated tensor.

Cumulative-vs-delta mismatch is the most common silent bug — offline RTF benchmarks pass, but users hear replays or truncation.

I2. Multimodal output consumer hygiene

outputs[0].outputs[0].multimodal_output[<key>] can be any of Tensor, list[Tensor] (pre-consolidation snapshot), np.ndarray, or scalar. When writing tests, examples, and benchmarks:

Never use dict.get("a") or dict.get("b") on tensor values — Python evaluates the tensor's boolean, raising RuntimeError: Boolean value of Tensor with more than one value is ambiguous. Use explicit if x is None chains.
Always defensively handle the list form: if isinstance(x, list): x = torch.cat([t.reshape(-1) for t in x], dim=0).
Assert shape / dtype / duration explicitly; do not rely on truthiness for presence checks.

I3. Hot-loop GPU discipline

Inside any per-step model loop (AR decode, diffusion solver, CFM Euler, vocoder block loop):

No tensor.item(), .cpu(), or .tolist() — each triggers a GPU→CPU sync; at 10 steps × 60 frames × 4 ops that is 2400 syncs per request.
Prefer dst.copy_(src) over dst.fill_(src.item()) when writing a scalar tensor into a buffer.
Prefer torch.compile(Model.forward, fullgraph=False) on the whole forward over per-submodule compile — fewer dispatch boundaries, larger fusion regions. Measure before choosing granularity.
No Python-side control flow that depends on tensor values; use torch.where / masking instead.

Profile first, optimize second. See the profiling docs / project memory for the trace-analysis workflow.

I4. Validation pyramid

Offline RTF alone is necessary but not sufficient. Every new TTS model must pass all three:

Layer	Catches	Tool
Offline RTF / duration check	Throughput regressions, missing audio, wrong sample rate	`end2end.py`, pytest e2e
Browser streaming playback	Delta/cumulative bugs, chunk boundary glitches, TTFP regressions	Gradio demo over `/v1/audio/speech?stream=true`
Concurrent requests	Per-request state leaks, codec window round-robin gaps	`max_num_seqs>1` smoke test with 4+ parallel prompts

Declaring a model "done" without all three has shipped regressions more than once.

I5. Per-request state is owned by the request, not the model

If the model caches anything across forward() calls (streaming generators, codec buffers, sliding-window pads, CUDA graph state), key it by request ID:

self._state: dict[str, YourState] = {}    # request_key → state
# fetch: request_key = str(info.get("_omni_req_id", "0"))
# free on finish: del self._state[request_key]

A shared buffer silently corrupts audio across concurrent requests — the symptom is crosstalk or truncation only under load.

Phase 1: HuggingFace Reference

Goal: Understand the reference implementation and verify it produces correct audio.

Steps

Run the reference model end-to-end using the official HuggingFace / GitHub code
Document the architecture:
- What are the sub-models? (AR decoder, codec decoder, vocoder, etc.)
- What is the token vocabulary? (semantic codes, RVQ codebooks, special tokens)
- What is the output format? (sample rate, channels, codec type)
Capture reference outputs for comparison during integration
Identify the config structure: config.json fields, model_type, sub-model configs

Key Questions

How many codebooks? What are the codebook sizes?
What special tokens exist? (<|voice|>, <|audio_start|>, <|im_end|>, etc.)
What is the token-to-ID mapping for codec codes?
What is the hop length / frame rate of the codec?
Does the model support voice cloning? How? (reference audio encoding, speaker embeddings, etc.)

Deliverables

Working reference script that produces audio
Architecture diagram / notes
Token vocabulary mapping
Reference audio samples for regression testing

Phase 2: Stage Separation (Offline Inference)

Goal: Split the model into vLLM-Omni stages and get offline inference working.

Steps

Register the model in vllm_omni/model_executor/models/registry.py
Create config classes (configuration_<model>.py) with model_type registration
Implement Stage 0 (AR model):
- Subclass appropriate base (e.g., wrap Qwen3 decoder layers)
- Implement forward() for autoregressive token generation
- Handle special token logic (start/stop tokens, codec token mapping)
- If dual-AR (like Fish Speech), implement Fast AR as a nested module
Implement Stage 1 (Decoder):
- Load codec weights (may need lazy loading from separate checkpoint)
- Implement forward(): codec codes -> audio waveform
- Return OmniOutput with multimodal_outputs
Create stage config YAML defining both stages, memory allocation, and model paths
Create stage input processor for prompt building
Write end2end.py test script

Critical Parameters to Get Right

Parameter	Impact if Wrong
Hop length	Audio duration wrong, streaming noise
Token ID mapping	Garbage codes -> noise output
Codebook count/size	Shape mismatch crashes
Stop token	Generation never stops or stops too early
dtype / autocast	Numerical issues, silent quality degradation
Repetition penalty	Must match reference (often 1.0 for TTS)

Debugging Priority (from experience)

When audio output is wrong, check in this order:

RoPE / attention: Are position encodings correct? Is the attention mask right?
Normalization: RMSNorm epsilon, layer norm placement (pre vs post)
Hop length: Product of all upsample rates in the codec decoder
Token mapping: Are codec IDs correctly offset from the vocabulary base?
Sampling parameters: Temperature, top_k, top_p, repetition_penalty
Tensor layout: Codebook-major vs frame-major ordering
dtype: Float32 for codec decoders (autocast can corrupt audio)

Streaming Correctness Rules (single-stage and two-stage)

These bugs appear in almost every new TTS PR. Check all before the first push. See also the cross-cutting invariants I1 (output contract) and I5 (per-request state) above — the rules below are the Phase 2-specific instances of those invariants:

Accumulate codes across AR steps — each forward() appends new codes; do not reset between steps or audio will be truncated (fish speech: fix: accumulate audio_codes across steps)
Emit delta audio, not full waveform — in streaming mode yield only the new chunk per step, not the re-decoded full waveform from step 0 (fish speech: fix: emit delta audio not full waveform)
All return paths must emit model_outputs — if any early-return branch skips setting model_outputs, the serving layer silently drops that step's audio (fish speech: fix: ensure ALL return paths emit model_outputs)
Per-request state isolation — for batched concurrent requests, key all state by request ID; a shared buffer corrupts audio across requests (fish speech: fix: per-request vocode + delta emission)
Codec tensor device — move codec codes to the codec decoder's device before calling decode; mismatches cause silent CPU fallback or crashes (fish speech: fix: use model device for CUDA stream)
AR stage max_num_seqs — set to at least 4 in production configs; for single-stage models this is the only stage. For two-stage models, Stage 0 (AR) needs max_num_seqs ≥ 4 to pipeline concurrent requests; Stage 1 (codec decoder) typically uses max_num_seqs: 1 intentionally. Default of 1 everywhere causes audio gaps under concurrency because the codec window round-robins across requests (RFC #2568)

Optional Dependency Handling

Patch optional dependencies (torchaudio / torchcodec / soundfile) at the top of load_weights(), not at module import. Failures to do so cause cryptic errors only on environments missing the optional package — after the model is already deployed. See references/optional-deps.md for the full pattern, signature constraints, and MOSS-TTS-Nano reference.

Single-Stage AR Pattern (alternative to two-stage)

When the upstream model cannot be cleanly split into an AR stage and a separate decoder, run the full pipeline inside a single AR worker and stream audio through a per-request inference_stream() generator keyed by _omni_req_id. Stage config must set worker_type: ar, engine_output_type: audio, final_output: true, is_comprehension: true, and async_chunk: false at the top level. Only extract params from additional_information that you actually forward, or pre-commit fails ruff F841.

Full walkthrough with the complete forward() / _create_stream_gen() skeleton and stage-config fields: references/single-stage-ar.md. For an in-tree reference, look for any single-stage AR model under vllm_omni/model_executor/models/ — e.g. the MOSS-TTS-Nano integration when it lands.

VoxCPM2 is a different pattern and should not reuse this skeleton — it runs the base LM under vLLM PagedAttention with external side-computation. See plan/voxcpm2_native_ar_design.md.

Deliverables

Model files in vllm_omni/model_executor/models/<model_name>/
Stage config YAML
Working end2end.py at examples/offline_inference/text_to_speech/<model>/end2end.py
New section in examples/offline_inference/text_to_speech/README.md (table row + per-model section). Do not create a top-level examples/offline_inference/<model>/ dir or a per-model README.md inside text_to_speech/<model>/ — the hub README is the documented surface and the mkdocs generate_examples hook only descends one level into examples/<category>/.

Phase 3: Online Serving

Goal: Expose the model via /v1/audio/speech API endpoint.

Steps

Register in serving_speech.py — add all 5 points in a single commit; partial integration causes hard-to-debug failures. This file is modified by every model PR and is the most common source of rebase conflicts — see conflict note below.

Point 1 — stage constant (near the top, alongside the other _*_TTS_MODEL_STAGES sets):
```
_YOUR_MODEL_TTS_MODEL_STAGES = {"your_stage_key"}
```
Point 2 — union into _TTS_MODEL_STAGES:
```
_TTS_MODEL_STAGES: set[str] = (
    ...
    | _YOUR_MODEL_TTS_MODEL_STAGES
)
```
Point 3 — model type detection in _detect_tts_model_type():
```
if model_stage in _YOUR_MODEL_TTS_MODEL_STAGES:
    return "your_model"
```
Point 4 — validation dispatch in _validate_tts_request():
```
if self._tts_model_type == "your_model":
    return self._validate_your_model_request(request)
```
Point 5 — validation + parameter-builder methods:
```
def _validate_your_model_request(self, request) -> str | None:
    if not request.input or not request.input.strip():
        return "Input text cannot be empty"
    return None

def _build_your_model_params(self, request) -> dict:
    params = {"text": [request.input]}
    if request.voice is not None:
        params["voice"] = [request.voice]
    return params
```
Wire _build_your_model_params into _create_tts_request() alongside the other model-specific param builders.

Two dispatch patterns coexist: Fish Speech uses a self._is_fish_speech boolean instance attribute checked before elif self._is_tts, while all newer models (CosyVoice3, MOSS-TTS-Nano) use the _tts_model_type string returned by _detect_tts_model_type(). For new models, always use the _tts_model_type string pattern — do not add new _is_* flags.

Unused variable rule: only extract fields in _build_your_model_params that are actually forwarded to the model. Unused extractions fail ruff F841. For voice-cloning fields (ref_audio → prompt_audio_path, ref_text → prompt_text), add them to the param builder and verify they reach the model call.

Rebase conflict note: when rebasing onto main after another model was merged, serving_speech.py will conflict. Resolution: always keep both the upstream model's additions and your own — never discard either side.
Handle model-specific parameters:
- Voice cloning: ref_audio encoding and prompt injection
- max_new_tokens override in sampling params
- Model-specific default values
Create client scripts: speech_client.py, run_server.sh
Test all response formats: wav, mp3, flac, pcm
Add Gradio demo: Interactive web UI with streaming support

Voice Cloning Pattern

import base64
from pathlib import Path

def build_voice_clone_prompt(ref_audio_path: str, text: str, codec) -> list:
    """Build prompt with reference audio for voice cloning in serving_speech.py."""
    audio_bytes = Path(ref_audio_path).read_bytes()
    codes = codec.encode(audio_bytes)  # Encode on CPU using model's codec (e.g., DAC)
    token_ids = [code + codec.vocab_offset for code in codes.flatten().tolist()]
    return [
        {"role": "system", "content": f"<|voice|>{''.join(chr(t) for t in token_ids)}"},
        {"role": "user", "content": text},
    ]

Test Case Writing (CI Levels)

Follow the vllm-omni-test skill for markers, file naming (test_{slug}.py / test_{slug}_expansion.py), Buildkite wiring, and copy-paste run commands. Also read CI_5levels.md and tests_style.md.

Classify the model's CI priority first (high / medium / low). High-priority TTS models are typically those on the integration hot path or listed in tracking issues such as #1832; medium and low tiers cover the long tail. When unsure, ask the reviewer which tier applies.

Priority	Required test levels	Files & markers
High	L1 unit/logic · L2 online smoke · L3 online + offline integration · L4 feature + performance	See table below
Medium	L3 online + offline · L4 feature only	Skip dedicated L1/L2 unless fixing a logic bug
Low	L4 feature only	One or two `*_expansion.py` parametrized cases

Per-level deliverables (TTS / pytest.mark.tts):

Level	Location	Marker	CI pipeline	Notes
L1	`tests/model_executor/…`, `tests/entrypoints/openai_api/…`, stage-processor tests	`core_model` + `cpu`	`test-ready.yml`	Prompt assembly, async_chunk helpers, `serving_speech` validation — no GPU
L2	`tests/e2e/online_serving/test_{slug}.py`	`core_model` + `advanced_model` (both on baseline smoke) + `tts` + `@hardware_test(...)`	`test-ready.yml` (`ready` label)	Default deploy smoke: single `/v1/audio/speech` or offline `OmniRunner` path
L3	`tests/e2e/online_serving/test_{slug}.py` and `tests/e2e/offline_inference/test_{slug}.py`	Baseline smoke: `core_model` + `advanced_model`; heavier cases: `advanced_model` only (+ `tts`)	`test-merge.yml` or merged into nightly TTS function job	Streaming, voice clone, batch/queue, async_chunk
L4	`tests/e2e/online_serving/test_{slug}_expansion.py`, optional offline expansion	`full_model` + `tts`	`test-nightly.yml` (`:full_moon: TTS · Function Test with L4`)	Feature matrix; perf → `tests/dfx/perf/tests/test_tts.json`

L2 & L3 online — same file, dual marks on the baseline smoke: The first / simplest case in test_{slug}.py (default deploy, single non-streaming /v1/audio/speech or equivalent offline path) should carry both @pytest.mark.core_model and @pytest.mark.advanced_model on the same function so it runs in L2 (test-ready.yml, --run-level core_model, basic validation) and L3 (test-merge.yml, --run-level advanced_model, deeper validation) without duplicating the test. In-tree examples: test_voxcpm2_tts.py::test_text_to_audio_001, test_qwen3_tts_customvoice.py::test_text_to_audio_001.

Heavier scenarios in the same file use advanced_model only (streaming, extra languages, concurrency, async_chunk, batch). Example: test_voice_clone_en_streaming_001 → advanced_model only. When migrating L3 to nightly, move those heavier cases into test_{slug}_expansion.py with full_model and drop the dedicated merge job (see test_ming_tts_expansion.py, test_glm_tts_expansion.py).

@pytest.mark.core_model
@pytest.mark.advanced_model
@pytest.mark.tts
@hardware_test(res={"cuda": "L4"}, num_cards=1)
@pytest.mark.parametrize("omni_server", tts_server_params, indirect=True)
def test_voice_clone_en_non_streaming_001(omni_server, openai_client) -> None:
    openai_client.send_audio_speech_request({...})

L4 consolidation: Prefer parametrized OmniServerParams rows (default, async_chunk, feature flags) in one expansion module rather than many merge-only files (#1832).

L4 performance (high-priority models): Add latency / throughput / stress rows in tests/dfx/perf/tests/test_tts.json, or a dedicated tests/dfx/perf/tests/test_{slug}.json when the model must not join the shared nightly server matrix before integration lands (see VoxCPM2 / Coqui XTTS pattern). Register the model in benchmarks/tts/model_configs.yaml for local bench_tts.py. Wire a separate test-nightly.yml Perf Test step when the JSON is not merged into test_tts.json yet.

Keep model-specific code inside test modules — not tests/helpers/{slug}.py:

Put MODEL, deploy path, vendored REF_AUDIO_URL, get_prompt(), and inline request_config dicts in each test_{slug}.py, test_{slug}_expansion.py, offline test_{slug}.py, and L1 test_{slug}_*.py as needed.
Do not add tests/helpers/{slug}.py (or tests/helpers/{model_name}.py) to deduplicate constants or request builders across those files. A little duplication is intentional; follow in-tree references such as tests/e2e/online_serving/test_glm_tts.py and tests/e2e/online_serving/test_cosyvoice3_tts_expansion.py.
tests/helpers/ is for repo-wide harness code only (mark.py, media.py, runtime.py, stage_config.py, assertions.py, fixtures/). Import those; do not extend the tree with per-model modules.

Runtime send helpers (tests/helpers/runtime.py) — online and offline e2e:

Path	Fixture	Call
Online `/v1/*`	`openai_client`	`openai_client.send_*_request(request_config)`
Offline inference	`omni_runner_handler`	`omni_runner_handler.send_*_request(request_config)`

Grep runtime.py first — reuse send_omni_request, send_diffusion_request, send_audio_speech_request (online + offline Qwen-style TTS), send_single_stage_tts_request (Coqui XTTS / MOSS-TTS-Nano offline), etc.
No matching helper → add send_<feature>_request (or send_<route>_http_request for negative/dfx) in runtime.py with general assert_* bundled inside, then call it from the test.
Test file holds request_config dicts only — not omni.generate, not _collect_audio(), not raw HTTP/SDK.
Wire tests/helpers/runtime.py into Buildkite source_file_dependencies when you add helpers.

See vllm-omni-test skill § Runtime send helpers for full tables and exceptions.

Deliverables

Updated serving_speech.py with all 5 integration points (single commit)
Client scripts and server launcher under examples/online_serving/text_to_speech/<model>/
Gradio demo with streaming and voice cloning UI in the same dir
E2E tests per Test Case Writing (CI Levels) above (priority tier determines L1–L4 scope)
Buildkite wired per level: test-ready.yml (L1/L2), test-merge.yml or nightly function job (L3), test-nightly.yml (L4) — see vllm-omni-test skill
New section in examples/online_serving/text_to_speech/README.md (table row + per-model section). Do not create a top-level examples/online_serving/<model>/ dir or a per-model README.md inside text_to_speech/<model>/.

E2E test pitfalls to avoid

One OmniServerParams set per file. omni_server is module-scoped; a second id in the same file forces mid-module teardown/restart and exposes startup races (APIConnectionError on the first request post-restart). Split variants into separate files instead.
No external URL fetches from the server. CI and some dev hosts can't reach raw.githubusercontent.com over TLS. Inline ref audio as data:audio/wav;base64,...; the serving layer accepts both URL and data URL.
Use the harness readiness gate. The fixture waits for HTTP 200 on /health; don't add time.sleep in tests. If warmup is incomplete, make /health return non-200 until you're actually ready.
Mark tests per the CI Levels table — baseline smoke: core_model + advanced_model; heavier cases: advanced_model only; L4 expansion: full_model
No per-model helper modules — do not create tests/helpers/{slug}.py; keep constants and request_config payloads in the test file
Online and offline e2e go through runtime.py — openai_client.send_audio_speech_request (online); omni_runner_handler.send_audio_speech_request (Qwen-style offline) or send_single_stage_tts_request (single-stage offline). Add a new send_*_request in runtime.py when none fits; do not embed omni.generate or HTTP in tests

Phase 4: Async Chunk (Streaming)

Goal: Enable inter-stage streaming so audio chunks are produced while AR generation continues.

Steps

Update stage config YAML:

async_chunk: true
codec_chunk_frames: 25      # frames per chunk
codec_left_context_frames: 25  # overlap for smooth boundaries

Implement chunk handling in Stage 1:
- Accept partial input (chunk of codec codes)
- Handle left context for smooth audio boundaries
- Return partial audio in OmniOutput
Test streaming:
- Verify audio quality matches non-streaming output
- Check for artifacts at chunk boundaries
- Measure TTFA (time to first audio)
Update online serving to support stream=true with PCM output

Streaming Architecture

Stage 0 (AR)                    Stage 1 (Decoder)
  |                                |
  |-- chunk 0 (25 frames) ------> decode -> audio chunk 0 -> client
  |-- chunk 1 (25 frames) ------> decode -> audio chunk 1 -> client
  |-- chunk 2 (25 frames) ------> decode -> audio chunk 2 -> client
  ...

Key Considerations

Left context overlap: Prevents audible artifacts at chunk boundaries
Hop length matters: context_audio_samples = context_frames * hop_length
First chunk latency: Can use larger initial chunk for better quality, then smaller chunks

Deliverables

Updated stage config with async_chunk enabled
Smooth streaming audio without boundary artifacts
TTFA metrics

Phase 5: CUDA Graph Acceleration

Goal: Capture the AR loop as a CUDA graph for significant speedup.

Steps

Identify the hot loop: The AR decoding loop that runs N steps per token
Create static buffers:
- KV caches with fixed max sequence length
- Pre-built causal masks and position tensors per step
- Static input/output tensors
Implement graph capture:
- Warm up with real data
- Capture the forward pass
- Replay with updated inputs
Handle constraints:
- Use torch.argmax instead of torch.multinomial (graph-safe)
- Fixed batch size (fall back to eager for other sizes)
- No dynamic control flow inside the graph

See references/cuda-graph-example.md for a worked skeleton (Qwen3-TTS code predictor, 16-step AR loop), performance expectations (3–5× on the graphed component for fixed batch_size=1), and the graph-safety constraints you must honor inside the captured region.

Deliverables

CUDA graph implementation for the AR hot loop
Benchmark script comparing eager vs graph performance
Documentation of constraints and fallback behavior

Phase 6: Pre-commit and DCO

Goal: Every commit passes pre-commit lint and carries a DCO Signed-off-by line that matches the author email.

Install hooks once: pre-commit install.
Run pre-commit run --files <changed-files> before every push; accept any auto-fixes, stage, re-commit.
Sign every commit with git commit -s. DCO checks that author email and Signed-off-by email match — git config user.email must match your GitHub account email.

Common pre-commit failures, recovery commands for missing sign-off, and the full pre-commit run invocation for a TTS model: references/precommit-dco.md.

Integration Checklist

Use this checklist when integrating a new TTS model:

Cross-Cutting Invariants (verify at end of every phase)

I1: forward() docstring states cumulative vs delta; consolidation path audited end-to-end
I2: Tests / examples / benchmarks never use dict.get(a) or dict.get(b) on tensor values; list form handled
I3: No .item() / .cpu() / Python branch on tensor values inside per-step loops
I4: Offline RTF, browser streaming playback, and concurrent-request smoke test all pass
I5: Any cross-step cache keyed by _omni_req_id; entries freed when the request finishes

Phase 1: HF Reference

Reference model runs and produces correct audio
Architecture documented (stages, codebooks, tokens, sample rate)
Reference audio samples saved for comparison

Phase 2: Stage Separation

Model registered in registry.py
Config classes created with model_type registration
Stage 0 (AR) implemented and generates correct tokens
Stage 1 (Decoder) produces correct audio from tokens — dtype float32 for codec decoder
Stage 1 max_num_seqs ≥ 4 in production config (default 1 causes gaps under concurrency)
Optional dependency fallbacks handled at load_weights() time (torchaudio/soundfile/etc.)
Streaming: codec codes accumulated across AR steps (not reset per step)
Streaming: delta audio emitted per chunk, not full re-decoded waveform
Streaming: all forward() return paths emit model_outputs
Streaming: per-request state keyed by request ID (not shared across requests)
Streaming: codec tensors moved to codec decoder device before decode
Stage config YAML created
end2end.py produces audio matching reference quality
README.md written

Phase 3: Online Serving

All 5 serving_speech.py integration points added in one commit
Only extract params in _build_*_params that are forwarded to the model call (ruff F841)
Prompt builder handles text input correctly
Voice cloning works (if supported)
All response formats work (wav, mp3, flac, pcm)
Client scripts and server launcher created
E2E tests added per model priority tier (see Test Case Writing (CI Levels))
Buildkite entries match level: test-ready.yml / test-merge.yml or nightly TTS job / test-nightly.yml
Gradio demo working
Documentation added (offline + online docs, nav, supported models)

Phase 4: Async Chunk

Stage config updated with async_chunk: true
Stage 1 handles partial chunks correctly
No audio artifacts at chunk boundaries
Streaming via API (stream=true) works
TTFA measured and acceptable

Phase 5: CUDA Graph

Hot loop identified and profiled
Static buffers allocated
Graph captured and replays correctly
Benchmark shows meaningful speedup
Fallback to eager works for unsupported configs

Phase 6: Pre-commit and DCO

pre-commit run --files <changed> passes before every push
Every commit has Signed-off-by matching the author email (git commit -s)
git config user.email matches the email registered on your GitHub account
Details and failure-recovery commands: references/precommit-dco.md

References

In-skill references (details split out of the main body):

references/single-stage-ar.md — full forward() / generator skeleton for the MOSS-TTS-Nano-style pattern
references/optional-deps.md — torchaudio / torchcodec fallback pattern
references/cuda-graph-example.md — Qwen3-TTS code-predictor CUDA graph skeleton
references/precommit-dco.md — full pre-commit invocation, failure table, DCO recovery

Project docs and adjacent skills:

TTS audio skill — supported models and usage
Fish Speech integration — complete example of Phases 1–3
Qwen3-TTS reference — complete example of all 5 phases
vllm-omni-test skill — L1–L4 markers, naming, Buildkite wiring, run commands
Adding a TTS model (developer guide)
plan/voxcpm2_native_ar_design.md — VoxCPM2's vLLM-native AR + side-computation pattern (distinct from the generator-based single-stage described above)