qwen3-tts-local-inference

star 8

Generate speech from text using Qwen3-TTS via direct Python inference — no server required. Use when: (1) converting text to speech / synthesising audio, (2) creating voiceovers or spoken content, (3) cloning a voice from reference audio, (4) generating TTS with built-in speakers or custom voice descriptions. Supports custom-voice (9 speakers), voice-design (natural language), and voice-clone (~3 s reference). Outputs .wav files. Both 0.6B (small, default) and 1.7B (large) models available. Runs entirely offline after model download.

modbender By modbender schedule Updated 3/6/2026

name: qwen3-tts-local-inference description: > Generate speech from text using Qwen3-TTS via direct Python inference — no server required. Use when: (1) converting text to speech / synthesising audio, (2) creating voiceovers or spoken content, (3) cloning a voice from reference audio, (4) generating TTS with built-in speakers or custom voice descriptions. Supports custom-voice (9 speakers), voice-design (natural language), and voice-clone (~3 s reference). Outputs .wav files. Both 0.6B (small, default) and 1.7B (large) models available. Runs entirely offline after model download.

Qwen3-TTS — Local Inference (No Server)

Run Qwen3-TTS directly in Python — no HTTP server, no REST API. Call a script or import the engine in your own code.

Quick reference

Mode What it does Key args
custom-voice 9 built-in speakers, optional emotion/style --speaker, --instruct
voice-design Describe the voice in natural language --instruct (required)
voice-clone Clone from ~3 s reference audio --ref-audio, --ref-text

Available Speakers

The CustomVoice model includes 9 premium voices:

Speaker Language Description
Vivian Chinese Bright, slightly edgy young female
Serena Chinese Warm, gentle young female
Uncle_Fu Chinese Seasoned male, low mellow timbre
Dylan Chinese (Beijing) Youthful Beijing male, clear
Eric Chinese (Sichuan) Lively Chengdu male, husky
Ryan English Dynamic male, rhythmic
Aiden English Sunny American male
Ono_Anna Japanese Playful female, light nimble
Sohee Korean Warm female, rich emotion

Languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian, Auto


1 — Setup

Install dependencies once (from the skill directory):

First-time setup (one-time):

bash scripts/setup.sh

Custom download location:

python scripts/download_models.py --model-dir /path/to/models

Models are stored under {baseDir}/models/ by default. Override with QWEN_TTS_MODEL_DIR env var or --model-dir flag.


2 — Generate speech (CLI)

Custom Voice (default)

cd {baseDir}
python scripts/tts.py "Hello, how are you today?" --speaker Ryan --language English

With emotion/style instruction:

python scripts/tts.py "Great news everyone!" --speaker Aiden --instruct "cheerful and energetic"

Voice Design

Describe the voice in natural language:

python scripts/tts.py "Welcome to our show!" \
  --mode voice-design \
  --language English \
  --instruct "Warm, confident female voice in her 30s with a slight British accent"

Voice Clone

Clone a voice from a short (~3 s) reference audio clip:

python scripts/tts.py "This is spoken in the cloned voice." \
  --mode voice-clone \
  --language English \
  --ref-audio path/to/reference.wav \
  --ref-text "Transcript of the reference audio."

Common options

Flag Purpose
-o output.wav Save to exact file path instead of auto-named file
--output-dir DIR Override output directory (default: tts_output/)
--model-dir DIR Override model directory
--json Print result as JSON
-v Verbose logging

3 — Python API

Use the engine directly in code:

import sys
sys.path.insert(0, "{baseDir}/scripts")

from inference import TTSInferenceEngine

engine = TTSInferenceEngine(
    model_dir="{baseDir}/models",   # optional, uses default if omitted
    output_dir="./tts_output",       # optional
)

result = engine.generate_custom_voice(
    text="Hello world!",
    language="English",
    speaker="Ryan",
    instruct="calm and professional",
)
print(result)
# {"file": "tts_output/custom_voice_20260218_...wav", "duration_s": 1.23, "inference_s": 4.56}

Available methods:

  • engine.generate_custom_voice(text, language, speaker, instruct)
  • engine.generate_voice_design(text, language, instruct)
  • engine.generate_voice_clone(text, language, ref_audio, ref_text)
  • engine.status() — returns loaded variant, device, paths

4 — Configuration

All settings are controlled via environment variables. Set them before running.

Variable Default Description
QWEN_TTS_MODEL_SIZE small small (0.6B) or large (1.7B)
QWEN_TTS_MODEL_DIR {baseDir}/models Where model weights are stored
QWEN_TTS_DEVICE auto (cuda:0 or cpu) Inference device
QWEN_TTS_DTYPE auto (bfloat16 / float32) Model precision
QWEN_TTS_OUTPUT_DIR ./tts_output Where generated .wav files are saved

Switch to the 1.7B model:

set QWEN_TTS_MODEL_SIZE=large
python scripts/tts.py "Hello world"

Use a custom model directory:

set QWEN_TTS_MODEL_DIR=D:\my-models\qwen-tts
python scripts/tts.py "Hello world"

Important notes

  • Small model (0.6B) is the default. It uses less RAM and is faster. Switch to large (1.7B) for higher quality.
  • CPU inference is slow. Expect 30-120 s per sentence for the 1.7B model. The 0.6B model is roughly 2x faster.
  • Only one model variant is loaded at a time. Switching modes (e.g. custom-voice to voice-clone) triggers a model swap.
  • Output .wav files land in tts_output/ by default.
  • Models are downloaded to {baseDir}/models/ by default. Run download_models.py --size all to pre-download both sizes for offline use.
  • Voice Design mode has no 0.6B variant — it always uses the 1.7B model regardless of QWEN_TTS_MODEL_SIZE.
Install via CLI
npx skills add https://github.com/modbender/skill-library-mcp --skill qwen3-tts-local-inference
Repository Details
star Stars 8
call_split Forks 2
navigation Branch main
article Path SKILL.md
More from Creator