audio-tts - SKILL.md Agent Skill

name: audio-tts description: Generate speech audio from text using Qwen3 TTS, or clone a voice from reference audio. Triggered when the user wants to convert text to speech, generate audio, read text aloud, or clone/mimic a voice. Supports multiple speakers, English and Chinese, and emotion/style control.

Qwen3 TTS — Text-to-Speech and Voice Cloning

Generate speech audio from text, or clone a voice from a reference audio file.

Binaries

{baseDir}/scripts/tts — Text-to-speech generation with named speakers.
{baseDir}/scripts/voice_clone — Voice cloning from a reference audio file.

Models

{baseDir}/scripts/models/Qwen3-TTS-12Hz-0.6B-CustomVoice — Named speaker TTS (0.6B parameters).
{baseDir}/scripts/models/Qwen3-TTS-12Hz-0.6B-Base — Voice cloning from reference audio (0.6B parameters).

Reference Audio

Pre-packaged reference audio files for voice cloning are available at {baseDir}/scripts/reference_audio/. Each speaker has two files:

{baseDir}/scripts/reference_audio/<speaker_name>.wav — Reference audio (mono 24kHz 16-bit WAV)
{baseDir}/scripts/reference_audio/<speaker_name>.txt — Transcript of the reference audio

Available reference speakers: trump, elon_musk.

When to Use Which Tool

tts — When the user wants to generate speech from text using a named speaker (Vivian, Ryan, etc.). Supports English and Chinese.
voice_clone — When the user wants to clone a specific voice from a reference audio file and generate new speech in that voice. If the user asks to clone a voice by speaker name (e.g., "speak like Trump", "use Elon Musk's voice"), check {baseDir}/scripts/reference_audio/ for a matching <speaker_name>.wav and <speaker_name>.txt pair, and use ICL mode with both files.

Linux Environment Setup

On Linux, the binaries require libtorch shared libraries. Set the library path before running any command:

export LD_LIBRARY_PATH={baseDir}/scripts/libtorch/lib:$LD_LIBRARY_PATH

On macOS, no environment setup is needed (the binaries use the MLX backend). All commands below show the macOS form. On Linux, prefix each command with LD_LIBRARY_PATH={baseDir}/scripts/libtorch/lib:$LD_LIBRARY_PATH.

Text-to-Speech

Generate speech audio from text with a named speaker.

{baseDir}/scripts/tts \
  {baseDir}/scripts/models/Qwen3-TTS-12Hz-0.6B-CustomVoice \
  "<text>" \
  <speaker> \
  <language>

Parameters

Parameter	Required	Description
model_path	Yes	Path to the model directory
text	Yes	The text to synthesize as speech
speaker	Yes	Speaker name (see Available Speakers below)
language	Yes	`english` or `chinese`

Available Speakers

Vivian, Serena, Ryan, Aiden, Uncle_fu, Ono_anna, Sohee, Eric, Dylan.

Output

Generates output.wav (24kHz mono WAV) in the current working directory.

Example

{baseDir}/scripts/tts \
  {baseDir}/scripts/models/Qwen3-TTS-12Hz-0.6B-CustomVoice \
  "Hello! Welcome to the Qwen3 text-to-speech system." \
  Vivian \
  english

Voice Cloning (ICL Mode)

Clone a voice from a reference audio file using ICL (In-Context Learning). This encodes the reference audio into codec tokens and conditions generation on both the speaker embedding and the reference audio/text transcript, producing high-fidelity voice cloning.

Both a reference audio file and its transcript text are required.

{baseDir}/scripts/voice_clone \
  {baseDir}/scripts/models/Qwen3-TTS-12Hz-0.6B-Base \
  <reference_audio.wav> \
  "<text>" \
  <language> \
  "<reference_text>"

Parameters

Parameter	Required	Description
model_path	Yes	Path to the Base model directory
reference_audio	Yes	Path to reference WAV file (mono 24kHz 16-bit)
text	Yes	The text to synthesize in the cloned voice
language	Yes	`english` or `chinese`
reference_text	Yes	Transcript of the reference audio

Reference Audio Requirements

The reference audio must be a mono 24kHz 16-bit WAV file. Convert from other formats with ffmpeg:

ffmpeg -i input.m4a -ac 1 -ar 24000 -sample_fmt s16 reference.wav

Output

Generates output_voice_clone.wav (24kHz mono WAV) in the current working directory.

Example

{baseDir}/scripts/voice_clone \
  {baseDir}/scripts/models/Qwen3-TTS-12Hz-0.6B-Base \
  reference.wav \
  "This is a voice cloning test with in-context learning." \
  english \
  "The transcript of what was said in the reference audio."

Workflow

1. Determine the Task

If the user wants to generate speech from text with a named speaker (Vivian, Ryan, etc.), use tts.
If the user wants to clone a voice from an audio file, use voice_clone.
If the user asks to clone a voice by speaker name (e.g., "speak like Trump", "in Elon Musk's voice"), use voice_clone with the pre-packaged reference audio.

2. Prepare Input

For tts: Identify the text, speaker name, and language from the user's request. Default to Vivian and english if not specified.
For voice_clone with a named reference speaker:
1. Look up {baseDir}/scripts/reference_audio/<speaker_name>.wav and {baseDir}/scripts/reference_audio/<speaker_name>.txt.
2. Read the transcript from the .txt file.
3. Pass both the .wav file and the transcript text.
For voice_clone with a user-provided audio file: Ensure the reference audio is a mono 24kHz 16-bit WAV. Convert if needed using ffmpeg. Ask the user for the transcript of the reference audio.

3. Run the Command

Run the appropriate binary using the full paths to the binaries and model directories. On Linux, prefix with LD_LIBRARY_PATH={baseDir}/scripts/libtorch/lib:$LD_LIBRARY_PATH.

Example: Clone by Speaker Name

If the user says "Say hello world in Trump's voice":

# Read the transcript
REF_TEXT=$(cat {baseDir}/scripts/reference_audio/trump.txt)

# Run voice clone with ICL mode
{baseDir}/scripts/voice_clone \
  {baseDir}/scripts/models/Qwen3-TTS-12Hz-0.6B-Base \
  {baseDir}/scripts/reference_audio/trump.wav \
  "Hello world" \
  english \
  "$REF_TEXT"

4. Return the Output

The output WAV file will be in the current working directory:

tts produces output.wav
voice_clone produces output_voice_clone.wav

Inform the user of the output file path.