voicebox - SKILL.md Agent Skill

name: voicebox version: 1.0.0 author: zanearcher category: audio description: "Text-to-speech voice toolkit. DEFAULT ACTION: When called with text (e.g. /voicebox hello), IMMEDIATELY run: uv run $SKILL_DIR/scripts/voicebox.py generate "Calm Narrator" "" --play. Do NOT ask questions, do NOT greet the user — just generate and play the speech. Also supports: voice cloning, multi-speaker conversations, recording, and transcription." user_invocable: true

Voicebox TTS Skill

Path convention: All commands below use $SKILL_DIR to mean this skill's base directory. When Claude loads a skill, the system prompt includes Base directory for this skill: <path> — use that path. Example: if installed at ~/.agent/skills/voicebox/, then $SKILL_DIR = ~/.agent/skills/voicebox.

Standalone text-to-speech using mlx-audio. Supports custom voice design (from text descriptions) and voice cloning (from audio samples). No external app required.

CRITICAL: Default Action Rule

When /voicebox is followed by plain text (not a keyword like "create", "clone", "transcribe", "record", "list", "delete"), the DEFAULT action is ALWAYS to generate speech immediately using the "Calm Narrator" profile. Do NOT ask follow-up questions. Do NOT present a menu of options. Just run the generate command and play the audio.

Example: /voicebox hello → immediately run:

uv run $SKILL_DIR/scripts/voicebox.py generate "Calm Narrator" "hello" --play

Only route to other modes when the input contains explicit keywords:

"create" / "design" → Mode 2 (create profile)
"clone" / "record" → Mode 3/4 (clone voice)
"transcribe" → Mode 5 (transcription)
"conversation" / "drama" / "audiobook" → Mode 6 (multi-speaker)
"list" / "delete" / "speakers" / "models" → Management commands

Everything else = generate speech immediately.

Usage

Generate speech

/voicebox "Hello world"                                     (uses default: Calm Narrator)
/voicebox "Calm Narrator" "Hello world"                     (explicit profile)
/voicebox "angry tone" "My Voice" "I can't believe this!"   (style + profile)

Create profiles

/voicebox create a calm narrator voice profile              (designed - from description)
/voicebox create a custom profile using Ryan                (custom - preset speaker)
/voicebox clone my voice from /path/to/audio.wav            (cloned - from audio file)
/voicebox clone my voice                                    (record from mic + clone)

Transcribe audio/video

/voicebox transcribe /path/to/audio.wav
/voicebox transcribe /path/to/video.mp4

Generate multi-speaker conversation

/voicebox create a news broadcast with anchor, reporter, and expert
/voicebox make a conversation between Calm Narrator and Cheerful Girl
/voicebox generate a drama scene with these characters...

Trigger phrases (activates this skill automatically)

"clone my voice", "record my voice", "create a voice clone"
"transcribe this", "transcribe audio", "transcribe video"
"create a conversation", "make a drama", "generate a dialogue", "audiobook"
Any request involving audio/video transcription or speech-to-text

Architecture

Three model categories with quality tiers:

Category	High (default)	Standard	Use Case
Voice Design	`Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16`	(same — only 1.7B exists)	Custom voices from description
Voice Clone	`Qwen3-TTS-12Hz-1.7B-Base-bf16`	`Qwen3-TTS-12Hz-0.6B-Base-bf16`	Clone a real voice
Custom Voice	`Qwen3-TTS-12Hz-1.7B-CustomVoice-bf16`	`Qwen3-TTS-12Hz-0.6B-CustomVoice-bf16`	9 preset premium speakers with style control
ASR (Transcription)	`Qwen/Qwen3-ASR-1.7B`	`Qwen/Qwen3-ASR-0.6B`	Speech-to-text

All commands accept --quality standard or --quality high (default) to select model tier.

All state is in the skill's data/ directory (default: $SKILL_DIR/data/):

profiles.json — profile registry
samples/ — WAV files for reference audio

Script: scripts/voicebox.py (relative to skill root)

Default Profiles

Context	Default Profile	Type	Best For
Overall default (English text)	Calm Narrator	Designed	Neutral narration
CustomVoice Chinese	Dylan	Custom (speaker: dylan)	Chinese with emotion
CustomVoice English	Aiden	Custom (speaker: aiden)	English with emotion
CustomVoice Japanese	Ono Anna	Custom (speaker: ono_anna)	Japanese with emotion

When the user runs /voicebox "some text" with only text (a single quoted argument), use Calm Narrator by default.

When the text requires emotional delivery or the user requests a specific emotion (angry, sad, excited, etc.), prefer CustomVoice defaults by language: Dylan for Chinese, Aiden for English, Ono Anna for Japanese. CustomVoice supports --instruct for emotion control.

Mode 1: Generate Speech

Step-by-step workflow

Parse arguments — Count the quoted arguments:
- One quoted arg → Text only. Use the default profile ("Calm Narrator").
- Two quoted args → First is profile name, second is text to speak.
- Three quoted args → First is style/instruct, second is profile name, third is text.
Find the profile — Look up the profile name in profiles.json (case-insensitive, partial match OK).
Emotion segmentation (OPT-IN ONLY — never do this by default):

By default, long text is generated as a single pass with generate. Do NOT automatically split text into emotional segments. Only activate emotion segmentation when the user explicitly requests it with phrases like:
- "with emotions", "auto-emotion", "segment emotions", "emotional delivery"
- "split by emotion", "vary the emotions", "make it expressive"
If the user does NOT ask for emotion segmentation, skip this entire step and go straight to step 4 (simple generate), regardless of text length.

When the user explicitly requests emotion segmentation:

a. Analyze the text — Identify distinct emotional segments (e.g., excitement → worry → determination → gratitude).

b. Split into segments — Break the text at natural emotion boundaries (sentence or clause level).

c. Tag each segment with an appropriate instruct emotion from this palette:
- Joy/Excitement: "excited, enthusiastic, joyful tone", "ecstatic, breathless with joy"
- Anger: "angry, furious, intense tone", "irritated, sharp and impatient"
- Sadness/Sorrow: "sorrowful, heartbroken, tearful voice", "melancholic, quiet and reflective"
- Fear/Panic: "panicked, terrified, desperate tone", "anxious, nervous, trembling voice"
- Calm/Neutral: "calm, steady, composed delivery", "warm and reassuring tone"
- Surprise: "shocked, astonished, wide-eyed disbelief"
- Determination: "determined, resolute, firm and confident"
- Tenderness: "gentle, tender, soft-spoken with warmth"
d. Generate a conversation script — CRITICAL: USE THE SAME SINGLE PROFILE FOR EVERY LINE.

THIS IS NOT A MULTI-SPEAKER CONVERSATION. Emotion segmentation is ONE person speaking with varying emotions. Every "profile" value in the script MUST be identical — the same profile the user selected (or the default). Only the "instruct" field varies between lines. Do NOT assign different profiles to different segments.
```
{
  "title": "emotional_speech",
  "gap": 0.15,
  "lines": [
    {"profile": "Dylan", "text": "Segment 1 text...", "instruct": "excited, enthusiastic tone"},
    {"profile": "Dylan", "text": "Segment 2 text...", "instruct": "worried, anxious, trembling voice"},
    {"profile": "Dylan", "text": "Segment 3 text...", "instruct": "determined, resolute, firm tone"},
    {"profile": "Dylan", "text": "Segment 4 text...", "instruct": "gentle, grateful, warm and tender"}
  ]
}
```
Note: ALL four lines above use "Dylan" — the same profile. This is intentional and required.

e. Use the conversation command instead of single generate:
```
uv run $SKILL_DIR/scripts/voicebox.py conversation /tmp/emotional_script.json --play --gap 0.15
```
IMPORTANT — Emotion segmentation vs Multi-speaker:
- Emotion segmentation (this section): OPT-IN only. ONE voice, varying emotions. Same profile on every line.
- Multi-speaker conversation (Mode 6): MULTIPLE voices, different profiles per line. Only when user asks for conversation/dialogue/drama.
- Default for long text: Single generate call. No segmentation. No splitting.

Simple generate (short text or no emotion needed):

uv run $SKILL_DIR/scripts/voicebox.py generate "Profile Name" "text to speak" --play

With style override (short emotional text):

uv run $SKILL_DIR/scripts/voicebox.py generate "Profile Name" "text to speak" --instruct "angry tone" --play

With high quality (1.7B model) for cloned voices:

uv run $SKILL_DIR/scripts/voicebox.py generate "Profile Name" "text to speak" --quality high --play

IMPORTANT: Use timeout of 300000ms — model loading + generation takes time on first run.

Report result — Tell the user the audio was generated and played. Show duration and profile used.

If no profiles exist

Offer to create one using Mode 2.

Mode 2: Create Voice Profile from Description (Designed)

When the user says "create a ... voice profile":

Parse the voice description from the user's request.
Build a rich voice description — Expand the user's short description into a detailed multi-dimensional voice prompt:
- Template: [Age] [gender] with a [pitch] [characteristic] voice, [speaking rate] pace, [emotion/tone], suitable for [use case]
- Example: "calm narrator" → "Calm middle-aged male narrator with a deep warm baritone voice, slow measured pace, soothing and trustworthy tone, suitable for audiobook narration"
- Be specific: use "deep", "crisp", "fast-paced", not vague words like "nice"
Choose a sample text that matches the voice emotion:
- Neutral: "The morning sun rose gently over the quiet village, casting golden light across the cobblestone streets. Birds sang their morning songs as the world slowly came to life."
- Angry: "I told you a hundred times not to do that. This is absolutely unacceptable and I will not stand for it anymore."
- Cheerful: "Hey everyone, welcome back! I have some amazing news to share with you today, and I just can't wait to get started!"

Create the profile:

uv run $SKILL_DIR/scripts/voicebox.py create-designed "Calm Narrator" \
  --desc "Calm middle-aged male narrator with a deep warm baritone voice, slow measured pace, soothing and trustworthy tone, suitable for audiobook narration" \
  --lang en

IMPORTANT: Use timeout of 300000ms.

Confirm — Tell the user the profile was created and is ready to use.

Profile Naming Convention

Derive from the user's description, capitalize as title: "Calm Narrator", "Angry Woman"
Keep it short (2-3 words)

Mode 2b: Create Voice Profile from Preset Speaker (Custom)

When the user wants to use a premium preset voice (no description or audio needed):

Show available speakers if the user hasn't picked one:
- Chinese: serena, vivian, uncle_fu, dylan, eric
- English: ryan, aiden
- Japanese: ono_anna
- Korean: sohee
Create the profile:
```
uv run $SKILL_DIR/scripts/voicebox.py create-custom "Profile Name" <speaker>
```
Language is auto-detected from speaker. Optional --desc for a label.
Generate with style override — Custom profiles support --instruct for emotion/style:
```
uv run voicebox.py generate "Profile Name" "text" --instruct "angry tone" --play
```

When to use Custom vs Designed vs Cloned

Custom — Fast setup, consistent high-quality preset voices, supports style instructions
Designed — Unlimited creativity, describe any voice you want
Cloned — Reproduce a specific real person's voice from audio

Mode 3: Create Voice Profile from Audio File (Cloned)

When the user says "clone my voice from /path/to/file.wav" or provides an audio file:

Get the audio file path and a transcript of what was said in the recording.
If no transcript provided, auto-transcribe using the built-in transcription:
```
uv run $SKILL_DIR/scripts/transcribe.py /path/to/audio.wav
```
Only ask the user as a last resort.

Create the profile:

uv run $SKILL_DIR/scripts/voicebox.py create-cloned "My Voice" \
  --audio /path/to/sample.wav \
  --ref-text "transcript of what was said" \
  --lang en

Confirm — Tell the user the profile was created.

Mode 4: Record from Microphone and Clone (IMPORTANT)

This mode activates when the user says "clone my voice", "record my voice", "I want to clone a voice", or any request to clone without providing an audio file.

Step-by-step workflow

Ask the user for a profile name (or derive one like "My Voice", "[User's Name]'s Voice").
Ask what they'd like to say, or suggest a good sample sentence:
- "The morning sun rose gently over the quiet village, casting golden light across the cobblestone streets."
- Or let them say anything — 5-15 seconds of clear speech works best.
Confirm they're ready, then record and auto-clone in one command:
```
uv run $SKILL_DIR/scripts/voicebox.py record "My Voice" --duration 10 --lang en
```
- Default is 10 seconds. Adjust with --duration if the user wants more/less.
- If the user already knows what they'll say, pass it: --ref-text "what they said" (skips transcription)
- Without --ref-text, the command auto-transcribes using the built-in transcribe.py (Qwen3-ASR) — no external skill needed!
- IMPORTANT: Use timeout of 300000ms.
Play back the recording so the user can verify:
```
afplay $SKILL_DIR/data/samples/<slug>.wav
```
Confirm — Tell the user the profile was created and is ready to use with /voicebox "My Voice" "text to speak".

Requirements

ffmpeg must be installed (brew install ffmpeg)
macOS microphone permission must be granted to the terminal app

Mode 5: Transcribe Audio/Video

This mode activates when the user says "transcribe this", "transcribe audio/video", provides an audio/video file for transcription, or any speech-to-text request.

Step-by-step workflow

Get the file path from the user's request.

Run transcription:

uv run $SKILL_DIR/scripts/transcribe.py /path/to/file.wav

With optional language:

uv run $SKILL_DIR/scripts/transcribe.py /path/to/file.wav --language zh

IMPORTANT: Use timeout of 300000ms.

Return the transcript to the user.

Supported formats

Audio: wav, mp3, flac, m4a, ogg, aac, wma
Video: mp4, mkv, mov, avi, webm, m4v, flv, wmv (ffmpeg extracts audio automatically)

Supported languages

52 languages with auto-detection including: English, Chinese (+ dialects), Japanese, Korean, German, French, Spanish, Italian, Portuguese, Russian, Arabic, Hindi, Thai, Vietnamese, Indonesian, and more.

Mode 6: Generate Conversation / Audiobook / Drama

This mode activates when the user asks for a multi-speaker conversation, dialogue, drama, audiobook with multiple characters, news broadcast, or any scenario involving multiple voice profiles speaking in sequence.

Step-by-step workflow

Create a JSON script file based on the user's request. The script format is:
```
{
  "title": "Evening News",
  "gap": 0.25,
  "lines": [
    {"profile": "News Anchor", "text": "Good evening and welcome to the six o'clock news."},
    {"profile": "Cheerful Girl", "text": "Thanks, Tom! I'm here live at the scene.", "instruct": "excited field reporting tone"},
    {"profile": "Aiden", "text": "Well, this is actually quite common in my experience."}
  ]
}
```
Script fields:
- title — Name for the output directory and combined file
- gap — Silence between segments in seconds (default: 0.25)
- lines — Array of dialogue lines, each with:
  - profile — Name of an existing voice profile (must match exactly or partially)
  - text — The text to speak
  - instruct — (Optional) Style/emotion override for "designed" and "custom" profiles

Save the script to a temp file:

cat > /tmp/my_script.json << 'EOF'
{ ... }
EOF

Check that required profiles exist — Run list first. If profiles are missing, create them first using Mode 2 or Mode 3.
Run the conversation command:
```
uv run $SKILL_DIR/scripts/voicebox.py conversation /tmp/my_script.json --play
```
IMPORTANT: Use timeout of 300000ms — multi-segment generation can take several minutes.

Options:
- --output-dir DIR / -o DIR — Where to save segments + combined (default: /tmp/voicebox_{title_slug})
- --gap 0.5 — Override gap between segments (overrides script value)
- --quality high — Use 1.7B models for better quality
- --trim-silence (default) / --no-trim-silence — ffmpeg silence trimming on each segment
- --play / --no-play — Play the combined result when done
Report results — Show the per-segment durations and combined total from the script output.

How to write good conversation scripts

Keep lines short — 1-3 sentences per line works best. Split long monologues into multiple lines.
Use instruct for emotion — For designed profiles, add "instruct": "excited tone" or "instruct": "whispering" to override the default voice description per-line.
Mix profile types — You can freely mix designed and cloned profiles in the same script.
Gap tuning — 0.15-0.25s for fast dialogue, 0.4-0.6s for dramatic pauses, 0.8-1.0s for scene breaks.

Example: News broadcast

{
  "title": "Evening News",
  "gap": 0.3,
  "lines": [
    {"profile": "News Anchor", "text": "Good evening. Tonight's top story: a breakthrough in renewable energy."},
    {"profile": "Cheerful Girl", "text": "Thanks, Tom! I'm here at the research lab where scientists made the announcement earlier today.", "instruct": "excited field reporting tone"},
    {"profile": "Aiden", "text": "This discovery could fundamentally change how we think about solar power. The efficiency gains are remarkable."},
    {"profile": "News Anchor", "text": "Fascinating. We'll have more on this story after the break."}
  ]
}

Requirements

ffmpeg must be installed for --trim-silence (default: on). Use --no-trim-silence if ffmpeg is unavailable.
All profiles referenced in the script must already exist.

Script Commands Reference

# List all profiles
uv run $SKILL_DIR/scripts/voicebox.py list

# List available models and quality tiers
uv run $SKILL_DIR/scripts/voicebox.py models

# Create designed voice profile
uv run $SKILL_DIR/scripts/voicebox.py create-designed "Name" --desc "description" --lang en

# Create custom voice profile (preset speaker)
uv run $SKILL_DIR/scripts/voicebox.py create-custom "Name" <speaker>

# List available preset speakers
uv run $SKILL_DIR/scripts/voicebox.py speakers

# Create cloned voice profile (from existing audio file)
uv run $SKILL_DIR/scripts/voicebox.py create-cloned "Name" --audio /path/to.wav --ref-text "transcript" --lang en

# Record from microphone and clone (with known transcript)
uv run $SKILL_DIR/scripts/voicebox.py record "Name" --duration 10 --lang en --ref-text "what I said"

# Record from microphone and clone (auto-transcribe, high quality ASR)
uv run $SKILL_DIR/scripts/voicebox.py record "Name" --duration 10 --lang en --quality high

# Transcribe an audio file (built-in, no external skill needed)
uv run $SKILL_DIR/scripts/transcribe.py /path/to/audio.wav

# Transcribe with high quality ASR (1.7B model)
uv run $SKILL_DIR/scripts/transcribe.py /path/to/audio.wav --model Qwen/Qwen3-ASR-1.7B

# Generate speech
uv run $SKILL_DIR/scripts/voicebox.py generate "Name" "text" --play

# Generate with high quality (1.7B clone model)
uv run $SKILL_DIR/scripts/voicebox.py generate "Name" "text" --play --quality high

# Generate with style override
uv run $SKILL_DIR/scripts/voicebox.py generate "Name" "text" --instruct "angry" --play

# Generate a multi-speaker conversation from JSON script
uv run $SKILL_DIR/scripts/voicebox.py conversation /tmp/script.json --play

# Conversation with custom gap and no silence trimming
uv run $SKILL_DIR/scripts/voicebox.py conversation /tmp/script.json --gap 0.5 --no-trim-silence -o /tmp/my_show

# Conversation with high quality models
uv run $SKILL_DIR/scripts/voicebox.py conversation /tmp/script.json --quality high --play

# Delete a profile
uv run $SKILL_DIR/scripts/voicebox.py delete "Name"

Quality Tiers

All TTS and recording commands default to --quality high (1.7B). Use --quality standard for faster 0.6B models:

Tier	Clone Model	ASR Model	RAM Needed	Speed
high (default)	1.7B (~3.5GB)	1.7B (~3.5GB)	~8GB+	Better quality
standard	0.6B (~1.5GB)	0.6B (~1.5GB)	~4GB+	Faster, less RAM

Voice Design always uses 1.7B (only available size).

When the user asks for "faster" or "lighter", use --quality standard.

Voice Description Guide

Quick Rules

Be Specific — Use "deep", "crisp", "fast-paced", not "nice" or "good"
Multi-Dimensional — Combine gender + age + emotion + speaking style
Be Objective — Describe voice features, not preferences
Be Original — Never request celebrity imitations
Be Concise — Every word should add meaning

Good Examples

"Calm middle-aged male with deep magnetic voice, medium pace, warm and trustworthy"
"Young adult female with crisp energetic tone, fast pace, cheerful and engaging"
"Senior male narrator with slow rich baritone, composed delivery, for audiobook"
"Angry adult female with sharp intense voice, fast aggressive pace, furious and commanding"

Supported Languages

English (en), Chinese (zh), Japanese (ja), Korean (ko), German (de), French (fr), Russian (ru), Portuguese (pt), Spanish (es), Italian (it). Default is English.

Error Handling

Error	Action
No matching profile	Show available profiles, offer to create one
No profiles exist	Offer to create one using Mode 2
Model not yet downloaded	Inform user, it auto-downloads on first use (~3GB)
Generation fails	Show error message, check mlx-audio is installed
`uv` not accessible from sandbox	Use Task tool with general-purpose subagent

First-Time Download

On first use, models are downloaded from HuggingFace (~3.5GB each for 1.7B). The scripts detect this automatically and print:

First-time setup: downloading <model> (~3.5GB)...
This is a one-time download — future runs will be instant.

IMPORTANT for Claude: When running any voicebox command for the first time (or after clearing the HF cache), warn the user that the first run will take several minutes to download models. Use a timeout of 300000ms (5 minutes) for all generation, recording, and transcription commands. Subsequent runs load from cache and are much faster.

Implementation Notes

Audio output is WAV format at 24000 Hz sample rate
Models are cached at ~/.cache/huggingface/hub/
The mlx-audio package auto-installs via uv on first run
If the Bash sandbox blocks access to ~/.claude/skills/, use a Task tool with general-purpose subagent type which gets a fresh shell
Clean up temp files after playing: rm -f /tmp/voicebox_output.wav