name: voicebox
version: 1.0.0
author: zanearcher
category: audio
description: "Text-to-speech voice toolkit. DEFAULT ACTION: When called with text (e.g. /voicebox hello), IMMEDIATELY run: uv run $SKILL_DIR/scripts/voicebox.py generate "Calm Narrator" "" --play. Do NOT ask questions, do NOT greet the user — just generate and play the speech. Also supports: voice cloning, multi-speaker conversations, recording, and transcription."
user_invocable: true
Voicebox TTS Skill
Path convention: All commands below use
$SKILL_DIRto mean this skill's base directory. When Claude loads a skill, the system prompt includesBase directory for this skill: <path>— use that path. Example: if installed at~/.agent/skills/voicebox/, then$SKILL_DIR = ~/.agent/skills/voicebox.
Standalone text-to-speech using mlx-audio. Supports custom voice design (from text descriptions) and voice cloning (from audio samples). No external app required.
CRITICAL: Default Action Rule
When /voicebox is followed by plain text (not a keyword like "create", "clone", "transcribe", "record", "list", "delete"), the DEFAULT action is ALWAYS to generate speech immediately using the "Calm Narrator" profile. Do NOT ask follow-up questions. Do NOT present a menu of options. Just run the generate command and play the audio.
Example: /voicebox hello → immediately run:
uv run $SKILL_DIR/scripts/voicebox.py generate "Calm Narrator" "hello" --play
Only route to other modes when the input contains explicit keywords:
- "create" / "design" → Mode 2 (create profile)
- "clone" / "record" → Mode 3/4 (clone voice)
- "transcribe" → Mode 5 (transcription)
- "conversation" / "drama" / "audiobook" → Mode 6 (multi-speaker)
- "list" / "delete" / "speakers" / "models" → Management commands
Everything else = generate speech immediately.
Usage
Generate speech
/voicebox "Hello world" (uses default: Calm Narrator)
/voicebox "Calm Narrator" "Hello world" (explicit profile)
/voicebox "angry tone" "My Voice" "I can't believe this!" (style + profile)
Create profiles
/voicebox create a calm narrator voice profile (designed - from description)
/voicebox create a custom profile using Ryan (custom - preset speaker)
/voicebox clone my voice from /path/to/audio.wav (cloned - from audio file)
/voicebox clone my voice (record from mic + clone)
Transcribe audio/video
/voicebox transcribe /path/to/audio.wav
/voicebox transcribe /path/to/video.mp4
Generate multi-speaker conversation
/voicebox create a news broadcast with anchor, reporter, and expert
/voicebox make a conversation between Calm Narrator and Cheerful Girl
/voicebox generate a drama scene with these characters...
Trigger phrases (activates this skill automatically)
- "clone my voice", "record my voice", "create a voice clone"
- "transcribe this", "transcribe audio", "transcribe video"
- "create a conversation", "make a drama", "generate a dialogue", "audiobook"
- Any request involving audio/video transcription or speech-to-text
Architecture
Three model categories with quality tiers:
| Category | High (default) | Standard | Use Case |
|---|---|---|---|
| Voice Design | Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16 |
(same — only 1.7B exists) | Custom voices from description |
| Voice Clone | Qwen3-TTS-12Hz-1.7B-Base-bf16 |
Qwen3-TTS-12Hz-0.6B-Base-bf16 |
Clone a real voice |
| Custom Voice | Qwen3-TTS-12Hz-1.7B-CustomVoice-bf16 |
Qwen3-TTS-12Hz-0.6B-CustomVoice-bf16 |
9 preset premium speakers with style control |
| ASR (Transcription) | Qwen/Qwen3-ASR-1.7B |
Qwen/Qwen3-ASR-0.6B |
Speech-to-text |
All commands accept --quality standard or --quality high (default) to select model tier.
All state is in the skill's data/ directory (default: $SKILL_DIR/data/):
profiles.json— profile registrysamples/— WAV files for reference audio
Script: scripts/voicebox.py (relative to skill root)
Default Profiles
| Context | Default Profile | Type | Best For |
|---|---|---|---|
| Overall default (English text) | Calm Narrator | Designed | Neutral narration |
| CustomVoice Chinese | Dylan | Custom (speaker: dylan) | Chinese with emotion |
| CustomVoice English | Aiden | Custom (speaker: aiden) | English with emotion |
| CustomVoice Japanese | Ono Anna | Custom (speaker: ono_anna) | Japanese with emotion |
When the user runs /voicebox "some text" with only text (a single quoted argument), use Calm Narrator by default.
When the text requires emotional delivery or the user requests a specific emotion (angry, sad, excited, etc.), prefer CustomVoice defaults by language: Dylan for Chinese, Aiden for English, Ono Anna for Japanese. CustomVoice supports --instruct for emotion control.
Mode 1: Generate Speech
Step-by-step workflow
Parse arguments — Count the quoted arguments:
- One quoted arg → Text only. Use the default profile ("Calm Narrator").
- Two quoted args → First is profile name, second is text to speak.
- Three quoted args → First is style/instruct, second is profile name, third is text.
Find the profile — Look up the profile name in profiles.json (case-insensitive, partial match OK).
Emotion segmentation (OPT-IN ONLY — never do this by default):
By default, long text is generated as a single pass with
generate. Do NOT automatically split text into emotional segments. Only activate emotion segmentation when the user explicitly requests it with phrases like:- "with emotions", "auto-emotion", "segment emotions", "emotional delivery"
- "split by emotion", "vary the emotions", "make it expressive"
If the user does NOT ask for emotion segmentation, skip this entire step and go straight to step 4 (simple generate), regardless of text length.
When the user explicitly requests emotion segmentation:
a. Analyze the text — Identify distinct emotional segments (e.g., excitement → worry → determination → gratitude).
b. Split into segments — Break the text at natural emotion boundaries (sentence or clause level).
c. Tag each segment with an appropriate
instructemotion from this palette:- Joy/Excitement:
"excited, enthusiastic, joyful tone","ecstatic, breathless with joy" - Anger:
"angry, furious, intense tone","irritated, sharp and impatient" - Sadness/Sorrow:
"sorrowful, heartbroken, tearful voice","melancholic, quiet and reflective" - Fear/Panic:
"panicked, terrified, desperate tone","anxious, nervous, trembling voice" - Calm/Neutral:
"calm, steady, composed delivery","warm and reassuring tone" - Surprise:
"shocked, astonished, wide-eyed disbelief" - Determination:
"determined, resolute, firm and confident" - Tenderness:
"gentle, tender, soft-spoken with warmth"
d. Generate a conversation script — CRITICAL: USE THE SAME SINGLE PROFILE FOR EVERY LINE.
THIS IS NOT A MULTI-SPEAKER CONVERSATION. Emotion segmentation is ONE person speaking with varying emotions. Every
"profile"value in the script MUST be identical — the same profile the user selected (or the default). Only the"instruct"field varies between lines. Do NOT assign different profiles to different segments.{ "title": "emotional_speech", "gap": 0.15, "lines": [ {"profile": "Dylan", "text": "Segment 1 text...", "instruct": "excited, enthusiastic tone"}, {"profile": "Dylan", "text": "Segment 2 text...", "instruct": "worried, anxious, trembling voice"}, {"profile": "Dylan", "text": "Segment 3 text...", "instruct": "determined, resolute, firm tone"}, {"profile": "Dylan", "text": "Segment 4 text...", "instruct": "gentle, grateful, warm and tender"} ] }Note: ALL four lines above use
"Dylan"— the same profile. This is intentional and required.e. Use the conversation command instead of single generate:
uv run $SKILL_DIR/scripts/voicebox.py conversation /tmp/emotional_script.json --play --gap 0.15IMPORTANT — Emotion segmentation vs Multi-speaker:
- Emotion segmentation (this section): OPT-IN only. ONE voice, varying emotions. Same profile on every line.
- Multi-speaker conversation (Mode 6): MULTIPLE voices, different profiles per line. Only when user asks for conversation/dialogue/drama.
- Default for long text: Single
generatecall. No segmentation. No splitting.
Simple generate (short text or no emotion needed):
uv run $SKILL_DIR/scripts/voicebox.py generate "Profile Name" "text to speak" --playWith style override (short emotional text):
uv run $SKILL_DIR/scripts/voicebox.py generate "Profile Name" "text to speak" --instruct "angry tone" --playWith high quality (1.7B model) for cloned voices:
uv run $SKILL_DIR/scripts/voicebox.py generate "Profile Name" "text to speak" --quality high --playIMPORTANT: Use timeout of 300000ms — model loading + generation takes time on first run.
Report result — Tell the user the audio was generated and played. Show duration and profile used.
If no profiles exist
Offer to create one using Mode 2.
Mode 2: Create Voice Profile from Description (Designed)
When the user says "create a ... voice profile":
Parse the voice description from the user's request.
Build a rich voice description — Expand the user's short description into a detailed multi-dimensional voice prompt:
- Template:
[Age] [gender] with a [pitch] [characteristic] voice, [speaking rate] pace, [emotion/tone], suitable for [use case] - Example: "calm narrator" → "Calm middle-aged male narrator with a deep warm baritone voice, slow measured pace, soothing and trustworthy tone, suitable for audiobook narration"
- Be specific: use "deep", "crisp", "fast-paced", not vague words like "nice"
- Template:
Choose a sample text that matches the voice emotion:
- Neutral: "The morning sun rose gently over the quiet village, casting golden light across the cobblestone streets. Birds sang their morning songs as the world slowly came to life."
- Angry: "I told you a hundred times not to do that. This is absolutely unacceptable and I will not stand for it anymore."
- Cheerful: "Hey everyone, welcome back! I have some amazing news to share with you today, and I just can't wait to get started!"
Create the profile:
uv run $SKILL_DIR/scripts/voicebox.py create-designed "Calm Narrator" \ --desc "Calm middle-aged male narrator with a deep warm baritone voice, slow measured pace, soothing and trustworthy tone, suitable for audiobook narration" \ --lang enIMPORTANT: Use timeout of 300000ms.
Confirm — Tell the user the profile was created and is ready to use.
Profile Naming Convention
- Derive from the user's description, capitalize as title: "Calm Narrator", "Angry Woman"
- Keep it short (2-3 words)
Mode 2b: Create Voice Profile from Preset Speaker (Custom)
When the user wants to use a premium preset voice (no description or audio needed):
Show available speakers if the user hasn't picked one:
- Chinese: serena, vivian, uncle_fu, dylan, eric
- English: ryan, aiden
- Japanese: ono_anna
- Korean: sohee
Create the profile:
uv run $SKILL_DIR/scripts/voicebox.py create-custom "Profile Name" <speaker>Language is auto-detected from speaker. Optional
--descfor a label.Generate with style override — Custom profiles support
--instructfor emotion/style:uv run voicebox.py generate "Profile Name" "text" --instruct "angry tone" --play
When to use Custom vs Designed vs Cloned
- Custom — Fast setup, consistent high-quality preset voices, supports style instructions
- Designed — Unlimited creativity, describe any voice you want
- Cloned — Reproduce a specific real person's voice from audio
Mode 3: Create Voice Profile from Audio File (Cloned)
When the user says "clone my voice from /path/to/file.wav" or provides an audio file:
Get the audio file path and a transcript of what was said in the recording.
If no transcript provided, auto-transcribe using the built-in transcription:
uv run $SKILL_DIR/scripts/transcribe.py /path/to/audio.wavOnly ask the user as a last resort.
Create the profile:
uv run $SKILL_DIR/scripts/voicebox.py create-cloned "My Voice" \ --audio /path/to/sample.wav \ --ref-text "transcript of what was said" \ --lang enConfirm — Tell the user the profile was created.
Mode 4: Record from Microphone and Clone (IMPORTANT)
This mode activates when the user says "clone my voice", "record my voice", "I want to clone a voice", or any request to clone without providing an audio file.
Step-by-step workflow
Ask the user for a profile name (or derive one like "My Voice", "[User's Name]'s Voice").
Ask what they'd like to say, or suggest a good sample sentence:
- "The morning sun rose gently over the quiet village, casting golden light across the cobblestone streets."
- Or let them say anything — 5-15 seconds of clear speech works best.
Confirm they're ready, then record and auto-clone in one command:
uv run $SKILL_DIR/scripts/voicebox.py record "My Voice" --duration 10 --lang en- Default is 10 seconds. Adjust with
--durationif the user wants more/less. - If the user already knows what they'll say, pass it:
--ref-text "what they said"(skips transcription) - Without
--ref-text, the command auto-transcribes using the built-intranscribe.py(Qwen3-ASR) — no external skill needed! - IMPORTANT: Use timeout of 300000ms.
- Default is 10 seconds. Adjust with
Play back the recording so the user can verify:
afplay $SKILL_DIR/data/samples/<slug>.wavConfirm — Tell the user the profile was created and is ready to use with
/voicebox "My Voice" "text to speak".
Requirements
- ffmpeg must be installed (
brew install ffmpeg) - macOS microphone permission must be granted to the terminal app
Mode 5: Transcribe Audio/Video
This mode activates when the user says "transcribe this", "transcribe audio/video", provides an audio/video file for transcription, or any speech-to-text request.
Step-by-step workflow
Get the file path from the user's request.
Run transcription:
uv run $SKILL_DIR/scripts/transcribe.py /path/to/file.wavWith optional language:
uv run $SKILL_DIR/scripts/transcribe.py /path/to/file.wav --language zhIMPORTANT: Use timeout of 300000ms.
Return the transcript to the user.
Supported formats
- Audio: wav, mp3, flac, m4a, ogg, aac, wma
- Video: mp4, mkv, mov, avi, webm, m4v, flv, wmv (ffmpeg extracts audio automatically)
Supported languages
52 languages with auto-detection including: English, Chinese (+ dialects), Japanese, Korean, German, French, Spanish, Italian, Portuguese, Russian, Arabic, Hindi, Thai, Vietnamese, Indonesian, and more.
Mode 6: Generate Conversation / Audiobook / Drama
This mode activates when the user asks for a multi-speaker conversation, dialogue, drama, audiobook with multiple characters, news broadcast, or any scenario involving multiple voice profiles speaking in sequence.
Step-by-step workflow
Create a JSON script file based on the user's request. The script format is:
{ "title": "Evening News", "gap": 0.25, "lines": [ {"profile": "News Anchor", "text": "Good evening and welcome to the six o'clock news."}, {"profile": "Cheerful Girl", "text": "Thanks, Tom! I'm here live at the scene.", "instruct": "excited field reporting tone"}, {"profile": "Aiden", "text": "Well, this is actually quite common in my experience."} ] }Script fields:
title— Name for the output directory and combined filegap— Silence between segments in seconds (default: 0.25)lines— Array of dialogue lines, each with:profile— Name of an existing voice profile (must match exactly or partially)text— The text to speakinstruct— (Optional) Style/emotion override for "designed" and "custom" profiles
Save the script to a temp file:
cat > /tmp/my_script.json << 'EOF' { ... } EOFCheck that required profiles exist — Run
listfirst. If profiles are missing, create them first using Mode 2 or Mode 3.Run the conversation command:
uv run $SKILL_DIR/scripts/voicebox.py conversation /tmp/my_script.json --playIMPORTANT: Use timeout of 300000ms — multi-segment generation can take several minutes.
Options:
--output-dir DIR/-o DIR— Where to save segments + combined (default:/tmp/voicebox_{title_slug})--gap 0.5— Override gap between segments (overrides script value)--quality high— Use 1.7B models for better quality--trim-silence(default) /--no-trim-silence— ffmpeg silence trimming on each segment--play/--no-play— Play the combined result when done
Report results — Show the per-segment durations and combined total from the script output.
How to write good conversation scripts
- Keep lines short — 1-3 sentences per line works best. Split long monologues into multiple lines.
- Use
instructfor emotion — For designed profiles, add"instruct": "excited tone"or"instruct": "whispering"to override the default voice description per-line. - Mix profile types — You can freely mix designed and cloned profiles in the same script.
- Gap tuning — 0.15-0.25s for fast dialogue, 0.4-0.6s for dramatic pauses, 0.8-1.0s for scene breaks.
Example: News broadcast
{
"title": "Evening News",
"gap": 0.3,
"lines": [
{"profile": "News Anchor", "text": "Good evening. Tonight's top story: a breakthrough in renewable energy."},
{"profile": "Cheerful Girl", "text": "Thanks, Tom! I'm here at the research lab where scientists made the announcement earlier today.", "instruct": "excited field reporting tone"},
{"profile": "Aiden", "text": "This discovery could fundamentally change how we think about solar power. The efficiency gains are remarkable."},
{"profile": "News Anchor", "text": "Fascinating. We'll have more on this story after the break."}
]
}
Requirements
- ffmpeg must be installed for
--trim-silence(default: on). Use--no-trim-silenceif ffmpeg is unavailable. - All profiles referenced in the script must already exist.
Script Commands Reference
# List all profiles
uv run $SKILL_DIR/scripts/voicebox.py list
# List available models and quality tiers
uv run $SKILL_DIR/scripts/voicebox.py models
# Create designed voice profile
uv run $SKILL_DIR/scripts/voicebox.py create-designed "Name" --desc "description" --lang en
# Create custom voice profile (preset speaker)
uv run $SKILL_DIR/scripts/voicebox.py create-custom "Name" <speaker>
# List available preset speakers
uv run $SKILL_DIR/scripts/voicebox.py speakers
# Create cloned voice profile (from existing audio file)
uv run $SKILL_DIR/scripts/voicebox.py create-cloned "Name" --audio /path/to.wav --ref-text "transcript" --lang en
# Record from microphone and clone (with known transcript)
uv run $SKILL_DIR/scripts/voicebox.py record "Name" --duration 10 --lang en --ref-text "what I said"
# Record from microphone and clone (auto-transcribe, high quality ASR)
uv run $SKILL_DIR/scripts/voicebox.py record "Name" --duration 10 --lang en --quality high
# Transcribe an audio file (built-in, no external skill needed)
uv run $SKILL_DIR/scripts/transcribe.py /path/to/audio.wav
# Transcribe with high quality ASR (1.7B model)
uv run $SKILL_DIR/scripts/transcribe.py /path/to/audio.wav --model Qwen/Qwen3-ASR-1.7B
# Generate speech
uv run $SKILL_DIR/scripts/voicebox.py generate "Name" "text" --play
# Generate with high quality (1.7B clone model)
uv run $SKILL_DIR/scripts/voicebox.py generate "Name" "text" --play --quality high
# Generate with style override
uv run $SKILL_DIR/scripts/voicebox.py generate "Name" "text" --instruct "angry" --play
# Generate a multi-speaker conversation from JSON script
uv run $SKILL_DIR/scripts/voicebox.py conversation /tmp/script.json --play
# Conversation with custom gap and no silence trimming
uv run $SKILL_DIR/scripts/voicebox.py conversation /tmp/script.json --gap 0.5 --no-trim-silence -o /tmp/my_show
# Conversation with high quality models
uv run $SKILL_DIR/scripts/voicebox.py conversation /tmp/script.json --quality high --play
# Delete a profile
uv run $SKILL_DIR/scripts/voicebox.py delete "Name"
Quality Tiers
All TTS and recording commands default to --quality high (1.7B). Use --quality standard for faster 0.6B models:
| Tier | Clone Model | ASR Model | RAM Needed | Speed |
|---|---|---|---|---|
| high (default) | 1.7B (~3.5GB) | 1.7B (~3.5GB) | ~8GB+ | Better quality |
| standard | 0.6B (~1.5GB) | 0.6B (~1.5GB) | ~4GB+ | Faster, less RAM |
Voice Design always uses 1.7B (only available size).
When the user asks for "faster" or "lighter", use --quality standard.
Voice Description Guide
Quick Rules
- Be Specific — Use "deep", "crisp", "fast-paced", not "nice" or "good"
- Multi-Dimensional — Combine gender + age + emotion + speaking style
- Be Objective — Describe voice features, not preferences
- Be Original — Never request celebrity imitations
- Be Concise — Every word should add meaning
Good Examples
"Calm middle-aged male with deep magnetic voice, medium pace, warm and trustworthy"
"Young adult female with crisp energetic tone, fast pace, cheerful and engaging"
"Senior male narrator with slow rich baritone, composed delivery, for audiobook"
"Angry adult female with sharp intense voice, fast aggressive pace, furious and commanding"
Supported Languages
English (en), Chinese (zh), Japanese (ja), Korean (ko), German (de), French (fr), Russian (ru), Portuguese (pt), Spanish (es), Italian (it). Default is English.
Error Handling
| Error | Action |
|---|---|
| No matching profile | Show available profiles, offer to create one |
| No profiles exist | Offer to create one using Mode 2 |
| Model not yet downloaded | Inform user, it auto-downloads on first use (~3GB) |
| Generation fails | Show error message, check mlx-audio is installed |
uv not accessible from sandbox |
Use Task tool with general-purpose subagent |
First-Time Download
On first use, models are downloaded from HuggingFace (~3.5GB each for 1.7B). The scripts detect this automatically and print:
First-time setup: downloading <model> (~3.5GB)...
This is a one-time download — future runs will be instant.
IMPORTANT for Claude: When running any voicebox command for the first time (or after clearing the HF cache), warn the user that the first run will take several minutes to download models. Use a timeout of 300000ms (5 minutes) for all generation, recording, and transcription commands. Subsequent runs load from cache and are much faster.
Implementation Notes
- Audio output is WAV format at 24000 Hz sample rate
- Models are cached at
~/.cache/huggingface/hub/ - The
mlx-audiopackage auto-installs viauvon first run - If the Bash sandbox blocks access to
~/.claude/skills/, use a Task tool withgeneral-purposesubagent type which gets a fresh shell - Clean up temp files after playing:
rm -f /tmp/voicebox_output.wav