name: impressions
description: Add and use custom voices for VoiceMode TTS via local mlx-audio. Use when the user wants to clone a voice, do an impression, add a reference clip, or use voice="" in converse.
Impressions
Make VoiceMode speak in any voice. The model takes a short reference clip and synthesises fresh speech in that voice via local Qwen3-TTS on top of mlx-audio.
Status: Preview / experimental. Apple Silicon only. Opt-in.
When to use this skill
- User asks for "voice cloning", "do an impression", "speak as X", "add my voice"
- A
voice=argument invoicemode:conversedoesn't match a known Kokoro voice - User wants to install or troubleshoot the
mlx-audioservice - User asks how to configure a remote mlx-audio server
Quick start
# 1. Install the local TTS service (one-time, Apple Silicon only)
voicemode service install mlx-audio
# 2. Add a voice from a reference clip
voicemode clone add fleabag ~/Downloads/fleabag-clip.wav
# 3. Use it
voicemode converse --voice fleabag
In the MCP converse tool, pass voice="fleabag" -- VoiceMode auto-routes any voice that matches a profile in VOICEMODE_VOICES_DIR to mlx-audio instead of Kokoro / OpenAI.
Reference clip requirements
voicemode clone add validates the input before doing any expensive work:
- Duration: 3-9 seconds (5-9s sweet spot). Clips outside this window are rejected with an actionable error.
- Mono speech, no music or cross-talk. The model copies what it hears -- including hum, music beds, laugh tracks, and overlapping speakers.
- Any input format accepted. WAV, MP3, M4A, etc. -- ffmpeg normalises whatever you hand it.
- Output is always mono 24 kHz 16-bit PCM with loudnorm I=-16 TP=-1.5 LRA=11. This is the canonical voice-lab format; the original input is replaced by this normalised render at
default.wav. - ALWAYS pair the clip with its transcript. The model conditions on the reference text; without one it ASRs the clip itself, and any mis-hearing (noisy or vintage audio especially) corrupts the conditioning -- the symptom is stammering / stuttered synthesis.
voicemode clone addauto-transcribes intovoice.md(verify it -- correct mis-hearings by hand); voice-lab'ssayasreads<clip>.txtnext to each wav; the MCPconversetool takesref_textalongside a clip-pathvoice. (Root-caused on VL-50, 2026-06-11: 1977 Doctor Who clips stammered until transcripts were supplied -- then "much better!!!".)
Trimming a too-long clip
If your source is longer than 9 seconds, trim with the same one-liner the runtime error suggests:
ffmpeg -i in.wav -ss 0 -t 8 out.wav
On-disk layout
Voices live as directories under ~/.voicemode/voices/<name>/:
~/.voicemode/voices/fleabag/
├── default.wav # required: 3-9s of clean reference audio, mono 24kHz 16-bit PCM
└── voice.md # auto-generated by `voicemode clone add` -- name, source, duration, format, transcript
voice.md carries YAML front matter with name, source (original input path), duration_seconds, format (literal mono 24kHz 16-bit PCM, loudnorm I=-16 TP=-1.5 LRA=11), and transcript. It documents what the clip is and where it came from.
voices.json at the voices root is retained as a legacy index -- voicemode clone add writes an entry pointing at <name>/default.wav so older consumers keep working. Prefer the directory layout above for new work.
Multiple WAVs are allowed alongside default.wav; symlink whichever one is "active" to default.wav. A directory with multiple WAVs and no default.wav is treated as a sample bin and skipped.
Picking a clip
5-9 seconds of clean conversational speech beats 30 seconds of noisy podcast audio. The model copies what it hears -- including hum, music beds, and laugh tracks. See docs/finding-samples.md for ranking heuristics, an mlx-whisper word-timestamp ranker concept, and ffmpeg loudnorm recipes.
Configuration
| Variable | Default | Purpose |
|---|---|---|
VOICEMODE_VOICES_DIR |
~/.voicemode/voices |
Where voice profiles live |
VOICEMODE_REMOTE_VOICES_DIR |
(unset) | Path on remote mlx-audio host (path translation) |
VOICEMODE_MLX_AUDIO_BASE_URL |
http://127.0.0.1:8890/v1 |
OpenAI-compatible mlx-audio endpoint |
VOICEMODE_IMPRESSIONS_MODEL |
mlx-community/Qwen3-TTS-12Hz-1.7B-Base-bf16 |
Hugging Face model ID |
Deprecated aliases (one release only)
The unreleased 8.7.0 candidate used VOICEMODE_CLONE_* names. They're honoured in 8.7.x with a one-shot deprecation warning and removed in 8.8.0:
| Deprecated | Use instead |
|---|---|
VOICEMODE_CLONE_BASE_URL |
VOICEMODE_MLX_AUDIO_BASE_URL |
VOICEMODE_CLONE_MODEL |
VOICEMODE_IMPRESSIONS_MODEL |
VOICEMODE_CLONE_PORT |
VOICEMODE_MLX_AUDIO_PORT |
If you see those in a user's voicemode.env, suggest updating them.
Footguns
- Missing reference transcript = stammering. A clip without its transcript forces the model to ASR the reference itself; on anything but clean modern audio that mis-hears, and the synthesis stutters. Fix:
<clip>.txtbeside the wav (sayas),ref_textin converse, correctedtranscript:invoice.md. See "Reference clip requirements". - Kokoro name collisions -- naming a voice
af_sky(or any other Kokoro voice name) shadows the Kokoro voice. Pick distinctive names likefleabag,mike-2026,bryan_morning. - Apple Silicon only -- no fallback for Intel Macs / Linux / Windows. Don't suggest installing mlx-audio on those platforms.
- First synthesis is slow -- ~3.4 GB model download on first call. Warn the user.
Deep dives
- docs/setup.md -- install path, model quants table, remote mlx-audio config, troubleshooting.
- docs/finding-samples.md -- clip ranking heuristic, ffmpeg loudnorm recipe, link to voice-lab.
Related
- Impressions guide -- user-facing prose version of this skill.
- VoiceMode skill -- primary voice interaction skill.
- voice-lab -- companion repo for curating reference clips and personas.