impressions

star 1.2k

Add and use custom voices for VoiceMode TTS via local mlx-audio. Use when the user wants to clone a voice, do an impression, add a reference clip, or use voice="<name>" in converse.

mbailey By mbailey schedule Updated 5/4/2026

name: impressions description: Add and use custom voices for VoiceMode TTS via local mlx-audio. Use when the user wants to clone a voice, do an impression, add a reference clip, or use voice="" in converse.

Impressions

Make VoiceMode speak in any voice. The model takes a short reference clip and synthesises fresh speech in that voice via local Qwen3-TTS on top of mlx-audio.

Status: Preview / experimental. Apple Silicon only. Opt-in.

When to use this skill

  • User asks for "voice cloning", "do an impression", "speak as X", "add my voice"
  • A voice= argument in voicemode:converse doesn't match a known Kokoro voice
  • User wants to install or troubleshoot the mlx-audio service
  • User asks how to configure a remote mlx-audio server

Quick start

# 1. Install the local TTS service (one-time, Apple Silicon only)
voicemode service install mlx-audio

# 2. Add a voice from a reference clip
voicemode clone add fleabag ~/Downloads/fleabag-clip.wav

# 3. Use it
voicemode converse --voice fleabag

In the MCP converse tool, pass voice="fleabag" -- VoiceMode auto-routes any voice that matches a profile in VOICEMODE_VOICES_DIR to mlx-audio instead of Kokoro / OpenAI.

Reference clip requirements

voicemode clone add validates the input before doing any expensive work:

  • Duration: 3-9 seconds (5-9s sweet spot). Clips outside this window are rejected with an actionable error.
  • Mono speech, no music or cross-talk. The model copies what it hears -- including hum, music beds, laugh tracks, and overlapping speakers.
  • Any input format accepted. WAV, MP3, M4A, etc. -- ffmpeg normalises whatever you hand it.
  • Output is always mono 24 kHz 16-bit PCM with loudnorm I=-16 TP=-1.5 LRA=11. This is the canonical voice-lab format; the original input is replaced by this normalised render at default.wav.
  • ALWAYS pair the clip with its transcript. The model conditions on the reference text; without one it ASRs the clip itself, and any mis-hearing (noisy or vintage audio especially) corrupts the conditioning -- the symptom is stammering / stuttered synthesis. voicemode clone add auto-transcribes into voice.md (verify it -- correct mis-hearings by hand); voice-lab's sayas reads <clip>.txt next to each wav; the MCP converse tool takes ref_text alongside a clip-path voice. (Root-caused on VL-50, 2026-06-11: 1977 Doctor Who clips stammered until transcripts were supplied -- then "much better!!!".)

Trimming a too-long clip

If your source is longer than 9 seconds, trim with the same one-liner the runtime error suggests:

ffmpeg -i in.wav -ss 0 -t 8 out.wav

On-disk layout

Voices live as directories under ~/.voicemode/voices/<name>/:

~/.voicemode/voices/fleabag/
├── default.wav        # required: 3-9s of clean reference audio, mono 24kHz 16-bit PCM
└── voice.md           # auto-generated by `voicemode clone add` -- name, source, duration, format, transcript

voice.md carries YAML front matter with name, source (original input path), duration_seconds, format (literal mono 24kHz 16-bit PCM, loudnorm I=-16 TP=-1.5 LRA=11), and transcript. It documents what the clip is and where it came from.

voices.json at the voices root is retained as a legacy index -- voicemode clone add writes an entry pointing at <name>/default.wav so older consumers keep working. Prefer the directory layout above for new work.

Multiple WAVs are allowed alongside default.wav; symlink whichever one is "active" to default.wav. A directory with multiple WAVs and no default.wav is treated as a sample bin and skipped.

Picking a clip

5-9 seconds of clean conversational speech beats 30 seconds of noisy podcast audio. The model copies what it hears -- including hum, music beds, and laugh tracks. See docs/finding-samples.md for ranking heuristics, an mlx-whisper word-timestamp ranker concept, and ffmpeg loudnorm recipes.

Configuration

Variable Default Purpose
VOICEMODE_VOICES_DIR ~/.voicemode/voices Where voice profiles live
VOICEMODE_REMOTE_VOICES_DIR (unset) Path on remote mlx-audio host (path translation)
VOICEMODE_MLX_AUDIO_BASE_URL http://127.0.0.1:8890/v1 OpenAI-compatible mlx-audio endpoint
VOICEMODE_IMPRESSIONS_MODEL mlx-community/Qwen3-TTS-12Hz-1.7B-Base-bf16 Hugging Face model ID

Deprecated aliases (one release only)

The unreleased 8.7.0 candidate used VOICEMODE_CLONE_* names. They're honoured in 8.7.x with a one-shot deprecation warning and removed in 8.8.0:

Deprecated Use instead
VOICEMODE_CLONE_BASE_URL VOICEMODE_MLX_AUDIO_BASE_URL
VOICEMODE_CLONE_MODEL VOICEMODE_IMPRESSIONS_MODEL
VOICEMODE_CLONE_PORT VOICEMODE_MLX_AUDIO_PORT

If you see those in a user's voicemode.env, suggest updating them.

Footguns

  • Missing reference transcript = stammering. A clip without its transcript forces the model to ASR the reference itself; on anything but clean modern audio that mis-hears, and the synthesis stutters. Fix: <clip>.txt beside the wav (sayas), ref_text in converse, corrected transcript: in voice.md. See "Reference clip requirements".
  • Kokoro name collisions -- naming a voice af_sky (or any other Kokoro voice name) shadows the Kokoro voice. Pick distinctive names like fleabag, mike-2026, bryan_morning.
  • Apple Silicon only -- no fallback for Intel Macs / Linux / Windows. Don't suggest installing mlx-audio on those platforms.
  • First synthesis is slow -- ~3.4 GB model download on first call. Warn the user.

Deep dives

  • docs/setup.md -- install path, model quants table, remote mlx-audio config, troubleshooting.
  • docs/finding-samples.md -- clip ranking heuristic, ffmpeg loudnorm recipe, link to voice-lab.

Related

Install via CLI
npx skills add https://github.com/mbailey/voicemode --skill impressions
Repository Details
star Stars 1,219
call_split Forks 170
navigation Branch main
article Path SKILL.md
More from Creator