kokoro-tts - SKILL.md Agent Skill

name: kokoro-tts description: "Generate speech audio from text using local TTS models (mlx-audio pocket-tts or Kokoro-ONNX). Use when user asks to generate a voice message, text-to-speech, speak something aloud, create an audio clip from text, or send a voice reply. Also use for iMessage voice bubble replies. Runs entirely locally — free, no API keys. NOT for speech-to-text or transcription (that's whisper/STT)."

kokoro-tts — Local Text-to-Speech

Project: https://github.com/lucasnewman/kokoro-onnx (Kokoro-ONNX) Alt: https://github.com/ml-explore/mlx-audio (mlx-audio pocket-tts)

Generate speech audio from text using local models on Apple Silicon. Zero cost, no API keys, no network required.

Models

Model	Library	Size	Notes
pocket-tts-4bit	mlx-audio	~200MB	No named voices, auto-detects language. Fastest on Apple Silicon.
kokoro-v1.0	kokoro-onnx	~82MB	Named voices (af_bella, am_puck, ef_dora, etc.), explicit language selection.

Both produce high-quality speech at near-realtime speeds on M-series Macs.

Prerequisites

mlx-audio (recommended)

pip install mlx-audio soundfile numpy

Kokoro-ONNX (alternative)

pip install kokoro-onnx soundfile numpy
# Models auto-download to ~/.cache/kokoro-onnx/ on first use

For iMessage voice bubbles (CAF/Opus format)

macOS afconvert is required (pre-installed on all Macs).

Quick Usage

Generate WAV audio

# mlx-audio
from mlx_audio.tts.utils import load_model
import soundfile as sf
import numpy as np

model = load_model("mlx-community/pocket-tts-4bit")
results = list(model.generate("Hello, this is a test."))
sf.write("output.wav", np.array(results[0].audio), results[0].sample_rate)

# Kokoro-ONNX
from kokoro_onnx import Kokoro
import soundfile as sf

kokoro = Kokoro("~/.cache/kokoro-onnx/kokoro-v1.0.onnx",
                "~/.cache/kokoro-onnx/voices-v1.0.bin")
samples, sr = kokoro.create("Hello, this is a test.", voice="af_bella", speed=1.0)
sf.write("output.wav", samples, sr)

Generate iMessage Voice Bubble (CAF/Opus)

iMessage voice messages require CAF container with Opus codec at 48kHz mono. Use macOS afconvert:

# Step 1: Generate WAV (using either model above)
# Step 2: Resample to 48kHz mono
afconvert input.wav -o resampled.wav -d LEI16 -c 1 -r 48000
# Step 3: Convert to CAF/Opus
afconvert resampled.wav -o voice_reply.caf -f caff -d opus -b 32000

The output .caf file plays as a native voice bubble in iMessage (with waveform, not as a file attachment).

Prepend silence (recommended)

iMessage's audio player clips the first ~100ms. Prepend 150ms silence:

silence = np.zeros(int(sample_rate * 0.15), dtype=samples.dtype)
samples = np.concatenate([silence, samples])

Sending via iMessage

After generating the .caf file, send as a voice message:

message action=upload-file channel=imessage target=<recipient>
  filePath=/path/to/voice_reply.caf
  filename="Audio Message.caf"
  contentType="audio/x-caf"

Always include a text reply alongside voice messages for accessibility.

Kokoro-ONNX Voice Reference

Voice ID	Description	Language
af_bella	Female, warm	English
af_heart	Female, gentle	English
am_puck	Male, clear	English
am_adam	Male, deep	English
ef_dora	Female	Spanish
em_alex	Male	Spanish

Full list: https://github.com/lucasnewman/kokoro-onnx#voices

Tips

Speed control: Kokoro-ONNX supports speed= param (0.5-2.0). mlx-audio pocket-tts does not.
Language: pocket-tts auto-detects. Kokoro requires explicit lang= param ("en-us", "es", etc.).
Batch: For long text, split into sentences and generate each separately for lower latency.
Streaming: For real-time applications, generate sentence-by-sentence as LLM output streams in.
Quality: Both models produce natural-sounding speech. Kokoro has more voice variety; pocket-tts is simpler.