name: gemini-tts-narrator-tf description: Narrate text into a high-quality MP3 audiobook using Google Cloud Text-to-Speech API with Gemini TTS models. Accepts local file paths (desktop or VPS), URLs, or raw text. Produces native MP3 with warm narration style, smart paragraph chunking with pause markup, and multi-model support. Trigger on "narrate", "read aloud", "audiobook", "TTS this story", "convert to audio", "read this to me", or "narrator".
Gemini TTS Narrator
Goal
Convert text into a high-quality narrated MP3 audio file using Google Cloud Text-to-Speech API with Gemini TTS models.
Success Criteria
- Input: local file path (.txt, .md), URL, or inline text
- Output: single MP3 file saved to
work_products/or specified path - Audio is clear, naturally paced, and covers the full input text
- Long texts are chunked with
[medium pause]markup between paragraphs
Context
Canonical reference:
docs/03_Operations/123_Gemini_TTS_Source_Of_Truth_2026-04-22.md
- API: Google Cloud Text-to-Speech API (
texttospeech.googleapis.com) - NOT AI Studio (
generativelanguage.googleapis.com) — see doc 123 for why - Auth: Service account key at
/opt/universal_agent/.gcp-tts-sa-key.json(VPS) or gcloud ADC (desktop) - Default model:
gemini-3.1-flash-tts-preview(newest, best controllability, requiresglobalendpoint) - Fallback models:
gemini-2.5-flash-tts(GA, fast),gemini-2.5-pro-tts(highest quality) - Voices: Aoede (female, warm/narrative, default), Charon (male, deep), Kore, Leda, etc. (30 total)
- Output: Native MP3 — no ffmpeg conversion needed
- Limits: 4000 bytes text + 4000 bytes prompt per request (separate fields)
- Chunking: Texts >3800 bytes split at paragraph then sentence boundaries
- GCP Project:
gen-lang-client-0229532959
Execution Steps
- Read input: Resolve the source — file path, URL, or raw text.
- Desktop paths like
/home/kjdragan/lrepos/universal_agent/...are automatically mapped to VPS paths/opt/universal_agent/...
- Desktop paths like
- Validate: Confirm the source exists and is readable. Block if missing.
- Run narrate_gemini.py:
GOOGLE_APPLICATION_CREDENTIALS=/opt/universal_agent/.gcp-tts-sa-key.json \ uv run .claude/skills/gemini-tts-narrator-tf/scripts/narrate_gemini.py <input> -o <output.mp3> -v Aoede - Fallback: If Cloud TTS fails, try with
gemini-2.5-flash-ttsmodel:uv run .claude/skills/gemini-tts-narrator-tf/scripts/narrate_gemini.py <input> -o <output.mp3> -m gemini-2.5-flash-tts - Deliver: You MUST deliver the resulting
.mp3file to Kevin via email. Usemcp__internal__agentmail_send_with_local_attachmentsto email it tokevinjdragan@gmail.comusing his contact info fromUSER.md.
Script Options (narrate_gemini.py)
narrate_gemini.py <input> [-o OUTPUT] [-v VOICE] [-m MODEL] [--prompt PROMPT] [--language LANG]
input File path, URL, or raw text
-o, --output Output MP3 path (default: <input>_narrated.mp3)
-v, --voice Voice name (default: Aoede)
-m, --model TTS model (default: gemini-3.1-flash-tts-preview)
--prompt Custom narration style prompt
--language Language code (default: en-US)
⏳ Long-Running Execution (CRITICAL)
The TTS script takes 30-90 seconds PER CHUNK. A typical article produces 5-10 chunks, meaning 5-10 minutes total.
You MUST run this command SYNCHRONOUSLY with a long timeout:
GOOGLE_APPLICATION_CREDENTIALS=/opt/universal_agent/.gcp-tts-sa-key.json \
uv run .claude/skills/gemini-tts-narrator-tf/scripts/narrate_gemini.py <input> -o <output.mp3> -v Aoede 2>&1
Set timeout: 600000 (10 minutes) on the Bash tool call. The command will block until all chunks are rendered and the final MP3 is assembled.
🚫 DO NOT use run_in_background: true — the session will end before the script finishes and the result will never be delivered.
🚫 DO NOT repeatedly cat or tail an output file in a polling loop. This wastes tool calls and triggers the circuit breaker.
The script prints progress lines (Chunk N/M ... ✓) and a final Saved: <path> line. Wait for the command to complete, then deliver the MP3.
Available Models
| Model | Type | Best For |
|---|---|---|
gemini-3.1-flash-tts-preview |
Preview | Default. Best controllability, multi-speaker |
gemini-2.5-flash-tts |
GA | Fast, cheap, reliable |
gemini-2.5-pro-tts |
GA | Highest quality for audiobooks |
gemini-2.5-flash-lite-preview-tts |
Preview | Ultra cost-efficient |
Markup Tags (auto-inserted)
The script auto-inserts [medium pause] between paragraphs. For manual control:
[sigh],[laughing],[uhm]— non-speech sounds[whispering],[shouting],[sarcasm]— style modifiers[short pause],[medium pause],[long pause]— pacing
Constraints
- If the source file is missing, BLOCK the task immediately with the exact path needed.
- Each chunk must stay under ~3800 UTF-8 bytes (API limit is 4000 bytes per field).
- Output is native MP3 — no intermediate WAV or ffmpeg conversion.
- Auth requires service account key or gcloud ADC. Do NOT use GEMINI_IMAGE_API_KEY.
Dependencies
google-cloud-texttospeech>=2.29.0(via PEP 723 inline deps)gcloudCLI (for auth token generation)- No ffmpeg required (native MP3 output)