gemini-tts-narrator-tf - SKILL.md Agent Skill

name: gemini-tts-narrator-tf description: Narrate text into a high-quality MP3 audiobook using Google Cloud Text-to-Speech API with Gemini TTS models. Accepts local file paths (desktop or VPS), URLs, or raw text. Produces native MP3 with warm narration style, smart paragraph chunking with pause markup, and multi-model support. Trigger on "narrate", "read aloud", "audiobook", "TTS this story", "convert to audio", "read this to me", or "narrator".

Gemini TTS Narrator

Goal

Convert text into a high-quality narrated MP3 audio file using Google Cloud Text-to-Speech API with Gemini TTS models.

Success Criteria

Input: local file path (.txt, .md), URL, or inline text
Output: single MP3 file saved to work_products/ or specified path
Audio is clear, naturally paced, and covers the full input text
Long texts are chunked with [medium pause] markup between paragraphs

Context

Canonical reference: docs/03_Operations/123_Gemini_TTS_Source_Of_Truth_2026-04-22.md

API: Google Cloud Text-to-Speech API (texttospeech.googleapis.com)
NOT AI Studio (generativelanguage.googleapis.com) — see doc 123 for why
Auth: Service account key at /opt/universal_agent/.gcp-tts-sa-key.json (VPS) or gcloud ADC (desktop)
Default model: gemini-3.1-flash-tts-preview (newest, best controllability, requires global endpoint)
Fallback models: gemini-2.5-flash-tts (GA, fast), gemini-2.5-pro-tts (highest quality)
Voices: Aoede (female, warm/narrative, default), Charon (male, deep), Kore, Leda, etc. (30 total)
Output: Native MP3 — no ffmpeg conversion needed
Limits: 4000 bytes text + 4000 bytes prompt per request (separate fields)
Chunking: Texts >3800 bytes split at paragraph then sentence boundaries
GCP Project: gen-lang-client-0229532959

Execution Steps

Read input: Resolve the source — file path, URL, or raw text.
- Desktop paths like /home/kjdragan/lrepos/universal_agent/... are automatically mapped to VPS paths /opt/universal_agent/...
Validate: Confirm the source exists and is readable. Block if missing.

Run narrate_gemini.py:

GOOGLE_APPLICATION_CREDENTIALS=/opt/universal_agent/.gcp-tts-sa-key.json \
uv run .claude/skills/gemini-tts-narrator-tf/scripts/narrate_gemini.py <input> -o <output.mp3> -v Aoede

Fallback: If Cloud TTS fails, try with gemini-2.5-flash-tts model:

uv run .claude/skills/gemini-tts-narrator-tf/scripts/narrate_gemini.py <input> -o <output.mp3> -m gemini-2.5-flash-tts

Deliver: You MUST deliver the resulting .mp3 file to Kevin via email. Use mcp__internal__agentmail_send_with_local_attachments to email it to kevinjdragan@gmail.com using his contact info from USER.md.

Script Options (narrate_gemini.py)

narrate_gemini.py <input> [-o OUTPUT] [-v VOICE] [-m MODEL] [--prompt PROMPT] [--language LANG]
  input          File path, URL, or raw text
  -o, --output   Output MP3 path (default: <input>_narrated.mp3)
  -v, --voice    Voice name (default: Aoede)
  -m, --model    TTS model (default: gemini-3.1-flash-tts-preview)
  --prompt       Custom narration style prompt
  --language     Language code (default: en-US)

⏳ Long-Running Execution (CRITICAL)

The TTS script takes 30-90 seconds PER CHUNK. A typical article produces 5-10 chunks, meaning 5-10 minutes total.

You MUST run this command SYNCHRONOUSLY with a long timeout:

GOOGLE_APPLICATION_CREDENTIALS=/opt/universal_agent/.gcp-tts-sa-key.json \
uv run .claude/skills/gemini-tts-narrator-tf/scripts/narrate_gemini.py <input> -o <output.mp3> -v Aoede 2>&1

Set timeout: 600000 (10 minutes) on the Bash tool call. The command will block until all chunks are rendered and the final MP3 is assembled.

🚫 DO NOT use run_in_background: true — the session will end before the script finishes and the result will never be delivered.

🚫 DO NOT repeatedly cat or tail an output file in a polling loop. This wastes tool calls and triggers the circuit breaker.

The script prints progress lines (Chunk N/M ... ✓) and a final Saved: <path> line. Wait for the command to complete, then deliver the MP3.

Available Models

Model	Type	Best For
`gemini-3.1-flash-tts-preview`	Preview	Default. Best controllability, multi-speaker
`gemini-2.5-flash-tts`	GA	Fast, cheap, reliable
`gemini-2.5-pro-tts`	GA	Highest quality for audiobooks
`gemini-2.5-flash-lite-preview-tts`	Preview	Ultra cost-efficient

Markup Tags (auto-inserted)

The script auto-inserts [medium pause] between paragraphs. For manual control:

[sigh], [laughing], [uhm] — non-speech sounds
[whispering], [shouting], [sarcasm] — style modifiers
[short pause], [medium pause], [long pause] — pacing

Constraints

If the source file is missing, BLOCK the task immediately with the exact path needed.
Each chunk must stay under ~3800 UTF-8 bytes (API limit is 4000 bytes per field).
Output is native MP3 — no intermediate WAV or ffmpeg conversion.
Auth requires service account key or gcloud ADC. Do NOT use GEMINI_IMAGE_API_KEY.

Dependencies

google-cloud-texttospeech>=2.29.0 (via PEP 723 inline deps)
gcloud CLI (for auth token generation)
No ffmpeg required (native MP3 output)