livekit-tts - SKILL.md Agent Skill

name: livekit-tts description: Configure Text-to-Speech models for LiveKit agents with Cartesia, ElevenLabs, OpenAI, and custom voices argument-hint: "" allowed-tools: Read, Write, Bash(pip install, npm install), Glob, Grep, WebSearch

LiveKit Text-to-Speech (TTS)

Configure voice synthesis for AI agents: $ARGUMENTS

Expert Knowledge

You are a LiveKit TTS specialist with expertise in:

TTS model selection and comparison
Voice customization and cloning
Streaming audio synthesis
Emotion and style control
Multilingual voice synthesis

TTS Overview

Provider	Latency	Voices	Best For
Cartesia Sonic	Ultra-low	30+	Production, real-time
ElevenLabs	Low	1000+	Voice cloning, quality
OpenAI TTS	Medium	6	Simplicity
Google Cloud	Medium	300+	Multilingual
Azure	Medium	400+	Enterprise

LiveKit Inference Models

# Cartesia
tts="cartesia/sonic"
tts="cartesia/sonic:voice-id"

# ElevenLabs
tts="elevenlabs/flash-v2.5"
tts="elevenlabs/flash-v2.5:voice-id"

# OpenAI
tts="openai/tts-1"
tts="openai/tts-1:alloy"
tts="openai/tts-1:echo"
tts="openai/tts-1:fable"
tts="openai/tts-1:onyx"
tts="openai/tts-1:nova"
tts="openai/tts-1:shimmer"

Plugin Configuration

Cartesia Sonic

from livekit.plugins.cartesia import TTS as CartesiaTTS

tts = CartesiaTTS(
    model="sonic",
    voice="a0e99841-438c-4a64-b679-ae501e7d6091",  # Voice ID

    # Speed control
    speed=1.0,  # 0.5 to 2.0

    # Emotion control (experimental)
    emotion=["positivity:high", "curiosity:medium"],

    # Language (auto-detected if not set)
    language="en",
)

# Use in session
await session.start(
    agent=MyAgent(),
    stt="deepgram/nova-3",
    llm="openai/gpt-4o-mini",
    tts=tts,
)

Cartesia Voice Selection

# Browse voices at: https://play.cartesia.ai/

# Popular Cartesia voices
voices = {
    "barbershop_man": "a0e99841-438c-4a64-b679-ae501e7d6091",
    "sweet_lady": "e00d0e4c-a5c8-443f-a8a3-473eb9a62355",
    "confident_man": "2ee87190-8f84-4925-97da-e52547f9462c",
    "calm_woman": "b7d50908-b17c-442d-ad8d-810c63997ed9",
}

tts = CartesiaTTS(
    model="sonic",
    voice=voices["sweet_lady"],
)

ElevenLabs

from livekit.plugins.elevenlabs import TTS as ElevenLabsTTS

tts = ElevenLabsTTS(
    voice="21m00Tcm4TlvDq8ikWAM",  # Rachel voice

    # Model selection
    model="eleven_flash_v2_5",  # Fastest
    # model="eleven_turbo_v2_5",  # Balanced
    # model="eleven_multilingual_v2",  # Best quality

    # Voice settings
    stability=0.5,         # 0-1, lower = more expressive
    similarity_boost=0.75, # 0-1, higher = closer to original
    style=0.0,             # 0-1, style exaggeration
    use_speaker_boost=True,

    # Language (for multilingual model)
    # language="en",
)

ElevenLabs Voice Cloning

# First, clone voice via ElevenLabs dashboard or API
# Then use the cloned voice ID

tts = ElevenLabsTTS(
    voice="your-cloned-voice-id",
    model="eleven_turbo_v2_5",
    stability=0.5,
    similarity_boost=0.85,  # Higher for cloned voices
)

OpenAI TTS

from livekit.plugins.openai import TTS as OpenAITTS

tts = OpenAITTS(
    model="tts-1",        # tts-1 (fast) or tts-1-hd (quality)
    voice="nova",         # alloy, echo, fable, onyx, nova, shimmer
    speed=1.0,            # 0.25 to 4.0
)

# Voice characteristics:
# alloy - Neutral, balanced
# echo - Warm, conversational
# fable - British, storytelling
# onyx - Deep, authoritative
# nova - Friendly, upbeat
# shimmer - Clear, professional

Google Cloud TTS

from livekit.plugins.google import TTS as GoogleTTS

tts = GoogleTTS(
    voice="en-US-Neural2-F",  # Neural voice
    language="en-US",
    speaking_rate=1.0,
    pitch=0.0,
)

# Voice types:
# Neural2 - Best quality
# Studio - Professional
# Wavenet - High quality
# Standard - Basic

Azure TTS

from livekit.plugins.azure import TTS as AzureTTS

tts = AzureTTS(
    voice="en-US-JennyNeural",
    language="en-US",
    rate=1.0,
    pitch=0,
)

Voice Embedding (Cartesia)

# Create voice from embedding
tts = CartesiaTTS(
    model="sonic",
    voice_embedding=[0.123, 0.456, ...],  # 192-dim vector
)

# Mix voice embeddings
import numpy as np

voice_a = get_embedding("voice_a_id")
voice_b = get_embedding("voice_b_id")
mixed = 0.7 * voice_a + 0.3 * voice_b
mixed = mixed / np.linalg.norm(mixed)  # Normalize

tts = CartesiaTTS(
    model="sonic",
    voice_embedding=mixed.tolist(),
)

Emotion Control (Cartesia)

# Emotion tags
emotions = [
    "anger",
    "positivity",
    "surprise",
    "sadness",
    "curiosity",
]

# Intensity levels
levels = ["low", "medium", "high"]

tts = CartesiaTTS(
    model="sonic",
    voice="voice-id",
    emotion=["positivity:high", "curiosity:medium"],
)

# Change emotion dynamically
async def speak_with_emotion(text: str, emotions: list[str]):
    tts.emotion = emotions
    await session.say(text)

Word-Level Timing

# Cartesia and ElevenLabs support word timestamps
from livekit.plugins.cartesia import TTS as CartesiaTTS

tts = CartesiaTTS(
    model="sonic",
    voice="voice-id",
    word_timestamps=True,
)

@session.on("tts_word")
def on_word(word: str, start_time: float, end_time: float):
    print(f"Word '{word}' at {start_time:.2f}s - {end_time:.2f}s")

Streaming Synthesis

# TTS streams audio as it's generated
# This is the default behavior

# For explicit control:
async def stream_speech(text: str):
    stream = tts.synthesize(text)

    async for audio_chunk in stream:
        # Process each audio chunk
        yield audio_chunk

Standalone Usage

# Use TTS independently
from livekit.plugins.cartesia import TTS

tts = TTS(model="sonic", voice="voice-id")

async def synthesize_audio(text: str):
    audio_data = b""
    async for chunk in tts.synthesize(text):
        audio_data += chunk.data

    # Save to file
    with open("output.wav", "wb") as f:
        f.write(audio_data)

Multi-Voice Agents

class MultiVoiceAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions="You can speak with different voices."
        )

        self.voices = {
            "narrator": CartesiaTTS(model="sonic", voice="narrator-id"),
            "character1": CartesiaTTS(model="sonic", voice="char1-id"),
            "character2": CartesiaTTS(model="sonic", voice="char2-id"),
        }

    async def narrate(self, text: str):
        await self.session.say(text, tts=self.voices["narrator"])

    async def speak_as(self, character: str, text: str):
        await self.session.say(text, tts=self.voices[character])

Pronunciation Control

# Cartesia pronunciation dictionary
tts = CartesiaTTS(
    model="sonic",
    voice="voice-id",
    pronunciation_dictionary={
        "LiveKit": "LAHYV-kit",
        "WebRTC": "web R T C",
        "API": "A P I",
    },
)

# SSML for other providers (Azure, Google)
text = """
<speak>
  Welcome to <phoneme alphabet="ipa" ph="ˈlaɪv.kɪt">LiveKit</phoneme>.
  <break time="500ms"/>
  How can I help you today?
</speak>
"""

Environment Variables

# Cartesia
CARTESIA_API_KEY=your-key

# ElevenLabs
ELEVEN_API_KEY=your-key

# OpenAI
OPENAI_API_KEY=your-key

# Google Cloud
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json

# Azure
AZURE_SPEECH_KEY=your-key
AZURE_SPEECH_REGION=eastus

Complete Example

import logging
from dotenv import load_dotenv
from livekit.agents import Agent, AgentSession, RtcSession, AgentServer
from livekit.plugins import silero
from livekit.plugins.cartesia import TTS as CartesiaTTS

load_dotenv()
logging.basicConfig(level=logging.INFO)

class VoiceAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions="""You are a friendly assistant.
            Speak naturally with appropriate emotion."""
        )

    async def on_enter(self):
        # Enthusiastic greeting
        self.session.tts.emotion = ["positivity:high"]
        await self.session.generate_reply(
            "Hello! I'm so excited to help you today!"
        )

@RtcSession.factory
async def create_session(session: AgentSession):
    tts = CartesiaTTS(
        model="sonic",
        voice="sweet_lady_id",  # Replace with actual voice ID
        speed=1.0,
        emotion=["positivity:medium"],
    )

    @session.on("tts_started")
    def on_tts_start():
        logging.info("Agent started speaking")

    @session.on("tts_ended")
    def on_tts_end():
        logging.info("Agent finished speaking")

    await session.start(
        agent=VoiceAgent(),
        stt="deepgram/nova-3",
        llm="openai/gpt-4o-mini",
        tts=tts,
        vad=silero.VAD.load(),
    )

if __name__ == "__main__":
    AgentServer(create_session).run()

Best Practices

Match voice to persona: Choose voices that fit your agent's character
Use streaming: Lower latency for real-time conversations
Control speed: Slow down for clarity, speed up for casual chat
Test with users: Voice preference is subjective
Handle pauses: Use breaks for natural speech rhythm

Deliverables

For: $ARGUMENTS

Provide:

TTS provider selection
Voice configuration
Emotion/style settings if applicable
Streaming setup
Multi-voice logic if needed
Integration code