name: livekit-tts
description: Configure Text-to-Speech models for LiveKit agents with Cartesia, ElevenLabs, OpenAI, and custom voices
argument-hint: ""
allowed-tools: Read, Write, Bash(pip install, npm install), Glob, Grep, WebSearch
LiveKit Text-to-Speech (TTS)
Configure voice synthesis for AI agents: $ARGUMENTS
Expert Knowledge
You are a LiveKit TTS specialist with expertise in:
- TTS model selection and comparison
- Voice customization and cloning
- Streaming audio synthesis
- Emotion and style control
- Multilingual voice synthesis
TTS Overview
| Provider | Latency | Voices | Best For |
|---|---|---|---|
| Cartesia Sonic | Ultra-low | 30+ | Production, real-time |
| ElevenLabs | Low | 1000+ | Voice cloning, quality |
| OpenAI TTS | Medium | 6 | Simplicity |
| Google Cloud | Medium | 300+ | Multilingual |
| Azure | Medium | 400+ | Enterprise |
LiveKit Inference Models
# Cartesia
tts="cartesia/sonic"
tts="cartesia/sonic:voice-id"
# ElevenLabs
tts="elevenlabs/flash-v2.5"
tts="elevenlabs/flash-v2.5:voice-id"
# OpenAI
tts="openai/tts-1"
tts="openai/tts-1:alloy"
tts="openai/tts-1:echo"
tts="openai/tts-1:fable"
tts="openai/tts-1:onyx"
tts="openai/tts-1:nova"
tts="openai/tts-1:shimmer"
Plugin Configuration
Cartesia Sonic
from livekit.plugins.cartesia import TTS as CartesiaTTS
tts = CartesiaTTS(
model="sonic",
voice="a0e99841-438c-4a64-b679-ae501e7d6091", # Voice ID
# Speed control
speed=1.0, # 0.5 to 2.0
# Emotion control (experimental)
emotion=["positivity:high", "curiosity:medium"],
# Language (auto-detected if not set)
language="en",
)
# Use in session
await session.start(
agent=MyAgent(),
stt="deepgram/nova-3",
llm="openai/gpt-4o-mini",
tts=tts,
)
Cartesia Voice Selection
# Browse voices at: https://play.cartesia.ai/
# Popular Cartesia voices
voices = {
"barbershop_man": "a0e99841-438c-4a64-b679-ae501e7d6091",
"sweet_lady": "e00d0e4c-a5c8-443f-a8a3-473eb9a62355",
"confident_man": "2ee87190-8f84-4925-97da-e52547f9462c",
"calm_woman": "b7d50908-b17c-442d-ad8d-810c63997ed9",
}
tts = CartesiaTTS(
model="sonic",
voice=voices["sweet_lady"],
)
ElevenLabs
from livekit.plugins.elevenlabs import TTS as ElevenLabsTTS
tts = ElevenLabsTTS(
voice="21m00Tcm4TlvDq8ikWAM", # Rachel voice
# Model selection
model="eleven_flash_v2_5", # Fastest
# model="eleven_turbo_v2_5", # Balanced
# model="eleven_multilingual_v2", # Best quality
# Voice settings
stability=0.5, # 0-1, lower = more expressive
similarity_boost=0.75, # 0-1, higher = closer to original
style=0.0, # 0-1, style exaggeration
use_speaker_boost=True,
# Language (for multilingual model)
# language="en",
)
ElevenLabs Voice Cloning
# First, clone voice via ElevenLabs dashboard or API
# Then use the cloned voice ID
tts = ElevenLabsTTS(
voice="your-cloned-voice-id",
model="eleven_turbo_v2_5",
stability=0.5,
similarity_boost=0.85, # Higher for cloned voices
)
OpenAI TTS
from livekit.plugins.openai import TTS as OpenAITTS
tts = OpenAITTS(
model="tts-1", # tts-1 (fast) or tts-1-hd (quality)
voice="nova", # alloy, echo, fable, onyx, nova, shimmer
speed=1.0, # 0.25 to 4.0
)
# Voice characteristics:
# alloy - Neutral, balanced
# echo - Warm, conversational
# fable - British, storytelling
# onyx - Deep, authoritative
# nova - Friendly, upbeat
# shimmer - Clear, professional
Google Cloud TTS
from livekit.plugins.google import TTS as GoogleTTS
tts = GoogleTTS(
voice="en-US-Neural2-F", # Neural voice
language="en-US",
speaking_rate=1.0,
pitch=0.0,
)
# Voice types:
# Neural2 - Best quality
# Studio - Professional
# Wavenet - High quality
# Standard - Basic
Azure TTS
from livekit.plugins.azure import TTS as AzureTTS
tts = AzureTTS(
voice="en-US-JennyNeural",
language="en-US",
rate=1.0,
pitch=0,
)
Voice Embedding (Cartesia)
# Create voice from embedding
tts = CartesiaTTS(
model="sonic",
voice_embedding=[0.123, 0.456, ...], # 192-dim vector
)
# Mix voice embeddings
import numpy as np
voice_a = get_embedding("voice_a_id")
voice_b = get_embedding("voice_b_id")
mixed = 0.7 * voice_a + 0.3 * voice_b
mixed = mixed / np.linalg.norm(mixed) # Normalize
tts = CartesiaTTS(
model="sonic",
voice_embedding=mixed.tolist(),
)
Emotion Control (Cartesia)
# Emotion tags
emotions = [
"anger",
"positivity",
"surprise",
"sadness",
"curiosity",
]
# Intensity levels
levels = ["low", "medium", "high"]
tts = CartesiaTTS(
model="sonic",
voice="voice-id",
emotion=["positivity:high", "curiosity:medium"],
)
# Change emotion dynamically
async def speak_with_emotion(text: str, emotions: list[str]):
tts.emotion = emotions
await session.say(text)
Word-Level Timing
# Cartesia and ElevenLabs support word timestamps
from livekit.plugins.cartesia import TTS as CartesiaTTS
tts = CartesiaTTS(
model="sonic",
voice="voice-id",
word_timestamps=True,
)
@session.on("tts_word")
def on_word(word: str, start_time: float, end_time: float):
print(f"Word '{word}' at {start_time:.2f}s - {end_time:.2f}s")
Streaming Synthesis
# TTS streams audio as it's generated
# This is the default behavior
# For explicit control:
async def stream_speech(text: str):
stream = tts.synthesize(text)
async for audio_chunk in stream:
# Process each audio chunk
yield audio_chunk
Standalone Usage
# Use TTS independently
from livekit.plugins.cartesia import TTS
tts = TTS(model="sonic", voice="voice-id")
async def synthesize_audio(text: str):
audio_data = b""
async for chunk in tts.synthesize(text):
audio_data += chunk.data
# Save to file
with open("output.wav", "wb") as f:
f.write(audio_data)
Multi-Voice Agents
class MultiVoiceAgent(Agent):
def __init__(self):
super().__init__(
instructions="You can speak with different voices."
)
self.voices = {
"narrator": CartesiaTTS(model="sonic", voice="narrator-id"),
"character1": CartesiaTTS(model="sonic", voice="char1-id"),
"character2": CartesiaTTS(model="sonic", voice="char2-id"),
}
async def narrate(self, text: str):
await self.session.say(text, tts=self.voices["narrator"])
async def speak_as(self, character: str, text: str):
await self.session.say(text, tts=self.voices[character])
Pronunciation Control
# Cartesia pronunciation dictionary
tts = CartesiaTTS(
model="sonic",
voice="voice-id",
pronunciation_dictionary={
"LiveKit": "LAHYV-kit",
"WebRTC": "web R T C",
"API": "A P I",
},
)
# SSML for other providers (Azure, Google)
text = """
<speak>
Welcome to <phoneme alphabet="ipa" ph="ˈlaɪv.kɪt">LiveKit</phoneme>.
<break time="500ms"/>
How can I help you today?
</speak>
"""
Environment Variables
# Cartesia
CARTESIA_API_KEY=your-key
# ElevenLabs
ELEVEN_API_KEY=your-key
# OpenAI
OPENAI_API_KEY=your-key
# Google Cloud
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
# Azure
AZURE_SPEECH_KEY=your-key
AZURE_SPEECH_REGION=eastus
Complete Example
import logging
from dotenv import load_dotenv
from livekit.agents import Agent, AgentSession, RtcSession, AgentServer
from livekit.plugins import silero
from livekit.plugins.cartesia import TTS as CartesiaTTS
load_dotenv()
logging.basicConfig(level=logging.INFO)
class VoiceAgent(Agent):
def __init__(self):
super().__init__(
instructions="""You are a friendly assistant.
Speak naturally with appropriate emotion."""
)
async def on_enter(self):
# Enthusiastic greeting
self.session.tts.emotion = ["positivity:high"]
await self.session.generate_reply(
"Hello! I'm so excited to help you today!"
)
@RtcSession.factory
async def create_session(session: AgentSession):
tts = CartesiaTTS(
model="sonic",
voice="sweet_lady_id", # Replace with actual voice ID
speed=1.0,
emotion=["positivity:medium"],
)
@session.on("tts_started")
def on_tts_start():
logging.info("Agent started speaking")
@session.on("tts_ended")
def on_tts_end():
logging.info("Agent finished speaking")
await session.start(
agent=VoiceAgent(),
stt="deepgram/nova-3",
llm="openai/gpt-4o-mini",
tts=tts,
vad=silero.VAD.load(),
)
if __name__ == "__main__":
AgentServer(create_session).run()
Best Practices
- Match voice to persona: Choose voices that fit your agent's character
- Use streaming: Lower latency for real-time conversations
- Control speed: Slow down for clarity, speed up for casual chat
- Test with users: Voice preference is subjective
- Handle pauses: Use breaks for natural speech rhythm
Deliverables
For: $ARGUMENTS
Provide:
- TTS provider selection
- Voice configuration
- Emotion/style settings if applicable
- Streaming setup
- Multi-voice logic if needed
- Integration code