voice-agents - SKILL.md Agent Skill

name: "voice-agents" description: 'Build low-latency voice agents using speech-to-speech (S2S) realtime APIs and cascaded STT->LLM->TTS pipelines. Covers OpenAI Realtime / GPT Realtime, Azure Voice Live, Gemini Live, Deepgram Voice Agent, ElevenLabs Conversational AI; turn-taking, barge-in, latency budgets, tool use during speech, and telephony (Twilio, LiveKit, Pipecat).' metadata: author: "AgentX" version: "1.0.0" created: "2026-04-30" updated: "2026-04-30" compatibility: frameworks: ["openai-realtime", "azure-voice-live", "gemini-live", "deepgram-voice-agent", "elevenlabs-conversational", "livekit-agents", "pipecat", "twilio", "vocode"] languages: ["python", "typescript"]

Voice Agents

Purpose: Build voice experiences that feel like a phone call, not a chatbot reading aloud.

When to Use This Skill

Real-time voice assistants (web, mobile, kiosk)
Telephony agents (inbound / outbound calls, IVR replacement)
Multimodal copilots that speak and listen
Accessibility features that require interruption / barge-in

Architecture Choice

Speech-to-Speech (S2S, single model)
   audio in -> [Realtime model] -> audio out
   Pros: <500ms latency, natural prosody, can hear tone
   Cons: fewer model choices, higher cost, fewer guardrail hooks

Cascaded
   audio in -> [STT] -> [LLM] -> [TTS] -> audio out
   Pros: any LLM, full control, cheaper, easier guardrails
   Cons: 800-1500ms latency, lost prosody, brittle endpointing

Pick S2S when latency and natural turn-taking dominate. Pick cascaded when you need a specific LLM, tighter cost control, or richer logging.

Provider Snapshot (April 2026)

Provider	Mode	Notes
OpenAI Realtime / GPT Realtime	S2S	WebRTC, voice activity detection, function calling
Azure Voice Live	S2S	Foundry-integrated, Azure RAI controls
Gemini Live	S2S	Multimodal (audio + video), tool use
Deepgram Voice Agent	S2S	Telephony focus, low latency
ElevenLabs Conversational AI	S2S	Best TTS quality, broad voice catalog
Cartesia, Rime, Deepgram Aura	TTS	For cascaded pipelines
AssemblyAI, Deepgram Nova	STT	For cascaded pipelines
LiveKit Agents, Pipecat, Vocode	Orchestration	Wire the pipeline together

Latency Budget

End-to-end target for natural conversation: <800ms mouth-close to mouth-open.

Stage	Cascaded budget	S2S budget
Endpointing (silence detection)	200-300ms	100-200ms
STT first-token	150-300ms	n/a
LLM first-token	200-500ms	200-500ms
TTS first-audio	150-300ms	bundled
Network jitter	50-150ms	50-150ms

Tactics: streaming everywhere, partial STT into LLM, partial LLM into TTS, regional endpoints, WebRTC over WebSocket.

Turn-Taking and Barge-In

Server-side voice activity detection (VAD) is mandatory; do not rely on client only
Barge-in: when user starts speaking, cancel TTS playback and the in-flight LLM/TTS request, then handle new input
Detect overlap vs back-channel ("uh-huh", "right") -- do not interrupt on back-channel
Endpointing: combine VAD + semantic completeness for natural pauses

Tool Use During Speech

Announce slow tools ("let me check that...") to fill latency
Use parallel tool calls when independent (see tool-use-and-function-calling)
For long-running tools (>3s), stream interim audio updates
Confirm destructive tool calls verbally ("you'd like me to cancel order 123, correct?")

Telephony Specifics

Codec: G.711 mu-law / PCMU at 8kHz; some platforms upsample to 16kHz
DTMF: handle keypad input alongside voice
Dropped-call handling: persist conversation state for callback
Recording / consent: verbal disclosure required in many jurisdictions
E911 / regulated routing: never route emergencies through an LLM agent
Stack: Twilio Voice + LiveKit Agents / Pipecat is the common production combo

Safety and Compliance

PII redaction in transcripts before storage
Voice biometrics: do not use raw voice as auth without consent and a fallback
Consent for recording (jurisdiction dependent)
Hallucination risk in voice is amplified -- users hear confidence; ground responses

Eval

Word Error Rate (WER) per accent / language slice
Mean response latency p50/p95
Barge-in handling success rate
Turn-taking false-cuts and false-misses
Task-completion rate (not just transcript correctness)

Skills to Load Alongside

Need	Skill
Tool calls inside the voice loop	`tool-use-and-function-calling`
Latency / cost telemetry	`agent-observability`
Choosing the right LLM behind the voice	`reasoning-models`, `llm-gateway-and-routing`
Guardrails for voice-bound output	`ai-safety-and-red-teaming`

References

OpenAI Realtime / GPT Realtime docs
Azure Voice Live, Gemini Live, Deepgram Voice Agent docs
LiveKit Agents, Pipecat