name: voice
description: Voice input/output, transcription, TTS, API keys; manages /.voice/voice_config.json.
serve: PORT=$(curl -sk -X POST https://localhost:$VESTAD_PORT/agents/$AGENT_NAME/services -H "X-Agent-Token: $AGENT_TOKEN" -H 'Content-Type: application/json' -d '{"name":"voice"}' | python3 -c "import sys,json; print(json.load(sys.stdin)['port'])") && SKILL_PORT=$PORT PYTHONPATH=/agent/skills screen -dmS voice uv run python -m voice.server
Voice setup (STT/TTS)
Voice lets the user talk to you through the mic and hear your responses spoken aloud in the Vesta app.
Once configured, the user can manage voice settings directly from the agent settings page in the app, including changing voices, listening to voice previews, toggling STT/TTS on or off, and adjusting sensitivity. Let them know this after setup.
When to offer setup
- User mentions voice, microphone, speaking aloud, hearing you, TTS, STT, transcription
- User complains the mic button is disabled or they can't hear you
- New container, user hasn't set up voice yet. Offer once early, then drop it (track in memory so you don't nag)
The setup flow
- Ask which they want: Deepgram for input (speech-to-text), ElevenLabs for output (text-to-speech). Both are independent; the user may configure only one.
- Walk them through getting a key: see SETUP.md for the per-provider link and where to find the key.
- Validate the key before saving:
uv run ~/agent/skills/voice/scripts/voice_keys.py validate --provider deepgram --key <key> - Save the key:
uv run ~/agent/skills/voice/scripts/voice_keys.py set-key --domain stt --provider deepgram --key <key> - Pick a voice (TTS only). Ask the user if they'd prefer a male or female voice, then set an appropriate default:
- Male voices: Roger (laid-back), Charlie (deep, Australian), George (warm, British), Liam (energetic), Chris (charming), Brian (deep, resonant), Daniel (steady, British)
- Female voices: Sarah (mature), Laura (enthusiastic), Alice (clear, British), Matilda (professional), Jessica (playful), Lily (velvety, British)
uv run ~/agent/skills/voice/scripts/voice_keys.py set-voice --id <voice_id>Let them know they can browse all voices and listen to previews in the app settings later.
- Ensure the voice server is running. The app fetches config from it. Check with
screen -ls | grep voice. If it's not running, start it:PORT=$(curl -sk -X POST https://localhost:$VESTAD_PORT/agents/$AGENT_NAME/services -H "X-Agent-Token: $AGENT_TOKEN" -H 'Content-Type: application/json' -d '{"name":"voice"}' | python3 -c "import sys,json; print(json.load(sys.stdin)['port'])") SKILL_PORT=$PORT PYTHONPATH=~/agent/skills screen -dmS voice uv run python -m voice.server - Confirm, e.g. "Voice is ready! You can use the mic button now. You can also change voices, listen to previews, and tweak settings from the settings page in the app."
Commands
# See current state
uv run ~/agent/skills/voice/scripts/voice_keys.py status
# Keys
uv run ~/agent/skills/voice/scripts/voice_keys.py validate --provider {deepgram|elevenlabs} --key <k>
uv run ~/agent/skills/voice/scripts/voice_keys.py set-key --domain {stt|tts} --provider {deepgram|elevenlabs} --key <k>
uv run ~/agent/skills/voice/scripts/voice_keys.py clear --domain {stt|tts} # removes provider + keys entirely
# Enable/disable (keeps configuration intact, just toggles on/off)
uv run ~/agent/skills/voice/scripts/voice_keys.py enable --domain {stt|tts}
uv run ~/agent/skills/voice/scripts/voice_keys.py disable --domain {stt|tts}
# TTS voice selection
uv run ~/agent/skills/voice/scripts/voice_keys.py set-voice --id <voice_id>
uv run ~/agent/skills/voice/scripts/voice_keys.py add-voice --id <voice_id> --name <name> --description "..."
uv run ~/agent/skills/voice/scripts/voice_keys.py remove-voice --id <voice_id>
# STT keyterms (words the transcription should bias toward)
uv run ~/agent/skills/voice/scripts/voice_keys.py add-keyterm <term>
uv run ~/agent/skills/voice/scripts/voice_keys.py remove-keyterm <term>
# STT end-of-turn tuning
uv run ~/agent/skills/voice/scripts/voice_keys.py set-eot --threshold 0.8
uv run ~/agent/skills/voice/scripts/voice_keys.py set-eot --timeout-ms 10000
Common asks
- "Disable TTS / stop speaking" →
disable --domain tts(keeps keys, just turns it off) - "Enable TTS / start speaking again" →
enable --domain tts - "Disable STT / turn off the mic" →
disable --domain stt - "Remove voice completely" →
clear --domain tts(wipes provider + keys) - "I want you to sound like
" →set-voice --id <matching voice_id from status>(or tell them they can browse and preview voices in the app settings) - "Make sure you recognize '{AGENT_NAME}'" →
add-keyterm {AGENT_NAME} - "Finalize my turns faster" → lower
--threshold(e.g. 0.6) - "Stop cutting me off" → raise
--threshold(e.g. 0.9) or raise--timeout-ms
Providers
Deepgram (STT, voice input)
- Domain:
stt, provider name:deepgram - Model:
flux-general-en(~$0.0048/min) - New accounts get $200 free credit
- Keyterms bias the transcription toward specific words (e.g. the agent's name)
- End-of-turn detection is tuned via
--threshold(confidence, 0-1) and--timeout-ms(silence timeout)
ElevenLabs (TTS, voice output)
- Domain:
tts, provider name:elevenlabs - Model:
eleven_flash_v2_5, output format:mp3_22050_32 - Free tier: 10k characters/month
- Ships with premade voices; users can also add custom/cloned voices from their ElevenLabs account
- Adding a voice: when the user provides an ElevenLabs voice ID without a name or description, fetch them from the API before calling
add-voice:
Use the first line ascurl -s https://api.elevenlabs.io/v1/voices/<id> | python3 -c " import sys,json; v=json.load(sys.stdin); l=v.get('labels',{}) print(v.get('name','')) print(', '.join(p for p in [l.get('description',''),l.get('accent',''),l.get('gender','')] if p))"--nameand the second as--description. If the fetch fails, ask the user.