name: voice-architecture description: Umbrella reference for how voice + chat work in the Guided Growth app. Three cost-tier paths keyed to the UX-26 dual-button orb — Path 1 Vapi (both halves on, State 1, full-duplex), Path 2 async voice (exactly one voice half on, States 2/3 — Soniox STT in or Cartesia Sonic out), Path 3 Direct LLM text only (both halves off, State 4). STT is Soniox; TTS is Cartesia Sonic 3.5. Read this when asking "which path do I use?", reasoning about callLLM / screen_contexts / session_log, choosing between Vapi and Cartesia, or comparing the paths. Per-path implementation details live in path-1-vapi / path-2-async / path-3-direct-llm. user-invocable: false
Voice & Chat — Three-Path Architecture
The three paths are cost tiers keyed to the UX-26 dual-button orb, not screen groups. The orb has two independent halves — AI-output (Cartesia TTS) and mic (Soniox STT). The live button-state picks the path; a surface has a default, but the moment a user flips a half, the path changes under it.
PATH 1 — Vapi (State 1: both halves on) full STT + LLM + TTS bundled in Vapi assistant
User ⇄ Frontend ⇄ Vapi ⇄ User realtime, bidirectional, tool webhooks for side effects
STT Soniox · LLM OpenAI inside Vapi (no BYO/callLLM; context via variableValues) · TTS Cartesia Sonic 3.5
PATH 2 — async voice (State 2 or 3: one voice half on) single voice half, asynchronous turns
State 2 (AI on, mic off): User → callLLM → LLM → Cartesia Sonic 3.5 (or MP3) → User
State 3 (AI off, mic on): User speaks → Soniox STT → callLLM → LLM → text → User
Check-ins are ONE pattern here (State 2 prompt then State 3 reply), not the whole tier.
PATH 3 — Direct LLM (State 4: both halves off) text in, text out, no voice
User → Frontend → callLLM → LLM → Frontend renders text → User
Boundary rule: any voice present is at least Path 2; dropping one Vapi half falls to Path 2, never straight to Path 3. Only both halves off reaches Path 3.
Cost tier vs. implementation. The path-* skill folders are named by implementation and predate this cost-tier reframe: path-1-vapi = the Vapi loop (tier Path 1); path-2-async = the check-in async-reflection pipeline (one tier-Path-2 implementation); path-3-direct-llm = the Direct-LLM implementation (useLLM → /api/llm, with optional direct Soniox STT / Cartesia TTS) that backs orb States 2, 3, and 4 — i.e. tier Path 2 (States 2/3) and tier Path 3 (State 4). So one implementation spans two cost tiers; the folder name is historical.
Reference files
- paths.md — full diagram, decision matrix, per-surface mapping
- shared.md —
callLLM(),screen_contexts,session_log, side-effect pattern (tool → DB → Realtime → UI) - glossary.md — Soniox (STT) vs Cartesia (Sonic 3.5 TTS, legacy Line) vs Vapi — what each thing actually is
When this skill is the right one to read
- "Which path does [screen X] use?" → paths.md
- "What is callLLM?" / "Where does screen_contexts get used?" → shared.md
- "Soniox vs Sonic vs Line?" / "Vapi vs Cartesia?" → glossary.md
- "Why are some flows tap-driven and others voice?" → shared.md ("caught up" principle)
When to skip this and read a path skill
| Working on | Skill |
|---|---|
| Vapi assistant config, onboarding voice, gcartesia-agents repo, useOnboardingAgent, useRealtimeVoice | path-1-vapi |
| Morning/evening check-in voice, MP3 + Sonic composition, Cartesia REST endpoints, useVoiceCommand/Chat/Input, ActionDispatcher | path-2-async |
| The Direct-LLM implementation behind the non-Vapi orb states (States 2/3/4 — tier Path 2 + Path 3), useLLM consumers, /api/llm, tap-driven LLM use | path-3-direct-llm |
Migration posture
The repo is mid-migration. Vapi (@vapi-ai/web) and Soniox STT (soniox-stream.ts, /api/stt) are live; the dual-button orb-state model (src/lib/orb/orbState.ts) is implemented. The legacy Cartesia Line agent (gcartesia-agents/) is retiring. Target is the diagram above: Vapi for Path 1, callLLM-orchestrated MP3 + Cartesia Sonic 3.5 for Path 2 (one voice half), callLLM-only for Path 3 (text, State 4).
Each path skill leads with its target state. Current code is documented as legacy (current-cartesia-*.md files inside each path) — preserve while reading existing code, do not extend.
Source of truth
Product-side reference: ~/Documents/Upwork/YA/Voice_System_Implementation_Guide.md (v6.0, April 2026). The diagram in this skill is the engineering view of that doc.