name: path-3-direct-llm
description: Direct-LLM implementation behind the three non-Vapi orb states (UX-26 States 2, 3, 4). Frontend → useLLM → /api/llm → frontend renders text (optionally spoken via Cartesia TTS in State 2 (tts-service.ts), or driven by Soniox STT in State 3). No Vapi orchestration. NOTE on cost tiers: by the dual-button model these states split across Path 2 (States 2/3 — one voice half on) and Path 3 (State 4 — text only); this one implementation spans both, the folder name is historical. Covers any surface running a non-Vapi orb combination — onboarding chat overlay, post-onboarding CHAT, tap-driven LLM consumers. Auto-invoked when working on the non-Vapi orb states (voice_out_only / voice_in_only / text_only), useLLM, /api/llm, /api/stt, or the onboarding parser. NOT for the Vapi orb State 1 (path-1-vapi) and NOT for the check-in async-reflection loop (path-2-async).
user-invocable: false
Path 3 — Direct LLM (Three Non-Vapi Orb States)
This skill covers the Direct-LLM implementation that runs UX-26 States 2, 3, and 4 wherever those orb combinations appear (onboarding chat overlay, post-onboarding CHAT, tap-driven LLM consumers). Path 1 (Vapi) owns State 1 (full-duplex voice).
Cost tier vs. this skill. The dual-button model assigns these states to two cost tiers: State 2 (AI on, mic off) and State 3 (AI off, mic on) are Path 2 (one voice half on); only State 4 (both off) is Path 3 (text only). They share this single implementation —
useLLM→/api/llm, plus direct Cartesia TTS (State 2) or direct Soniox STT (State 3). So "path-3-direct-llm" the folder ≠ "Path 3" the cost tier; the name predates the reframe. See voice-architecture/paths.md for the state→path table.
Per gg-spec/docs/global-ux-rules.md:
- State 2 (AI on, mic off) — Cartesia TTS via
tts-service.tsspeaks; user types or taps. (cost tier: Path 2) - State 3 (AI off, mic on) — Soniox STT captures user speech; LLM reply rendered as text only, no audio. (cost tier: Path 2)
- State 4 (both off) — pure text in / text out. (cost tier: Path 3)
All three states route through useLLM → /api/llm (or the onboarding parser for screen-bound flows). STT runs Soniox via /api/stt.
User → Frontend → callLLM() → LLM → Frontend renders text → User
Reference files
- surfaces.md — surfaces that run the three non-Vapi orb states
- side-effects.md — session_log writes and the "caught up" principle
When Path 3 is the right path
- The surface is in orb State 2, 3, or 4 (anywhere in the app).
- The user is typing into a chat overlay or hearing TTS output without the mic open.
- STT captures user speech but the response is text-only (no TTS).
- Tap-driven LLM consumers (suggestions, summaries, parse-on-submit).
When Path 3 is the wrong path
- The surface is in full duplex (orb State 1) → Path 1 (Vapi).
- The surface is a daily check-in or single-utterance command → Path 2 (async composition —
path-2-async).
How callLLM is involved
callLLM() is the single entry point for every LLM call across all three paths. For Path 3, it's the only dependency — there's no voice provider in the loop.
const response = await callLLM({
userId,
screenId,
userInput: text,
});
A channel discriminator (e.g. 'direct' vs 'vapi') is planned but not implemented — useLLM / src/api/llm.ts do not accept it today.
callLLM() prepends:
- The
screen_contextsrow forscreenId(where the user is) - The
session_logdelta since the last callLLM foruserId(what happened since last call) - The base system prompt
Then it calls the LLM provider directly (no Vapi, no Cartesia). Records the timestamp. Returns the response.
Status: callLLM not built yet (P1-34). Today's text/STT surfaces make ad-hoc OpenAI calls or skip the LLM entirely. The data foundation (screen_contexts, session_log) exists. See voice-architecture/shared.md.
The "caught up" principle
This is the cost optimization that keeps the LLM aware without burning a token on every tap.
Tap-driven actions (add a habit, log a goal, change a preference, accept a suggestion):
- Write to
session_logwithevent_type+ payload. - Do NOT call the LLM right now.
Later, when the next LLM call fires (user types, or speaks in State 3, or opens a surface that calls callLLM):
callLLM()runs.- Reads the
screen_contextsrow for the current screen. - Reads
session_logrows since the lastrecordLLMCallTimestamp(userId). - Sees the tap events in the delta.
- Prepends both to the prompt.
- The LLM is fully caught up.
Cheap and aware.
Trigger points where Path 3 actually calls the LLM
- User submits text in a chat surface →
callLLM()runs. - User finishes speaking in State 3 (STT transcript ready) →
callLLM()runs. - Tap explicitly asks for LLM help ("suggest a new habit", "summarize my week") →
callLLM()runs. - A Path 1 (Vapi) session starts → that path's LLM call also goes through
callLLM()(channel discrimination planned, see above).
Everything else writes to session_log and waits.
Anti-patterns
| Pattern | Why it's wrong |
|---|---|
| Calling OpenAI / Anthropic directly from a chat hook | Bypasses callLLM. LLM goes blind to screen_contexts + session_log. |
| Routing every tap through the LLM "for awareness" | Burns tokens. Use session_log instead — LLM reads delta on next call. |
| Skipping session_log writes "to save a request" | Breaks the delta. LLM gets out of sync. Always write the event. |
| Adding a fourth path | The doc forbids it. Extend callLLM (planned channel discriminator) or use a tool webhook (Path 1) or a composition piece (path-2-async). |
Side effects
Path 3 actions can still trigger the same side-effect pattern as Path 1 and path-2-async — when the LLM's response includes a CRUD intent, it routes through the same ActionDispatcher → DataService → Supabase chain. Base tool names: update_profile, navigate_next, log_event, get_user_context.
For tap-driven flows that are pure CRUD (no LLM), call DataService directly and write to session_log so the next LLM call is aware. See side-effects.md for the exact pattern.
Onboarding tool-calling
On ONBOARD-* screens the LLM receives a per-screen tool set and the base tools are excluded. Eight tools in api/_lib/llm/onboarding/schemas.ts:
submit_profile(ONBOARD-01--FORM)submit_path_choice(ONBOARD-FORK--FORM)submit_category(ONBOARD-BEGINNER-01)submit_goals(ONBOARD-BEGINNER-02)add_habit/remove_habit(ONBOARD-BEGINNER-03)submit_reflection_config(ONBOARD-BEGINNER-07)submit_brain_dump(ONBOARD-ADVANCED)
Gating: getOnboardingTools(screen_id) in api/_lib/llm/onboarding/registry.ts. The allowedToolNames gate in api/llm/[...path].ts rejects hallucinated base-tool calls on these screens.
Eager-call directive: ONBOARDING_TOOL_ADDENDUM in api/_lib/llm/onboarding/systemPromptAddendum.ts. The prompt also injects an "## Already-Filled Fields" block from onboarding_states.data so the LLM doesn't re-ask across session restart.
Handlers UPSERT into onboarding_states keyed on anon_id with GREATEST(current_step, X) monotonic guards and JSONB || merges. See side-effects.md for the full chain.
Relationship to the other paths
| Path 3 calls | Other paths call | |
|---|---|---|
callLLM() |
yes — direct (channel discriminator planned, not implemented) | Path 1: yes (via Vapi); path-2-async: yes (same callLLM entry) |
screen_contexts table |
yes — via callLLM ctx | yes — same |
session_log table |
yes — read (delta) + write (tap events) | yes — read on each callLLM; write at meaningful points |
ActionDispatcher |
optional — when LLM returns a CRUD intent (update_profile, navigate_next, log_event) |
path-2-async: yes (single-utterance commands). Path 1: replaced by Vapi tool webhooks. |
| Voice infra (Vapi / Soniox / Cartesia) | State 4: none. State 2: Cartesia Sonic out. State 3: Soniox STT in. | Path 1: Vapi (Soniox + Cartesia inside). path-2-async: Soniox + Cartesia Sonic. |