name: voice-transcription description: >- Framework-wide voice dictation in the agent sidebar composer. Use when changing composer microphone UX, the transcribe-voice route, or the Voice Transcription settings section. Covers transcription-source routing, cleanup routing, Google realtime gating, and the voice transcription application-state keys. scope: dev metadata: internal: true
Voice Transcription
Click-to-toggle microphone inside the sidebar composer turns speech into
text. Users configure live transcription separately from AI cleanup in
Settings → Voice Transcription. The feature is available in every template
that renders TiptapComposer.
UX rules
- Always show the mic alongside the send button. Cursor replaces send with mic when the composer is empty; their users complain. We keep both visible — Lovable does the same.
- Click-to-toggle, not push-to-talk. More forgiving in a sidebar, avoids
host-app hotkey clashes. Keyboard shortcut is
Cmd/Ctrl+Shift+MandEscapecancels mid-recording. - Transcript lands in the composer, editable, never auto-sent. Insert at
the caret via
editor.chain().focus().insertContent(text).run(). - No CSS transitions for the recording state. Framework rule; use static
brand color (
#625DF5) instead of pulses. - Icon: Tabler
IconMicrophone(idle) /IconPlayerStopFilled(recording). Never use a sparkle or robot icon. - Errors via inline alert or toast, never
window.alert.
Source And Cleanup
Settings must keep these as separate choices:
- Live transcription source:
mac-native,google-realtime, orbatch. - AI cleanup: independent off/on toggle. Cleanup uses managed Gemini first
when a managed AI services connection is configured, then BYOK Gemini (
GEMINI_API_KEY). Gemini cleanup/title/summary generation is not a live STT source.
application_state["voice-transcription-prefs"] stores
{ transcriptionMode, provider, instructions }. The legacy provider field
is still written for old clients and batch provider preferences:
| Value | Meaning | Needs key |
|---|---|---|
mac-native |
Native macOS/Tauri speech path; web clients normalize to browser-native where needed | No |
google-realtime |
Dedicated WebSocket → Google Speech-to-Text gRPC StreamingRecognize path |
GOOGLE_APPLICATION_CREDENTIALS |
batch |
Upload audio after stop through the existing batch route | Builder/Gemini/Groq/OpenAI depending on fallback |
auto provider |
Browser SpeechRecognition when supported; server batch fallback chain otherwise | No key needed in browsers that support SpeechRecognition |
builder-gemini |
Managed Gemini Flash-Lite batch/cleanup preference | Managed AI services account connected |
gemini |
Direct Google Gemini BYOK batch/cleanup preference | GEMINI_API_KEY |
groq |
Groq Whisper batch preference | GROQ_API_KEY |
openai |
OpenAI Whisper batch preference | OPENAI_API_KEY |
browser |
Legacy native/browser live speech preference | No |
Default behavior:
- The shared web settings/composer default to Batch /
auto. Inautomode,useVoiceDictationusesstartBrowser()(Web Speech API, no key required, incremental streaming) when the browser supportsSpeechRecognition. It only falls back to the MediaRecorder → server upload path whenSpeechRecognitionis not available (e.g. Firefox). This means dictation works out of the box in Chrome, Edge, and Safari without any API key configuration. - Dedicated macOS Tauri-native surfaces may save
mac-native, but do not assume the shared React settings default to it. - Old stored
buildervalues are treated asbuilder-gemini. - Old stored
browservalues are treated asmac-native. - Saved
google-realtimepreferences must never hit/_agent-native/transcribe-voice. They go through the dedicated session bridgePOST /_agent-native/transcribe-stream/session, which mints an opaque ai-services websocket session and keeps the Google service-account JSON off the client. - In the current bridge, the Google option is only actually ready when both the user's
GOOGLE_APPLICATION_CREDENTIALSsecret exists and a managed AI services connection is configured, because the framework mints the managed ai-services session before streaming begins.
Where the pieces live
| File | Purpose |
|---|---|
packages/core/src/client/composer/useVoiceDictation.ts |
Provider-routing hook (MediaRecorder / Web Speech) |
packages/core/src/client/composer/VoiceButton.tsx |
Mic button + live amplitude + cancel overlay |
packages/core/src/client/composer/TiptapComposer.tsx |
Wires the hook, insertion, and keyboard shortcut |
packages/core/src/client/settings/VoiceTranscriptionSection.tsx |
Live source + cleanup controls in sidebar settings |
packages/core/src/client/transcription/BuilderTranscriptionCta.tsx |
CTA shown when Builder account isn't connected |
packages/core/src/client/transcription/use-live-transcription.ts |
Web Speech live-transcription hook for recordings |
packages/core/src/server/transcribe-voice.ts |
Route handler (routes to Builder/Gemini/Groq/Whisper) |
packages/core/src/transcription/builder-transcription.ts |
Builder proxy transcription client |
packages/core/src/secrets/register-framework-secrets.ts |
Framework-level provider key registration |
Key resolution (server)
transcribe-voice.ts is batch-only. Do not add realtime streaming to this
route. Google Speech-to-Text realtime uses a dedicated audio-frame protocol:
client audio frames → /_agent-native/transcribe-stream/session →
ai-services WebSocket → Google gRPC StreamingRecognize → partial / final
transcript events. Use the canonical docs URL:
https://cloud.google.com/speech-to-text/v2/docs/streaming-recognize
Batch routing is based on the user's provider preference:
- If
builder-geminiandresolveHasBuilderPrivateKey()→ callstranscribeWithBuilder({ model: "gemini-3-1-flash-lite" })via Builder proxy, or uses Builder Gemini Flash-Lite to clean up a live native/browser transcript when the desktop client sends text instead of audio. - If
builderandresolveHasBuilderPrivateKey()→ legacy alias; preferbuilder-gemini. - If
gemini→ resolvesGEMINI_API_KEYand calls the direct Google Gemini path. - If
groq→ resolvesGROQ_API_KEYand calls Groq's Whisper-compatible endpoint. - If
openai→ resolvesOPENAI_API_KEY:readAppSecret({ key: "OPENAI_API_KEY", scope: "user", scopeId: session.email })— user's encrypted secret.resolveCredential("OPENAI_API_KEY")— env var + SQL settings fallback.
In auto mode / no preference, the route tries Builder Gemini Flash-Lite first
when Builder is connected, then Gemini BYOK, Groq, and OpenAI.
When a request includes instructions, pass them through to the selected LLM
provider. Gemini uses them in the transcription prompt, Builder receives them
as transcription/cleanup instructions, and Whisper-compatible providers receive
them as provider prompt/context.
Never hardcode a shared key. Never log the value. Never echo it back to the client.
Overriding per-template
Templates can:
- Disable the mic: pass
voiceEnabled={false}toTiptapComposer. - Replace the button: wrap
TiptapComposerand render your ownextraActionButton(the framework mic sits betweenextraActionButtonand the send button). - Pre-register provider keys as
required: true: callregisterRequiredSecret(...)from your own server plugin when a template needs a specific BYOK provider in onboarding.
Don'ts
- Don't call transcription providers from the client — go through
/_agent-native/transcribe-voiceso the user's secret stays server-side. - Don't remove the cancel affordance — mic permission abuse paranoia is real.
- Don't auto-submit the transcript — users always edit before sending.
- Don't copy Cursor's "hide send when empty" pattern — it confuses users.