name: voice-first-planning description: Use when starting a new feature, writing a spec, or brainstorming architecture and you want to capture richer intent than typing allows. Speak your thoughts into a transcription tool, then feed the raw transcript to Claude Code for structuring.
Voice-First Planning
Overview
Use speech-to-text tools to dictate your initial specs, feature ideas, and architectural thoughts instead of typing them. Speaking naturally produces longer, more context-rich input because people self-edit heavily when typing but ramble freely when talking. That rambling is gold — LLMs are excellent at extracting structured meaning from unstructured speech.
Core principle: Speaking freely captures intent that is hard to express in typed text. Don't self-edit — ramble and let the LLM find the structure.
Dependency: A speech-to-text tool (WhisperFlow, macOS Dictation, or similar).
When to Use
- At the start of a new feature when you have a rough idea but haven't formalized it
- During brainstorming when you want to explore multiple approaches quickly
- When writing specs, PRDs, or design documents from scratch
- When you catch yourself staring at a blank prompt, unsure how to phrase what you want
- When explaining a bug or problem you understand intuitively but struggle to articulate in text
When NOT to Use
- During active implementation (typing precise code instructions is faster)
- For short, well-defined commands ("fix the typo on line 42")
- When you are in a shared/quiet workspace without a private space to speak
- For editing or refining an already-written spec (just edit the text directly)
Common Mistakes
| Mistake | Why it's wrong |
|---|---|
| Editing yourself while speaking | The whole point is to capture raw, unfiltered intent. Self-editing while speaking defeats the purpose — you lose the same context you lose when typing. Just talk. |
| Skipping the transcription review | Speech-to-text makes errors. Quickly scan the transcript for mangled names, technical terms, or homophones before pasting it in. A 10-second scan prevents confused output. |
| Using voice during implementation | Voice shines during planning and ideation. Once you are writing code, typed instructions are more precise. Don't force voice where typing is better. |
| Pasting the transcript without a framing prompt | Claude Code needs to know what to do with the wall of text. Always prepend a short instruction like "Structure this into a feature spec" or "Extract the requirements from this transcript." |
| Speaking in short, clipped sentences | You are not typing. Speak in full, natural paragraphs. Explain the why, the context, the constraints, the edge cases. Longer is better — the LLM will compress it. |
The Workflow
Step 1: Set up a speech-to-text tool
Pick one and install it:
| Tool | Platform | Notes |
|---|---|---|
| Wispr Flow | macOS, Windows, iOS, Android | AI-powered voice-to-text with auto-editing. 4x faster than typing. Recommended. |
| macOS Dictation | macOS | Built-in. Press Fn Fn (or Globe key twice) in any text field. Good enough for most use cases. |
| Superwhisper | macOS | Polished Whisper app with hotkey activation. |
| Windows Voice Typing | Windows | Press Win+H. Built-in, decent quality. |
| Google Docs Voice Typing | Browser | Tools > Voice typing. Works well, requires Chrome. |
Any tool that converts speech to editable text works. The key requirement is that you can paste the output into a terminal or editor.
Step 2: Speak your idea freely
Open your transcription tool and start talking. Cover:
- What you want to build and why it matters
- Who will use it and what their workflow looks like
- The constraints — what you cannot change, what is already decided
- Edge cases you are already thinking about
- Anything you are unsure about — name the uncertainty out loud
Do not organize your thoughts first. Do not outline. Just talk. Aim for 1-3 minutes of continuous speech. This typically produces 200-500 words of transcript, which is far more context than most people would type.
Step 3: Quick-scan the transcript for errors
Speech-to-text tools mangle technical terms. Before pasting, scan for:
- Variable/function names — "getUserById" might become "get user by ID" or worse
- Library names — "Playwright" might become "play right"
- Homophones — "route" vs "root", "cache" vs "cash"
- Jargon — domain-specific terms the model may not have in its vocabulary
Fix only the obviously wrong terms. Do not rewrite — the raw, spoken style is the point.
Tip: Use TMUX with VIM keybindings to quickly jump through the transcript and fix transcription errors before pasting. VIM's word-motion keys (w, b, cw) make surgical fixes fast.
Step 4: Paste into Claude Code with a framing prompt
Prepend a one-line instruction that tells Claude Code what to produce:
Structure this spoken transcript into a feature spec with requirements,
constraints, and open questions:
[paste transcript here]
Other useful framing prompts:
- "Extract the architectural decisions from this transcript and list the trade-offs for each."
- "Turn this into a task breakdown with estimated complexity for each item."
- "Identify the requirements, assumptions, and risks in this transcript."
- "Summarize the core idea in 2 sentences, then list all the details I mentioned."
Step 5: Iterate on the structured output
Claude Code will return a well-organized version of your rambling thoughts. Now you can:
- Remove sections that were tangents
- Add precision where the spoken version was vague
- Ask follow-up questions about parts the LLM flagged as ambiguous
- Feed the structured spec back into your planning workflow
This is where the real leverage appears: you went from a blank prompt to a structured spec in under 5 minutes, and it contains context you never would have typed.
Quick Reference
| Item | Details |
|---|---|
| Recommended tool (macOS) | WhisperFlow or macOS Dictation (Fn Fn) |
| Recommended tool (Windows) | Windows Voice Typing (Win+H) |
| Ideal speaking length | 1-3 minutes (~200-500 words of transcript) |
| Best stage to use | Planning, ideation, spec writing |
| Worst stage to use | Active implementation, precise code edits |
| Transcript editing | TMUX + VIM mode for quick surgical fixes |
| Key framing prompt | "Structure this spoken transcript into a feature spec with requirements, constraints, and open questions:" |
Key Principles
- Typing makes you self-edit. Speaking does not. When you type, you delete and rephrase to sound "right." When you speak, you explain what you actually mean, including the messy context that matters most.
- LLMs are excellent at extracting structure from rambling speech. You do not need to organize your thoughts before speaking. That is the LLM's job. Your job is to provide maximum raw signal.
- Voice is for planning, not implementation. This technique is most effective in the ideation and spec-writing phase. Once you are deep in code, typed instructions give you the precision you need.
- Fix transcription errors, but do not rewrite. A quick scan for mangled technical terms is essential. Rewriting the transcript into polished prose is counterproductive — you are just re-introducing the self-editing problem.
- Always frame the paste. A raw transcript dump without a prompt like "structure this into a spec" forces the LLM to guess what you want. One sentence of framing turns a wall of text into useful output.
Attribution
Based on Josh's technique from the Coding Agents: AI Driven Dev Conference. Josh uses WhisperFlow to speak initial specs, noting that speaking gives more context than typing because people naturally self-edit when they type. He also recommends TMUX with VIM mode for quickly fixing transcription errors before feeding the text to an LLM.