name: spec-driven-dev description: Use when starting any non-trivial feature, refactor, or new project that will touch more than one file. Drives an AI coding agent through a gated Spec → Plan → Build → Test → Review → Ship lifecycle so work is specified before it is built, verified before it is reviewed, and reviewed before it ships.
Spec-Driven Development
A gated, six-phase workflow for shipping real software with an AI coding agent. Each phase has a goal, exit criteria, and a rationalization table for the excuses an agent will invent to skip it. Do not advance until the current gate is closed.
When to Use
- Starting a new project, service, or significant feature
- Requirements arrive as a paragraph, screenshot, or vague idea
- The change touches more than one file or subsystem
- You are about to make an architectural decision
- An agent is about to write more than ~100 lines without a plan
- Picking up half-finished work from another session
When NOT to Use
- One-line typo or copy fix
- Renaming a single local variable
- Changes fully scoped by a one-line request where "done" is unambiguous
- Pure dependency bumps with no behavior change
- Throwaway spikes clearly marked as experiments and never merged
If unsure, do the spec phase anyway. A 10-minute spec is cheap insurance.
The Gated Workflow
SPEC ──▶ PLAN ──▶ BUILD ──▶ TEST ──▶ REVIEW ──▶ SHIP
│ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼
What & Tasks & Thin Prove it Five-axis Deploy,
why order slices works quality observe,
by test gate roll back
│ │ │ │ │ │
└──────────┴──────────┴──────────┴──────────┴───────────┘
human or reviewer approves each gate
Rules of the gate:
- You may loop backward (Test reveals a broken spec — return to Spec).
- You may not skip forward. No building without a plan. No shipping without review.
- Each gate produces a named artifact committed to the repo:
SPEC.md,PLAN.md, code + tests, a review note / PR description, a launch + rollback note.
Phase 1: SPEC
Goal. Turn a vague request into a written contract that answers what, why, for whom, and how we know it is done. Surface assumptions before they become bugs.
Inputs. User request, existing codebase, any prior specs, constraints.
Steps.
- Restate the request in one sentence. Confirm with the human.
- List assumptions explicitly. Example:
ASSUMPTIONS: web app, Postgres, session cookies. Correct me or I proceed. - Reframe vague requirements as measurable success criteria. "Make it fast" becomes "dashboard LCP < 2.5s on 4G, initial data < 500ms".
- Write the spec using
templates/spec.mdand save it in the repo. - Define Always / Ask First / Never boundaries.
- Log open questions. Do not guess answers.
Exit criteria.
- Spec file saved in the repository
- Human has read and approved it
- Success criteria are specific and testable
- Boundaries are explicit
- No unanswered blocking questions
Common Rationalizations
| Excuse | Reality |
|---|---|
| "It's obvious what to build" | If it were, two engineers would write the same code. They won't. |
| "I'll write the spec after the code works" | That's documentation, not specification. It can no longer change the design. |
| "The user told me what they want" | Users describe symptoms, not contracts. Extract the contract. |
| "Requirements will change anyway" | That's exactly why you write them down — so the change is visible. |
Phase 2: PLAN
Goal. Convert the spec into an ordered list of small, verifiable tasks with an explicit dependency graph. No code is written in this phase.
Inputs. Approved spec, relevant source files, existing conventions.
Steps.
- Enter read-only mode. Read the spec and the code it will touch.
- Draw the dependency graph: what must exist before what.
- Slice vertically. Thin end-to-end slices beat horizontal layers (schema → all APIs → all UI is the wrong shape).
- Write tasks using
templates/plan.md. Each task fits one focused session and touches ~5 files or fewer. - Insert verification checkpoints between task groups.
- Call out risks and parallelization coordination (e.g. agree an API contract before splitting frontend/backend).
Task sizing.
| Size | Files | Example |
|---|---|---|
| XS | 1 | Add a validation rule |
| S | 1–2 | One endpoint, one component |
| M | 3–5 | One vertical feature slice |
| L | 5–8 | Multi-component feature (split if possible) |
| XL | 8+ | Too big. Break it down. |
Exit criteria.
- Plan file saved in the repository
- Every task has acceptance criteria and a verify step
- Dependencies are ordered correctly
- No task is XL
- Checkpoints exist between phases
- Human approved the plan
Common Rationalizations
| Excuse | Reality |
|---|---|
| "I'll figure it out as I go" | That's how tangled diffs are born. Ten minutes planning saves hours untangling. |
| "Planning is overhead" | Planning is the work. Implementation without a plan is just typing. |
| "I can hold it in my head" | Context windows are finite. Compaction will eat it. |
Phase 3: BUILD
Goal. Implement the plan in thin, compilable, revertible increments. Every increment leaves the tree green.
Inputs. Approved plan, the current task.
The increment cycle: implement → test → verify → commit → next slice
Rules.
- Simplicity first. What is the simplest thing that could work? Three similar lines beat a premature abstraction. Do not generalize before the third use case demands it.
- Scope discipline. Touch only what the task requires. Note unrelated issues; do not fix them. No "while I'm here" refactors.
- One thing per commit. Feature, refactor, and config change are three commits, not one.
- Always compilable. Build and existing tests must pass after every slice.
- Feature flags for WIP. Merge behind a flag if users should not see it yet.
- Rollback-friendly. Additive > destructive. Each increment revertable alone.
Exit criteria for each increment.
- Does one thing, completely
- Build passes
- Lint and type checks pass
- Existing tests still pass
- Committed with a descriptive, imperative message
Common Rationalizations
| Excuse | Reality |
|---|---|
| "Faster to do it all at once" | It feels faster until something breaks in 500 changed lines. |
| "Too small to commit separately" | Small commits are free. Large commits hide bugs. |
| "Let me just quickly add this too" | Scope creep wearing a disguise. |
| "This refactor is small enough to include" | Mixing refactor with feature makes both harder to review. |
Phase 4: TEST
Goal. Prove the code works with tests that will survive refactoring. "Seems to work" is not done.
Inputs. Built code, spec's success criteria, bug reports (if any).
Steps.
- New behavior: write the test before the code (RED → GREEN → REFACTOR).
- Bug fixes: Prove-It pattern. Write a failing test that reproduces the bug. Only then fix it. The test becomes the regression guard.
- Follow the pyramid: ~80% unit, ~15% integration, ~5% end-to-end.
- Test state, not interactions. Assert on outcomes, not on which private methods were called.
- Prefer real > fakes > stubs > mocks. Mock only at boundaries you cannot control (external APIs, clocks, randomness).
- Name tests like a spec:
it('sets completedAt when task is completed'). - Browser work: verify in an actual browser (DOM, console, network, screenshots). Unit tests alone do not prove a page renders.
Exit criteria.
- Every behavior in the spec has a test
- Bug fixes include a reproduction test that failed before the fix
- Full suite green:
<test command> - No skipped or disabled tests
- Coverage did not regress
- No flaky tests introduced
Common Rationalizations
| Excuse | Reality |
|---|---|
| "I'll write tests after it works" | You won't. And post-hoc tests test implementation, not behavior. |
| "I tested it manually" | Manual testing does not persist. Tomorrow's change will silently break it. |
| "Tests slow me down" | They slow you now; they speed you up every time you change this code. |
| "All tests pass" (unverified) | "Passes" without output is a claim, not a fact. Run them. |
Phase 5: REVIEW
Goal. A five-axis quality gate. Nothing merges without it — including code you wrote yourself.
Inputs. Built, tested code. The spec. The plan.
The five axes.
- Correctness — matches spec, handles edges and errors, tests test the right thing.
- Readability & simplicity — clear names, clean flow, no cleverness, no dead code, abstractions earn their complexity.
- Architecture — fits existing patterns, clean boundaries, correct dependency direction, right level of abstraction.
- Security — input validated, secrets out of code, auth checked, queries parameterized, external data treated as untrusted.
- Performance — no N+1, no unbounded loops, pagination on list endpoints, async where required.
Findings must carry severity. Label every comment so the author knows what is required vs optional:
| Prefix | Meaning |
|---|---|
| Critical: | Blocks merge — security, data loss, correctness |
| (no prefix) | Required before merge |
| Consider: / Optional: | Suggestion |
| Nit: | Style or formatting preference |
| FYI: | Informational only |
Approval standard. Approve when the change improves overall code health,
even if not how you would have written it. Do not block on preference. Do not
rubber-stamp either. Use templates/review.md as the checklist.
Exit criteria.
- All five axes evaluated
- Every Critical issue resolved
- Every required issue resolved or explicitly deferred with a filed bug
- Tests and build green on the final revision
- Change description stands alone (first line is imperative and informative)
Common Rationalizations
| Excuse | Reality |
|---|---|
| "It works, good enough" | Working + unreadable + insecure = debt that compounds. |
| "Tests pass, therefore good" | Tests do not catch architecture or security. Review does. |
| "AI-generated, probably fine" | AI output is confident and plausibly wrong. Scrutinize more, not less. |
| "We'll clean it up later" | Later never comes. Clean up before merge or file a tracked bug. |
Phase 6: SHIP
Goal. Deploy safely. Every launch must be reversible, observable, and incremental.
Inputs. Reviewed, merged code. A rollback plan. A monitoring plan.
Steps.
- Run the pre-launch checklist from
templates/ship.md(code quality, security, performance, accessibility, infra, docs). - Put user-visible behavior behind a feature flag with an owner and expiry.
- Stage the rollout:
staging → prod (flag off) → team → 5% → 25% → 50% → 100% - Document rollback triggers before deploying: error rate > 2× baseline, p95 latency > 50% above baseline, new client JS errors > 0.1% of sessions, any data integrity or security issue.
- Watch dashboards for the first hour: error rate, latency, business metric, logs flowing, health endpoint 200.
- Clean up the feature flag within two weeks of full rollout.
Exit criteria.
- Pre-launch checklist green
- Rollback plan written and linked from the PR/launch note
- Feature flag configured, with owner and expiration
- Monitoring dashboards exist and show data
- Team knows the deploy is happening
- First-hour post-deploy verification completed
Common Rationalizations
| Excuse | Reality |
|---|---|
| "Works in staging, will work in prod" | Prod has different data, traffic, and edge cases. Monitor after deploy. |
| "We don't need a flag for this" | Every feature benefits from a kill switch. Even simple ones break things. |
| "We'll add monitoring later" | You cannot debug what you cannot see. Add it before launch. |
| "Rolling back is admitting failure" | Rolling back is responsible engineering. Shipping broken is the failure. |
| "It's Friday afternoon, let's ship it" | No. |
Red Flags: Signs the Agent Is Skipping Phases
- Code written before any spec exists
PLAN.mdtasks that say "implement the feature" and nothing else- Touching files outside the current task "while I'm here"
- A single commit mixing schema, UI, and config changes
- "Tests pass" claimed with no output pasted
- More than 100 lines added with no test run in between
- Review comments with no severity labels
- A deploy with no rollback plan
- A feature flag with no owner and no expiry
If you see any of these, return to the most recent unpassed gate and redo it.
Verification Checklist
Before declaring the workflow done:
- Spec committed, approved, still accurate
- Plan committed, every task checked off
- Every task has tests proving its behavior
- Five-axis review completed, all Critical resolved
- Pre-launch checklist green
- Rollback plan written and understood
- Feature flag (if any) has an owner and a kill date
- Monitoring live and showing traffic from new code
- First-hour post-deploy verification done
Templates
templates/spec.md— write this in Phase 1templates/plan.md— write this in Phase 2templates/review.md— use this in Phase 5templates/ship.md— use this in Phase 6
Related Skills
This workflow is intentionally coarse. Reach for focused skills for depth on individual phases (planning, TDD, code review, shipping). This skill is the spine; others are the organs.