spec-driven-dev - SKILL.md Agent Skill

name: spec-driven-dev description: Use when starting any non-trivial feature, refactor, or new project that will touch more than one file. Drives an AI coding agent through a gated Spec → Plan → Build → Test → Review → Ship lifecycle so work is specified before it is built, verified before it is reviewed, and reviewed before it ships.

Spec-Driven Development

A gated, six-phase workflow for shipping real software with an AI coding agent. Each phase has a goal, exit criteria, and a rationalization table for the excuses an agent will invent to skip it. Do not advance until the current gate is closed.

When to Use

Starting a new project, service, or significant feature
Requirements arrive as a paragraph, screenshot, or vague idea
The change touches more than one file or subsystem
You are about to make an architectural decision
An agent is about to write more than ~100 lines without a plan
Picking up half-finished work from another session

When NOT to Use

One-line typo or copy fix
Renaming a single local variable
Changes fully scoped by a one-line request where "done" is unambiguous
Pure dependency bumps with no behavior change
Throwaway spikes clearly marked as experiments and never merged

If unsure, do the spec phase anyway. A 10-minute spec is cheap insurance.

The Gated Workflow

  SPEC  ──▶  PLAN  ──▶  BUILD  ──▶  TEST  ──▶  REVIEW  ──▶  SHIP
   │          │          │          │          │           │
   ▼          ▼          ▼          ▼          ▼           ▼
 What &     Tasks &    Thin       Prove it   Five-axis   Deploy,
  why       order      slices     works      quality     observe,
                                   by test    gate       roll back
   │          │          │          │          │           │
   └──────────┴──────────┴──────────┴──────────┴───────────┘
            human or reviewer approves each gate

Rules of the gate:

You may loop backward (Test reveals a broken spec — return to Spec).
You may not skip forward. No building without a plan. No shipping without review.
Each gate produces a named artifact committed to the repo: SPEC.md, PLAN.md, code + tests, a review note / PR description, a launch + rollback note.

Phase 1: SPEC

Goal. Turn a vague request into a written contract that answers what, why, for whom, and how we know it is done. Surface assumptions before they become bugs.

Inputs. User request, existing codebase, any prior specs, constraints.

Steps.

Restate the request in one sentence. Confirm with the human.
List assumptions explicitly. Example: ASSUMPTIONS: web app, Postgres, session cookies. Correct me or I proceed.
Reframe vague requirements as measurable success criteria. "Make it fast" becomes "dashboard LCP < 2.5s on 4G, initial data < 500ms".
Write the spec using templates/spec.md and save it in the repo.
Define Always / Ask First / Never boundaries.
Log open questions. Do not guess answers.

Exit criteria.

Spec file saved in the repository
Human has read and approved it
Success criteria are specific and testable
Boundaries are explicit
No unanswered blocking questions

Common Rationalizations

Excuse	Reality
"It's obvious what to build"	If it were, two engineers would write the same code. They won't.
"I'll write the spec after the code works"	That's documentation, not specification. It can no longer change the design.
"The user told me what they want"	Users describe symptoms, not contracts. Extract the contract.
"Requirements will change anyway"	That's exactly why you write them down — so the change is visible.

Phase 2: PLAN

Goal. Convert the spec into an ordered list of small, verifiable tasks with an explicit dependency graph. No code is written in this phase.

Inputs. Approved spec, relevant source files, existing conventions.

Steps.

Enter read-only mode. Read the spec and the code it will touch.
Draw the dependency graph: what must exist before what.
Slice vertically. Thin end-to-end slices beat horizontal layers (schema → all APIs → all UI is the wrong shape).
Write tasks using templates/plan.md. Each task fits one focused session and touches ~5 files or fewer.
Insert verification checkpoints between task groups.
Call out risks and parallelization coordination (e.g. agree an API contract before splitting frontend/backend).

Task sizing.

Size	Files	Example
XS	1	Add a validation rule
S	1–2	One endpoint, one component
M	3–5	One vertical feature slice
L	5–8	Multi-component feature (split if possible)
XL	8+	Too big. Break it down.

Exit criteria.

Plan file saved in the repository
Every task has acceptance criteria and a verify step
Dependencies are ordered correctly
No task is XL
Checkpoints exist between phases
Human approved the plan

Common Rationalizations

Excuse	Reality
"I'll figure it out as I go"	That's how tangled diffs are born. Ten minutes planning saves hours untangling.
"Planning is overhead"	Planning is the work. Implementation without a plan is just typing.
"I can hold it in my head"	Context windows are finite. Compaction will eat it.

Phase 3: BUILD

Goal. Implement the plan in thin, compilable, revertible increments. Every increment leaves the tree green.

Inputs. Approved plan, the current task.

The increment cycle: implement → test → verify → commit → next slice

Rules.

Simplicity first. What is the simplest thing that could work? Three similar lines beat a premature abstraction. Do not generalize before the third use case demands it.
Scope discipline. Touch only what the task requires. Note unrelated issues; do not fix them. No "while I'm here" refactors.
One thing per commit. Feature, refactor, and config change are three commits, not one.
Always compilable. Build and existing tests must pass after every slice.
Feature flags for WIP. Merge behind a flag if users should not see it yet.
Rollback-friendly. Additive > destructive. Each increment revertable alone.

Exit criteria for each increment.

Does one thing, completely
Build passes
Lint and type checks pass
Existing tests still pass
Committed with a descriptive, imperative message

Common Rationalizations

Excuse	Reality
"Faster to do it all at once"	It feels faster until something breaks in 500 changed lines.
"Too small to commit separately"	Small commits are free. Large commits hide bugs.
"Let me just quickly add this too"	Scope creep wearing a disguise.
"This refactor is small enough to include"	Mixing refactor with feature makes both harder to review.

Phase 4: TEST

Goal. Prove the code works with tests that will survive refactoring. "Seems to work" is not done.

Inputs. Built code, spec's success criteria, bug reports (if any).

Steps.

New behavior: write the test before the code (RED → GREEN → REFACTOR).
Bug fixes: Prove-It pattern. Write a failing test that reproduces the bug. Only then fix it. The test becomes the regression guard.
Follow the pyramid: ~80% unit, ~15% integration, ~5% end-to-end.
Test state, not interactions. Assert on outcomes, not on which private methods were called.
Prefer real > fakes > stubs > mocks. Mock only at boundaries you cannot control (external APIs, clocks, randomness).
Name tests like a spec: it('sets completedAt when task is completed').
Browser work: verify in an actual browser (DOM, console, network, screenshots). Unit tests alone do not prove a page renders.

Exit criteria.

Every behavior in the spec has a test
Bug fixes include a reproduction test that failed before the fix
Full suite green: <test command>
No skipped or disabled tests
Coverage did not regress
No flaky tests introduced

Common Rationalizations

Excuse	Reality
"I'll write tests after it works"	You won't. And post-hoc tests test implementation, not behavior.
"I tested it manually"	Manual testing does not persist. Tomorrow's change will silently break it.
"Tests slow me down"	They slow you now; they speed you up every time you change this code.
"All tests pass" (unverified)	"Passes" without output is a claim, not a fact. Run them.

Phase 5: REVIEW

Goal. A five-axis quality gate. Nothing merges without it — including code you wrote yourself.

Inputs. Built, tested code. The spec. The plan.

The five axes.

Correctness — matches spec, handles edges and errors, tests test the right thing.
Readability & simplicity — clear names, clean flow, no cleverness, no dead code, abstractions earn their complexity.
Architecture — fits existing patterns, clean boundaries, correct dependency direction, right level of abstraction.
Security — input validated, secrets out of code, auth checked, queries parameterized, external data treated as untrusted.
Performance — no N+1, no unbounded loops, pagination on list endpoints, async where required.

Findings must carry severity. Label every comment so the author knows what is required vs optional:

Prefix	Meaning
Critical:	Blocks merge — security, data loss, correctness
(no prefix)	Required before merge
Consider: / Optional:	Suggestion
Nit:	Style or formatting preference
FYI:	Informational only

Approval standard. Approve when the change improves overall code health, even if not how you would have written it. Do not block on preference. Do not rubber-stamp either. Use templates/review.md as the checklist.

Exit criteria.

All five axes evaluated
Every Critical issue resolved
Every required issue resolved or explicitly deferred with a filed bug
Tests and build green on the final revision
Change description stands alone (first line is imperative and informative)

Common Rationalizations

Excuse	Reality
"It works, good enough"	Working + unreadable + insecure = debt that compounds.
"Tests pass, therefore good"	Tests do not catch architecture or security. Review does.
"AI-generated, probably fine"	AI output is confident and plausibly wrong. Scrutinize more, not less.
"We'll clean it up later"	Later never comes. Clean up before merge or file a tracked bug.

Phase 6: SHIP

Goal. Deploy safely. Every launch must be reversible, observable, and incremental.

Inputs. Reviewed, merged code. A rollback plan. A monitoring plan.

Steps.

Run the pre-launch checklist from templates/ship.md (code quality, security, performance, accessibility, infra, docs).
Put user-visible behavior behind a feature flag with an owner and expiry.
Stage the rollout: staging → prod (flag off) → team → 5% → 25% → 50% → 100%
Document rollback triggers before deploying: error rate > 2× baseline, p95 latency > 50% above baseline, new client JS errors > 0.1% of sessions, any data integrity or security issue.
Watch dashboards for the first hour: error rate, latency, business metric, logs flowing, health endpoint 200.
Clean up the feature flag within two weeks of full rollout.

Exit criteria.

Pre-launch checklist green
Rollback plan written and linked from the PR/launch note
Feature flag configured, with owner and expiration
Monitoring dashboards exist and show data
Team knows the deploy is happening
First-hour post-deploy verification completed

Common Rationalizations

Excuse	Reality
"Works in staging, will work in prod"	Prod has different data, traffic, and edge cases. Monitor after deploy.
"We don't need a flag for this"	Every feature benefits from a kill switch. Even simple ones break things.
"We'll add monitoring later"	You cannot debug what you cannot see. Add it before launch.
"Rolling back is admitting failure"	Rolling back is responsible engineering. Shipping broken is the failure.
"It's Friday afternoon, let's ship it"	No.

Red Flags: Signs the Agent Is Skipping Phases

Code written before any spec exists
PLAN.md tasks that say "implement the feature" and nothing else
Touching files outside the current task "while I'm here"
A single commit mixing schema, UI, and config changes
"Tests pass" claimed with no output pasted
More than 100 lines added with no test run in between
Review comments with no severity labels
A deploy with no rollback plan
A feature flag with no owner and no expiry

If you see any of these, return to the most recent unpassed gate and redo it.

Verification Checklist

Before declaring the workflow done:

Spec committed, approved, still accurate
Plan committed, every task checked off
Every task has tests proving its behavior
Five-axis review completed, all Critical resolved
Pre-launch checklist green
Rollback plan written and understood
Feature flag (if any) has an owner and a kill date
Monitoring live and showing traffic from new code
First-hour post-deploy verification done

Templates

templates/spec.md — write this in Phase 1
templates/plan.md — write this in Phase 2
templates/review.md — use this in Phase 5
templates/ship.md — use this in Phase 6

Related Skills

This workflow is intentionally coarse. Reach for focused skills for depth on individual phases (planning, TDD, code review, shipping). This skill is the spine; others are the organs.