name: tdd description: Use when implementing any feature or bugfix, before writing implementation code
Test-Driven Development (TDD)
Write the test first. Watch it fail. Write minimal code to pass.
If you didn't watch the test fail, you don't know if it tests the right thing.
When to Use
- New features, bug fixes, refactoring, behavior changes
Exceptions (confirm with user): throwaway prototypes, generated code, configuration files.
Red-Green-Refactor Cycle
RED — Write Failing Test
Write one minimal test for one behavior.
test('retries failed operations 3 times', async () => {
let attempts = 0;
const operation = () => {
attempts++;
if (attempts < 3) throw new Error('fail');
return 'success';
};
const result = await retryOperation(operation);
expect(result).toBe('success');
expect(attempts).toBe(3);
});
Requirements:
- One behavior per test
- Name describes the behavior
- Real code over mocks
Verify RED
Run the test. Confirm it fails (not errors) for the expected reason — the feature is missing, not a typo.
Test passes immediately? You're testing existing behavior. Fix the test. Test errors? Fix the error, re-run until it fails correctly.
GREEN — Minimal Code
Write the simplest code that passes.
async function retryOperation<T>(fn: () => Promise<T>): Promise<T> {
for (let i = 0; i < 3; i++) {
try {
return await fn();
} catch (e) {
if (i === 2) throw e;
}
}
throw new Error('unreachable');
}
No extra features, no "improvements" beyond the test.
Verify GREEN
Run the test. Confirm it passes with clean output (no errors or warnings). Confirm other tests still pass.
Test fails? Fix code, not test. Other tests break? Fix now.
REFACTOR
After green only: remove duplication, improve names, extract helpers.
Keep tests green. Add no new behavior. Then write the next failing test.
Good Tests
| Quality | Good | Bad |
|---|---|---|
| Minimal | One thing. "and" in name → split it. | test('validates email and domain and whitespace') |
| Clear | Name describes behavior | test('test1') |
| Shows intent | Demonstrates desired API | Obscures what code should do |
Example: Bug Fix
Bug: Empty email accepted
RED
test('rejects empty email', async () => {
const result = await submitForm({ email: '' });
expect(result.error).toBe('Email required');
});
→ Run: FAIL: expected 'Email required', got undefined ✓
GREEN
function submitForm(data: FormData) {
if (!data.email?.trim()) return { error: 'Email required' };
// ...
}
→ Run: PASS ✓
REFACTOR — Extract validation for multiple fields if needed.
The Test-First Rule
Write production code only after a failing test exists for it.
If code was written before its test: delete it and restart with TDD. Keeping pre-written code as "reference" leads to testing-after — you test what you built rather than what's required.
Common Rationalizations
| Excuse | Reality |
|---|---|
| "Too simple to test" | Simple code breaks. The test takes 30 seconds. |
| "I'll test after" | Tests passing immediately prove nothing. |
| "Already manually tested" | Ad-hoc ≠ systematic. No record, can't re-run. |
| "Deleting X hours is wasteful" | Sunk cost. Unverified code is technical debt. |
| "Need to explore first" | Explore freely, then discard and start with TDD. |
| "Hard to test" | Hard to test = hard to use. Simplify the design. |
When Stuck
| Problem | Solution |
|---|---|
| Don't know how to test | Write the desired API first. Write the assertion first. Ask user. |
| Test too complicated | Design too complicated. Simplify the interface. |
| Must mock everything | Code too coupled. Use dependency injection. |
| Test setup huge | Extract helpers. Still complex → simplify design. |
Debugging Integration
Bug found → write a failing test reproducing it → follow TDD cycle. The test proves the fix and prevents regression.
Testing Anti-Patterns
When adding mocks or test utilities, read @testing-anti-patterns.md to avoid:
- Testing mock behavior instead of real behavior
- Adding test-only methods to production classes
- Mocking without understanding dependencies
Eval-Driven TDD for AI Code
When AI generates implementation code, the test suite doubles as an evaluation harness.
pass@k Metric
Measures probability that at least one of k generated samples passes all tests.
| Metric | Meaning |
|---|---|
| pass@1 | First attempt passes all tests |
| pass@5 | At least one of 5 attempts passes |
| pass@10 | At least one of 10 attempts passes |
Workflow:
- Write the test suite (RED phase) — this is your eval spec.
- Generate k candidate implementations (varying temperature / prompt).
- Run each candidate against the full suite. Record pass/fail per candidate.
- Select the passing candidate, then REFACTOR as usual.
Eval Harness Design
- Tests should be deterministic and fast so they can evaluate many candidates.
- Cover functional correctness, edge cases, and performance bounds.
- Separate behavioral tests (part of eval) from integration tests (not part of eval).
- Keep eval-critical tests tagged or in a dedicated suite for automated scoring.
# Run eval suite against a candidate
npm test -- --testPathPattern="eval/" --bail
# Score: count passing candidates out of k
When to Use Eval-Driven TDD
- Generating utility functions, algorithms, data transformers
- Comparing prompt strategies for the same spec
- Validating AI-assisted refactors against the existing suite
Coverage Integration
Threshold Configuration
{
"coverageThreshold": {
"global": { "branches": 80, "functions": 80, "lines": 80, "statements": 80 }
}
}
Quick Coverage Commands
npx vitest run --coverage # Vitest
npm test -- --coverage # Jest
pytest --cov --cov-report=term # pytest
Review coverage by risk priority: auth, money, mutations, uploads, error paths.
Verification Checklist
Before marking work complete:
- Every new function/method has a test
- Watched each test fail before implementing
- Each failure was for the expected reason
- Wrote minimal code to pass each test
- All tests pass with clean output
- Mocks used only when unavoidable
- Edge cases and errors covered
- Coverage meets project thresholds