name: testing description: How to write good tests. Use when writing tests, improving test coverage, or evaluating test quality. Also invoked by other skills — BDD at RED phase, tdd-review at GREEN gate, refactor at PROTECT phase, and debug. Core test quality knowledge across all workflows. user-invocable: false allowed-tools: '*'
Writing Good Tests
Tests prove the system behaves correctly. Every test — unit, integration, E2E, eval — must verify observable behavior, not implementation details.
Core Principle: TEST BEHAVIOR, NOT IMPLEMENTATION
Philosophy: Behavior-Biased Testing
What this means: At every test level, assert on what the system does (outputs, side effects, user-visible outcomes) — never on how it does it (internal state, mock call counts, private methods).
Why: Tests coupled to implementation break on every refactor. Behavioral tests survive refactoring because behavior doesn't change — only the internals do.
Scope preference: When multiple test types can verify a behavior, prefer the highest scope that covers the behavior with acceptable feedback speed. Higher scope = more confidence that the real system works.
Prefer (highest confidence):
E2E → proves the user can do the thing
Integration → proves components work together
Unit → proves the algorithm is correct
Fallback (lowest scope):
When to drop to a lower scope:
- Pure function with many edge cases (20+ combinations) → unit test
- Internal service boundary, no UI involved → integration test
- Algorithm with complex logic (parsing, math, state machines) → unit test
- Only one module's contract matters → integration test
When to stay at higher scope:
- User-facing feature or workflow → E2E
- Multiple modules must cooperate → integration or E2E
- "If this breaks, users notice immediately" → E2E
Announce your decision: "Test type: [unit/integration/E2E/eval] because [reason]."
For the full decision tree, bug detection matrix, and edge cases: .safeword/guides/testing-guide.md
Iron Laws
Non-negotiable at every test level. Violating these produces tests that pass but catch nothing.
1. Test Behavior, Not Implementation
// WRONG — tests internal state
expect(component.state.count).toBe(1);
expect(mockFn).toHaveBeenCalledWith('internal-detail');
// RIGHT — tests observable behavior
expect(screen.getByText('Count: 1')).toBeVisible();
expect(result).toEqual({ total: 42 });
This applies at EVERY level:
- Unit: assert on return values, not on which helpers were called
- Integration: assert on API responses, not on which service methods fired
- E2E: assert on what the user sees, not on DOM structure
- Eval: grade the output quality, not the path the LLM took
2. Every Test Needs a Meaningful Assertion
If your assertion would pass for ANY input, it asserts nothing.
// WRONG — asserts nothing useful
expect(() => processData(input)).not.toThrow();
expect(result).toBeTruthy();
expect(result).toBeDefined();
// RIGHT — asserts specific behavior
expect(processData(input)).toEqual({ status: 'ok', count: 3 });
expect(result.errors).toHaveLength(0);
3. Tests Must Fail First
A new test that passes immediately is testing nothing — or testing something that already works (no value added). For new behavior: RED then GREEN. For existing code: if a characterization test fails, you found a bug.
4. One Test, One Behavior
If a test name has "and" in it, split it. Each test verifies ONE observable outcome.
// WRONG
it('validates input and saves to database', ...);
// RIGHT
it('rejects input missing required field', ...);
it('saves valid input to database', ...);
5. Tests Must Be Independent
No test depends on another test's side effects. Fresh state per test. Run in any order.
Anti-Patterns
The most common ways AI-generated tests go wrong. Watch for all of them.
| Pattern | Problem | Fix |
|---|---|---|
| Coverage theater | High line coverage, tests catch no bugs | Every test should fail if you break the behavior it guards |
| Mock everything | Tests only verify mock wiring, not real behavior | Use real dependencies where practical; mock only external services |
| Duplicate tests | 20 tests with same structure, different values | Use parameterized/table-driven tests: it.each(...) |
| Happy-path only | Misses edge cases where real bugs live | Always include: empty input, boundary values, error paths |
| Hardcoded magic values | Timestamps, IDs, paths break across environments | Use builders, relative values, or factories |
| Snapshot overuse | Large snapshots pass review without scrutiny | Prefer targeted assertions; snapshots only for large stable structures |
| Testing private methods | Couples tests to implementation | Test through the public API |
| Exact UI text matching | Breaks on copy changes | Use regex /submit/i or data-testid attributes |
| Bug-locking | Tests written against buggy code encode the bug | Write tests BEFORE implementation (TDD), or verify behavior is correct first |
| Scope defaulting | AI defaults to unit tests for everything | Ask "what's the highest scope with acceptable feedback speed?" first |
Behavioral Testing by Type
Unit Tests — Behavioral
Test the contract (inputs → outputs), not the internals.
// Behavioral: asserts on output
it('applies 20% discount for VIP users', () => {
expect(calculateDiscount(100, { tier: 'VIP' })).toBe(80);
});
// Non-behavioral: asserts on internal call
it('calls applyRate with 0.2', () => {
calculateDiscount(100, { tier: 'VIP' });
expect(applyRate).toHaveBeenCalledWith(0.2);
});
Integration Tests — Behavioral
Test that components produce correct combined outcomes with real dependencies.
// Behavioral: asserts on combined outcome
it('returns user profile with computed permissions', async () => {
const response = await api.get('/users/1/profile');
expect(response.data.permissions).toContain('edit_posts');
});
// Non-behavioral: asserts on which services were called
it('calls UserService then PermissionService', async () => { ... });
E2E Tests — Behavioral
Test what the user can see and do. E2E tests are naturally behavioral — lean into this.
// Behavioral: user-visible outcome
test('user creates account and sees dashboard', async ({ page }) => {
await page.goto('/signup');
await page.fill('[name="email"]', 'test@example.com');
await page.fill('[name="password"]', 'secure123');
await page.click('button:has-text("Sign Up")');
await expect(page).toHaveURL('/dashboard');
await expect(page.getByText('Welcome')).toBeVisible();
});
LLM Evals — Behavioral
Grade what the output achieves, not the path the model took. Use deterministic assertions first, LLM-as-judge second.
# Deterministic assertion (cheap, run every commit)
- type: javascript
value: JSON.parse(output).intent === 'order_pizza'
# LLM-as-judge (for subjective quality, run on PR/schedule)
- type: llm-rubric
value: |
PASS: Correctly identifies pizza order, confirms size and type
FAIL: Wrong intent, ignores key details, or generic response
Eval-specific principles:
- Grade outcomes, not paths — the LLM can take any route to the right answer
- Binary PASS/FAIL over scales — "3 vs 4" is meaningless; force clarity
- One dimension per scorer — don't bundle factuality + tone + completeness
- Deterministic checks first — regex, schema validation, required fields before LLM-as-judge
Writing Approach
Match Existing Style
Before writing any test, find existing tests near the code under test. Match their imports, describe/it structure, helpers, and patterns. Don't introduce new conventions into an established test suite.
If no existing tests: use AAA pattern (Arrange-Act-Assert).
Design Before Writing
List planned tests before coding. For each test, name:
- What behavior it verifies (not what code it calls)
- What the key assertion is (not "it doesn't throw")
- Why this test matters (what bug would slip through without it?)
Aim for: happy path + edge cases + error cases + at least one test the implementation could plausibly get wrong.
One Test at a Time
Write one test → run it → verify it fails (or passes for characterization) → move to next. Never write all tests at once then run them.
Patterns
Test Data Builders
function buildUser(overrides = {}) {
return { id: 'test-1', name: 'Test User', role: 'member', ...overrides };
}
it('applies VIP discount', () => {
const user = buildUser({ role: 'vip' });
expect(calculateDiscount(user)).toBe(0.2);
});
Async Testing — Never Use Arbitrary Timeouts
// WRONG
await sleep(3000);
await page.waitForTimeout(500);
// RIGHT — wait for condition
await expect.poll(() => getStatus()).toBe('ready');
await waitFor(() => expect(element).toBeVisible());
Descriptive Test Names
// WRONG
it('works correctly');
it('should handle edge case');
// RIGHT — describes the behavior
it('returns 401 when API key is missing');
it('preserves user input after validation error');
Quick Reference
| Need | Action |
|---|---|
| Full test type selection guide | .safeword/guides/testing-guide.md |
| Test definition template (BDD) | .safeword/templates/test-definitions-feature.md |
| Test quality review | /audit |
| Feature-level TDD with scenarios | /bdd |
| Debugging failing tests | /debug |