testing - SKILL.md Agent Skill

name: testing description: How to write good tests. Use when writing tests, improving test coverage, or evaluating test quality. Also invoked by other skills — BDD at RED phase, tdd-review at GREEN gate, refactor at PROTECT phase, and debug. Core test quality knowledge across all workflows. user-invocable: false allowed-tools: '*'

Writing Good Tests

Tests prove the system behaves correctly. Every test — unit, integration, E2E, eval — must verify observable behavior, not implementation details.

Core Principle: TEST BEHAVIOR, NOT IMPLEMENTATION

Philosophy: Behavior-Biased Testing

What this means: At every test level, assert on what the system does (outputs, side effects, user-visible outcomes) — never on how it does it (internal state, mock call counts, private methods).

Why: Tests coupled to implementation break on every refactor. Behavioral tests survive refactoring because behavior doesn't change — only the internals do.

Scope preference: When multiple test types can verify a behavior, prefer the highest scope that covers the behavior with acceptable feedback speed. Higher scope = more confidence that the real system works.

Prefer (highest confidence):
  E2E        → proves the user can do the thing
  Integration → proves components work together
  Unit        → proves the algorithm is correct
Fallback (lowest scope):

When to drop to a lower scope:

Pure function with many edge cases (20+ combinations) → unit test
Internal service boundary, no UI involved → integration test
Algorithm with complex logic (parsing, math, state machines) → unit test
Only one module's contract matters → integration test

When to stay at higher scope:

User-facing feature or workflow → E2E
Multiple modules must cooperate → integration or E2E
"If this breaks, users notice immediately" → E2E

Announce your decision: "Test type: [unit/integration/E2E/eval] because [reason]."

For the full decision tree, bug detection matrix, and edge cases: .safeword/guides/testing-guide.md

Iron Laws

Non-negotiable at every test level. Violating these produces tests that pass but catch nothing.

1. Test Behavior, Not Implementation

// WRONG — tests internal state
expect(component.state.count).toBe(1);
expect(mockFn).toHaveBeenCalledWith('internal-detail');

// RIGHT — tests observable behavior
expect(screen.getByText('Count: 1')).toBeVisible();
expect(result).toEqual({ total: 42 });

This applies at EVERY level:

Unit: assert on return values, not on which helpers were called
Integration: assert on API responses, not on which service methods fired
E2E: assert on what the user sees, not on DOM structure
Eval: grade the output quality, not the path the LLM took

2. Every Test Needs a Meaningful Assertion

If your assertion would pass for ANY input, it asserts nothing.

// WRONG — asserts nothing useful
expect(() => processData(input)).not.toThrow();
expect(result).toBeTruthy();
expect(result).toBeDefined();

// RIGHT — asserts specific behavior
expect(processData(input)).toEqual({ status: 'ok', count: 3 });
expect(result.errors).toHaveLength(0);

3. Tests Must Fail First

A new test that passes immediately is testing nothing — or testing something that already works (no value added). For new behavior: RED then GREEN. For existing code: if a characterization test fails, you found a bug.

4. One Test, One Behavior

If a test name has "and" in it, split it. Each test verifies ONE observable outcome.

// WRONG
it('validates input and saves to database', ...);

// RIGHT
it('rejects input missing required field', ...);
it('saves valid input to database', ...);

5. Tests Must Be Independent

No test depends on another test's side effects. Fresh state per test. Run in any order.

Anti-Patterns

The most common ways AI-generated tests go wrong. Watch for all of them.

Pattern	Problem	Fix
Coverage theater	High line coverage, tests catch no bugs	Every test should fail if you break the behavior it guards
Mock everything	Tests only verify mock wiring, not real behavior	Use real dependencies where practical; mock only external services
Duplicate tests	20 tests with same structure, different values	Use parameterized/table-driven tests: `it.each(...)`
Happy-path only	Misses edge cases where real bugs live	Always include: empty input, boundary values, error paths
Hardcoded magic values	Timestamps, IDs, paths break across environments	Use builders, relative values, or factories
Snapshot overuse	Large snapshots pass review without scrutiny	Prefer targeted assertions; snapshots only for large stable structures
Testing private methods	Couples tests to implementation	Test through the public API
Exact UI text matching	Breaks on copy changes	Use regex `/submit/i` or data-testid attributes
Bug-locking	Tests written against buggy code encode the bug	Write tests BEFORE implementation (TDD), or verify behavior is correct first
Scope defaulting	AI defaults to unit tests for everything	Ask "what's the highest scope with acceptable feedback speed?" first

Behavioral Testing by Type

Unit Tests — Behavioral

Test the contract (inputs → outputs), not the internals.

// Behavioral: asserts on output
it('applies 20% discount for VIP users', () => {
  expect(calculateDiscount(100, { tier: 'VIP' })).toBe(80);
});

// Non-behavioral: asserts on internal call
it('calls applyRate with 0.2', () => {
  calculateDiscount(100, { tier: 'VIP' });
  expect(applyRate).toHaveBeenCalledWith(0.2);
});

Integration Tests — Behavioral

Test that components produce correct combined outcomes with real dependencies.

// Behavioral: asserts on combined outcome
it('returns user profile with computed permissions', async () => {
  const response = await api.get('/users/1/profile');
  expect(response.data.permissions).toContain('edit_posts');
});

// Non-behavioral: asserts on which services were called
it('calls UserService then PermissionService', async () => { ... });

E2E Tests — Behavioral

Test what the user can see and do. E2E tests are naturally behavioral — lean into this.

// Behavioral: user-visible outcome
test('user creates account and sees dashboard', async ({ page }) => {
  await page.goto('/signup');
  await page.fill('[name="email"]', 'test@example.com');
  await page.fill('[name="password"]', 'secure123');
  await page.click('button:has-text("Sign Up")');
  await expect(page).toHaveURL('/dashboard');
  await expect(page.getByText('Welcome')).toBeVisible();
});

LLM Evals — Behavioral

Grade what the output achieves, not the path the model took. Use deterministic assertions first, LLM-as-judge second.

# Deterministic assertion (cheap, run every commit)
- type: javascript
  value: JSON.parse(output).intent === 'order_pizza'

# LLM-as-judge (for subjective quality, run on PR/schedule)
- type: llm-rubric
  value: |
    PASS: Correctly identifies pizza order, confirms size and type
    FAIL: Wrong intent, ignores key details, or generic response

Eval-specific principles:

Grade outcomes, not paths — the LLM can take any route to the right answer
Binary PASS/FAIL over scales — "3 vs 4" is meaningless; force clarity
One dimension per scorer — don't bundle factuality + tone + completeness
Deterministic checks first — regex, schema validation, required fields before LLM-as-judge

Writing Approach

Match Existing Style

Before writing any test, find existing tests near the code under test. Match their imports, describe/it structure, helpers, and patterns. Don't introduce new conventions into an established test suite.

If no existing tests: use AAA pattern (Arrange-Act-Assert).

Design Before Writing

List planned tests before coding. For each test, name:

What behavior it verifies (not what code it calls)
What the key assertion is (not "it doesn't throw")
Why this test matters (what bug would slip through without it?)

Aim for: happy path + edge cases + error cases + at least one test the implementation could plausibly get wrong.

One Test at a Time

Write one test → run it → verify it fails (or passes for characterization) → move to next. Never write all tests at once then run them.

Patterns

Test Data Builders

function buildUser(overrides = {}) {
  return { id: 'test-1', name: 'Test User', role: 'member', ...overrides };
}

it('applies VIP discount', () => {
  const user = buildUser({ role: 'vip' });
  expect(calculateDiscount(user)).toBe(0.2);
});

Async Testing — Never Use Arbitrary Timeouts

// WRONG
await sleep(3000);
await page.waitForTimeout(500);

// RIGHT — wait for condition
await expect.poll(() => getStatus()).toBe('ready');
await waitFor(() => expect(element).toBeVisible());

Descriptive Test Names

// WRONG
it('works correctly');
it('should handle edge case');

// RIGHT — describes the behavior
it('returns 401 when API key is missing');
it('preserves user input after validation error');

Quick Reference

Need	Action
Full test type selection guide	`.safeword/guides/testing-guide.md`
Test definition template (BDD)	`.safeword/templates/test-definitions-feature.md`
Test quality review	`/audit`
Feature-level TDD with scenarios	`/bdd`
Debugging failing tests	`/debug`