name: test-design-reviewer description: "Assess test suite quality using Farley's 8 Properties and Tautology Theatre detection. Use when user says review tests, test quality, are my tests good, test assessment, or test design review. Not for writing tests (use language skills) or code review (use gemini-review)."
ABOUTME: Test quality assessment using Farley's 8 Properties of Good Tests
ABOUTME: Detects tautological tests, mock theatre, and structural test weaknesses
Test Design Reviewer
Quality Notes
- Read every test file thoroughly before scoring
- Quality over speed: analyze what each test actually verifies
- Do not skip the Tautology Theatre check
Process
Step 1: Collect test files
Identify all test files in scope. Use language-appropriate patterns:
- Go:
*_test.go - Python:
test_*.py,*_test.py - Ruby:
*_spec.rb - JS/TS:
*.test.ts,*.spec.ts
Step 2: Score against Farley's 8 Properties
Rate each property 0-10 across the test suite. Provide evidence.
| # | Property | Question to ask | Red flags |
|---|---|---|---|
| 1 | Understandable | Can you tell what's being tested in 5 seconds? | Cryptic names, no arrange/act/assert structure, shared state |
| 2 | Maintainable | Will this break when implementation changes? | Testing private methods, brittle selectors, hardcoded values |
| 3 | Repeatable | Same result every run, any order, any machine? | Time-dependent, filesystem-dependent, test ordering, shared DB state |
| 4 | Atomic | One reason to fail? | Multiple assertions testing different behaviors, setup-heavy |
| 5 | Necessary | Does this test earn its keep? | Duplicate coverage, testing framework/language behavior |
| 6 | Granular | Pinpoints the failure location? | Coarse assertions (assert result), catch-all tests |
| 7 | Fast | Runs in milliseconds? | Real HTTP calls, sleep/wait, full DB setup per test |
| 8 | First | Written before production code? | Tests that mirror implementation structure, not behavior |
Scoring methodology:
- Static scoring: compute 0-10 per property using sigmoid normalization on signal densities (negative signals / test methods, positive signals / test methods). Use
lib/cli_calculator.pyfor deterministic math (JSON in, JSON out). - LLM scoring: assess holistically per property, focusing on semantic aspects static analysis misses (naming quality, assertion appropriateness, tautology theatre)
- Blend:
final_property_score = 0.60 * static_score + 0.40 * llm_scoreper property - Conservative default: when no signals detected for a property, default to 5.0 (unknown quality, not good quality)
Per-property scoring rubrics (anchor scores to these bands):
| Property | 9-10 | 7-8 | 5-6 | 3-4 | 1-2 |
|---|---|---|---|---|---|
| U | Reads like specs; behavior clear without reading impl | Clear with minor ambiguities | Requires code inspection to understand | Cryptic; relies on impl details | test1/test2; magic numbers throughout |
| M | Proper abstractions; verifies behavior not impl | Good separation; occasional brittleness | Some impl coupling; some over-specified mocks | Tightly coupled; verify with exact counts | Reflection for private fields; mirrors impl exactly |
| R | Fully deterministic; no external deps | Rarely flaky; minimal env deps | Occasional flakiness; timing deps | Filesystem, timing, env deps present | sleep, file I/O, network, system time, unseeded random |
| A | Fully isolated; no shared state; parallelizable | Mostly isolated; minor shared setup | Some shared state; order sometimes matters | Heavy interdeps; must run in order | Shared mutable statics; ordering annotations |
| N | Every test adds unique value; parameterized for variations | Most tests valuable; minor redundancy | Checkbox exercises; moderate redundancy | Redundant tests; framework testing; mock tautologies | assertTrue(true); disabled tests; tests verify only mocks |
| G | Each test verifies single outcome; pinpoints issues | Focused; occasional logical assertion groups | Multiple behaviors; failure diagnosis takes effort | Sprawling; multiple unrelated assertions | 20+ assertions; testEverything() methods |
| F | Pure computation; no I/O; milliseconds | Quick; minor optimization opportunities | Some slow tests; noticeable suite time | File I/O or database calls | sleep, network calls, heavy setup/teardown |
| T | Clear test-first evidence; tests drive design | Likely test-first; good design influence | Unclear; tests may be afterthoughts | Mirrors impl; likely test-after; mock-heavy | Clearly written after code; coverage patches |
Aggregation methodology:
- Per-test-method: collect signals at individual method level
- Per-test-file: mean for positive signals, P90 for negative signals (worst offenders must surface)
- Per-test-suite: LOC-weighted mean across files
Sampling for large suites:
- Under 50 test files: analyze all
- Over 50: SHA-256 deterministic selection (30%) plus all files exceeding 100 test methods
Weighted Farley Index = (U*1.5 + M*1.5 + R*1.25 + A*1.0 + N*1.0 + G*1.0 + F*0.75 + T*1.0) / 9.0
Divisor is 9.0 (sum of weights), not 8 (number of properties). U/M weighted highest (readability, coupling); F weighted lowest (speed is contextual).
| Range | Rating | Interpretation |
|---|---|---|
| 9.0-10.0 | Exemplary | Model suite; tests serve as living documentation |
| 7.5-8.9 | Excellent | High quality with minor improvement opportunities |
| 6.0-7.4 | Good | Solid foundation with clear areas for improvement |
| 4.5-5.9 | Fair | Functional but needs significant attention to test design |
| 3.0-4.4 | Poor | Tests provide limited value; major refactoring needed |
| 0.0-2.9 | Critical | Tests may be harmful; consider rewriting from scratch |
Step 3: Tautology Theatre Detection
The critical question: "Would this test still pass if all production code were deleted?"
Scan for these 4 patterns:
| Pattern | What it looks like | Example |
|---|---|---|
| Mock tautology | Test verifies that a mock returns what it was told to return | mock.return_value = 42; assert service.get() == 42 (only tests the mock) |
| Mock-only test | Every dependency is mocked, nothing real executes | Test with 5 mocks and zero real objects |
| Trivial tautology | Assertion is always true regardless of code | assert isinstance(result, dict) when function signature guarantees dict |
| Framework test | Tests framework behavior, not application logic | Testing that Rails validations work, that pytest fixtures inject |
Also scan for mock interaction anti-patterns (affect Maintainable score):
| Pattern | What it looks like |
|---|---|
| Over-specified interactions | verify with exact call counts, call ordering, verifyNoMoreInteractions |
| Testing internal details | ArgumentCaptor deep inspection, verify(never()) mirroring branches, high verify-to-assert ratio |
For each tautology or anti-pattern found: report the file, line, pattern type, and why it's problematic.
Step 4: Report
## Test Design Review
### Farley Index: X.X / 10.0 (Rating)
| Property | Static | LLM | Blended | Weight | Weighted | Key Evidence |
|----------|--------|-----|---------|--------|----------|--------------|
| Understandable | X.X | X.X | X.X | 1.50x | X.XX | ... |
| Maintainable | X.X | X.X | X.X | 1.50x | X.XX | ... |
| Repeatable | X.X | X.X | X.X | 1.25x | X.XX | ... |
| Atomic | X.X | X.X | X.X | 1.00x | X.XX | ... |
| Necessary | X.X | X.X | X.X | 1.00x | X.XX | ... |
| Granular | X.X | X.X | X.X | 1.00x | X.XX | ... |
| Fast | X.X | X.X | X.X | 0.75x | X.XX | ... |
| First (TDD) | X.X | X.X | X.X | 1.00x | X.XX | ... |
### Tautology Theatre Analysis
Each subsection always present; use "None detected." when empty.
#### Mock Tautologies
| Test Method | Line | Mock Setup | Assertion |
#### Mock-Only Tests
| Test Method | Line | Evidence |
#### Trivial Tautologies
| Test Method | Line | Assertion |
#### Framework Tests
| Test Method | Line | Assertion | What It Actually Tests |
**Summary**: {total} instances across {affected}/{total_methods} test methods.
### Top 3 Improvements
1. [Highest-impact fix targeting weakest high-weight property]
2. [Second priority]
3. [Third priority]
### Methodology Notes
- Static/LLM blend: 60/40
- Files analyzed: {count} ({sampling note})
- Language: {lang}, Framework: {framework}
Integration with Review Pipeline
This skill is invoked by the orchestrator when test files are in scope (see orchestrator-protocol.md, review routing step). Can also be invoked directly via /test-design-reviewer.
Deterministic Scoring Calculator
lib/cli_calculator.py provides JSON-in, JSON-out math for reproducible scores. Delegate all Farley Index arithmetic to this CLI to avoid LLM rounding drift.
Commands: normalize-property, blend-scores, compute-farley, get-rating, aggregate-file, aggregate-suite, full-pipeline.
# Normalize a single property from signal counts
python lib/cli_calculator.py normalize-property '{"prop":"U","neg_count":2,"pos_count":8,"total_methods":20}'
# Compute Farley Index from 8 blended scores
python lib/cli_calculator.py compute-farley '{"U":8.5,"M":7.0,"R":9.0,"A":8.0,"N":7.5,"G":8.0,"F":6.0,"T":7.0}'
# End-to-end: raw signals + optional LLM scores -> index + rating
python lib/cli_calculator.py full-pipeline '{"properties":{"U":{"neg_count":2,"pos_count":8,"total_methods":20},...},"llm_scores":{"U":8.0,...}}'
Common Issues
| Issue | Solution |
|---|---|
| No test files found | Check file patterns; some projects use non-standard locations |
| High mock count ≠ bad | Mocks are fine when testing boundaries; flag only when they replace all real logic |
| Property scores vary by test type | Score unit tests and integration tests separately if the suite is mixed |
| Legacy test suite scores low | Focus improvements on the top 3, not a full rewrite |