name: test-confidence description: AI-driven test execution. Opus decides what to run and how confident to be, based on your diff. argument-hint: "--full to run to 100% | --strict to halt on pre-existing failures" allowed-tools: Bash(git *), Bash(bundle exec rspec *), Bash(cat *), Bash(find *), Bash(wc *), Bash(head *), Bash(tail *), Bash(grep *), Bash(bin/test-confidence *)
Test Confidence
Run bin/test-confidence to have Opus 4.7 analyze your diff, decide the risk level, plan which tests to run and in what order, and set confidence milestones. The AI decides the shape of the curve based on this specific diff.
Usage
bin/test-confidence # Run to 99%, stop. Skips past pre-existing failures.
bin/test-confidence --full # Run to 100%
bin/test-confidence --strict # Halt on any failure, including pre-existing
ANTHROPIC_API_KEY is auto-sourced from .env if not exported.
If $ARGUMENTS is provided, pass it through: bin/test-confidence $ARGUMENTS
How it works
- Finds changed files (branch diff vs main for PR branches, local diff on main)
- Hashes the diff and checks
tmp/test-confidence/for a cached plan; reuses if found - Otherwise sends the diff + spec tree + touched directories to Opus 4.7 in one call
- Opus returns a plan: risk level, ordered test list, confidence milestones
- Script executes the plan, showing yellow progress bar toward 99%
- At 99%, bar turns green. Safe to commit.
- With
--full, continues running remaining tests toward 100%
Pre-existing failure detection
When a spec fails, the script applies a two-step check:
- Path heuristic — does the failing spec file or its source counterpart (e.g.,
app/models/user.rbforspec/models/user_spec.rb) appear in the diff?- In the diff → real regression. Halt immediately.
- Not in the diff → suspect pre-existing. Verify on merge-base.
- Merge-base verify — re-run the failing rspec examples in a temporary worktree at the branch's merge-base with
main.- Still fails on merge-base → confirmed pre-existing. Continue.
- Passes on merge-base → cross-file regression caught. Halt.
The verify step is skipped (heuristic verdict trusted) when the diff includes a db/migrate/*.rb (DB schema drift) or Gemfile.lock (bundler drift), or when no merge-base is available. --strict skips both checks and halts on every failure.
Cost profile: zero overhead in the common no-failure case; verify only runs on heuristic-flagged "pre-existing" failures. Catches the dangerous direction (silent regression skipped) while keeping false-alarm investigation costs bounded.
The key insight: Opus decides ad hoc how many tests are needed for each confidence level. A comment-only change might need 2 tests for 99%. A payment model refactor might need 100.
When to use
Run this before every commit. It replaces manually picking which specs to run.
$ARGUMENTS