name: testing description: | Test orchestration pipeline for arc Phase 7.7 — 4-tier testing (unit, property-based, integration, E2E/browser) with diff-scoped discovery and structured reporting. Extended tier covers contract validation, visual regression, design tokens, a11y, test history, and flaky detection. Auto-loaded by the arc orchestrator during the test phase. Keywords: testing, test pipeline, unit test, integration, E2E, PBT, property-based, fast-check, hypothesis, proptest, visual regression, design token, accessibility, flaky test, contract validation. user-invocable: false disable-model-invocation: false
Testing Orchestration — Arc Phase 7.7
This skill provides the knowledge base for the arc pipeline's testing phase. It is auto-loaded by the arc orchestrator and injected into test runner agents.
Testing Pyramid Hierarchy
/\
/E2E\ ← Slow, few (max 3 routes)
/------\
/Integr. \ ← Moderate speed, moderate count
/----------\
/ PBT (1.5) \ ← Fast, invariant-based (when PBT lib available)
/--------------\
/ Unit Tests (1) \ ← Fast, many (diff-scoped)
/------------------\
Execution order: Unit → PBT → Integration → E2E (serial by tier, parallel within tier) Failure cascade: Tiers execute serially (unit → PBT → integration → E2E). Tier failures are non-blocking — all enabled tiers execute regardless of prior tier results, based on scope detection and service health.
Tier 1.5: Property-Based Testing
PBT tests invariants with randomly generated inputs, catching edge cases that example-based tests miss. Runs between unit (Tier 1) and integration (Tier 2) tests.
Skip conditions: No PBT library in dependencies AND no PBT-suitable patterns detected in changed code.
Discovery: Check package.json for fast-check, requirements.txt/pyproject.toml for hypothesis, Cargo.toml for proptest, go.mod for rapid. If library present, run PBT tier. If absent but patterns detected, suggest adding the library.
Timeout: 2x unit test timeout (PBT generation is CPU-intensive). v3.x baked default: 2.
See property-based-testing.md for library selection, code templates, discovery protocol, and common property patterns.
Model Routing Rules
| Role | Model | Rationale |
|---|---|---|
| Test orchestration (team lead) | Opus | Complex coordination, strategy |
| Unit test runner | Sonnet | Fast execution, low complexity |
| Contract validator | Sonnet | API/schema validation, non-blocking |
| Integration test runner | Sonnet | Moderate complexity, service interaction |
| E2E browser tester | Sonnet | Browser interaction, snapshot analysis |
| Extended test runner | Sonnet | Long-running scenarios, checkpoint support |
| Failure analyst | Opus (inherit) | Root cause analysis, multi-file reasoning |
Strict enforcement: Team lead (Opus) NEVER executes test commands directly. All test execution happens via Sonnet teammates.
Scope Detection
See scope-detection.md for the shared resolveTestScope() algorithm.
Summary:
- Input: PR number string, branch name, or empty (auto-detect current branch)
- Output:
{ files: string[], source: "pr"|"branch"|"current", label: string } - Priority: PR files (via
gh) → branch diff → current-branch diff → fallback warn - Security: PR numbers must be digit-only; branch names validated against
[a-zA-Z0-9._/-]+ - Used by arc Phase 7.7 TEST. (Note: standalone
/rune:test-browserskill was removed in v3.0.0-alpha.2; algorithm now lives only inside the arc test phase.)
Diff-Scoped Test Discovery
See test-discovery.md for the full algorithm.
Summary:
- Get changed files from
resolveTestScope()— NOT from rawgit diff - Map each source file to its test counterpart by convention
- If no test file found → flag as "uncovered implementation"
- Include changed test files directly
- For shared utilities (
lib/,utils/,core/) → trigger full unit suite
Service Startup Patterns
See service-startup.md for the full protocol.
Summary:
- Auto-detect: docker-compose.yml → Docker; package.json → npm; Makefile → make
- Health check: HTTP GET every 2s, max 30 attempts (60s total)
- Hard timeout: 3 minutes for Docker startup
- Snapshot verification: after health check, open browser and check page is not blank/error
- Arc mode: WARN and proceed if verification fails
- Standalone mode: abort with framework-specific fix instructions
- Failure → skip integration/E2E tiers, unit tests still run
File-to-Route Mapping
See file-route-mapping.md for framework patterns.
Test Report Format
See test-report-template.md for the output spec.
Discipline Integration (v1.173.0)
Echo-Back Requirement (AC-8.4.1, AC-8.4.2)
Test runner agents MUST echo-back their test strategy before execution:
I will verify:
AC-1 (user authentication) → unit test: test_login_flow
AC-2 (rate limiting) → integration test: test_rate_limiter
AC-5 (audit logging) → no test available (WARN: criteria not covered)
This echo-back is logged to the test strategy document and lets the orchestrator detect criteria misalignment BEFORE tests run. If a criterion has no test, it's flagged early rather than discovered post-execution.
Failure Classification with F-Codes (AC-8.4.3, AC-8.4.4, AC-8.4.5)
The fix loop classifies failures with discipline failure codes for pattern tracking (see failure-codes.md for full F1-F17 registry):
| F-Code | Name | Meaning | Recovery Action |
|---|---|---|---|
| F3 | PROOF_FAILURE | Implementation is wrong — code doesn't meet criterion | Fix code, re-run |
| F8 | INFRASTRUCTURE_FAILURE | Test itself is broken or infra is misconfigured | Fix test/infra, re-run |
| F17 | CONVERGENCE_STAGNATION | Same test fails same assertion across 2+ fix attempts | Escalate immediately — stop retrying |
F17 detection: When the same test fails with the same assertion message across 2+ fix attempts, the fix loop breaks immediately instead of retrying. This prevents wasting cycles on unfixable issues.
F-code → discipline metrics: Classification feeds into discipline metrics (Shard 5 T5.1) for failure pattern tracking across pipeline runs. Patterns like "F3 on auth tests" recurring across arcs indicate systemic implementation gaps.
Implementation status: F-code classification is emitted via
warn()during fix loops and recorded in convergence history (checkpoint). Structured metrics persistence to a cross-run metrics store is planned as part of the discipline metrics pipeline (Shard 5 T5.1) but not yet implemented — current tracking is per-run via checkpoint data and echo entries.
Failure Escalation Protocol
Test runner detects failure
→ Write structured failure to tier result file
→ Continue remaining tests in tier
→ After all tiers complete:
→ Team lead reads tier results (summary only — Glyph Budget pattern)
→ If failures detected:
→ Spawn test-failure-analyst (Opus, 3-min deadline)
→ Analyst reads: failure traces + source code + error logs
→ Analyst receives plan context via test strategy document (which includes
planFilePath from checkpoint — enabling spec-aware root cause analysis)
→ Analyst produces: root cause + fix proposal + confidence
→ With plan context, analyst can identify "test fails because criterion
AC-X was never implemented" — not just "test fails on line Y"
→ If analyst times out: attach raw test output instead
Batch Execution Model (v1.165.0+)
Phase 7.7 uses sequential batched execution instead of parallel background agents. Each batch = 1 foreground agent (blocking call, zero idle risk).
Execution order: unit batches → PBT → contract → integration → e2e → extended Batch sizing: TARGET_BATCH_DURATION_MS / avg_test_duration (clamped to 1-20) Fix loop: On failure, lead analyzes + fixes + reruns (max 2 retries) Checkpoint: testing-plan.json is both plan AND checkpoint (atomic writes) Fresh context: Stop hook re-injects per batch for unlimited context budget
See batch-execution.md for the full algorithm. See testing-plan-schema.md for the JSON schema.
Anti-Skip Enforcement Rules
These rules are MANDATORY — not suggestions. Violation halts the pipeline.
- NEVER skip tests because they "take too long"
- NEVER mark testing as "done" with unfixed failures (unless max retries exceeded)
- ALL diff-scoped test files MUST be executed
- Fix-before-continue is MANDATORY — failed batch enters fix loop before proceeding
- Testing plan MUST exist before any execution begins
- Budget exhaustion is the ONLY valid skip reason — log explicitly as
skipped_budget_exhausted
Completeness Check
After all batches complete, verify:
- No batches with status "pending" remain (all executed or explicitly skipped)
- Skipped batches have skip_reason logged
- Warning emitted if any batch failed after max retries
Test Scenario Schema
See scenario-schema.md for the YAML test scenario format.
Summary:
- Scenarios live in
.rune/test-scenarios/*.yml - Required fields:
name,tier(unit/pbt/integration/e2e/extended/contract) - Discovered in STEP 0.5, merged into strategy in STEP 1.5
- Capped at
testing.scenarios.max_per_run(default 50) - Gate:
testing.scenarios.enabled(default true)
Extended Tier Checkpoint/Resume
See checkpoint-protocol.md for the checkpoint/resume protocol.
Summary:
- Extended scenarios write progress to
tmp/arc/{id}/extended-checkpoint.json - On resume: orchestrator reads checkpoint and passes as
extendedResumeState - Checkpoint interval:
testing.extended_tier.checkpoint_interval_ms(default 300_000ms) - Budget:
testing.extended_tier.timeout_ms(default 3_600_000ms) - Gate:
testing.extended_tier.enabledAND extended scenarios exist
Test Data Fixtures
See fixture-protocol.md for test data fixture execution.
Summary:
- Fixtures define seed data for integration and E2E tiers
- Applied before scenario steps, within the test runner agent (STEPs 5/6/7)
- Teardown runs after each scenario completes (regardless of pass/fail), not per-tier
- Gate:
testing.fixtures.enabled
Visual Regression
See visual-regression.md for the visual regression protocol.
Summary:
- E2E browser tester captures screenshots during STEP 7
- Inline comparison against baselines in
testing.visual_regression.baseline_dir - Comparison tool:
agent-browser compare --baseline <path> --current <path> --format json - Metric: similarity score (higher = better; 1.0 = identical)
- Similarity threshold:
testing.visual_regression.threshold(default 0.95 = 95% similarity) - Fail condition:
diffData.similarity < threshold(below 95% similarity) - Failures appended as WARN section in
test-results-e2e.md(non-blocking) - Gate:
testing.visual_regression.enabled - Canonical implementation: arc-phase-test.md lines 381–407
Design Token Compliance
See design-token-check.md for design token compliance checks.
Summary:
- Validates that changed frontend files use token-based values (not hardcoded colors/spacing)
- Runs inline after E2E tier (team lead only)
- Findings appended to test report as WARN
- Gate:
testing.design_tokens.enabled
Accessibility Validation
See accessibility-check.md for accessibility validation protocol.
Summary:
- WCAG 2.1 AA compliance checks on rendered routes
- Runs via e2e-browser-tester (injected instructions)
- Findings appended to
test-results-e2e.md - Gate:
testing.accessibility.enabled
Test History Persistence
See history-protocol.md for test history persistence format.
Summary:
- Written to
.rune/test-history/test-history.jsonl(JSONL rolling window) - Includes: pass/fail counts, durations, tier breakdown, flaky scores, PR number
- Rolling window:
testing.history.max_entries(default 50) - Gate:
testing.history.enabled(default true) - Inline in STEP 9.5 (no agent spawn)
- Canonical implementation: arc-phase-test.md STEP 9.5 (lines 580–635)
Regression Detection
See regression-detection.md for regression signal detection.
Two complementary regression signals are evaluated in STEP 9.5. They use different config keys, different algorithms, and different data granularities — they are NOT the same check:
Signal 1 — Global pass-rate drop (arc-phase-test.md STEP 9.5, inline):
- Compares current run pass rate against the immediately preceding history entry
- Config:
testing.history.pass_rate_drop_threshold(float, 0.0–1.0, default0.05= 5% drop) - Algorithm:
passRateDrop = previousPassRate - currentPassRate; if passRateDrop > threshold → warn - On detection:
updateCheckpoint({ test_regression_detected: true, regression_pass_rate_drop: passRateDrop })+ warn - Gate: history must have ≥ 2 entries
Signal 2 — Per-test historical series (regression-detection.md, per-test algorithm):
- Evaluates each currently-failing test against its pass/fail history over last 10 runs
- Config:
testing.history.regression_threshold(integer, default7) — minimum recent passing runs out of last 10 to classify as a regression - Algorithm:
passCount = recentRuns.filter(passed).length; if passCount >= threshold → regression - On detection: test listed in regression report with confidence score
- Gate: history must have ≥ 2 entries; test must exist in history (skips new tests)
Flaky Test Identification
See flaky-detection.md for flaky test identification.
Summary:
- Computes per-test flaky scores from history:
pass_in_some_runs AND fail_in_others - Scores persisted in history entries as
flaky_scoresmap - High-flaky tests surfaced in test report for human review
- Gate:
testing.flaky_detection.enabled(default true)
Security Patterns
SAFE_TEST_COMMAND_PATTERN
/^[a-zA-Z0-9._\-\/ ]+$/
Validates test runner commands. Blocks semicolons, pipes, backticks, $().
Applied to ALL commands parsed from project config files (package.json, pytest.ini).
SAFE_PATH_PATTERN
/^[a-zA-Z0-9._\-\/]+$/
Validates all file paths. Rejects .. traversal. Always quote: "$file".
E2E URL Scope Restriction
E2E URLs MUST be scoped to localhost or the testing E2E base URL host (v3.x default http://localhost:3000; see plugins/rune/references/v3-defaults.md).
External URLs are rejected to prevent agent-browser from navigating to untrusted sites.
Output Truncation
- 500-line ceiling for AI agent context
- Full output written to artifact file
- Summary (last 20-50 lines) extracted for agent context
- Secret scrubbing:
AWS_*,*_KEY,*_SECRET,*_TOKEN,Bearer,sk-*,ghp_*, JWT tokens, emails redacted before agent ingestion. See secret-scrubbing.md for regex patterns andscrubSecrets()implementation