name: ego-field-test description: Run automated field tests across all registered GNOME extensions, diff against baselines, classify findings, and produce regression reports.
ego-field-test — Automated Field Test Pipeline
Run batch ego-lint across all field test extensions with baseline diffing, finding classification, and regression reporting.
Modes
- Default (no args): Lint-only batch + classify + synthesize (fast, no API cost beyond this session)
--review: Run ego-review on ALL extensions via headlessclaude -p(~$0.50-2.00/ext, ~5-10 min each)--review-changed: Run ego-review only on extensions with changed lint results--review-dry-run: Hydrate review prompts without invokingclaude -p(for testing templates)--parallel N: Max concurrentclaude -psessions (default: 3)--update-baselines: Save current lint results as new baselines after review
Workflow
Step 1: Run the Orchestrator
bash scripts/field-test-runner.sh --no-fetch
If extensions aren't cached yet:
bash scripts/field-test-runner.sh
This runs ego-lint on all extensions in field-tests/manifest.yaml, produces JSON results per extension, diffs against baselines, and appends to history.
To also run ego-review (headless Claude):
# Review all extensions (~$5-20 total, ~20-30 min with parallel 3)
bash scripts/field-test-runner.sh --no-fetch --review
# Review only changed extensions (cheaper, faster)
bash scripts/field-test-runner.sh --no-fetch --review-changed
# Test prompt hydration without invoking Claude
bash scripts/field-test-runner.sh --no-fetch --review-dry-run --extension hara-hachi-bu
# Control concurrency
bash scripts/field-test-runner.sh --no-fetch --review --parallel 5
Step 2: Review Results
Read the summary:
field-tests/results/<latest-timestamp>/summary.json
For each extension, read:
field-tests/results/<latest-timestamp>/<name>.lint.json— full resultsfield-tests/results/<latest-timestamp>/<name>.diff.json— diff from baseline (if baseline exists)
Step 3: Classify New Findings
For each extension with unannotated_findings in the diff:
- Read the finding details from the lint JSON
- Read the extension source code to determine if each finding is TP, FP, borderline, or expected
- Update
field-tests/annotations/<name>.yamlwith the classification
Classification guide:
- tp: True positive — real issue that a reviewer would flag
- fp: False positive — ego-lint is wrong, should be fixed in the tool
- borderline: Arguable — reasonable people could disagree
- expected: Correct detection but expected for this extension type (e.g., shell overrides in V-Shell)
Step 4: Selective ego-review (if --review or --review-changed)
The runner handles this automatically when --review or --review-changed is passed:
- For each extension (all or changed-only), hydrates
scripts/review-prompt.mdwith lint JSON, diff, and annotations - Launches
claude -pwith the hydrated prompt,--plugin-dir,--add-dir, 10-minute timeout, $4 budget cap - Throttles concurrency to
--parallel N(default 3) using background subshells + poll loop - Saves output to
field-tests/results/<timestamp>/<name>.review.md - Tracks review status per extension:
ok,timeout,error,skipped,excluded,dry-run,no-report,none
Use --review-dry-run to test prompt hydration without invoking Claude. Hydrated prompts are saved to <name>.review-prompt.md.
Step 5: Synthesize Regression Report
Produce field-tests/reports/<date>-regression.md with:
- Summary table: Extension name × PASS/FAIL/WARN/SKIP with deltas from baseline
- Trend data: From
field-tests/history.jsonl— FP count on approved extensions over last N runs - New unannotated findings grouped by rule ID (cross-extension patterns)
- Resolved findings (things that got fixed)
- High-priority FP candidates: Rules that fire as FP on 2+ approved extensions
- Gaps: Findings ego-review caught that ego-lint missed (only if
--review)
Step 6: Update field-tests/README.md
Update the "Latest Lint Results" and "Annotation Coverage" tables in field-tests/README.md to reflect the current run:
- Latest Lint Results: Replace the ego-lint version, per-extension PASS/FAIL/WARN/SKIP counts, and totals row with values from the latest
history.jsonlentries - Annotation Coverage: Update TP/FP/borderline/expected/classified/unannotated counts from
annotations/*.yamland the diff output
Keep the "Extension Catalog" and "Code Metrics" sections unchanged unless a new extension was added to the manifest.
Step 7: Issue Creation (if FPs confirmed)
For new false positives on EGO-approved extensions that are confirmed FP (not borderline):
Create a GitHub issue:
- Label:
false-positive - Title:
False positive: R-XXXX-NN on <extension> - Body: Rule ID, file:line, why it's FP, which other extensions are affected, suggested fix
Step 8: Update Baselines (if --update-baselines)
bash scripts/field-test-runner.sh --update-baselines --no-fetch
File Layout
field-tests/
├── manifest.yaml # Extension sources (committed)
├── cache/ # Downloaded extensions (gitignored)
├── baselines/ # Golden JSON snapshots (committed)
├── annotations/ # Finding classifications (committed)
├── results/ # Timestamped run output (gitignored)
├── history.jsonl # Trend data (committed)
└── reports/ # Regression reports (committed)
scripts/
├── field-test-runner.sh # Bash orchestrator
├── parse-manifest.py # Manifest YAML → JSON
├── parse-lint-results.py # ego-lint output → JSON
├── diff-baselines.py # Baseline comparison
├── review-prompt.md # Review prompt template ({{PLACEHOLDER}} tokens)
└── hydrate-review-prompt.py # Template hydrator (lint JSON + diff + annotations)
Annotation Format
findings:
- id: "R-SEC-22::dconf CLI spawn"
classification: tp
notes: "dconf import/export — legitimate but needs disclosure"
- id: "init/shell-modification::constructor"
classification: fp
notes: "Constructor called from enable(). Fixed in PR #21"
Iteration Cycle
- Make a code change (guard pattern, threshold tweak, new rule)
- Run
/ego-field-test— see immediate impact across all extensions - Classify new unannotated findings
- Update
field-tests/README.mdresults and annotation tables - If FPs found, create issues and fix them
- Run
/ego-field-test --update-baselinesto snapshot improved state - Repeat