name: benchmark-regression-triage description: Use this skill whenever VibeGuard hook latency, benchmark-action output, GitHub Actions bench-output artifacts, or a suspected performance regression needs investigation. It preserves the PR-to-CI-artifact workflow for comparing recent PRs, older anchors such as PR #350, and current main without mixing local machine noise into CI trend evidence.
Benchmark Regression Triage
Overview
This is a repo-local maintainer skill for VibeGuard contributors. It is intentionally stored under .claude/skills/ so maintainers can use it while working in this repository, but it is not an installable product skill and must not be added to schemas/install-modules.json unless it is promoted into skills/ or workflows/ with a user-facing support contract.
This skill diagnoses VibeGuard hook latency regressions by comparing GitHub Actions bench-output artifacts across PR runs, merge runs, and mainline runs. It is meant for non-obvious cases where the current benchmark is under budget but slower than a previous design, such as a post-write-guard path losing its post-write-fast-check fast path after a runtime migration.
Treat CI artifacts as the trend source. Local benchmarks are useful for reproduction after a hypothesis exists, but they are not comparable to GitHub runner history because machine load, shell startup, cache state, and --runs count can dominate P95.
When to Activate
- A user asks why VibeGuard benchmark numbers are slower than before.
- A PR appears to change hook latency, benchmark output, or benchmark-action reporting.
- You need to compare recent PRs with older anchors such as PR #350 or a known fast-path implementation.
- A hook is still below the absolute latency budget but may have lost a faster design.
- You need a reusable workflow for downloading and comparing GitHub Actions
bench-outputartifacts.
Inputs
Collect these before drawing conclusions:
- Repository full name from
gh repo view. - Candidate PR numbers or commits, including one recent run and one older anchor.
- GitHub Actions run IDs tied to the exact PR head, merge commit, or main commit.
bench-output.jsonfrom each run, downloaded into separate directories.- The relevant budget contract from
docs/reference/hook-latency-contract.md.
Workflow
1. Search Existing Context
Search before adding a new hypothesis or artifact path:
rg -n "benchmark|bench-output|hook latency|post-write-fast|post-write-fast-check|post-write-guard|github-action-benchmark" \
.github docs tests hooks scripts skills workflows README.md CHANGELOG.md
Open the latency contract and CI workflow:
sed -n '1,120p' docs/reference/hook-latency-contract.md
sed -n '270,420p' .github/workflows/ci.yml
Confirm whether the benchmark artifact is uploaded from Linux only. In this repo, the Upload benchmark results step is guarded by runner.os == 'Linux', so artifact comparisons should use the Linux bench-output artifact unless the workflow changes.
2. Identify PR and Mainline Runs
Start with metadata, not assumptions:
REPO="$(gh repo view --json owner,name --jq '.owner.login + "/" + .name')"
gh pr view 417 --repo "$REPO" \
--json number,title,state,mergedAt,headRefName,baseRefName,mergeCommit,url
gh pr view 350 --repo "$REPO" \
--json number,title,state,mergedAt,headRefName,baseRefName,mergeCommit,url
gh run list --repo "$REPO" --branch main --limit 30 \
--json databaseId,workflowName,event,displayTitle,headSha,createdAt,status,conclusion,url
For a PR head branch, resolve the branch and list its runs:
PR=417
HEAD_BRANCH="$(gh pr view "$PR" --repo "$REPO" --json headRefName --jq '.headRefName')"
gh run list --repo "$REPO" --branch "$HEAD_BRANCH" --limit 20 \
--json databaseId,workflowName,event,displayTitle,headSha,createdAt,status,conclusion,url
Record whether each run is a pull_request run, a merge-to-main run, or a later mainline run. Do not compare an unrelated later main commit against a PR head without stating the extra changes in between.
3. Download Benchmark Artifacts
For each chosen run, confirm the artifact exists:
RUN_ID=27085248846
gh api "repos/$REPO/actions/runs/$RUN_ID/artifacts" \
--jq '.artifacts[] | select(.name == "bench-output") | [.id, .name, .expired, .size_in_bytes] | @tsv'
Download into a per-run directory so files do not overwrite each other:
LABEL="pr-417-head"
OUT="/tmp/vg-bench/$LABEL"
rm -rf "$OUT"
mkdir -p "$OUT"
gh run download "$RUN_ID" --repo "$REPO" -n bench-output -D "$OUT"
test -f "$OUT/bench-output.json"
If an old PR artifact has expired, do not present reconstructed local numbers as historical CI truth. Use the old run logs or benchmark-action report if available, otherwise label the old comparison as a code-path reconstruction and state the evidence gap.
4. Extract Comparable P95 Values
Extract only P95 rows when checking the latency contract:
jq -r '
.[]
| select(.name | endswith("(P95)"))
| [.name, .value]
| @tsv
' "$OUT/bench-output.json"
For side-by-side comparisons, keep run IDs and labels visible:
for file in /tmp/vg-bench/*/bench-output.json; do
label="$(basename "$(dirname "$file")")"
jq -r --arg label "$label" '
.[]
| select(.name | endswith("(P95)"))
| [$label, .name, .value]
| @tsv
' "$file"
done
Normalize fixture names exactly. Compare post-write-guard (100) (P95) to the same fixture in another run, not to post-write-guard (5000) or a P50/P99 row.
5. Compare Against Budgets and Prior Design
Use the contract to decide whether CI is failing:
bash tests/bench_hook_latency.sh --runs=3 --fail-on-regression
Use artifact deltas to decide whether a design got slower:
- Under budget means acceptable by the current CI gate.
- Slower than a known fast path can still be a real regression worth fixing.
- A one-run delta needs code evidence before claiming root cause.
For the post-write-guard fast-path incident, the useful comparison was:
- Recent PR and mainline artifacts showed
post-write-guardP95 stayed under budget. - Older code around PR #350 had a
post-write-fast-checkfront path that allowed clean writes to exit before full duplicate and quality scans. - Later runtime consolidation preserved correctness but the wrapper path no longer called the fast check first, so clean writes paid the full
post-write-checkcost.
6. Locate the Root Cause
Do not stop at a chart. Tie the numbers to a code path:
rg -n "post-write-fast-check|post-write-check|post-write-guard|FALLBACK|NEEDS_FULL_CHECK" hooks tests vibeguard-runtime scripts
git log --oneline --decorate -- hooks/post-write-guard.sh hooks/post-write-fast-check.sh tests/bench_hook_latency.sh
git show <old-commit>:hooks/post-write-guard.sh | sed -n '1,220p'
git show <new-commit>:hooks/post-write-guard.sh | sed -n '1,220p'
Classify the finding carefully:
budget failure: the benchmark gate exceeds the contract.design regression: the benchmark is still green, but a previously cheap safe path became more expensive.measurement gap: old artifacts expired or the compared runs are not equivalent.
7. Verify a Fix or Recommendation
If code is changed, run the focused checks for the touched surface:
bash tests/test_hook_perf_contract.sh
bash tests/bench_hook_latency.sh --runs=3 --fail-on-regression
bash scripts/ci/validate-hook-perf.sh
git diff --check
When the change touches Rust runtime code, add:
cargo check --manifest-path vibeguard-runtime/Cargo.toml
cargo test --manifest-path vibeguard-runtime/Cargo.toml
When the work is only a skill or document, run skill format validation instead:
python3 scripts/skill_validate.py --format-only --proposed-skill path/to/SKILL.md
Report Format
Use this compact report shape:
facts:
- repo:
- runs compared:
- artifact paths:
comparison:
| fixture | old anchor | recent PR | merge/main | budget |
|---|---:|---:|---:|---:|
root_cause:
- classification:
- evidence:
- affected path:
pitfalls_checked:
- same fixture and metric:
- CI artifact vs local benchmark separated:
- run IDs tied to commits:
verification:
- commands:
- result:
gaps:
- expired artifacts or non-equivalent runs:
Red Flags
- Comparing local benchmark output to CI artifact history as if they came from the same environment.
- Reporting a benchmark regression without the run ID, commit SHA, event type, and downloaded artifact path.
- Treating a green budget result as proof that no design regression happened.
- Comparing different fixtures, such as
post-write-guard (100)againstpost-write-guard (5000). - Guessing root cause from timing alone without checking the hook wrapper, runtime path, or relevant git history.
Checklist
- Search existing benchmark docs, CI workflow, tests, and hook paths before adding a new hypothesis.
- Download
bench-outputartifacts into separate per-run directories. - Compare the same P95 fixture names across PR, merge, and mainline runs.
- Tie every run to a PR number, commit SHA, event type, and Actions URL.
- Separate CI trend evidence from local reproduction measurements.
- Check the code path and git history before naming root cause.
- Run focused verification or clearly state why validation is unavailable.
Boundaries
This skill does not change benchmark budgets by itself. Budget changes need separate justification in docs/reference/hook-latency-contract.md, CI updates, and focused tests.
This skill does not authorize editing hooks, runtime code, or GC scripts. Use it to investigate and report. Implementation requires a separate explicit task with file ownership and verification.