name: bencher
description: Maintain and troubleshoot Invowk's Bencher benchmark infrastructure. Use when working on Bencher, BMF benchmark output, benchmark GitHub Actions, PGO benchstat comparisons, dedicated/bare-metal runner handoff, benchmark image packaging, Bencher thresholds, "No thresholds found" dashboard warnings, benchmark alerts, or files such as .github/workflows/benchmarks*.yml, .github/workflows/pgo-benchstat.yml, scripts/bench-bmf.mjs, scripts/bencher-threshold-args.sh, scripts/bencher-registry-login.sh, and build/bencher/Dockerfile.
Bencher
Overview
Use this skill to keep Invowk's Bencher integration coherent: GitHub Actions packages the benchmark image, Bencher runs the measurement job on the configured Spec/Testbed, and BMF output plus thresholds determine the dashboard, alerts, and PR status.
Source Files
.github/workflows/benchmarks.yml: base-repo PR andmainbenchmark flow..github/workflows/benchmarks-upload.yml: trusted workflow for fork PR benchmark upload..github/workflows/pgo-benchstat.yml: scheduled/manual GitHub-hosted comparison of-pgo=offversus default PGO usingbenchstat; this is not a Bencher upload path, but shares benchmark workflow hygiene..github/workflows/release.yml: release-tag performance tracking against the previous tag.build/bencher/Dockerfile: reproducible benchmark image, Go/Node versions, vendoring, warmup.scripts/bench-bmf.mjs: emits BMF JSON and controls the public tracked suite throughTRACKED_GO_BENCHMARKSandSHORT_BENCH_REGEX.scripts/bencher-threshold-args.sh: defines every threshold model passed tobencher run.scripts/bencher-registry-login.sh: derives the Docker registry user from the Bencher JWTsubclaim.- Benchmark workloads currently come from
internal/benchmark,internal/app/modulesync,internal/app/moduleops, andcmd/invowk.
Runner Model
- Preserve the dedicated Bencher runner design. GitHub Actions should package and push the benchmark image only; it must not measure benchmarks on GitHub-hosted runners.
- Keep
BENCHER_SPECandBENCHER_TESTBEDin sync across both benchmark workflows and release performance tracking. Current values areintel-v1andbencher-intel-v1-go-1-26. - For base-repo PRs, use branch
pr-<number>, hash the PR head SHA, and pass the base branch and base SHA as the start point. - For PR branches, keep
--start-point-clone-thresholdsand--start-point-resetso PR history is anchored to the base branch. - For fork PRs, use the trusted upload workflow: checkout trusted packaging separately from untrusted source, build the image with trusted workflow/scripts, then ask Bencher to run the source image.
- For releases, benchmark the tag against the previous tag as the start point so release history is anchored to shipped versions rather than PR branches.
- PR benchmark alerts fail by default through
--error-on-alert. If a maintainer has deliberately accepted an expected feature cost, add thebenchmarks: accepted-regressionPR label; workflows setBENCHER_ERROR_ON_ALERT=falsefor that PR, so alerts remain visible in Bencher while the Actions job stays advisory. Push and manual benchmark runs are also advisory: they publish main history to Bencher without--error-on-alertand without a Bencher GitHub check when alerts would otherwise fail already-merged history. Release-tag benchmark alerts remain blocking. Do not use advisory mode for benchmark script bugs, missing thresholds, or unexplained regressions.
Threshold Model
- Treat thresholds as linked to
branch + testbed + measure, not to individual benchmarks. - Use measure slugs as the stable identifiers in
--threshold-measure. Bencher may display title-cased names such asLatency,File Size, andBuild Time, while their slugs arelatency,file-size, andbuild-time. - If
scripts/bench-bmf.mjsemits a new measure and alerts are expected, add the matching threshold model inscripts/bencher-threshold-args.shin the same patch. - If
scripts/bench-bmf.mjsadds only new benchmarks using existing measures, no new threshold is needed. - Keep
--thresholds-resetso stale threshold models do not survive after measures are removed or renamed. - Use upper-bound-only thresholds for regressions. Lower values are improvements for latency, memory, allocations, build time, and file size.
Current emitted measures and threshold defaults:
| Measure slug | Display name | Threshold |
|---|---|---|
latency |
Latency |
percentage, min sample 5, max sample 64, upper 0.10 |
memory |
memory |
percentage, min sample 5, max sample 64, upper 0.10 |
allocations |
allocations |
percentage, min sample 5, max sample 64, upper 0.10 |
build-time |
Build Time |
percentage, min sample 5, max sample 32, upper 0.20 |
file-size |
File Size |
percentage, min sample 2, max sample 16, upper 0.10 |
Diagnostics
When the dashboard says No thresholds found, check the report JSON first. A successful bencher run can still have one unthresholded measure.
project=ddfe58db-e86d-49b8-a6c7-60fc46eabf0b
branch=pr-<number>
testbed=bencher-intel-v1-go-1-26
api=https://api.bencher.dev/v0
curl -fsS "$api/projects/$project/reports?branch=$branch&testbed=$testbed&sort=date_time&direction=desc&per_page=1" |
jq -r '.[0].results[][] as $bench |
$bench.measures[] |
select(.threshold == null) |
[$bench.benchmark.name, .measure.slug, .measure.name, .metric.value] |
@tsv'
Summarize threshold coverage by measure:
curl -fsS "$api/projects/$project/reports?branch=$branch&testbed=$testbed&sort=date_time&direction=desc&per_page=1" |
jq -r '[.[0].results[][] as $bench |
$bench.measures[] |
{measure:.measure.slug, has_threshold:(.threshold != null)}] |
group_by(.measure)[] |
[.[0].measure, length, (map(select(.has_threshold)) | length), (map(select(.has_threshold | not)) | length)] |
@tsv'
List the active threshold models:
curl -fsS "$api/projects/$project/thresholds?branch=$branch&testbed=$testbed" |
jq -r '.[] |
[.measure.slug, .measure.name, .model.test, .model.upper_boundary, .model.min_sample_size, .model.max_sample_size] |
@tsv' |
sort
Inspect Actions logs for the actual bencher run request and report URL:
log_file=$(mktemp)
gh run view <run-id> --repo invowk/invowk --log > "$log_file"
rg -n 'Bencher New Report|thresholds|View report|No thresholds|Job status|BENCHER_|--threshold|--start-point' "$log_file"
Known Pitfalls
latencyversusLatencyis not a threshold bug by itself. A prior threshold incident had sluglatencycorrectly matching displayedLatency; the warning came from missingbuild-time.boundary.baseline: nullusually means the threshold exists but Bencher does not yet have enough samples for that model. This is different fromthreshold: null.- A Bencher JWT or registry failure often looks like broken benchmark code. Check
BENCHER_API_TOKEN, the registry login script, and direct API access before rewriting benchmark logic. benchmarks: accepted-regressionis not the only advisory-alert path.scripts/bencher-threshold-args.shalso drops--error-on-alertwhen the start-point branch has no latest benchmark result, because Bencher cannot compare against an absent baseline.- Context7 may resolve
Bencherto unrelated benchmark projects. When that happens, use official Bencher docs plus direct API, CLI output, and report JSON as the source of truth:https://bencher.dev/docs/explanation/bencher-run/https://bencher.dev/docs/explanation/thresholds/https://bencher.dev/docs/api/projects/reports/
- Do not treat
SHORT_BENCH_REGEXas internal cleanup. It controls what becomes long-lived public benchmark history. - In
pgo-benchstat.yml, keepBENCH_*settings available to every step that references them. Step-levelenv:values do not carry into the later report step; promote shared benchmark variables to job-levelenv:or duplicate them on each step that uses them. - After renaming a Go benchmark function, update both
TRACKED_GO_BENCHMARKSandbenchmarkNameMap. Keep the public benchmark slug stable when the workload is the same behavior under a renamed implementation, so Bencher history and thresholds continue across code renames. - Large feature branches can trigger broad, legitimate alerts across latency, memory, allocations, and
binary/bin-invowkfile size. First verify the BMF script still emits the intended tracked suite and every measure has thresholds. If the regression is an intentional tradeoff, use the explicitbenchmarks: accepted-regressionlabel instead of weakening thresholds or deleting--error-on-alertglobally.
Verification
After Bencher infrastructure changes:
- Recheck Bencher constants:
rg -n 'BENCHER_(PROJECT|SPEC|TESTBED)' .github/workflows/benchmarks.yml .github/workflows/benchmarks-upload.yml .github/workflows/release.yml. - Run focused script checks:
bash -n scripts/bencher-threshold-args.sh,node scripts/test_bench_bmf.mjs, andbash scripts/test_bencher_registry_login.sh. - Run
make test-scriptsfor broad script coverage. - Run
make check-agent-docsif.agents/skills/bencher/SKILL.mdchanged. - Push and watch the
Benchmarks / Track Benchmarkscheck. - Verify the latest Bencher report has zero
threshold: nullmeasures if the task involved thresholds. - Check PR status with
gh pr view <number> --json mergeStateStatus,mergeable,statusCheckRollup.