name: rota-bench-regression-analysis
description: Analyze recent GraalPy benchmark regressions on master as part of the weekly rota. Use when asked to analyze benchmarks for rota.
Bench Regression Analysis
Use This Skill For
- Recent regression summaries from
scripts/compare_bench_regressions.py --rota.
- Follow-up inspection of unattributed plausible change points.
Core Workflow
- Run:
scripts/compare_bench_regressions.py --rota --json-out /tmp/compare_bench_regressions_rota.json
- Use the text output for the human summary and the JSON for precise inspection.
- Focus on
plausible regressions. Ignore flaky and inconclusive items unless they help explain a plausible shift.
- Split the summary into:
Attributed
To bisect
To watch
- Show the current summary
- Execute the bisect script for each "to bisect" entry in parallel, then wait for all of them to finish.
The builds can take many hours without the script showing any output, make sure you wait for them with a long timeout.
If running in codex: round-robin poll the processes with
write_stdin and 1 hour timeout (the configuration might cap this at a lower timeout in practice)
- Collect the bisect results and move any benchmarks that were attributed by the bisections.
- Show the final summary. Note any failed bisects.
Useful JSON Queries
jq '.direct_suspects[] | {good_commit, bad_commit, bad_author_email, bad_subject}' \
/tmp/compare_bench_regressions_rota.json
jq '[.change_points[] | select(.classification == "plausible")]' \
/tmp/compare_bench_regressions_rota.json
Attributed Regressions
- Start from
direct_suspects in /tmp/compare_bench_regressions_rota.json.
- For each suspect, keep the abbreviated bad commit ID, full author email, full commit subject, and the worst example benchmarks per suite, not the full list.
- Prefer one worst example per affected suite such as
micro, meso, macro...
Unattributed Regressions
- Start from plausible
change_points whose (good_commit, bad_commit] pair is not already covered by direct_suspects.
- Inspect the range with:
git log --first-parent --reverse --format='%H%x09%ae%x09%s' GOOD..BAD
git show --stat --summary --format=fuller BAD
git diff --stat GOOD..BAD
- If needed, inspect individual commits in the range with
git show --stat --summary --format=fuller COMMIT.
Attribution Rules
- If the change point is an exact single-parent GraalPy commit and
mx.graalpython/suite.py imports did not change, it can usually be attributed to that commit.
- Changes to imports in
mx.graalpython/suite.py can never be confidently attributed without bisecting Graal. Keep those unattributed and say so explicitly (including the graal commit range)
- If an unattributed first-parent range contains one plausible GraalPy code change and the rest are documentation, tests, retags, or other non-performance changes, attribute it to that one code change.
- If the series is already shifted by an earlier attributed commit and a later unattributed range only preserves the new level, fold the later item into the earlier attribution.
- Cross-configuration correlation matters.
- If
native shows an exact jump on one commit and jvm later shows the same benchmark shifted upward through a range containing that commit, treat them as likely the same cause unless the later range has a better candidate.
- If both
jvm and native jump at or immediately after a Graal import update, keep both under the same unattributed Graal-side cause.
Flakiness Check
- Use Bench Server data when the unattributed item is small or suspicious.
- Query the benchmark series with
bench-cli run - and check whether the change is a clean step up that stays high, a one-point spike that immediately falls back, or already present before the reported range.
- A stable step change is a real regression candidate.
- An isolated last-point bump with no supporting related regressions is usually watch-and-rerun material, not a strong attribution.
Bench Server Checks
- Prefer querying only the specific benchmark and configuration under investigation.
- Typical filters:
branch = master, target benchmark, host-vm = graalvm-ee, target host-vm-config, guest-vm = graalpython, target guest-vm-config, metric.name = time, commit.committer-ts last-n 30d.
- Reduce output to
commit.rev and average metric value so the step pattern is easy to inspect.
bench-cli sometimes fails with 404 when the server is overloaded. If that happens, wait for a minute and try again
Output Contract
- List findings first, not process notes.
- Keep three top-level sections (if not empty):
Attributed and To bisect and To watch
- In the attributed section, use this header format:
abcd1234efgh | author@oracle.com | Full subject
- Unattributed changes that look plausible go to "to bisect", flaky ones go to "to watch"
- In the "to bisect" section, add an invocation (don't execute yet) of
scripts/bisect_benchmark_regression.py that can bisect it (use unabbreviated commits in this case)
- In the "to watch" section, say whether the item looks flaky, or likely the same cause as another attributed item.
- Do not abbreviate commit subjects.
- Keep author emails.
- Abbreviate commit IDs to 12 characters.
- Do not list every benchmark if there are many; only the worst examples from each affected suite. If you didn't list all, say "and X others".
Guardrails
- If the script or you can't find
bench-cli, ask the user to provide it from the bench-server repo.
- You may offer to clone the repo and create the cli for the user. The repo is on the same bitbucket as graalpython, the project is
INFRA and the repo is called bench-server
- Don't submit more than 5 bisect jobs. If there are more in the "to bisect" list, pick 5 that look the most serious and leave the rest as "to bisect".