name: audit-ci description: Audit recent CI runs for anomalies (warnings, uncached docker builds, flaky/slow tests, regressions).
Find the things that are subtly degrading CI -- the anomalies a human won't notice without poring over job logs: warnings, wasteful rebuilds, flaky tests that recovered on retry, slow steps, and failures quietly recurring across many PRs. Identify and report; don't fix. If you fan out with subagents, ensure they follow the guidelines below (they don't inherit this file), and verify load-bearing claims at the source yourself.
Where results live
In .github/workflows/ci.yml, the release tests run only on release-branch pushes, so a normal PR/main audit never sees them.
Test results are not in the workflow jobs. Each test job ends by POSTing a separate check-run (the report-flaky-aware-tests action) named Unit + Integration Tests / Acceptance Tests. They show as in 0s with /runs/<id> URLs (the PR "Checks" format, not /actions/runs/.../job/). Fetch the retry/flaky table:
gh api repos/imbue-ai/mngr/check-runs/<id> --jq '{name,conclusion,summary:.output.summary}'
- Conclusion:
success= clean;neutral= passed only after retries (the flaky signal);failure= hard fail. - Summary has a
| Test | Runs | Final | @flaky |table;Finalreadsflaked N, passed Morpassed. A high run-count withFinal = passedis just offload's by-design reruns, not a flake.
Getting the data
Start cheap: gh run view <run_id> lists jobs + durations and has an ANNOTATIONS section that surfaces warnings and failure messages without pulling logs. Drop to gh run view --job <id> --log only when annotations don't explain something.
What to look for
- Warnings (the ANNOTATIONS section, including on otherwise-passing runs).
- Uncached image rebuilds: offload records each checkpoint commit's base-image ID in a git note (
refs/notes/offload-images); the image itself lives in Modal and is evicted after ~48h, so occasional rebuilds are expected. Flag frequent rebuilds with no triggering change, or a missingcontents: writeperm / failedgit fetch refs/notes/*(loses the notes -> rebuild every run). Grep logs for base-image build /cache miss. - Flaky tests (
neutral/Flaky-recovered > 0): record the test,@flakystatus, run-count, and the failure reason from the## Failurestraceback. - Slow jobs/tests: flag anything egregiously slow for its value, or that bottlenecks the pipeline regardless of value (e.g. a single 5-min test). Investigate every slow job to find what caused the extra time -- break it down per-step (log timestamps) before blaming tests; e.g.
actions/checkoutwithfetch-depth: 0does a full-history, all-branches fetch that can dominate wall-clock. - Failures recurring across PRs: one hard failure is a human's problem, but the same signature on multiple unrelated branches is a systemic issue a human looking at a single PR won't see.
- Coverage:
test-offloadprints a coverage-delivery diagnostic; aMISMATCHline = dropped.coveragedata. - Repeated log noise: the same warning recurring many times within a job (even a passing one) usually signals a misconfigured environment.
- Anything else that looks off: you won't have a rule for every anomaly -- flag whatever seems wrong and dig in. First rule out the known-weird-but-expected (see traps below), e.g. offload running a given test a nondeterministic number of times due to speculative retries and cancellations.
Don't over-claim (common traps)
- Always read the actual failure output before attributing a cause. Hard failures are usually branch-specific gates (ratchet sync, coverage, stale generated CLI docs, snapshot mismatches), not infra; Modal acceptance flakiness is usually absorbed by retries (-> neutral), so it rarely causes the hard failure.
- Cross-branch duration differences are mostly test-phase variance; the cached base build (
45s) and env-prep (60s) are near-constant. Only flag a regression if setup/cache steps grew or the same branch slowed across runs. - Normal Modal host-creation output (not errors/cache-misses):
building from mngr default Dockerfile(cheap COPY layer, ~2s) and<pkg> ... Installing at runtime(onlytest_mngr_create_with_dockerfile_on_modal, not every host). - A single branch failing repeatedly the same way is its own bug; only the same failure across unrelated branches is systemic.
Report
Group by category, most important first. Per finding: what, where (run/check URL + test/job), frequency across sampled runs, and the diagnostic detail (a flake's failure reason; which step made a job slow). Separate infra noise from real regressions, and already-being-fixed (check gh pr list) from open. State how many runs you sampled and over what window.