audit-ci - SKILL.md Agent Skill

name: audit-ci description: Audit recent CI runs for anomalies (warnings, uncached docker builds, flaky/slow tests, regressions).

Find the things that are subtly degrading CI -- the anomalies a human won't notice without poring over job logs: warnings, wasteful rebuilds, flaky tests that recovered on retry, slow steps, and failures quietly recurring across many PRs. Identify and report; don't fix. If you fan out with subagents, ensure they follow the guidelines below (they don't inherit this file), and verify load-bearing claims at the source yourself.

Where results live

In .github/workflows/ci.yml, the release tests run only on release-branch pushes, so a normal PR/main audit never sees them.

Test results are not in the workflow jobs. Each test job ends by POSTing a separate check-run (the report-flaky-aware-tests action) named Unit + Integration Tests / Acceptance Tests. They show as in 0s with /runs/<id> URLs (the PR "Checks" format, not /actions/runs/.../job/). Fetch the retry/flaky table: gh api repos/imbue-ai/mngr/check-runs/<id> --jq '{name,conclusion,summary:.output.summary}'

Conclusion: success = clean; neutral = passed only after retries (the flaky signal); failure = hard fail.
Summary has a | Test | Runs | Final | @flaky | table; Final reads flaked N, passed M or passed. A high run-count with Final = passed is just offload's by-design reruns, not a flake.

Getting the data

Start cheap: gh run view <run_id> lists jobs + durations and has an ANNOTATIONS section that surfaces warnings and failure messages without pulling logs. Drop to gh run view --job <id> --log only when annotations don't explain something.

What to look for

Warnings (the ANNOTATIONS section, including on otherwise-passing runs).
Uncached image rebuilds: offload records each checkpoint commit's base-image ID in a git note (refs/notes/offload-images); the image itself lives in Modal and is evicted after ~48h, so occasional rebuilds are expected. Flag frequent rebuilds with no triggering change, or a missing contents: write perm / failed git fetch refs/notes/* (loses the notes -> rebuild every run). Grep logs for base-image build / cache miss.
Flaky tests (neutral / Flaky-recovered > 0): record the test, @flaky status, run-count, and the failure reason from the ## Failures traceback.
Slow jobs/tests: flag anything egregiously slow for its value, or that bottlenecks the pipeline regardless of value (e.g. a single 5-min test). Investigate every slow job to find what caused the extra time -- break it down per-step (log timestamps) before blaming tests; e.g. actions/checkout with fetch-depth: 0 does a full-history, all-branches fetch that can dominate wall-clock.
Failures recurring across PRs: one hard failure is a human's problem, but the same signature on multiple unrelated branches is a systemic issue a human looking at a single PR won't see.
Coverage: test-offload prints a coverage-delivery diagnostic; a MISMATCH line = dropped .coverage data.
Repeated log noise: the same warning recurring many times within a job (even a passing one) usually signals a misconfigured environment.
Anything else that looks off: you won't have a rule for every anomaly -- flag whatever seems wrong and dig in. First rule out the known-weird-but-expected (see traps below), e.g. offload running a given test a nondeterministic number of times due to speculative retries and cancellations.

Don't over-claim (common traps)

Always read the actual failure output before attributing a cause. Hard failures are usually branch-specific gates (ratchet sync, coverage, stale generated CLI docs, snapshot mismatches), not infra; Modal acceptance flakiness is usually absorbed by retries (-> neutral), so it rarely causes the hard failure.
Cross-branch duration differences are mostly test-phase variance; the cached base build (~~45s) and env-prep (~~60s) are near-constant. Only flag a regression if setup/cache steps grew or the same branch slowed across runs.
Normal Modal host-creation output (not errors/cache-misses): building from mngr default Dockerfile (cheap COPY layer, ~2s) and <pkg> ... Installing at runtime (only test_mngr_create_with_dockerfile_on_modal, not every host).
A single branch failing repeatedly the same way is its own bug; only the same failure across unrelated branches is systemic.

Report

Group by category, most important first. Per finding: what, where (run/check URL + test/job), frequency across sampled runs, and the diagnostic detail (a flake's failure reason; which step made a job slow). Separate infra noise from real regressions, and already-being-fixed (check gh pr list) from open. State how many runs you sampled and over what window.