name: fix-flaky
description: Triage and FIX flaky tests on a Semaphore project end-to-end — find the worst offenders, pull history, locate the test in the code, diagnose the root cause, write a fix or justified quarantine, and verify by re-running. Use whenever the user wants to fix/investigate flaky tests, de-flake CI, reduce intermittent failures, asks "why is CI randomly red", wants to quarantine a flake, or after sem-ai flaky list surfaces offenders. Goes BEYOND detection (test-intelligence) to root-cause + code change + verification.
user-invocable: true
Fix Flaky Tests
Detection tells you a test is flaky; this skill closes the loop to a fix. The hard part is never "is it flaky" — it's tying the failure to the actual code.
The loop
1. Discover + rank (ranked, denoised, compact — no jq needed)
sem-ai flaky list --project <name> \
--sort-field total_disruptions_count --sort-dir desc \
--disruptions ">1"
--disruptions ">1" drops one-off noise (single-failure pass_rate:50 rows);
the sort ranks by recurrence. Output omits the per-test disruption_history
histogram by default (rarely needed — --full restores it; no diagnosis path
below requires it). Pick a test that recurs across many commits and whose
test_file you can read.
2. Get the per-context history
sem-ai flaky show <test_id> --project <name> # POSITIONAL test_id (NOT --file). Returns per-context pass_rate, p95, disruptions_count.
For the real failure, run flaky failure <test_id> (see Pull the actual
failure) — don't hand-chase run ids. (latest_disruption_run_id is on the
flaky list row, not show.) Contexts whose stats are all-null simply have no
disruptions recorded on that branch; ignore them and read the non-null ones.
3. Locate in the code (paths are app-relative)
flaky failure (step 2) already hands you the failing file+line — no need to
derive them from the test name. But the reported path is app-relative, not
repo-root: in a monorepo test/foo/bar_test.exs lives under an app/service dir
(e.g. apps/api/test/…, apps/web/test/…, services/worker/test/…). Resolve
the on-disk path:
git -C <repo> ls-files | grep -F "$(echo <test_file> | sed 's/:[0-9]*$//')"
If that returns matches in several apps, disambiguate with the test_group/suite
from flaky show (e.g. a group like MyApp.Web.WidgetTest → the web app).
Read the test AND the code it exercises — flakes live in the seam.
4. Classify from the real failure — the table names the class, it does NOT hand you the fix
Pull the real failure first (see Pull the actual failure below): the
left:/right: + stacktrace are the diagnosis. The table only names the
class of nondeterminism and the direction a fix usually takes, so you
know what you're chasing — the actual fix comes from the test in front of you,
never from a cell.
| signal | likely class | fix direction |
|---|---|---|
asserts strict </> (or compare == :lt/:gt) on two timestamps taken close together; passes most runs |
clock-tie / nondeterministic time | allow the tie (inclusive bound), or freeze/inject time so the values are deterministic |
element acted on after an async re-render; StaleReferenceError |
stale-element after async render | retry the lookup+action on stale — a presence-assert does NOT fix it (the node goes stale after lookup) |
| in-test wait/sleep budget shorter than the work's failure tail | timeout too short for async work | raise the wait budget to match a non-flaky sibling; make the predicate nil-safe |
| asserts order of a query/collection with no explicit ordering | nondeterministic ordering | add deterministic ordering at the source |
| passes alone but fails after other tests (leftover rows/keys/processes) | shared/global state | isolate setup/teardown; unique fixtures |
| asserts a count of live processes/children that's off by one+ | leaked process from a prior test (shared named supervisor / registered process) | terminate/drain the named process in setup/on_exit, not just the DB |
| calls a real external service | external dependency | stub/mock, or mark + isolate |
p95 (from flaky show) is the heuristic only for the timeout row — for
clock-tie/stale-element/ordering it's a red herring. For the timeout class,
compare the wait budget to the failure tail, not p95 (a ~95%-pass flake's
p95 sits under the budget); the real ceiling is wait-helper fan-out ×
per-wait budget. Two high-value moves: grep the repo for other callers of
that wait helper and diff their budgets (a non-flaky sibling is proof +
fix template); and before writing any retry/wait machinery, grep for an
existing helper (retry_on_stale, assert_eventually, a shared Wait util)
and reuse it.
5. Fix or quarantine
Smallest change that removes the nondeterminism. A justified quarantine (skip/tag + linked ticket) is acceptable if a true fix is out of scope — say why. Match repo conventions; no comments unless the repo uses them.
6. Verify by RE-RUNNING (one green proves nothing)
Use the testbox skill to run the single test many times against your change, or a targeted rerun, and check the pass rate moved. Can't verify (no local toolchain, can't push, or testbox unavailable — e.g. an org that blocks debug sessions)? Say so and mark the fix provisional — that's an acceptable outcome, not a failure.
Pull the actual failure (flaky failure)
sem-ai flaky failure <test_id> --project <name>
One call resolves the latest disruption's job, fetches its log, and returns the
failing test's real assertion as JSON: {test_name, run_id, framework, summary, matched, failures:[{file, line, message}]}. message is the actual
code:/left:/right:/stacktrace — not a guess. It works for ExUnit (which
test report can't parse), and filters to your test. Pin a specific occurrence
with --run-id <job_id>.
matched:false→ the failure block didn't match your test name (the job ran it but it may have passed that run, or the name differs); it returns all failures in that job — eyeball them.log_unavailable→ the disruption's job log aged out (retention); diagnose from source + the playbook above.- Timeout-class flakes show a raised exception (e.g.
Timeout: ...), not an assertion diff —messagecarries the exception, not a failingassert. - For ExUnit,
messageoften includes the full process Logger output after the assertion — read all of it; the event ordering there is frequently the decisive evidence (e.g. an async consumer firing after the step you tested), not justleft/right.
Manual fallback (older binaries without flaky failure): a run_id from
flaky disruptions <test_id> (.run_id; skip null-padding rows) is a job id →
job log <run_id> (takes NO --project) → grep the failure block:
sem-ai job log <run_id> | jq -r '.[].output // empty' \
| grep -nE '[0-9]+\) (test|doctest)|match \(=\) failed|left:|right:|stacktrace:'
Composes with
- test-intelligence —
sem-ai test report|summaryfailure detail (when retrievable). - debug-pipeline —
sem-ai diagnose <run-id>for the broader failed run. - testbox — step 6 verification.
Gotchas
flaky show/disruptions/failuretake thetest_idpositionally (theargsfield via MCP);--filereturns empty silently.flaky disruptionscan return null-timestamp padding rows — ignore them.flaky showper-contextpass_rate/disruptions_countcan benulleven when disruptions exist — trust the disruption rows.- Don't
sem-ai context switchmid-task if one is set; pass--project.