ci-debugging - SKILL.md Agent Skill

name: ci-debugging description: Systematic CI/CD failure diagnosis using hypothesis-first investigation, local reproduction, and environment delta analysis. Use when a CI pipeline, GitHub Actions workflow, or build job fails; when tests pass locally but fail in CI; when diagnosing flaky tests, timeouts, or red pipelines; or when the user says "CI is failing", "the build is broken", or "works on my machine".

CI Debugging

Every CI failure is real until proven otherwise. Never assume flakiness.

Getting the Data

Pull the actual failure output before forming hypotheses:

gh run list --branch <branch> --limit 5        # find the failing run
gh run view <run-id> --log-failed              # only the failed steps' logs
gh run view <run-id> --job <job-id> --log      # one job's full log
gh run download <run-id>                       # artifacts (coverage, reports, screenshots)
gh run rerun <run-id> --failed                 # re-run only failed jobs (for evidence, not hope)

For step-level detail, re-run with debug logging: set the ACTIONS_STEP_DEBUG=true secret/variable, or gh run rerun <run-id> --debug. Compare the failing run against the last green run on the same branch (gh run list) — the diff in commits, dependency lockfiles, and workflow files between those two runs is the primary suspect list.

Hypothesis-First Diagnosis

Before investigating, list at least 3 possible root causes. Investigate each systematically rather than jumping to the first guess.

Example hypotheses for a test timeout:

Test relies on network access unavailable in CI
Parallel test execution causes resource contention
CI runner has less memory/CPU than local machine

Local Reproduction

Always reproduce the failure locally before pushing fixes.

Run the exact failing command, not a close equivalent
Match the CI environment as closely as possible (Node version, env vars)
If it passes locally, the delta between environments IS the bug

Environment Delta Analysis

Compare CI vs local:

Factor	Check
Node/runtime version	CI config vs `node -v` locally
OS	Linux CI vs macOS local
Dependency resolution	Fresh `npm ci` vs cached `node_modules`
Env vars	CI secrets/config vs local `.env`
Parallelism	CI may run tests in parallel differently
Memory/CPU	CI runners often have less resources
Network	CI may block external network access
File system	Case sensitivity (Linux) vs insensitive (macOS)

Read the Full Error

Read the complete error output, not just the last line
Check preceding log lines and warnings — they often contain the real cause
Look at stack traces to identify the actual failure point
Check for earlier failures that may cascade into the visible error

Fix Verification

After identifying a fix:

Explain why it addresses the root cause (not just the symptom)
Run the exact failing command locally
Verify the fix doesn't mask the real issue (e.g., adding a retry hides a race condition)

Anti-Patterns

Anti-Pattern	Why It's Wrong	Instead
"It's flaky, re-run it"	Masks real issues	Investigate the failure
Adding retries/sleeps	Hides timing bugs	Fix the race condition
Pushing speculative fixes	Wastes CI cycles	Reproduce and verify locally
Reading only the last error line	Misses root cause	Read full output from the top
Fixing symptoms	Problem will recur	Trace to root cause

Proving Flakiness

A failure is only flaky if you have evidence:

Multiple independent runs with identical environment showing different results
AND you can identify the non-deterministic source (race condition, time-dependent test, external service)

Without this evidence, treat every failure as a real bug.

Handoff

Once the root cause is identified, write a failing test that reproduces it before fixing — load the tdd skill (or characterisation-tests if the broken code has no tests). A CI fix without a pinning test is a recurrence waiting to happen.