ci-debug - SKILL.md Agent Skill

name: ci-debug license: MIT compatibility: "Claude Code 2.1.183+ — uses gh CLI for GitHub Actions log inspection. Works in interactive sessions and headless claude -p --bare invocations (e.g. /ork:ci-sentinel)." description: "Diagnose a failing CI run against an 11-pattern playbook. Classifies the failure, cites the relevant memory entry, proposes the exact fix command — but NEVER applies without explicit user approval. Use when a specific PR check or GitHub Actions run failed and you want a diagnosis instead of speculation. Don't use for org-wide CI sweeps (that's /status) or for app-level test failures (the playbook is CI-infra-specific)." argument-hint: "<PR-number | run-URL | job-URL>" context: fork version: 0.2.0 disable-model-invocation: true author: OrchestKit tags: [ci, github-actions, debugging, classification, propose-dont-apply] user-invocable: true allowed-tools: [Bash, Read, Grep, Glob] skills: [github-operations, memory] complexity: medium persuasion-type: guidance model: sonnet metadata: category: workflow-automation origin: "/insights audit 2026-05-11 — recurring CI-debug pattern across 12 sessions in 3 weeks" triggers: keywords: ["ci failure", "build red", "actions failed", "workflow failed", "PR check failed", "CI broken"] examples: - "PR #1842 build is red, what happened?" - "/ci-debug 822" - "/ci-debug https://github.com/owner/repo/actions/runs/12345" anti-triggers: ["what's failing across the org", "test broke in my code"] paths:

".github/workflows/**/*.yml"

/ci-debug — classify a failing CI run

Direct response to the recurring CI-debug pattern surfaced by /insights: ~12 sessions in 3 weeks doing the same classification dance. This skill encodes the 11 patterns so the dance becomes a lookup.

Input

User invokes with one of:

PR number: /ci-debug 822 (default repo from context; ask if ambiguous)
Run URL: /ci-debug https://github.com/owner/repo/actions/runs/12345
Job URL: /ci-debug https://github.com/owner/repo/actions/runs/X/job/Y

Execution

1. Resolve the failing job

# From PR number:
gh pr checks <n> --repo <owner>/<repo> --json bucket,link,name \
  --jq '.[] | select(.bucket=="fail") | "\(.name)|\(.link)"'

# From run URL:
gh api repos/<owner>/<repo>/actions/runs/<run-id>/jobs \
  --jq '.jobs[] | select(.conclusion=="failure")
                | {id, name, runner_name, started_at, completed_at,
                   steps: [.steps[] | select(.conclusion=="failure") | {name, number}]}'

If multiple jobs failed, pick the one with the shortest duration — root cause is usually the first failure; later jobs cascade.

No job in the fail bucket but a check won't settle? If gh pr checks shows zero fail-bucket entries yet a status sits in pending that never resolves (and gh pr view --json mergeStateStatus returns UNSTABLE while mergeable=MERGEABLE), this is a stuck external status, not a failure — jump straight to Pattern #11. There is no failing log to fetch; classify on the commit-status metadata (gh api repos/<o>/<r>/commits/<sha>/status).

2. Fetch the failing log

gh api repos/<owner>/<repo>/actions/jobs/<job_id>/logs 2>&1 \
  | grep -iE '(error|fail|ERR_|CONFLICT|Process completed with exit code)' \
  | head -30

Capture the FIRST distinct error message (later lines often echo).

3. Classify against the playbook

Walk the patterns in order. First match wins.

#	Pattern	Signature in logs	Memory ref	Proposed fix
1	Billing block	runner_name empty + steps[] empty + ~3s duration + annotation: "recent account payments have failed or your spending limit needs to be increased"	`billing-surface-hosted-vs-self-hosted.md`	Org admin → Settings → Billing & plans → raise limit / update card. No code change.
2	Root-lockfile drift	`ERR_PNPM_OUTDATED_LOCKFILE` mentioning `<ROOT>/typescript/<pkg>/package.json`	`pnpm-lock-root-vs-workspace-duality.md`	`pnpm install --lockfile-only && git add pnpm-lock.yaml && git commit && git push`.
3	uv.lock drift	`error: The lockfile at uv.lock needs to be updated`	`changeset-release-uv-lock-drift.md`	`cd python && uv lock` then commit.
4	ci-shared.yml missing permissions	startup_failure pattern (empty runner_name + steps[]=[] + ~3s) BUT billing is resolved	`ci-shared-permissions-block-required.md`	Add `permissions: { contents: read, packages: read }` to the caller workflow.
5	YAML python embed	YAML parse error pointing at a multi-line block scalar with `python -c`	`yaml-python-embed.md`	Rewrite `python -c` as a separate shell script invocation; never inline multi-line python in YAML.
6	actionlint shellcheck false-positive	audit/actionlint job failing with SC2086/SC2046 on workflow YAMLs you didn't touch	`audit-actionlint-triggers-on-workflow-edit.md`	Not required check; safe to merge past if the warnings predate your change. Optional: add shellcheck disable comments.
7	macOS BSD date %3N	`%3N` printed literally in CI output / arithmetic fails	`macos-bsd-date-no-percent-3N.md`	Replace `date +%s%3N` with `node -e 'console.log(Date.now())'` or `python3 -c 'import time; print(int(time.time()*1000))'`.
8	Runner pnpm Rosetta arch drift	pnpm install fails with "wrong-arch native bin" / dlopen error on a self-hosted runner	`runner-pnpm-rosetta-arch-drift.md`	Restart the affected runner pool; root cause is node x64↔arm64 flips storing wrong-arch native bins in shared cache.
9	Shallow clone false divergence	`git status` reports diverged but PR was actually merged	`shallow-clone-false-divergence.md`	`git fetch origin <branch> --unshallow` then `gh pr view --merge-commit` to verify.
10	Publish run cancelled	Publish-tag workflow run shows `conclusion=cancelled`; artifact never lands	`publish-runs-cancelled-need-redrive.md`	Re-fire via `gh workflow run publish-python.yml -f tag=<tag>` (adjust for your publish workflow).
11	Vercel status orphaned (path-skip)	No job in the `fail` bucket, but `Vercel` appears as a commit status (not a check-run) stuck `state=pending` with `created_at == updated_at` and no terminal update; all GitHub Actions checks green; `mergeStateStatus=UNSTABLE` + `mergeable=MERGEABLE` on an unprotected base branch	`vercel-pending-orphaned-on-path-skip.md`	Not a failure — cosmetic. Vercel posted a `pending` status then skipped the build (project-root path filter, e.g. a docs-only change that never touches `apps/web`), orphaning the status. Safe to merge: `gh pr merge <n> --repo <owner>/<repo> --squash`. Permanent fix: the Vercel project's Ignored Build Step must exit `0` AND report success for skipped paths so the status flips instead of dangling.

The memory references point at user-curated memory files (~/.claude/projects/<project>/memory/*.md). If your memory doesn't have them yet, the signature column is enough to classify — the memory citation is a nice-to-have, not required.

4. Report

For a matched pattern:

## CI Debug: <repo> · <pr-or-run-ref>

**Failing job:** `<job name>` (<duration>s) on runner `<runner_name>`
**Failing step:** <step name> (#<step number>)
**Error excerpt:**
\`\`\`
<first 3 lines of grep'd error>
\`\`\`

**Classification:** Pattern #<n> — <pattern name>
**Reference:** memory `<memory-file.md>`

**Proposed fix:**
<exact commands, one per line>

**Will I apply this?** No — awaiting your approval. Reply "go" to ship.

For an UNMATCHED failure:

## CI Debug: <repo> · <pr-or-run-ref> · NOVEL

**Failing step:** <step name>
**Unique log lines:**
\`\`\`
<top 10 distinct error lines>
\`\`\`

This doesn't match any of the 11 playbook patterns. Surfacing the raw
evidence for your read. Once you identify the root cause, consider
adding it to the playbook (in this SKILL.md) so the next run catches
it automatically.

Headless invocation (permission mode)

/ci-debug REQUIRES Bash tool access — gh pr checks, gh run view, and gh api .../jobs/.../logs are how it fetches evidence. In headless claude -p runs (ci-sentinel, cron):

Use --permission-mode acceptEdits — the headless "use tools without prompting" mode. The skill is read-only by design, so this grants nothing risky.
NEVER dontAsk — it silently REFUSES permission-requiring tools (including Bash), so every analysis returns empty output with no error (#1862 Bug C).
claude -p reports auth/permission failures as JSON on stdout, not stderr — capture and log both streams on failure (see shared/rules/cc-bare-auth-gotcha.md for the auth side).

CRITICAL guardrails

NEVER auto-apply a fix. This skill proposes; the user approves. Wasted cycles from misclassification are exactly the failure mode we're hardening against.
NEVER skip the fetch-log step. Without the actual error text, classification is guessing. If logs are gone (>90 days old, run deleted), say so and stop.
Cite the exact memory entry (filename + section heading where relevant) so the user can verify the analogy.
Each classification must have falsifying evidence — the signature must match. Don't pattern-match on hope.

When to invoke

A specific CI run / job / PR check has failed and the user wants the diagnosis.
The user pastes a run URL or "PR #N is failing — why?".

When NOT to invoke

For broad "what's failing across the org?" — use /status instead.
For application-level test failures (the playbook is CI-infra-specific, not e.g. pytest failures).
Before reading the actual log — don't speculate.

Adding patterns

When a novel CI failure surfaces (the UNMATCHED report fires), add a row to the table above:

Signature: the unique-enough log line / metadata combination. Must be falsifiable.
Memory ref: name the memory file you wrote with the full incident write-up.
Proposed fix: the EXACT command line. No prose, no "you might want to". The skill produces commands the user runs; ambiguity defeats the purpose.

Commit the SKILL.md change as docs(ci-debug): add pattern #N (<short-name>). The /ork:ci-sentinel workflow picks up new patterns automatically on its next sweep — no separate plumbing needed.

Related Skills

Composes with — /ork:ci-sentinel (the autonomous hourly trigger for this skill against open red PRs).
Anti-pattern — manually re-running gh pr checks and eyeballing the logs across N repos. That's exactly the toil this skill kills.
Upstream — /status for org-wide sweeps that surface WHICH PRs are red; this skill answers WHY.