name: ci-debug
license: MIT
compatibility: "Claude Code 2.1.183+ — uses gh CLI for GitHub Actions log inspection. Works in interactive sessions and headless claude -p --bare invocations (e.g. /ork:ci-sentinel)."
description: "Diagnose a failing CI run against an 11-pattern playbook. Classifies the failure, cites the relevant memory entry, proposes the exact fix command — but NEVER applies without explicit user approval. Use when a specific PR check or GitHub Actions run failed and you want a diagnosis instead of speculation. Don't use for org-wide CI sweeps (that's /status) or for app-level test failures (the playbook is CI-infra-specific)."
argument-hint: "<PR-number | run-URL | job-URL>"
context: fork
version: 0.2.0
disable-model-invocation: true
author: OrchestKit
tags: [ci, github-actions, debugging, classification, propose-dont-apply]
user-invocable: true
allowed-tools: [Bash, Read, Grep, Glob]
skills: [github-operations, memory]
complexity: medium
persuasion-type: guidance
model: sonnet
metadata:
category: workflow-automation
origin: "/insights audit 2026-05-11 — recurring CI-debug pattern across 12 sessions in 3 weeks"
triggers:
keywords: ["ci failure", "build red", "actions failed", "workflow failed", "PR check failed", "CI broken"]
examples:
- "PR #1842 build is red, what happened?"
- "/ci-debug 822"
- "/ci-debug https://github.com/owner/repo/actions/runs/12345"
anti-triggers: ["what's failing across the org", "test broke in my code"]
paths:
- ".github/workflows/**/*.yml"
/ci-debug — classify a failing CI run
Direct response to the recurring CI-debug pattern surfaced by /insights: ~12 sessions in 3 weeks doing the same classification dance. This skill encodes the 11 patterns so the dance becomes a lookup.
Input
User invokes with one of:
- PR number:
/ci-debug 822(default repo from context; ask if ambiguous) - Run URL:
/ci-debug https://github.com/owner/repo/actions/runs/12345 - Job URL:
/ci-debug https://github.com/owner/repo/actions/runs/X/job/Y
Execution
1. Resolve the failing job
# From PR number:
gh pr checks <n> --repo <owner>/<repo> --json bucket,link,name \
--jq '.[] | select(.bucket=="fail") | "\(.name)|\(.link)"'
# From run URL:
gh api repos/<owner>/<repo>/actions/runs/<run-id>/jobs \
--jq '.jobs[] | select(.conclusion=="failure")
| {id, name, runner_name, started_at, completed_at,
steps: [.steps[] | select(.conclusion=="failure") | {name, number}]}'
If multiple jobs failed, pick the one with the shortest duration — root cause is usually the first failure; later jobs cascade.
No job in the fail bucket but a check won't settle? If gh pr checks shows zero fail-bucket entries yet a status sits in pending that never resolves (and gh pr view --json mergeStateStatus returns UNSTABLE while mergeable=MERGEABLE), this is a stuck external status, not a failure — jump straight to Pattern #11. There is no failing log to fetch; classify on the commit-status metadata (gh api repos/<o>/<r>/commits/<sha>/status).
2. Fetch the failing log
gh api repos/<owner>/<repo>/actions/jobs/<job_id>/logs 2>&1 \
| grep -iE '(error|fail|ERR_|CONFLICT|Process completed with exit code)' \
| head -30
Capture the FIRST distinct error message (later lines often echo).
3. Classify against the playbook
Walk the patterns in order. First match wins.
| # | Pattern | Signature in logs | Memory ref | Proposed fix |
|---|---|---|---|---|
| 1 | Billing block | runner_name empty + steps[] empty + ~3s duration + annotation: "recent account payments have failed or your spending limit needs to be increased" | billing-surface-hosted-vs-self-hosted.md |
Org admin → Settings → Billing & plans → raise limit / update card. No code change. |
| 2 | Root-lockfile drift | ERR_PNPM_OUTDATED_LOCKFILE mentioning <ROOT>/typescript/<pkg>/package.json |
pnpm-lock-root-vs-workspace-duality.md |
pnpm install --lockfile-only && git add pnpm-lock.yaml && git commit && git push. |
| 3 | uv.lock drift | error: The lockfile at uv.lock needs to be updated |
changeset-release-uv-lock-drift.md |
cd python && uv lock then commit. |
| 4 | ci-shared.yml missing permissions | startup_failure pattern (empty runner_name + steps[]=[] + ~3s) BUT billing is resolved | ci-shared-permissions-block-required.md |
Add permissions: { contents: read, packages: read } to the caller workflow. |
| 5 | YAML python embed | YAML parse error pointing at a multi-line block scalar with python -c |
yaml-python-embed.md |
Rewrite python -c as a separate shell script invocation; never inline multi-line python in YAML. |
| 6 | actionlint shellcheck false-positive | audit/actionlint job failing with SC2086/SC2046 on workflow YAMLs you didn't touch | audit-actionlint-triggers-on-workflow-edit.md |
Not required check; safe to merge past if the warnings predate your change. Optional: add shellcheck disable comments. |
| 7 | macOS BSD date %3N | %3N printed literally in CI output / arithmetic fails |
macos-bsd-date-no-percent-3N.md |
Replace date +%s%3N with node -e 'console.log(Date.now())' or python3 -c 'import time; print(int(time.time()*1000))'. |
| 8 | Runner pnpm Rosetta arch drift | pnpm install fails with "wrong-arch native bin" / dlopen error on a self-hosted runner | runner-pnpm-rosetta-arch-drift.md |
Restart the affected runner pool; root cause is node x64↔arm64 flips storing wrong-arch native bins in shared cache. |
| 9 | Shallow clone false divergence | git status reports diverged but PR was actually merged |
shallow-clone-false-divergence.md |
git fetch origin <branch> --unshallow then gh pr view --merge-commit to verify. |
| 10 | Publish run cancelled | Publish-tag workflow run shows conclusion=cancelled; artifact never lands |
publish-runs-cancelled-need-redrive.md |
Re-fire via gh workflow run publish-python.yml -f tag=<tag> (adjust for your publish workflow). |
| 11 | Vercel status orphaned (path-skip) | No job in the fail bucket, but Vercel appears as a commit status (not a check-run) stuck state=pending with created_at == updated_at and no terminal update; all GitHub Actions checks green; mergeStateStatus=UNSTABLE + mergeable=MERGEABLE on an unprotected base branch |
vercel-pending-orphaned-on-path-skip.md |
Not a failure — cosmetic. Vercel posted a pending status then skipped the build (project-root path filter, e.g. a docs-only change that never touches apps/web), orphaning the status. Safe to merge: gh pr merge <n> --repo <owner>/<repo> --squash. Permanent fix: the Vercel project's Ignored Build Step must exit 0 AND report success for skipped paths so the status flips instead of dangling. |
The memory references point at user-curated memory files (~/.claude/projects/<project>/memory/*.md). If your memory doesn't have them yet, the signature column is enough to classify — the memory citation is a nice-to-have, not required.
4. Report
For a matched pattern:
## CI Debug: <repo> · <pr-or-run-ref>
**Failing job:** `<job name>` (<duration>s) on runner `<runner_name>`
**Failing step:** <step name> (#<step number>)
**Error excerpt:**
\`\`\`
<first 3 lines of grep'd error>
\`\`\`
**Classification:** Pattern #<n> — <pattern name>
**Reference:** memory `<memory-file.md>`
**Proposed fix:**
<exact commands, one per line>
**Will I apply this?** No — awaiting your approval. Reply "go" to ship.
For an UNMATCHED failure:
## CI Debug: <repo> · <pr-or-run-ref> · NOVEL
**Failing step:** <step name>
**Unique log lines:**
\`\`\`
<top 10 distinct error lines>
\`\`\`
This doesn't match any of the 11 playbook patterns. Surfacing the raw
evidence for your read. Once you identify the root cause, consider
adding it to the playbook (in this SKILL.md) so the next run catches
it automatically.
Headless invocation (permission mode)
/ci-debug REQUIRES Bash tool access — gh pr checks, gh run view, and gh api .../jobs/.../logs are how it fetches evidence. In headless claude -p runs (ci-sentinel, cron):
- Use
--permission-mode acceptEdits— the headless "use tools without prompting" mode. The skill is read-only by design, so this grants nothing risky. - NEVER
dontAsk— it silently REFUSES permission-requiring tools (including Bash), so every analysis returns empty output with no error (#1862 Bug C). claude -preports auth/permission failures as JSON on stdout, not stderr — capture and log both streams on failure (seeshared/rules/cc-bare-auth-gotcha.mdfor the auth side).
CRITICAL guardrails
- NEVER auto-apply a fix. This skill proposes; the user approves. Wasted cycles from misclassification are exactly the failure mode we're hardening against.
- NEVER skip the fetch-log step. Without the actual error text, classification is guessing. If logs are gone (>90 days old, run deleted), say so and stop.
- Cite the exact memory entry (filename + section heading where relevant) so the user can verify the analogy.
- Each classification must have falsifying evidence — the signature must match. Don't pattern-match on hope.
When to invoke
- A specific CI run / job / PR check has failed and the user wants the diagnosis.
- The user pastes a run URL or "PR #N is failing — why?".
When NOT to invoke
- For broad "what's failing across the org?" — use
/statusinstead. - For application-level test failures (the playbook is CI-infra-specific, not e.g. pytest failures).
- Before reading the actual log — don't speculate.
Adding patterns
When a novel CI failure surfaces (the UNMATCHED report fires), add a row to the table above:
- Signature: the unique-enough log line / metadata combination. Must be falsifiable.
- Memory ref: name the memory file you wrote with the full incident write-up.
- Proposed fix: the EXACT command line. No prose, no "you might want to". The skill produces commands the user runs; ambiguity defeats the purpose.
Commit the SKILL.md change as docs(ci-debug): add pattern #N (<short-name>). The /ork:ci-sentinel workflow picks up new patterns automatically on its next sweep — no separate plumbing needed.
Related Skills
- Composes with —
/ork:ci-sentinel(the autonomous hourly trigger for this skill against open red PRs). - Anti-pattern — manually re-running
gh pr checksand eyeballing the logs across N repos. That's exactly the toil this skill kills. - Upstream —
/statusfor org-wide sweeps that surface WHICH PRs are red; this skill answers WHY.