michel-monitor-pull-request-github-actions

name: michel-monitor-pull-request-github-actions description: Diagnose a failed, stuck, or never-triggered CI run on a GitHub PR, apply a local fix if possible, push it, and document the result in a single running PR comment. Invoke whenever Michel's CI monitor loop triggers with `any_failure`, `stuck`, or `not_triggered` — the bash loop already handles `pending` and `all_green` silently, so this skill never sees those states.

Monitor a pull request's GitHub Actions CI

Called during one iteration of the CI monitor loop. The harness prompt already contains everything needed to start:

{{CHECKS_JSON}} — snapshot of all check statuses at loop entry
{{REASON}} — any_failure, stuck, or not_triggered
{{PR_NUMBER}}, {{BRANCH}}, {{WORKDIR}}, {{ITER}}, {{MAX_ITER}}

Decision flow

Read CHECKS_JSON in prompt
  → if REASON == "stuck":          document timeout in running comment → done
  → if REASON == "not_triggered":  investigate why CI never started, document → done
  → if REASON == "any_failure":
      1. Identify failing run(s) via gh run list
      2. Fetch failed logs with gh run view --log-failed
      3. Match pattern → apply fix locally
      4. Commit + push
      5. Create or update running comment
      (if unfixable: document reason → done, no push)

1. Classify — `any_failure` vs `stuck`

The bash outer loop has already classified the state; you are invoked only in these two cases:

`{{REASON}}`	Meaning	Action
`any_failure`	At least one completed check has `conclusion` ∈ `{failure, timed_out, cancelled, action_required}`	Diagnose and fix (sections 2–3)
`stuck`	A workflow started but has been in-progress for >20 min	Document timeout in running comment (section 5) — no code changes
`not_triggered`	Zero checks ever registered for the whole monitoring window — CI never started	Investigate why (section 5) and document — no code changes

conclusion values to recognise: success / failure / timed_out / cancelled / action_required / neutral / skipped / stale / null.

A completed check with conclusion = null counts as failure.

gh CLI contract — re-fetch with `gh run list`, NOT `gh pr checks`

Do not use gh pr checks here. It resolves the commit's statusCheckRollup over GraphQL, and the worker's fine-grained PAT has no permission that can read the Checks API (check-runs). It always partial-fails with Resource not accessible by personal access token (…statusCheckRollup.contexts.nodes.N), returns an empty body, and makes a healthy PR look like zero checks — the bug that produced false not_triggered loops. There is no fine-grained permission that fixes this; only a classic PAT (repo scope) or a GitHub App can read check-runs.

Re-fetch CI state with gh run list instead — it hits GET /actions/runs (Actions:read, which the token has) and reports each workflow run's status and conclusion directly, no GraphQL:

gh run list --branch <BRANCH> --repo <OWNER/REPO> \
  --json databaseId,status,conclusion,workflowName,startedAt,headSha --limit 50

Classify only the runs whose .headSha equals the PR head commit (gh pr view <PR_NUMBER> --repo <OWNER/REPO> --json headRefOid --jq .headRefOid); runs on the branch with a different SHA are stale from earlier pushes.

run state	meaning
`status != "completed"`	pending
completed + `conclusion` ∈ `{success, neutral, skipped}`	pass
completed + `conclusion` ∈ `{failure, timed_out, cancelled, action_required, startup_failure, null}`	fail

2. Diagnose a failure

Step 1 — Start from the snapshot

Read {{CHECKS_JSON}} in the prompt first — it is the snapshot of the head commit's runs (name, status, conclusion, bucket) at loop entry. Only re-fetch if you need data fresher than the snapshot, and use gh run list to do it (see the gh CLI contract above) — never gh pr checks, which the fine-grained PAT cannot read.

Step 2 — Find the failing run

gh run list --branch <BRANCH> --repo <OWNER/REPO> \
  --json databaseId,name,status,conclusion,workflowName,url \
  --limit 10 | jq '.[] | select(.conclusion == "failure")'

Step 3 — Read failed step logs

gh run view <RUN_ID> --repo <OWNER/REPO> --log-failed

--log-failed returns only the failing steps' output — much smaller than full logs.

Step 4 — Get job/step names if needed

gh run view <RUN_ID> --repo <OWNER/REPO> --json jobs

Step 5 — Match the root cause

Log pattern	Likely cause	Fix
ESLint / Lint errors	Code violates ESLint rules	Fix the lint errors in the affected files
Jest / test failure	Assertion failed or file error	Fix the test or the production code it covers
TypeScript / tsc error	Type mismatch, bad import	Fix the type error in the source
Build error (nx build)	Missing dep, bad import path	Fix import or tsconfig
Missing secret / env var	Required CI env not set	Document as unfixable — you cannot add CI secrets
Repeated identical failure	Same error as prior iteration	Document as unfixable — cycling wastes iterations

Detecting a repeated failure: fetch the running comment body, scan for the previous iteration's **Failure:** and **Cause:** lines, and compare their error signature to the current run's log output. If the signatures match, declare unfixable rather than pushing another identical attempt.

3. Apply a fix and push

Edit and commit

git -C <WORKDIR> add <changed-files>
git -C <WORKDIR> commit -m "🐛 fix(ci): <short description>"

Validate the projects you touched with targeted nx test <project> / nx lint <project> before pushing, then let CI confirm — CI re-runs the full nx affected gate on the PR as the authoritative check (that is what you are monitoring).

Push

git push -u origin <BRANCH> --force-with-lease --no-verify

--force-with-lease only overwrites the remote tip if it matches the last-known ref, so it is safe for force-push. Push with --no-verify: the full gate already ran in the validation phase before the first push and CI re-runs it as the authoritative check, so re-running the local pre-push hook here only duplicates CI's work (and a non-zero exit risks losing the run).

4. Single running PR comment

There is exactly one running comment per PR, identified by . Posting a second comment fragments the history and makes the PR timeline hard to follow — always find the existing one and patch it in place.

Look up the comment ID

gh api --paginate "repos/<OWNER>/<REPO>/issues/<PR_NUMBER>/comments" \
  --jq '[.[] | select(.body | contains("<!-- michel-ci-monitor -->"))] | first | .id'

--paginate ensures the comment is found even if the PR has more than 30 comments.

Create the comment (first failure, no comment exists yet)

gh pr comment <PR_NUMBER> --repo <OWNER>/<REPO> --body "$(cat <<'BODY'
<!-- michel-ci-monitor -->
## Michel — CI Monitor

### Iteration 1

**Failure:** <workflow / job / step>
**Cause:** <1-2 sentences>
**Fix applied:** <what was changed and why>
**Commit:** <git sha>
BODY
)"

Update the comment (subsequent iterations)

Write the new body to a tempfile to avoid shell-quoting issues with $, backticks, or multi-line content:

COMMENT_ID=$(gh api --paginate "repos/<OWNER>/<REPO>/issues/<PR_NUMBER>/comments" \
  --jq '[.[] | select(.body | contains("<!-- michel-ci-monitor -->"))] | first | .id')

EXISTING=$(gh api "repos/<OWNER>/<REPO>/issues/comments/${COMMENT_ID}" --jq '.body')

cat > /tmp/comment-body.md <<BODY
${EXISTING}

### Iteration <N>

**Failure:** <details>
**Cause:** <details>
**Fix applied:** <details or 'Unfixable — see below'>
BODY

gh api -X PATCH "repos/<OWNER>/<REPO>/issues/comments/${COMMENT_ID}" \
  -F body=@/tmp/comment-body.md

Comment body structure — cumulative, one ### Iteration N heading per invocation:

What workflow / job / step failed
Root cause (from the logs)
Fix applied — or "Unfixable" with reason
Git commit SHA if a fix was pushed

5. Terminal states — unfixable or timeout

When you cannot fix the problem or the workflow is stuck, update the running comment and return without committing anything.

Unfixable:

### Iteration <N> — Unfixable

**Failure:** <details>
**Reason not fixable:** <missing CI secret | repeated identical failure | infra issue | error outside codebase>
**Recommendation:** Manual intervention required.

Timeout (stuck):

### Timeout reached

Workflow **<name>** has been running for >20 minutes — likely a CI infrastructure issue, not a code problem.
No fix was attempted. Manual re-run or intervention may be needed.

CI did not trigger (not_triggered):

First confirm with gh run list --branch <BRANCH> --repo <OWNER/REPO> whether any run exists. Zero runs ⇒ the workflow never started — most often no self-hosted runner was available, a branch/path filter excluded the PR, or the run needs approval.

### CI did not trigger

No checks registered for this PR within the monitoring window. CI did not start
— likely no self-hosted runner available, a branch/path filter, or a required run
approval. No fix was attempted. Verify runner availability / workflow triggers,
then re-run.