monitor-ci

star 246

Monitor Nx Cloud CI pipeline and handle self-healing fixes. USE WHEN user says "monitor ci", "watch ci", "ci monitor", "watch ci for this branch", "track ci", "check ci status", wants to track CI status, or needs help with self-healing CI fixes. ALWAYS USE THIS SKILL instead of native CI provider tools (gh, glab, etc.) for CI monitoring.

udayvunnam By udayvunnam schedule Updated 3/8/2026

name: monitor-ci description: Monitor Nx Cloud CI pipeline and handle self-healing fixes. USE WHEN user says "monitor ci", "watch ci", "ci monitor", "watch ci for this branch", "track ci", "check ci status", wants to track CI status, or needs help with self-healing CI fixes. ALWAYS USE THIS SKILL instead of native CI provider tools (gh, glab, etc.) for CI monitoring.

Monitor CI Command

You are the orchestrator for monitoring Nx Cloud CI pipeline executions and handling self-healing fixes. You spawn subagents to interact with Nx Cloud, run deterministic decision scripts, and take action based on the results.

Context

  • Current Branch: !git branch --show-current
  • Current Commit: !git rev-parse --short HEAD
  • Remote Status: !git status -sb | head -1

User Instructions

$ARGUMENTS

Important: If user provides specific instructions, respect them over default behaviors described below.

Configuration Defaults

Setting Default Description
--max-cycles 10 Maximum agent-initiated CI Attempt cycles before timeout
--timeout 120 Maximum duration in minutes
--verbosity medium Output level: minimal, medium, verbose
--branch (auto-detect) Branch to monitor
--fresh false Ignore previous context, start fresh
--auto-fix-workflow false Attempt common fixes for pre-CI-Attempt failures (e.g., lockfile updates)
--new-cipe-timeout 10 Minutes to wait for new CI Attempt after action
--local-verify-attempts 3 Max local verification + enhance cycles before pushing to CI

Parse any overrides from $ARGUMENTS and merge with defaults.

Nx Cloud Connection Check

CRITICAL: Before starting the monitoring loop, verify the workspace is connected to Nx Cloud.

Step 0: Verify Nx Cloud Connection

  1. Check nx.json at workspace root for nxCloudId or nxCloudAccessToken

  2. If nx.json missing OR neither property exists → exit with:

    Nx Cloud not connected. Unlock 70% faster CI and auto-fix broken PRs with https://nx.dev/nx-cloud
    
  3. If connected → continue to main loop

Architecture Overview

  1. This skill (orchestrator): spawns subagents, runs scripts, prints status, does local coding work
  2. ci-monitor-subagent (haiku): calls one MCP tool (ci_information or update_self_healing_fix), returns structured result, exits
  3. ci-poll-decide.mjs (deterministic script): takes ci_information result + state, returns action + status message
  4. ci-state-update.mjs (deterministic script): manages budget gates, post-action state transitions, and cycle classification

Status Reporting

The decision script handles message formatting based on verbosity. When printing messages to the user:

  • Prepend [monitor-ci] to every message from the script's message field
  • For your own action messages (e.g. "Applying fix via MCP..."), also prepend [monitor-ci]

Anti-Patterns (NEVER DO)

CRITICAL: The following behaviors are strictly prohibited:

Anti-Pattern Why It's Bad
Using CI provider CLIs with --watch flags (e.g., gh pr checks --watch, glab ci status -w) Bypasses Nx Cloud self-healing entirely
Writing custom CI polling scripts Unreliable, pollutes context, no self-healing
Cancelling CI workflows/pipelines Destructive, loses CI progress
Running CI checks on main agent Wastes main agent context tokens
Independently analyzing/fixing CI failures while polling Races with self-healing, causes duplicate fixes and confused state

If this skill fails to activate, the fallback is:

  1. Use CI provider CLI for READ-ONLY status check (single call, no watch/polling flags)
  2. Immediately delegate to this skill with gathered context
  3. NEVER continue polling on main agent

CI provider CLIs are acceptable ONLY for:

  • One-time read of PR/pipeline status
  • Getting PR/branch metadata
  • NOT for continuous monitoring or watch mode

Session Context Behavior

Important: Within a Claude Code session, conversation context persists. If you Ctrl+C to interrupt the monitor and re-run /monitor-ci, Claude remembers the previous state and may continue from where it left off.

  • To continue monitoring: Just re-run /monitor-ci (context is preserved)
  • To start fresh: Use /monitor-ci --fresh to ignore previous context
  • For a completely clean slate: Exit Claude Code and restart claude

MCP Tool Reference

ci_information

Input:

{
  "branch": "string (optional, defaults to current git branch)",
  "select": "string (optional, comma-separated field names)",
  "pageToken": "number (optional, 0-based pagination for long strings)"
}

Field Sets for Efficient Polling:

WAIT_FIELDS:
  'cipeUrl,commitSha,cipeStatus'
  # Minimal fields for detecting new CI Attempt

LIGHT_FIELDS:
  'cipeStatus,cipeUrl,branch,commitSha,selfHealingStatus,verificationStatus,userAction,failedTaskIds,verifiedTaskIds,selfHealingEnabled,failureClassification,couldAutoApplyTasks,shortLink,confidence,confidenceReasoning,hints,selfHealingSkippedReason,selfHealingSkipMessage'
  # Status fields for determining actionable state

HEAVY_FIELDS:
  'taskOutputSummary,suggestedFix,suggestedFixReasoning,suggestedFixDescription'
  # Large content fields - fetch only when needed for fix decisions

Default Behaviors by Status

The decision script returns one of the following statuses. This table defines the default behavior for each. User instructions can override any of these.

Simple exits — just report and exit:

Status Default Behavior
ci_success Exit with success
cipe_canceled Exit, CI was canceled
cipe_timed_out Exit, CI timed out
polling_timeout Exit, polling timeout reached
circuit_breaker Exit, no progress after 5 consecutive polls
environment_rerun_cap Exit, environment reruns exhausted
fix_auto_applying Do NOT call MCP — self-healing handles it. Record last_cipe_url, enter wait mode. No local git ops.
error Wait 60s and loop

Statuses requiring action — see subsections below:

Status Summary
fix_apply_ready Fix verified (all tasks or e2e-only). Apply via MCP.
fix_needs_local_verify Fix has unverified non-e2e tasks. Run locally, then apply or enhance.
fix_needs_review Fix verification failed/not attempted. Analyze and decide.
fix_failed Self-healing failed. Fetch heavy data, attempt local fix (gate check first).
no_fix No fix available. Fetch heavy data, attempt local fix (gate check first) or exit.
environment_issue Request environment rerun via MCP (gate check first).
self_healing_throttled Reject old fixes, attempt local fix.
no_new_cipe CI Attempt never spawned. Auto-fix workflow or exit with guidance.
cipe_no_tasks CI failed with no tasks. Retry once with empty commit.

fix_apply_ready

  • Spawn UPDATE_FIX subagent with APPLY
  • Record last_cipe_url, enter wait mode

fix_needs_local_verify

The script returns verifiableTaskIds in its output.

  1. Detect package manager: pnpm-lock.yamlpnpm nx, yarn.lockyarn nx, otherwise npx nx
  2. Run verifiable tasks in parallel — spawn general subagents for each task
  3. If all pass → spawn UPDATE_FIX subagent with APPLY, enter wait mode
  4. If any fail → Apply Locally + Enhance Flow (see below)

fix_needs_review

Spawn FETCH_HEAVY subagent, then analyze fix content (suggestedFixDescription, suggestedFixSummary, taskFailureSummaries):

  • If fix looks correct → apply via MCP
  • If fix needs enhancement → Apply Locally + Enhance Flow
  • If fix is wrong → run ci-state-update.mjs gate --gate-type local-fix. If not allowed, print message and exit. Otherwise → Reject + Fix From Scratch Flow

fix_failed / no_fix

Spawn FETCH_HEAVY subagent for taskFailureSummaries. Run ci-state-update.mjs gate --gate-type local-fix — if not allowed, print message and exit. Otherwise attempt local fix (counter already incremented by gate). If successful → commit, push, enter wait mode. If not → exit with failure.

environment_issue

  1. Run ci-state-update.mjs gate --gate-type env-rerun. If not allowed, print message and exit.
  2. Spawn UPDATE_FIX subagent with RERUN_ENVIRONMENT_STATE
  3. Enter wait mode with last_cipe_url set

self_healing_throttled

Spawn FETCH_HEAVY subagent for selfHealingSkipMessage.

  1. Parse throttle message for CI Attempt URLs (regex: /cipes/{id})
  2. Reject previous fixes — for each URL: spawn FETCH_THROTTLE_INFO to get shortLink, then UPDATE_FIX with REJECT
  3. Attempt local fix: Run ci-state-update.mjs gate --gate-type local-fix. If not allowed → skip to step 4. Otherwise use failedTaskIds and taskFailureSummaries for context.
  4. Fallback if local fix not possible or budget exhausted: push empty commit (git commit --allow-empty -m "ci: rerun after rejecting throttled fixes"), enter wait mode

no_new_cipe

  1. Report to user: no CI attempt found, suggest checking CI provider
  2. If --auto-fix-workflow: detect package manager, run install, commit lockfile if changed, enter wait mode
  3. Otherwise: exit with guidance

cipe_no_tasks

  1. Report to user: CI failed with no tasks recorded
  2. Retry: git commit --allow-empty -m "chore: retry ci [monitor-ci]" + push, enter wait mode
  3. If retry also returns cipe_no_tasks: exit with failure

Fix Action Flows

Apply via MCP

Spawn UPDATE_FIX subagent with APPLY. New CI Attempt spawns automatically. No local git ops.

Apply Locally + Enhance Flow

  1. nx-cloud apply-locally <shortLink> (sets state to APPLIED_LOCALLY)
  2. Enhance code to fix failing tasks
  3. Run failing tasks to verify
  4. If still failing → run ci-state-update.mjs gate --gate-type local-fix. If not allowed, commit current state and push (let CI be final judge). Otherwise loop back to enhance.
  5. If passing → commit and push, enter wait mode

Reject + Fix From Scratch Flow

  1. Run ci-state-update.mjs gate --gate-type local-fix. If not allowed, print message and exit.
  2. Spawn UPDATE_FIX subagent with REJECT
  3. Fix from scratch locally
  4. Commit and push, enter wait mode

Environment vs Code Failure Recognition

When any local fix path runs a task and it fails, assess whether the failure is a code issue or an environment/tooling issue before running the gate script.

Indicators of environment/tooling failures (non-exhaustive): command not found / binary missing, OOM / heap allocation failures, permission denied, network timeouts / DNS failures, missing system libraries, Docker/container issues, disk space exhaustion.

When detected → bail immediately, do NOT run gate (no budget consumed). Report that the failure is an environment/tooling issue, not a code bug.

Code failures (compilation errors, test assertion failures, lint violations, type errors) are genuine candidates for local fix attempts and proceed normally through the gate.

Git Safety

  • NEVER use git add -A or git add . — always stage specific files by name
  • Users may have concurrent local changes that must NOT be committed

Commit Message Format

git commit -m "fix(<projects>): <brief description>

Failed tasks: <taskId1>, <taskId2>
Local verification: passed|enhanced|failed-pushing-to-ci"

Main Loop

Step 1: Initialize Tracking

cycle_count = 0            # Only incremented for agent-initiated cycles (counted against --max-cycles)
start_time = now()
no_progress_count = 0
local_verify_count = 0
env_rerun_count = 0
last_cipe_url = null
expected_commit_sha = null
agent_triggered = false    # Set true after monitor takes an action that triggers new CI Attempt
poll_count = 0
wait_mode = false
prev_status = null
prev_cipe_status = null
prev_sh_status = null
prev_verification_status = null
prev_failure_classification = null

Step 2: Polling Loop

Repeat until done:

2a. Spawn subagent (FETCH_STATUS)

Determine select fields based on mode:

  • Wait mode: use WAIT_FIELDS (cipeUrl,commitSha,cipeStatus)
  • Normal mode (first poll or after newCipeDetected): use LIGHT_FIELDS
Task(
  agent: "ci-monitor-subagent",
  model: haiku,
  prompt: "FETCH_STATUS for branch '<branch>'.
           select: '<fields>'"
)

The subagent calls ci_information and returns a JSON object with the requested fields. This is a foreground call — wait for the result.

2b. Run decision script

node <skill_dir>/scripts/ci-poll-decide.mjs '<subagent_result_json>' <poll_count> <verbosity> \
  [--wait-mode] \
  [--prev-cipe-url <last_cipe_url>] \
  [--expected-sha <expected_commit_sha>] \
  [--prev-status <prev_status>] \
  [--timeout <timeout_seconds>] \
  [--new-cipe-timeout <new_cipe_timeout_seconds>] \
  [--env-rerun-count <env_rerun_count>] \
  [--no-progress-count <no_progress_count>] \
  [--prev-cipe-status <prev_cipe_status>] \
  [--prev-sh-status <prev_sh_status>] \
  [--prev-verification-status <prev_verification_status>] \
  [--prev-failure-classification <prev_failure_classification>]

The script outputs a single JSON line: { action, code, message, delay?, noProgressCount, envRerunCount, fields?, newCipeDetected?, verifiableTaskIds? }

2c. Process script output

Parse the JSON output and update tracking state:

  • no_progress_count = output.noProgressCount
  • env_rerun_count = output.envRerunCount
  • prev_cipe_status = subagent_result.cipeStatus
  • prev_sh_status = subagent_result.selfHealingStatus
  • prev_verification_status = subagent_result.verificationStatus
  • prev_failure_classification = subagent_result.failureClassification
  • prev_status = output.action + ":" + (output.code || subagent_result.cipeStatus)
  • poll_count++

Based on action:

  • action == "poll": Print output.message, sleep output.delay seconds, go to 2a
    • If output.newCipeDetected: clear wait mode, reset wait_mode = false
  • action == "wait": Print output.message, sleep output.delay seconds, go to 2a
  • action == "done": Proceed to Step 3 with output.code

Step 3: Handle Actionable Status

When decision script returns action == "done":

  1. Run cycle-check (Step 4) before handling the code
  2. Check the returned code
  3. Look up default behavior in the table above
  4. Check if user instructions override the default
  5. Execute the appropriate action
  6. If action expects new CI Attempt, update tracking (see Step 3a)
  7. If action results in looping, go to Step 2

Spawning subagents for actions

Several statuses require fetching heavy data or calling MCP:

  • fix_apply_ready: Spawn UPDATE_FIX subagent with APPLY
  • fix_needs_local_verify: Spawn FETCH_HEAVY subagent for fix details before local verification
  • fix_needs_review: Spawn FETCH_HEAVY subagent → get suggestedFixDescription, suggestedFixSummary, taskFailureSummaries
  • fix_failed / no_fix: Spawn FETCH_HEAVY subagent → get taskFailureSummaries for local fix context
  • environment_issue: Spawn UPDATE_FIX subagent with RERUN_ENVIRONMENT_STATE
  • self_healing_throttled: Spawn FETCH_HEAVY subagent → get selfHealingSkipMessage; then FETCH_THROTTLE_INFO + UPDATE_FIX for each old fix

Step 3a: Track State for New-CI-Attempt Detection

After actions that should trigger a new CI Attempt, run:

node <skill_dir>/scripts/ci-state-update.mjs post-action \
  --action <type> \
  --cipe-url <current_cipe_url> \
  --commit-sha <git_rev_parse_HEAD>

Action types: fix-auto-applying, apply-mcp, apply-local-push, reject-fix-push, local-fix-push, env-rerun, auto-fix-push, empty-commit-push

The script returns { waitMode, pollCount, lastCipeUrl, expectedCommitSha, agentTriggered }. Update all tracking state from the output, then go to Step 2.

Step 4: Cycle Classification and Progress Tracking

When the decision script returns action == "done", run cycle-check before handling the code:

node <skill_dir>/scripts/ci-state-update.mjs cycle-check \
  --code <code> \
  [--agent-triggered] \
  --cycle-count <cycle_count> --max-cycles <max_cycles> \
  --env-rerun-count <env_rerun_count>

The script returns { cycleCount, agentTriggered, envRerunCount, approachingLimit, message }. Update tracking state from the output.

  • If approachingLimit → ask user whether to continue (with 5 or 10 more cycles) or stop monitoring
  • If previous cycle was NOT agent-triggered (human pushed), log that human-initiated push was detected

Progress Tracking

  • no_progress_count, circuit breaker (5 polls), and backoff reset are handled by ci-poll-decide.mjs (progress = any change in cipeStatus, selfHealingStatus, verificationStatus, or failureClassification)
  • env_rerun_count reset on non-environment status is handled by ci-state-update.mjs cycle-check
  • On new CI Attempt detected (poll script returns newCipeDetected) → reset local_verify_count = 0, env_rerun_count = 0

Error Handling

Error Action
Git rebase conflict Report to user, exit
nx-cloud apply-locally fails Reject fix via MCP (action: "REJECT"), then attempt manual patch (Reject + Fix From Scratch Flow) or exit
MCP tool error Retry once, if fails report to user
Subagent spawn failure Retry once, if fails exit with error
Decision script error Treat as error status, increment no_progress_count
No new CI Attempt detected If --auto-fix-workflow, try lockfile update; otherwise report to user with guidance
Lockfile auto-fix fails Report to user, exit with guidance to check CI logs

User Instruction Examples

Users can override default behaviors:

Instruction Effect
"never auto-apply" Always prompt before applying any fix
"always ask before git push" Prompt before each push
"reject any fix for e2e tasks" Auto-reject if failedTaskIds contains e2e
"apply all fixes regardless of verification" Skip verification check, apply everything
"if confidence < 70, reject" Check confidence field before applying
"run 'nx affected -t typecheck' before applying" Add local verification step
"auto-fix workflow failures" Attempt lockfile updates on pre-CI-Attempt failures
"wait 45 min for new CI Attempt" Override new-CI-Attempt timeout (default: 10 min)
Install via CLI
npx skills add https://github.com/udayvunnam/xng-breadcrumb --skill monitor-ci
Repository Details
star Stars 246
call_split Forks 66
navigation Branch main
article Path SKILL.md
More from Creator