build-monitor - SKILL.md Agent Skill

name: build-monitor description: Continuously monitors Buildkite pipeline builds, detects failures, investigates root causes, fixes issues, and pushes fixes. Runs a polling loop that checks build status at configurable intervals for a configurable duration. Use when the user says "monitor builds", "watch pipeline", "watch CI", "continuous monitoring", "keep checking builds", or wants automated build-fix cycles.

Buildkite Build Monitor

Continuously monitor the Buildkite pipeline, detect failures, investigate root causes, fix code issues, perform adversarial review, and push fixes — all in an automated loop.

Prerequisites — Authentication Check

Before starting the monitoring loop, verify ALL required authentication is in place. Stop and report any failures before proceeding.

Step 1: Buildkite CLI

which bk || echo "FAIL: bk CLI not installed — run: brew tap buildkite/buildkite && brew install buildkite/buildkite/bk"
bk auth status 2>&1 | head -5

If not authenticated, ask the user to run bk auth login in a separate terminal — it requires interactive browser OAuth.

If org is not selected:

bk auth switch mockserver

Step 2: GitHub CLI

gh auth status 2>&1 | head -3

Needed for pushing fixes and creating PRs if required.

Step 3: Git Status

git status --porcelain
git branch --show-current

Verify:

Working tree is clean (no uncommitted changes that would conflict with fixes)
On master branch (or confirm which branch to monitor)

Step 4: Report Authentication Status

Print a summary table:

Authentication Status:
  Buildkite CLI: OK (org: mockserver)
  GitHub CLI:    OK
  Git:           clean, on master

If any check fails, stop and report — do not start monitoring.

Monitoring Loop

Parameters

Parameter	Default	Description
`interval`	10 minutes	Time between checks
`duration`	90 minutes	Total monitoring duration
`branch`	`master`	Branch to monitor (ignore other branches)
`auto_fix`	`false` (report only)	When `false`, investigate and report fixes only — do not change files. When `true`, the loop fixes, runs the full commit-workflow gate chain, then commits and pushes autonomously per the DVRR operating model (gate failure ⇒ no commit). Default off so an unattended monitor does not change master unless explicitly enabled.

Calculate total checks: duration / interval (e.g., 90/10 = 9 checks).

Check Procedure

For each check iteration:

1. Fetch Recent Builds

TOKEN=$(bk auth token)
curl -sH "Authorization: Bearer $TOKEN" \
  "https://api.buildkite.com/v2/organizations/mockserver/pipelines/mockserver/builds?per_page=10" \
  | python3 -c "
import json, sys
builds = json.load(sys.stdin)
for b in builds:
    state = b['state']
    num = b['number']
    branch = b['branch']
    msg = b['message'].split('\n')[0][:60] if b['message'] else 'N/A'
    jobs = []
    for j in b.get('jobs', []):
        if j.get('type') == 'script' and j.get('name') != ':pipeline:':
            jobs.append(f'{j.get(\"name\",\"?\")}: {j[\"state\"]}')
    print(f'#{num} [{state:>10}] {branch[:30]:30} {msg}')
    if jobs:
        print(f'    Jobs: {\", \".join(jobs)}')
"

2. Classify Builds

For each build on the monitored branch:

State	Action
`passed`	Log as healthy. No action needed.
`running`	Log progress. Check again next interval.
`failed`	Investigate if not already investigated in this session.
`skipped`	Normal (superseded by newer commit). Ignore.
`canceled`	Log. No action.

Track investigated build numbers to avoid re-investigating the same failure.

3. Investigate Failures

For each new failed build:

a. Classify the failure type:

TOKEN=$(bk auth token)
# Get the failed build details
curl -sH "Authorization: Bearer $TOKEN" \
  "https://api.buildkite.com/v2/organizations/mockserver/pipelines/mockserver/builds/{number}" \
  | python3 -c "
import json, sys
b = json.load(sys.stdin)
for j in b.get('jobs', []):
    if j.get('state') == 'failed' and j.get('type') == 'script':
        print(f'job_id={j[\"id\"]} name={j.get(\"name\",\"?\")} exit={j.get(\"exit_status\",\"?\")}')
        agent = j.get('agent', {})
        print(f'agent_state={agent.get(\"connection_state\",\"?\")}')
"

exit_status	agent_state	Diagnosis
`1`	`connected`/`disconnected`	Build/test failure — investigate logs
`-1`	`lost`	Agent died (spot termination, OOM) — infrastructure issue, no code fix needed
`0` with failed state	any	Unusual — check pipeline config

b. For build/test failures (exit_status=1):

Launch the pipeline-investigator subagent:

Task(subagent_type="pipeline-investigator", prompt="Investigate Buildkite build #{number}...")

The investigator will return:

Exact error messages
Root cause analysis
Affected files/modules
Suggested fix

c. For infrastructure failures (exit_status=-1, agent lost):

Log the failure as infrastructure-related. Optionally trigger a rebuild:

bk build rebuild {number} -p mockserver -y

Do NOT attempt code fixes for infrastructure failures.

4. Fix Code Issues

If the investigator identifies a code issue:

a. Understand the fix:

Read the affected source files
Understand the surrounding code context and conventions
Plan the minimal fix

b. Implement the fix:

Edit only the necessary files
Follow existing code style and conventions
Do NOT add comments unless the code is genuinely confusing

c. Validate locally (if possible):

For Java changes, run the specific failing test:

cd mockserver && ./mvnw test -pl {module} -Dtest={TestClassName}#{testMethodName} -Djava.security.egd=file:/dev/./urandom

Note: Full integration tests require the Docker CI image and may not run locally. Unit tests should run.

5. Adversarial Review

Before committing, run the adversarial review defined in .opencode/rules/commit-workflow.md Step 4 (a review-cheap subagent on a different model with fresh context, applying the 8-lens review constitution):

Task(subagent_type="review-cheap", prompt="Adversarially review the following changes using .opencode/rules/review-constitution.md. Verdict PASS or BLOCK: {diff}")

If the verdict is BLOCK, address the feedback, re-verify, and re-run the review before committing.

6. Commit and Push

Run only when auto_fix is enabled (it is off by default — see Parameters). When enabled, the gate chain is the authority to ship, per the DVRR operating model (.opencode/rules/operating-model.md): follow the full pre-commit workflow in .opencode/rules/commit-workflow.md (classify → validate → changelog → adversarial review with a PASS verdict → re-verify after any fix), then commit and push to master autonomously — no separate human approval step.

Fail-closed: if any gate fails (tests red, review BLOCK, review subagent unavailable), do NOT commit — report the failure and continue monitoring. A user can interject at any time to halt or amend. Report the fix you applied:

Applied fix for build #{number}:
{git diff --stat output}

Commit workflow:

a. Classify changed files:

git diff --name-only

b. Stage by explicit path (NEVER git add .):

git add path/to/file1.java path/to/file2.java

c. Commit with descriptive message:

git commit -m "Fix {description of what was fixed}

{Brief explanation of root cause and fix}"

d. Pull and push:

git pull --rebase
git push

e. Verify new build triggered:

TOKEN=$(bk auth token)
curl -sH "Authorization: Bearer $TOKEN" \
  "https://api.buildkite.com/v2/organizations/mockserver/pipelines/mockserver/builds?per_page=1&branch=master" \
  | python3 -c "import json,sys; b=json.load(sys.stdin)[0]; print(f'#{b[\"number\"]} {b[\"state\"]} {b[\"commit\"][:10]}')"

7. Wait for Next Check

sleep {interval_seconds}

Status Report

After each check, print a concise status report:

=== Build Monitor Check {N}/{total} — {timestamp} ===
Build #{number}: {state} ({branch})
  {jobs summary}
Action: {none|investigating|fixing|pushed fix|waiting for result}
Next check: {timestamp}

End of Monitoring

After all checks complete, print a final summary:

=== Build Monitor Summary ===
Duration: {start} to {end}
Checks performed: {N}
Builds observed: {list of build numbers and states}
Failures investigated: {count}
Fixes pushed: {count}
Current pipeline status: {passing|failing|running}

Failure Patterns Reference

Error Pattern	Category	Typical Fix
`COMPILATION ERROR`	Build error	Fix Java source code
`Tests run:.*Failures:`	Test failure	Fix test or production code
`invalid target release`	JDK mismatch	Update Docker image or compiler config
`class file has wrong version`	JDK mismatch	Update Docker image
`OutOfMemoryError`	Resource	Increase JVM heap in build script
`exit_status: -1` + agent `lost`	Infrastructure	Rebuild (spot termination)
`Timeout`	Hanging test	Add test timeouts or fix deadlock
`Connection refused`	Port conflict	Fix parallel test isolation

Important Rules

Only fix failures on the monitored branch (default: master). Ignore dependabot PR branches unless explicitly asked.
Never amend commits that have been pushed.
Stage files by explicit path — never git add . or git add -A.
Pull before push — git pull --rebase to handle concurrent changes.
Check git status before committing — if unexpected changes appear, stop and ask the user.
Track investigated builds — don't re-investigate the same failure.
Infrastructure failures don't need code fixes — just rebuild or wait.
Rate limit rebuilds — don't trigger more than one rebuild per check interval.