pipeline-investigation - SKILL.md Agent Skill

name: pipeline-investigation description: Investigates Buildkite pipeline failures to find root causes. Returns structured JSON to the parent for formatting. Triggers when users ask about failing pipelines, build errors, or need help debugging CI/CD issues. Accepts Buildkite build URLs or build numbers and performs deep investigation.

Buildkite Pipeline Failure Investigation

Investigate Buildkite pipeline failures systematically to identify root causes. This skill uses the Buildkite CLI (bk) as the primary interface, with REST API fallback.

Prerequisites — Buildkite CLI Authentication

The bk CLI is the preferred way to interact with Buildkite. Before investigating, ensure it is installed and authenticated.

Step 0: Check and Install the CLI

which bk || echo "bk CLI not installed"

If bk is not installed, install it:

brew tap buildkite/buildkite && brew install buildkite/buildkite/bk

Step 1: Check Authentication

bk auth status

If authenticated, you will see the org slug, token UUID, scopes, and user info. Proceed to investigation.

If not authenticated, you will see:

Error: you are not authenticated. Run bk auth login to authenticate, or run bk use to select a configured organization

In this case, stop and ask the user to authenticate:

Please run bk auth login in a terminal to authenticate with Buildkite via browser-based OAuth. This is a one-time setup similar to aws sso login. Once done, run bk auth switch mockserver to select the organization.

Do NOT attempt to run bk auth login yourself — it requires interactive browser OAuth that cannot be completed in a non-TTY terminal.

Step 2: Select Organization (if needed)

If bk auth status shows selected_org: "", the org hasn't been selected:

bk auth switch mockserver

Step 3: Get API Token for REST API Calls

When you need the REST API (for endpoints not covered by bk CLI):

TOKEN=$(bk auth token 2>/dev/null)
if [ -z "$TOKEN" ]; then
  echo "ERROR: bk auth token failed — run 'bk auth login' first"
  exit 1
fi
curl -sH "Authorization: Bearer $TOKEN" "https://api.buildkite.com/v2/..."

This extracts the OAuth token from the CLI's keychain storage. Always validate the token is non-empty before using it. Never ask the user to manually create API tokens — the CLI handles this automatically.

Investigation Workflow

Step 1: Parse Input

Extract organization, pipeline, and build number from the Buildkite URL or user input.

URL Pattern:

https://buildkite.com/{org}/{pipeline}/builds/{build_number}

The organization has multiple pipelines sharing the same agent pool:

mockserver — primary Java build and test
mockserver-client-node — Node.js client
mockserver-node — Node.js module
mockserver-performance-test — performance tests

Step 2: Get Build Overview

Using bk CLI (preferred):

bk build view {build_number} -p {pipeline} --json

Using REST API (fallback):

TOKEN=$(bk auth token)
curl -sH "Authorization: Bearer $TOKEN" \
  "https://api.buildkite.com/v2/organizations/mockserver/pipelines/{pipeline}/builds/{build_number}"

Save the commit SHA and created_at — you will need these later to check whether fixes have already been pushed.

Step 3: Check Agent Availability

Using bk CLI:

bk agent list --json

This shows all agents across all pipelines, their connection state, and current job (if busy).

Key fields to check:

connection_state — should be connected
job — if present, agent is busy; check which pipeline/build it's running
meta_data — agent tags including queue=default

Step 4: List Builds Across All Pipelines

To understand queue contention, check builds across all pipelines sharing the agent pool:

TOKEN=$(bk auth token)
curl -sH "Authorization: Bearer $TOKEN" \
  "https://api.buildkite.com/v2/organizations/mockserver/builds?state[]=scheduled&state[]=running&per_page=50"

Step 5: Identify Failed Jobs

bk build view {build_number} -p {pipeline} --json

Filter for failed jobs in the JSON output. Look at jobs[].state == "failed".

Step 6: Retrieve Job Logs

TOKEN=$(bk auth token)
curl -sH "Authorization: Bearer $TOKEN" \
  "https://api.buildkite.com/v2/organizations/mockserver/pipelines/{pipeline}/builds/{build_number}/jobs/{job_id}/log" \
  | jq -r '.content'

For large logs, save to file:

TOKEN=$(bk auth token)
curl -sH "Authorization: Bearer $TOKEN" \
  "https://api.buildkite.com/v2/organizations/mockserver/pipelines/{pipeline}/builds/{build_number}/jobs/{job_id}/log" \
  | jq -r '.content' > .tmp/buildkite-{build_number}-log.txt

Step 7: Download and Analyze Build Artifacts

TOKEN=$(bk auth token)
curl -sH "Authorization: Bearer $TOKEN" \
  "https://api.buildkite.com/v2/organizations/mockserver/pipelines/{pipeline}/builds/{build_number}/artifacts" \
  | jq '.[] | {filename,path,download_url}'

Download an artifact:

TOKEN=$(bk auth token)
curl -sLH "Authorization: Bearer $TOKEN" \
  "{download_url}" -o .tmp/buildkite-{build_number}-artifact.log

Step 8: Check GitHub Actions (Secondary CI)

If the Buildkite build passed but related GitHub Actions failed (CodeQL):

gh run list --repo mock-server/mockserver-monorepo --limit 10 --json status,conclusion,name,headBranch,createdAt
gh run view {run_id} --repo mock-server/mockserver-monorepo --log-failed

Step 9: Additional Investigation

Get build changes (commits that triggered the build):

git log --oneline {commit}..HEAD

Check recent build history for patterns:

TOKEN=$(bk auth token)
curl -sH "Authorization: Bearer $TOKEN" \
  "https://api.buildkite.com/v2/organizations/mockserver/pipelines/{pipeline}/builds?per_page=20&state=failed" \
  | jq '.[] | {number,state,message,created_at}'

Cancel a build (e.g., stuck or blocking):

bk build cancel {build_number} -p {pipeline} -y

Rebuild a build:

bk build rebuild {build_number} -p {pipeline} -y

Step 10: Confirm Flaky-vs-Real

If the failure looks timing/ordering/port/resource-related (e.g. Timeout, Connection refused/BindException, intermittent assertion on order or timing), re-run the single failing test (or check recent builds of the same commit) to confirm intermittency BEFORE classifying it real-vs-flaky:

# Re-run only the failing build (does NOT change the commit under test)
bk build rebuild {build_number} -p {pipeline} -y

# Or check whether other builds of the SAME commit passed/failed
TOKEN=$(bk auth token)
curl -sH "Authorization: Bearer $TOKEN" \
  "https://api.buildkite.com/v2/organizations/mockserver/pipelines/{pipeline}/builds?commit={commit}&per_page=10" \
  | jq '.[] | {number,state,created_at}'

If it passes on re-run (or other builds of the same commit passed), classify it FLAKY and set reproduced: false. If it fails again deterministically, it is a real failure — set reproduced: true.

Step 11: Enumerate Competing Hypotheses

Before concluding a root cause, enumerate the competing hypotheses and the evidence that rules each out (correlation is not causation — a commit landing just before the failure is not proof it caused it). Record the survivors and the ruled-out alternatives in root_cause.alternative_hypotheses.

Step 12: Classify Error

Match the identified root cause against common failure patterns:

Error Pattern	Category	Scope
`BUILD FAILURE` in Maven	`BUILD_ERROR`	ISOLATED
`Tests run:.*Failures:`	`TEST_FAILURE`	ISOLATED
`OutOfMemoryError`	`RESOURCE_ERROR`	MAY_BE_SYSTEMIC
`Connection refused`	`NETWORK_ERROR`	MAY_BE_SYSTEMIC
`docker: Error`	`DOCKER_ERROR`	ISOLATED
`Timeout`	`TIMEOUT`	MAY_BE_SYSTEMIC
Passes on re-run / intermittent across builds of same commit	`FLAKY`	ISOLATED (intermittent)
Agent did not connect	`AGENT_ERROR`	SYSTEMIC
Build `skipped`	`AUTO_SKIPPED`	NORMAL (newer commit superseded)
Build `scheduled` (stuck)	`QUEUE_STARVATION`	SYSTEMIC

Step 13: Check for Already-Pushed Fixes

CRITICAL: Before recommending a fix, check whether the root cause has already been addressed:

git fetch origin master --quiet
git log --oneline {commit}..origin/master
git diff --name-only {commit}..origin/master

Classify:

Classification	Meaning	Report Action
ALREADY FIXED	A commit on `origin/master` addresses this	Mark as fixed, cite the commit
OPEN	No fix found	Report with recommended fix

Build Scheduling and Queue Behaviour

Understanding how Buildkite schedules builds is critical for diagnosing "stuck" builds:

skip_queued_branch_builds — if enabled in the Buildkite pipeline settings (check via bk build view -p {pipeline} --json under pipeline.skip_queued_branch_builds), a newer build for the same branch causes older queued builds to be auto-skipped. This is normal, not a failure. The mockserver pipeline currently has this enabled.
cancel_running_branch_builds — if disabled (check via pipeline.cancel_running_branch_builds), running builds are NOT automatically cancelled when a newer commit is pushed. Older running builds continue to consume agents. The mockserver pipeline currently has this disabled.
Multiple pipelines share the same agent pool (queue=default). Scheduled builds from mockserver-performance-test, mockserver-client-node, and mockserver-node compete with the primary mockserver pipeline for agents.

ASG max_size limits the total number of agents. The Lambda scaler cannot add instances beyond this limit. Check the current max with:

DYLD_LIBRARY_PATH=/opt/homebrew/opt/expat/lib aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names "$(cd terraform/buildkite-agents && terraform output -raw auto_scaling_group_name 2>/dev/null || aws autoscaling describe-auto-scaling-groups --profile mockserver-build --region eu-west-2 --query 'AutoScalingGroups[?contains(Tags[?Key==`Name`].Value, `buildkite-mockserver`)].AutoScalingGroupName' --output text | head -1)" \
  --region eu-west-2 --profile mockserver-build \
  --query 'AutoScalingGroups[0].{MinSize:MinSize,MaxSize:MaxSize,DesiredCapacity:DesiredCapacity}'

Output — Structured Data Return

Return this structure in your final message:

{
  "schema": "pipeline-investigation/v1",
  "build": {
    "number": 0,
    "pipeline": "<pipeline slug>",
    "branch": "<branch>",
    "commit": "<commit sha>",
    "state": "failed",
    "failed_job": "<job name>",
    "started_at": "<ISO8601>",
    "finished_at": "<ISO8601>"
  },
  "reproduced": true,
  "root_cause": {
    "summary": "<one-line description>",
    "detail": "<technical explanation>",
    "error_excerpt": "<relevant log lines>",
    "failure_category": "<category from table above (e.g. FLAKY) or null>",
    "scope": "SYSTEMIC|ISOLATED|MAY_BE_SYSTEMIC",
    "alternative_hypotheses": [
      { "hypothesis": "<competing explanation>", "ruled_out_by": "<evidence that excludes it>" }
    ]
  },
  "fix_status": "OPEN|ALREADY_FIXED",
  "fix_commit": "<sha or null>",
  "fix_message": "<commit message or null>",
  "github_actions": [
    { "run_id": 0, "workflow": "<name>", "conclusion": "failure|success", "root_cause": "<summary or null>" }
  ],
  "commands_run": ["<exact bk / curl / git command used to gather evidence>"],
  "recommended_fix": "<actionable steps, null if already fixed>"
}

After returning the JSON, provide a brief summary (2-3 lines).

Notes

Always investigate the deepest failure first (root cause in logs, not surface symptoms)
Build artifacts often contain more detailed logs than the console output
For recurring failures, check recent commits and pipeline definition changes
Always run Step 13 before reporting — check if the issue was already fixed
When builds are stuck/scheduled, check agent availability across ALL pipelines, not just the one being investigated
Set reproduced: true only when the failure recurs deterministically (failed again on re-run, or every build of the commit failed); set reproduced: false for FLAKY failures that passed on re-run
Always populate commands_run with the exact commands used so the conclusion can be replayed by the reader
The bk CLI uses -p {pipeline} for pipeline selection, NOT --org — the org is set via bk auth switch