aws-investigation

star 4.9k

Investigates AWS infrastructure issues affecting Buildkite build agents (EC2, AutoScaling, Lambda). Returns structured JSON to the parent for formatting. Triggers when users ask about build agents not running, EC2 issues, ASG scaling problems, or infrastructure health.

mock-server By mock-server schedule Updated 6/4/2026

name: aws-investigation description: Investigates AWS infrastructure issues affecting Buildkite build agents (EC2, AutoScaling, Lambda). Returns structured JSON to the parent for formatting. Triggers when users ask about build agents not running, EC2 issues, ASG scaling problems, or infrastructure health.


AWS Infrastructure Investigation

Investigate AWS infrastructure issues affecting Buildkite build agents. Covers EC2 instances, AutoScaling Groups, the autoscaling Lambda, and related resources.

Prerequisites

  • AWS CLI installed (brew install awscli)
  • AWS SSO profile mockserver-build configured (SSO region: eu-west-2)
  • Active SSO session: aws sso login --profile mockserver-build
  • Corporate TLS proxy (if applicable): export AWS_CA_BUNDLE=$NODE_EXTRA_CA_CERTS (only if NODE_EXTRA_CA_CERTS is set)
  • macOS + Python 3.14 + Homebrew: if you get pyexpat symbol errors, export DYLD_LIBRARY_PATH=/opt/homebrew/opt/expat/lib

Infrastructure Overview

There are two build agent stacks. Investigate the current stack first; fall back to the legacy stack only if the current one has not been provisioned yet.

Current: Terraform-managed (eu-west-2)

Managed by terraform/buildkite-agents/ using the official Buildkite Elastic CI Stack module.

Property Value
Region eu-west-2
Instance type Read from terraform/buildkite-agents/terraform.tfvars (instance_types)
Scaling Read from Terraform variables (min_size, max_size, on_demand_percentage)
Scaler version buildkite-agent-scaler v1.11.2
Scaler runtime provided.al2023
Queue default
IaC terraform/buildkite-agents/

Resource names are generated by Terraform with a random suffix. To find them:

# Get ASG name from Terraform state
cd terraform/buildkite-agents
terraform output auto_scaling_group_name

# Or find ASGs with the Buildkite tag
aws autoscaling describe-auto-scaling-groups \
  --region eu-west-2 --profile mockserver-build \
  --query 'AutoScalingGroups[?contains(Tags[?Key==`Stack`].Value | [0], `buildkite-mockserver`)].{Name:AutoScalingGroupName,Desired:DesiredCapacity,Min:MinSize,Max:MaxSize,Instances:Instances[*].{ID:InstanceId,State:LifecycleState}}'

Legacy: CloudFormation-managed (us-east-1)

Being replaced by the Terraform stack above. May still be active during migration.

Resource Identifier Region
AutoScaling Group buildkite-AgentAutoScaleGroup-VGG28FR0DE6Q us-east-1
CloudFormation Stack buildkite us-east-1
Instance Type Inspect live ASG launch template via AWS CLI us-east-1
Autoscaling Lambda Use discovery query below (name generated by CloudFormation) us-east-1

AWS CLI Prefix

All commands require --region and --profile flags:

# Current stack (eu-west-2)
aws ... --region eu-west-2 --profile mockserver-build

# Legacy stack (us-east-1)
aws ... --region us-east-1 --profile mockserver-build

Investigation Workflow

Step 1: Determine Active Stack

Check which stack is currently running agents:

# Check current stack (eu-west-2) — look for ASGs tagged with buildkite-mockserver
aws autoscaling describe-auto-scaling-groups \
  --region eu-west-2 --profile mockserver-build \
  --query 'AutoScalingGroups[?contains(Tags[?Key==`Stack`].Value | [0], `buildkite-mockserver`)].{Name:AutoScalingGroupName,Desired:DesiredCapacity,Instances:Instances[*].{ID:InstanceId,State:LifecycleState}}'

# Check legacy stack (us-east-1)
aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names "buildkite-AgentAutoScaleGroup-VGG28FR0DE6Q" \
  --region us-east-1 --profile mockserver-build \
  --query 'AutoScalingGroups[0].{Name:AutoScalingGroupName,Desired:DesiredCapacity,Instances:Instances[*].{ID:InstanceId,State:LifecycleState}}'

Use whichever stack has instances (or non-zero desired capacity) for the remaining steps. Substitute the correct --region and ASG name accordingly.

Step 2: Quick Health Check

aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names "<ASG_NAME>" \
  --region <REGION> --profile mockserver-build \
  --query 'AutoScalingGroups[0].{Desired:DesiredCapacity,Min:MinSize,Max:MaxSize,Instances:Instances[*].{ID:InstanceId,State:LifecycleState,Health:HealthStatus}}'

Expected healthy state:

  • If queue is empty: Desired = 0 can be healthy (scale-to-zero)
  • If queue has pending jobs: desired capacity should increase above 0 within 1-2 scaler intervals
  • Active instances should be InService and Healthy

Problem indicators:

  • Desired: 0 — no agents requested (scaler not seeing jobs, or Lambda not running)
  • Desired > 0 but no instances — launch failures
  • Instances in Pending for >5 min — launch issues
  • Instances Unhealthy — failing health checks

Step 3: Check EC2 Instance Status

aws ec2 describe-instances \
  --filters "Name=tag:aws:autoscaling:groupName,Values=<ASG_NAME>" \
  --region <REGION> --profile mockserver-build \
  --query 'Reservations[].Instances[].{ID:InstanceId,State:State.Name,Type:InstanceType,Launch:LaunchTime,AZ:Placement.AvailabilityZone}'

For running instances, check system/instance status:

aws ec2 describe-instance-status \
  --instance-ids <instance-id-1> <instance-id-2> \
  --region <REGION> --profile mockserver-build

Step 4: Check Scaling Activities

aws autoscaling describe-scaling-activities \
  --auto-scaling-group-name "<ASG_NAME>" \
  --region <REGION> --profile mockserver-build \
  --max-items 10

Look for:

  • "user request explicitly set group desired capacity" — the Lambda scaler adjusted capacity
  • "an instance was taken out of service" — scale-in event
  • Failed status codes — launch failures (AMI issues, capacity, subnet exhaustion)

Step 5: Check the Autoscaling Lambda

Find the scaler Lambda by listing functions with a Buildkite-related name:

aws lambda list-functions \
  --region <REGION> --profile mockserver-build \
  --query 'Functions[?contains(FunctionName, `buildkite`) && (contains(FunctionName, `scaler`) || contains(FunctionName, `caling`))].{Name:FunctionName,Runtime:Runtime,State:State,LastModified:LastModified}'

Then check its logs:

# Recent invocations (last hour)
aws logs filter-log-events \
  --log-group-name "/aws/lambda/<LAMBDA_FUNCTION_NAME>" \
  --region <REGION> --profile mockserver-build \
  --start-time $(python3 -c "import time; print(int((time.time() - 3600) * 1000))") \
  --limit 20

# Error logs (last hour)
aws logs filter-log-events \
  --log-group-name "/aws/lambda/<LAMBDA_FUNCTION_NAME>" \
  --region <REGION> --profile mockserver-build \
  --start-time $(python3 -c "import time; print(int((time.time() - 3600) * 1000))") \
  --filter-pattern "ERROR" \
  --limit 10

Step 6: Check EC2 Console Output

For instances that are running but not registering as Buildkite agents:

aws ec2 get-console-output \
  --instance-id <instance-id> \
  --region <REGION> --profile mockserver-build \
  --query 'Output' --output text

Step 7: Check Suspended ASG Processes

aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names "<ASG_NAME>" \
  --region <REGION> --profile mockserver-build \
  --query 'AutoScalingGroups[0].SuspendedProcesses'

Note: AZRebalance is intentionally suspended to prevent killing running builds. Other suspended processes may indicate problems.

Failure Patterns

Symptom Likely Cause Investigation
ASG desired=0, no instances No Buildkite jobs pending, or Lambda not invoking Check Step 5 (Lambda logs)
ASG desired>0, no instances launching Launch template issue, AMI missing, capacity error Check Step 4 (scaling activities for errors)
Instances running but builds stuck Buildkite agent not starting on instance, token issue Check Step 6 (console output)
Lambda not invoking EventBridge rule disabled Check Step 5 (Lambda and EventBridge)
Lambda invoking but not scaling Buildkite API auth failure (expired token) Check Step 5 (Lambda error logs)
Instances cycle rapidly (launch/terminate) Health check failures, instance crashing on boot Check Steps 3, 4, 6
Agents run briefly then terminate Normal — MIN_SIZE=0, scaler scales down when jobs finish Not a bug

Emergency: Manually Scale Up Agents

If the Lambda is broken and you need agents immediately:

aws autoscaling set-desired-capacity \
  --auto-scaling-group-name "<ASG_NAME>" \
  --desired-capacity <TEMP_CAPACITY_LEQ_MAX_SIZE> \
  --region <REGION> --profile mockserver-build

Choose a temporary capacity that does not exceed the ASG MaxSize from Step 2.

Warning: The Lambda scaler may override this on its next invocation if it sees no pending jobs.

Output — Structured Data Return

Return this structure in your final message:

{
  "schema": "aws-investigation/v1",
  "timestamp": "<ISO8601>",
  "active_stack": "terraform-eu-west-2 | legacy-us-east-1",
  "asg": {
    "name": "<ASG name>",
    "region": "<region>",
    "desired_capacity": 0,
    "min_size": 0,
    "max_size": "<max_size>",
    "instances": [
      {
        "instance_id": "<id>",
        "state": "InService|Pending|Terminating",
        "health": "Healthy|Unhealthy",
        "availability_zone": "<az>"
      }
    ],
    "suspended_processes": ["<process names>"]
  },
  "lambda": {
    "function_name": "<name>",
    "state": "Active|Inactive",
    "runtime": "<runtime>",
    "recent_errors": ["<error messages>"],
    "last_invocation": "<ISO8601 or null>"
  },
  "root_cause": {
    "summary": "<one-line description>",
    "detail": "<technical explanation>",
    "category": "<category from failure patterns>",
    "evidence": "<relevant log lines or CLI output>"
  },
  "recommended_fix": "<actionable steps>",
  "warnings": ["<deprecation notices, capacity concerns, etc.>"]
}

After returning the JSON, provide a brief summary (2-3 lines).

Notes

  • Always run Step 1 first to determine which stack is active
  • The Lambda scaler logs are the most valuable data source for understanding scaling decisions
  • Always check if the Buildkite agent token is still valid if agents start but don't register
Install via CLI
npx skills add https://github.com/mock-server/mockserver-monorepo --skill aws-investigation
Repository Details
star Stars 4,893
call_split Forks 1,109
navigation Branch main
article Path SKILL.md
More from Creator