ops-logs-query - SKILL.md Agent Skill

name: ops-logs-query description: Internal — for Boundless team members only. Query AWS CloudWatch logs for Boundless services (provers, slasher, distributor, order stream, order generator, indexer, signal) on prod/staging environments. Use when the user asks to look at service logs, debug service behavior from log output, search logs for a request ID, or investigate errors using CloudWatch. Do NOT use for debugging local code changes, reviewing PRs, or investigating issues in the codebase itself.

Logs Query

Query AWS CloudWatch Logs for Boundless services on prod/staging.

Prerequisites

Read network_secrets.toml from the repo root. Extract the AWS credentials for the target environment from [aws.prod] or [aws.staging] (access_key_id, secret_access_key). If the file is not present, recommend the user create it -- instructions and credentials are in the Boundless runbook.
Export credentials before running any queries:

export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_DEFAULT_REGION="us-west-2"

Finding Log Groups

Prover log groups

Prover log groups follow the pattern /boundless/bento/<hostname>. The hostnames are defined in the Pulumi config files:

Staging: infra/cw-monitoring/Pulumi.staging.yaml
Prod: infra/cw-monitoring/Pulumi.production.yaml

Read the relevant Pulumi config to find current hostnames. As of now:

Log Group	Environment	Chain ID	Label (approx, in `network_address_labels`)	Description
`/boundless/bento/prover-84532-staging-nightly`	staging	84532 (Base Sepolia)		Staging nightly prover
`/boundless/bento/prover-8453-prod-release`	prod	8453 (Base)	`BPLatitudeRelease`	Prod release prover. Legacy name: `/boundless/bento/base-mainnet-prover-release`
`/boundless/bento/prover-8453-prod-nightly`	prod	8453 (Base)	`BPLatitudeNightly`	Prod nightly prover. Legacy name: `/boundless/bento/base-mainnet-prover-nightly`
`/boundless/bento/prover-84532-prod-nightly`	prod	84532 (Base Sepolia)		Prod Base Sepolia nightly prover
`/boundless/bento/prover-11155111-prod-nightly`	prod	11155111 (Eth Sepolia)		Prod Eth Sepolia nightly prover
`/boundless/bento/prover-01`	prod	8453 (Base)	`BPProver01DC`	Prod datacenter prover 01
`/boundless/bento/prover-02`	prod	8453 (Base)	`BPProver02DC`	Prod datacenter prover 02

The label column lists the network_address_labels.json label that the log group is believed to correspond to. Names may not match exactly -- always confirm against network_address_labels.json and Pulumi config before relying on the mapping.

If these are out of date, check the Pulumi config files for the current list.

We only have prover log groups for provers we operate. External provers do not have queryable logs.

Provers we operate are labeled with a BP prefix in network_address_labels.json (e.g. BP1, BP2, BPNightlyAWS). When investigating any issue, always highlight what our BP provers are doing -- did they skip, fail, drop, or fulfill? This should be called out explicitly even when the investigation is not specifically about our provers.

Discovering log groups for other services

Other services (slasher, distributor, order stream, order generator, indexer backend, indexer API, etc.) have log groups that follow a naming convention but may change. Discover them dynamically rather than hardcoding.

Log group naming convention: l-<staging|prod>-<chain_id>-<service-name>-<chain_id>-<resource>

Example: l-staging-167000-indexer-api-167000-lambda

Known chain IDs:

84532 = Base Sepolia
8453 = Base Mainnet
167000 = Taiko Mainnet
11155111 = Eth Sepolia
1 = Eth Mainnet

To find log groups for a specific service, search by prefix. Use the environment (staging/prod) and optionally the chain ID and service name:

aws logs describe-log-groups \
  --log-group-name-prefix "l-staging-84532" \
  --query 'logGroups[].logGroupName' --output table

To find log groups for a specific service across all chains in an environment:

aws logs describe-log-groups \
  --query 'logGroups[?contains(logGroupName, `staging`) && contains(logGroupName, `indexer`)].logGroupName' \
  --output table

To list all log groups in an environment:

aws logs describe-log-groups \
  --log-group-name-prefix "l-staging" \
  --query 'logGroups[].logGroupName' --output table

Also check bento prover log groups:

aws logs describe-log-groups \
  --log-group-name-prefix "/boundless/bento" \
  --query 'logGroups[].logGroupName' --output table

Some services have multiple log groups for different components (e.g. an indexer may have separate groups for the backend worker and the API lambda). When investigating an issue, check all matching log groups.

Service name patterns

Common service name fragments to search for:

Service	Search fragments
Indexer API	`indexer-api`
Indexer backend	`indexer`, `market-indexer`, `rewards-indexer`
Order stream	`order-stream`
Order generator	`order-generator`, `og`
Slasher	`slasher`
Distributor	`distributor`
Signal	`prod-8453-signal` (no `l-` prefix)
Prover (bento)	`/boundless/bento/prover` or `/boundless/bento/-prover-`

Querying Logs

Always filter by time range. Log groups are high-volume and queries without time bounds will be slow or hit limits.

Use aws logs filter-log-events for searching. Key parameters:

--log-group-name: required
--start-time / --end-time: Unix milliseconds (required -- always set these)
--filter-pattern: CloudWatch filter syntax for searching log content
--output json: pipe through jq for readability

Computing timestamps

Convert human-readable times to Unix milliseconds:

START=$(date -j -u -f "%Y-%m-%dT%H:%M:%SZ" "2026-03-30T00:00:00Z" +%s 2>/dev/null || date -d "2026-03-30T00:00:00Z" +%s)
START_MS=$((START * 1000))

END=$(date -j -u -f "%Y-%m-%dT%H:%M:%SZ" "2026-03-31T00:00:00Z" +%s 2>/dev/null || date -d "2026-03-31T00:00:00Z" +%s)
END_MS=$((END * 1000))

For relative times:

NOW_MS=$(date +%s)000
ONE_HOUR_AGO_MS=$(( ($(date +%s) - 3600) * 1000 ))
SIX_HOURS_AGO_MS=$(( ($(date +%s) - 21600) * 1000 ))
ONE_DAY_AGO_MS=$(( ($(date +%s) - 86400) * 1000 ))

Searching by request ID

The most common query pattern. Request IDs appear in log messages as hex values (e.g. 0x2a):

aws logs filter-log-events \
  --log-group-name "$LOG_GROUP" \
  --start-time "$ONE_HOUR_AGO_MS" \
  --end-time "$NOW_MS" \
  --filter-pattern '"0xREQUEST_ID"' \
  --output json | jq '.events[] | {timestamp: (.timestamp / 1000 | todate), message: .message}'

Searching by request digest

aws logs filter-log-events \
  --log-group-name "$LOG_GROUP" \
  --start-time "$ONE_HOUR_AGO_MS" \
  --end-time "$NOW_MS" \
  --filter-pattern '"0xDIGEST"' \
  --output json | jq '.events[] | {timestamp: (.timestamp / 1000 | todate), message: .message}'

Searching for errors

aws logs filter-log-events \
  --log-group-name "$LOG_GROUP" \
  --start-time "$ONE_HOUR_AGO_MS" \
  --end-time "$NOW_MS" \
  --filter-pattern '"ERROR"' \
  --output json | jq '.events[] | {timestamp: (.timestamp / 1000 | todate), message: .message}'

Searching across multiple log groups

When a service has multiple log groups, query each one:

for LG in "l-staging-84532-indexer-api-84532-lambda" "l-staging-84532-market-indexer-84532-task"; do
  echo "=== $LG ==="
  aws logs filter-log-events \
    --log-group-name "$LG" \
    --start-time "$ONE_HOUR_AGO_MS" \
    --end-time "$NOW_MS" \
    --filter-pattern '"ERROR"' \
    --output json | jq '.events[] | {timestamp: (.timestamp / 1000 | todate), message: .message}'
  sleep 1
done

Pagination

filter-log-events returns a nextToken when there are more results:

TOKEN=""
while true; do
  if [ -n "$TOKEN" ]; then
    RESP=$(aws logs filter-log-events \
      --log-group-name "$LOG_GROUP" \
      --start-time "$START_MS" \
      --end-time "$END_MS" \
      --filter-pattern '"0xREQUEST_ID"' \
      --next-token "$TOKEN" \
      --output json)
  else
    RESP=$(aws logs filter-log-events \
      --log-group-name "$LOG_GROUP" \
      --start-time "$START_MS" \
      --end-time "$END_MS" \
      --filter-pattern '"0xREQUEST_ID"' \
      --output json)
  fi

  echo "$RESP" | jq '.events[] | {timestamp: (.timestamp / 1000 | todate), message: .message}'
  TOKEN=$(echo "$RESP" | jq -r '.nextToken // empty')
  [ -z "$TOKEN" ] && break
  sleep 1
done

CloudWatch Filter Pattern Syntax

"exact phrase" -- matches logs containing the exact phrase (quotes required)
?term1 ?term2 -- OR: matches logs containing either term
"term1" "term2" -- AND: matches logs containing both terms
"ERROR" "request_id" -- combine filters

Checking for Recent Deployments

When investigating fulfillment rate drops, prover downtime, or success rate alarms for provers we operate, always check for recent deployments first. Nightly deployments restart the bento Docker Compose stack and can cause extended outages if the new image is broken.

Deployment events appear in the bento prover log groups (e.g. /boundless/bento/prover-11155111-prod-nightly). Look for these patterns:

# Find recent deployments in a time window
aws logs filter-log-events \
  --log-group-name "$LOG_GROUP" \
  --start-time "$START_MS" \
  --end-time "$END_MS" \
  --filter-pattern '?"Stopping Docker Compose" ?"Starting Docker Compose"' \
  --output json | jq '.events[] | {timestamp: (.timestamp / 1000 | todate), message: .message}'

A deployment cycle looks like:

"Stopping Docker Compose services" — old containers torn down
"Image ghcr.io/boundless-xyz/boundless/broker:<tag> Pulling" — new image pulled (the tag contains the git commit, e.g. nightly-3b8a71f)
"Container bento-broker-1 Created" / "Starting" — new containers come up
Optionally: "dependency failed to start: container ... is unhealthy" — a container failed its healthcheck, cascading to broker failure
If "Starting Docker Compose" is missing or far behind "Stopping Docker Compose", the broker may be in graceful shutdown drain — search ?"starting graceful shutdown" ?"in-progress orders to complete" ?"Cancelling critical tasks". The broker waits up to 2h (SHUTDOWN_GRACE_PERIOD_SECS) for committed orders before exiting; during this window bento_active=0 and channel closed errors from the chain monitor are expected, not an outage.

Deployments are significant events -- they restart the broker (causing a brief gap in telemetry and fulfillments even when healthy) and deploy new code that could introduce bugs or behavior changes. Always note when a deployment occurred relative to the issue being investigated.

If the broker stopped fulfilling shortly after a deployment, check for:

Healthcheck failures: ?"unhealthy" ?"failed to start" ?"Error dependency" — a dependency container (often rest_api) failed, preventing the broker from starting
Container crashes: ?"exit" ?"Exited" ?"Restarting" — the broker or a dependency crashed after startup
Image tag: compare the deployed image tag (git commit hash) against the git log to identify what changed

# Check for container failures after a deployment
aws logs filter-log-events \
  --log-group-name "$LOG_GROUP" \
  --start-time "$START_MS" \
  --end-time "$END_MS" \
  --filter-pattern '?"unhealthy" ?"failed to start" ?"Exited" ?"Error dependency"' \
  --output json | jq '.events[] | {timestamp: (.timestamp / 1000 | todate), message: .message}'

Secondary Fulfillment in Logs

When a prover locks an order but fails to fulfill it before the lock expires, the order becomes available for secondary fulfillment by any other prover, who earns the slash collateral as reward. In broker logs, secondary fulfillment attempts appear as FulfillAfterLockExpire entries. When investigating expired or slashed requests, search our BP prover logs for the request ID to see if they evaluated the secondary fulfillment opportunity:

# Search for secondary fulfillment activity on a specific request
aws logs filter-log-events \
  --log-group-name "$LOG_GROUP" \
  --start-time "$START_MS" \
  --end-time "$END_MS" \
  --filter-pattern '"0xREQUEST_ID" "FulfillAfterLockExpire"' \
  --output json | jq '.events[] | {timestamp: (.timestamp / 1000 | todate), message: .message}'

If the request ID doesn't appear at all, the prover never saw the secondary opportunity. If it appears with skip or error messages, note the reason -- common issues include the order being unprofitable at the slash collateral price, insufficient remaining deadline, or the prover being at capacity. Always check whether our BP provers attempted secondary fulfillment on orders that expired after being locked.

Tips

Keep time windows as narrow as possible (minutes or hours, not days)
Start with a request ID filter, then broaden if needed
Log messages are typically structured (JSON or key=value), so jq is useful for parsing
If the output is very large, add | head -50 or pipe to a file
Use the --limit flag to cap results per API call (default 10000)
When unsure which log group to query, discover them first with describe-log-groups
Some services span multiple log groups -- check all matching groups when investigating