opensearch-logs - SKILL.md Agent Skill

name: opensearch-logs description: >- OpenSearch observability toolkit — search logs, create dashboards, deploy automated monitors with Google Chat alerts. Works with any OpenSearch 2.x cluster. Use when building observability for a service, debugging production issues, creating monitoring dashboards, or setting up automated alerting. user_invocable: true args: Optional flags like "--mode report --hours 6" or "--service deal_structure --level ERROR"

OpenSearch Observability Toolkit

What This Skill Does

A complete observability toolkit for any OpenSearch 2.x cluster. The primary workflow is zero-code: discover what's in your logs, create dashboards and alerts from YAML configs, and get Google Chat notifications when things go wrong.

Core Flow: Discover → Dashboard → Alert

1. DISCOVER    os-search --query '"your-pattern"' --hours 24
               os-search --mode errors --hours 4
               → Understand what's in your logs

2. DASHBOARD   Write a YAML config (metrics, charts, queries)
               os-dash --mode create --config my-dashboard
               → OSD dashboard appears with live visualizations

3. ALERT       Write a YAML config (monitors, thresholds, webhooks)
               os-alert --mode create --profile my-alerts
               → OSD monitors run on schedule, fire Google Chat cards

4. REPORT      os-alert --mode report --hours 1
               → Rich health card sent to Google Chat on demand

No TypeScript required for steps 2-4 — just YAML files and CLI commands.

Setup

1. Install and build

cd .claude/skills/opensearch-logs/scripts && pnpm install && pnpm build

2. Configure credentials

cp .env.example .env   # then fill in your values

Credentials are loaded automatically in this order:

Local .env (in the scripts/ directory) — recommended
Home-dir fallback: ~/.opensearch-env
Environment variables: OPENSEARCH_HOST, OPENSEARCH_USERNAME, OPENSEARCH_PASSWORD

See .env.example for all available variables.

Optional config for non-default clusters (e.g., AWS ConveyorCloud)

Env Var	Default	Purpose
`OPENSEARCH_DATA_PORT`	`25060`	Data API port (AWS OpenSearch uses `443`)
`OPENSEARCH_DASHBOARDS_PORT`	`443`	Dashboards API port
`OPENSEARCH_INDEX_PREFIX`	`python-services`	Index name prefix (override per team)
`OPENSEARCH_TENANT`	`global`	Security tenant for FGAC-enabled clusters
`GCHAT_WEBHOOK_URL` or `PROD_GCHAT_WEBHOOK`	—	Google Chat webhook for alerts/reports

3. Set the scripts path

export OS_SCRIPTS=.claude/skills/opensearch-logs/scripts

4. Verify

node $OS_SCRIPTS/dist/os-search.js --mode count --hours 1

YAML-Driven Dashboards (os-dash)

Create OSD saved dashboards from declarative YAML — no code required.

Dashboard YAML schema

name: my-dashboard              # Used for OSD saved object ID
title: My Service Dashboard     # Displayed in OSD UI
description: Monitoring for...
time_from: now-24h              # Default time range
refresh_interval: 60000         # Auto-refresh (ms)

header: |                       # Optional markdown header panel
  ## My Dashboard
  Tracking key metrics for my service.

metrics:                        # Metric tiles (auto-layout, 6 per row)
  - title: Total Requests
    query: '"my-service"'
    label: Requests

  - title: Errors
    query: '"my-service" AND level:"ERROR"'
    label: Errors

charts:                         # Line charts (half or full width)
  - title: Requests Over Time
    width: half                 # half (default) or full
    series:
      All: '"my-service"'
      Errors: '"my-service" AND level:"ERROR"'

  - title: Errors by Type
    width: half
    series:
      Timeout: '"my-service" AND "timeout"'
      Connection: '"my-service" AND "connection refused"'

Commands

# Create/update a dashboard from YAML
node $OS_SCRIPTS/dist/os-dash.js --mode create --config my-dashboard

# Preview what would be created (no OSD calls)
node $OS_SCRIPTS/dist/os-dash.js --mode create --config my-dashboard --dry-run

# Delete a dashboard and its visualizations
node $OS_SCRIPTS/dist/os-dash.js --mode delete --config my-dashboard

YAML files live in scripts/dashboards/. The --config flag accepts a name (looked up in that directory) or a path to any YAML file.

Built-in dashboard configs

Config	Description
`langfuse-usage`	LangFuse LLM tracing: validations, timeouts, API key issues, per-service breakdown

Automated Alerts (os-alert)

Deploy OSD monitors that run on a schedule and send rich Google Chat Card v2 messages when thresholds are breached.

Alert profile YAML schema

name: my-alerts
description: Monitors for my service

destination:
  name: gchat-my-team          # Notification channel name
  type: gchat                  # Webhook URL from env var

card_templates:
  default: |                   # Mustache template (OSD renders at alert time)
    { "cardsV2": [{ "cardId": "alert-{{ctx.monitor.name}}", "card": {
      "header": { "title": "{{ctx.trigger.name}}" },
      "sections": [{ "widgets": [
        { "decoratedText": { "topLabel": "Hits", "text": "{{ctx.results.0.hits.total.value}}" } }
      ]}]
    }}]}

monitors:
  error_spike:
    description: "Error rate exceeds threshold"
    schedule: { interval: 5, unit: MINUTES }
    query:
      query_string:
        query: '"my-service" AND level:"ERROR"'
    trigger:
      name: error-spike
      severity: 2
      condition: "ctx.results[0].hits.total.value > {{thresholds.errors}}"
    card_template: default
    throttle: { value: 10, unit: MINUTES }

  # Health check monitor — always fires, sends a summary card
  hourly_health:
    description: "Hourly status with aggregated metrics"
    schedule: { interval: 1, unit: HOURS }
    query:
      bool:
        must:
          - query_string: { query: '"my-service"' }
        aggs:                 # Aggregations hoisted to top level automatically
          errors:
            filter:
              term: { level.keyword: ERROR }
    trigger:
      name: hourly-health
      severity: 5
      condition: "ctx.results[0].hits.total.value >= 0"  # Always fires
    card_template: health_check
    throttle: { value: 55, unit: MINUTES }

thresholds:
  errors: 50                  # Pre-flight resolved before sending to OSD

defaults:
  env: production
  dashboard_url: "https://..."

Template layers

Pre-flight ({{thresholds.*}}, {{defaults.*}}) — resolved by our code before sending to OSD
OSD Mustache ({{ctx.*}}) — resolved by OSD at alert time (monitor name, trigger, results)
Collision avoidance — {{ctx.*}} expressions are never touched by pre-flight resolution

Commands

# Create all monitors from a profile
node $OS_SCRIPTS/dist/os-alert.js --mode create --profile my-alerts

# Dry-run (resolve templates, print payloads, don't call API)
node $OS_SCRIPTS/dist/os-alert.js --mode create --profile my-alerts --dry-run

# Create a single monitor
node $OS_SCRIPTS/dist/os-alert.js --mode create --profile my-alerts --monitor error_spike

# List deployed monitors
node $OS_SCRIPTS/dist/os-alert.js --mode list
node $OS_SCRIPTS/dist/os-alert.js --mode list --profile my-alerts

# Test: execute a monitor (dry-run on OSD)
node $OS_SCRIPTS/dist/os-alert.js --mode test --profile my-alerts --monitor error_spike

# Test: send a test card to the webhook
node $OS_SCRIPTS/dist/os-alert.js --mode test --profile my-alerts

# Send a live health report card to Google Chat
node $OS_SCRIPTS/dist/os-alert.js --mode report --hours 1

# Delete all monitors for a profile
node $OS_SCRIPTS/dist/os-alert.js --mode delete --profile my-alerts

Built-in alert profiles

Profile	Monitors	Description
`production-incidents`	10	500s, 503s, browser pool, Playwright, connection refused, worker shutdowns, OOM, ReadTimeout, unsupported PDS, hourly health check
`service-health`	3	Per-service 5xx errors, high latency, 4xx spikes (use with `--service`)
`langfuse-usage`	4	LangFuse health check, timeout spike, API key missing, no-activity detector

Log Search (os-search)

Query logs directly for ad-hoc investigation. 9 modes covering search, aggregation, and reporting.

Quick start

# Adaptive report — discovers what's interesting automatically
node $OS_SCRIPTS/dist/os-search.js --mode report --hours 12

# Health dashboard
node $OS_SCRIPTS/dist/os-search.js --mode dashboard --hours 1

# Per-service health table
node $OS_SCRIPTS/dist/os-search.js --mode services --hours 1

# Free-text search (works with any log pattern)
node $OS_SCRIPTS/dist/os-search.js --query '"your-search-term"' --hours 4

# Error breakdown (by service, status code, endpoint)
node $OS_SCRIPTS/dist/os-search.js --mode errors --hours 1

# Latency analysis (p50/p95/p99)
node $OS_SCRIPTS/dist/os-search.js --mode latency --hours 4

# Histogram of events over time
node $OS_SCRIPTS/dist/os-search.js --query '"Connection refused"' --mode histogram --hours 24

# Trace a request by correlation ID
node $OS_SCRIPTS/dist/os-search.js --correlation-id "abc-123-def" --full

Flags

Flag	Short	Description
`--env`		`production` (default) or `staging`
`--query`	`-q`	Free-text OpenSearch query string
`--service`	`-s`	Filter by service alias
`--level`	`-l`	Filter by log level: `ERROR`, `WARNING`, `INFO`
`--status`		HTTP status code filter: `500`, `5xx`, `4xx`, `>=400`
`--correlation-id`		Trace a correlation ID
`--hours`		Look back N hours (default: 1)
`--from` / `--to`		Explicit time range (ISO 8601)
`--mode`	`-m`	`search`, `count`, `histogram`, `timeline`, `dashboard`, `latency`, `errors`, `services`, `report`
`--limit`	`-n`	Max results (default: 20)
`--interval`		Histogram bucket size (default: `1h`)
`--json`		JSON output
`--full`		No truncation
`--link`		Print shareable OSD URL

Modes

Mode	Description
`search`	Return matching log lines (default)
`count`	Count matching documents
`histogram`	Event counts bucketed by time
`timeline`	Chronological events (ascending)
`dashboard`	Health overview: global checks + per-service errors + latency
`latency`	p50/p95/p99 overall, per-service, over time
`errors`	Breakdown by status code, service, endpoint, error type + samples
`services`	Per-service health table
`report`	Adaptive report — discovers what's interesting, only shows relevant findings

Monitoring (os-monitor)

Pre-built monitoring views for deploys and incidents.

# Post-deploy health check (compares current vs baseline)
node $OS_SCRIPTS/dist/os-monitor.js --mode post-deploy --minutes 15

# Incident watchboard
node $OS_SCRIPTS/dist/os-monitor.js --mode incident-watch --hours 1

# Service spotlight
node $OS_SCRIPTS/dist/os-monitor.js --mode spotlight --service deal_structure --hours 1

# Compare two time windows
node $OS_SCRIPTS/dist/os-monitor.js --mode compare \
  --from1 2026-02-25T13:00Z --to1 2026-02-25T17:00Z \
  --from2 2026-02-25T17:30Z --to2 2026-02-25T18:00Z

# Continuous monitoring (re-runs every N seconds)
node $OS_SCRIPTS/dist/os-monitor.js --mode post-deploy --watch 60

Playbooks (os-playbook)

DAG-based multi-step query orchestration. Steps run in parallel where possible with dependency ordering, retries, and result passing.

# Broad incident triage
node $OS_SCRIPTS/dist/os-playbook.js --playbook incident-triage --hours 1

# Service deep dive
node $OS_SCRIPTS/dist/os-playbook.js --playbook service-deep-dive --service deal_structure --hours 2

# Create an OSD dashboard from playbook results
node $OS_SCRIPTS/dist/os-playbook.js --playbook incident-triage --hours 1 --dashboard

Built-in playbooks

Playbook	Steps	Description
`incident-triage`	5	Patterns, errors, services, latency, error samples
`service-deep-dive`	6	Count, errors, latency, timeline, 500 samples, histogram
`post-deploy-validation`	4	Patterns, errors, latency, services
`error-investigation`	6	Patterns, error breakdown, histogram, latency, timeline, samples

Debugging Decision Tree

1. Something is broken — start broad

node $OS_SCRIPTS/dist/os-search.js --mode report --hours 12

2. Narrow to a service

node $OS_SCRIPTS/dist/os-search.js --mode errors --service <service> --hours 1
node $OS_SCRIPTS/dist/os-search.js --service <service> --level ERROR --hours 1 --full

3. Build a timeline

node $OS_SCRIPTS/dist/os-search.js --service <service> --mode timeline --from "<start>" --to "<end>"
node $OS_SCRIPTS/dist/os-search.js --correlation-id "<id>" --full

4. Check latency

node $OS_SCRIPTS/dist/os-search.js --mode latency --service <service> --hours 4

5. Collect evidence and alert

# Create a dashboard from YAML for the issue
node $OS_SCRIPTS/dist/os-dash.js --mode create --config my-investigation

# Set up monitors so it doesn't happen again
node $OS_SCRIPTS/dist/os-alert.js --mode create --profile my-service-alerts

# Send a health report to the team
node $OS_SCRIPTS/dist/os-alert.js --mode report --hours 1

Use Cases Beyond Debugging

Feature readiness: "Create a monitor that tracks error rates for my new endpoint — when can we safely enable the feature flag?"
Post-deploy validation: "Watch error rates for 15 minutes after deploy, compare to baseline"
SLA monitoring: "Alert when p95 latency exceeds 5s for 3 consecutive checks"
Capacity planning: "Dashboard showing request volume trends by service over 7 days"
Incident response: "Triage playbook → evidence dashboard → Google Chat alert to the team"
Integration health: "Monitor LangFuse/Salesforce/BigQuery integration health with dedicated dashboards"

Architecture

The skill provides 6 CLI tools backed by a shared repository layer:

os-search     Search, count, histogram, timeline, dashboard, errors, latency, services, report
os-monitor    Post-deploy, incident-watch, spotlight, compare (with --watch for continuous)
os-playbook   DAG-based multi-step queries with parallel execution + dashboard generation
os-alert      YAML-driven OSD monitor management + Google Chat Card v2 alerts + health reports
os-dash       YAML-driven OSD dashboard generator (metrics, charts, markdown — auto-layout)
create-dashboards   Static production health dashboard (hardcoded for Python microservices)

All tools share:

OpenSearchRepository — data API (search, alerting, notifications) + dashboards API (saved objects)
config.ts — credential loading, index resolution, time range parsing
template.ts — {{expression}} resolution with ctx.* pass-through for OSD Mustache

Compatible with OpenSearch 2.x clusters using basic auth. Tested on DigitalOcean Managed OpenSearch and AWS ConveyorCloud (ServiceCatalog) deployments.