opensearch-logs

star 0

OpenSearch observability toolkit — search logs, create dashboards, deploy automated monitors with Google Chat alerts. Works with any OpenSearch 2.x cluster. Use when building observability for a service, debugging production issues, creating monitoring dashboards, or setting up automated alerting.

dipseth By dipseth schedule Updated 3/2/2026

name: opensearch-logs description: >- OpenSearch observability toolkit — search logs, create dashboards, deploy automated monitors with Google Chat alerts. Works with any OpenSearch 2.x cluster. Use when building observability for a service, debugging production issues, creating monitoring dashboards, or setting up automated alerting. user_invocable: true args: Optional flags like "--mode report --hours 6" or "--service deal_structure --level ERROR"

OpenSearch Observability Toolkit

What This Skill Does

A complete observability toolkit for any OpenSearch 2.x cluster. The primary workflow is zero-code: discover what's in your logs, create dashboards and alerts from YAML configs, and get Google Chat notifications when things go wrong.

Core Flow: Discover → Dashboard → Alert

1. DISCOVER    os-search --query '"your-pattern"' --hours 24
               os-search --mode errors --hours 4
               → Understand what's in your logs

2. DASHBOARD   Write a YAML config (metrics, charts, queries)
               os-dash --mode create --config my-dashboard
               → OSD dashboard appears with live visualizations

3. ALERT       Write a YAML config (monitors, thresholds, webhooks)
               os-alert --mode create --profile my-alerts
               → OSD monitors run on schedule, fire Google Chat cards

4. REPORT      os-alert --mode report --hours 1
               → Rich health card sent to Google Chat on demand

No TypeScript required for steps 2-4 — just YAML files and CLI commands.

Setup

1. Install and build

cd .claude/skills/opensearch-logs/scripts && pnpm install && pnpm build

2. Configure credentials

cp .env.example .env   # then fill in your values

Credentials are loaded automatically in this order:

  1. Local .env (in the scripts/ directory) — recommended
  2. Home-dir fallback: ~/.opensearch-env
  3. Environment variables: OPENSEARCH_HOST, OPENSEARCH_USERNAME, OPENSEARCH_PASSWORD

See .env.example for all available variables.

Optional config for non-default clusters (e.g., AWS ConveyorCloud)

Env Var Default Purpose
OPENSEARCH_DATA_PORT 25060 Data API port (AWS OpenSearch uses 443)
OPENSEARCH_DASHBOARDS_PORT 443 Dashboards API port
OPENSEARCH_INDEX_PREFIX python-services Index name prefix (override per team)
OPENSEARCH_TENANT global Security tenant for FGAC-enabled clusters
GCHAT_WEBHOOK_URL or PROD_GCHAT_WEBHOOK Google Chat webhook for alerts/reports

3. Set the scripts path

export OS_SCRIPTS=.claude/skills/opensearch-logs/scripts

4. Verify

node $OS_SCRIPTS/dist/os-search.js --mode count --hours 1

YAML-Driven Dashboards (os-dash)

Create OSD saved dashboards from declarative YAML — no code required.

Dashboard YAML schema

name: my-dashboard              # Used for OSD saved object ID
title: My Service Dashboard     # Displayed in OSD UI
description: Monitoring for...
time_from: now-24h              # Default time range
refresh_interval: 60000         # Auto-refresh (ms)

header: |                       # Optional markdown header panel
  ## My Dashboard
  Tracking key metrics for my service.

metrics:                        # Metric tiles (auto-layout, 6 per row)
  - title: Total Requests
    query: '"my-service"'
    label: Requests

  - title: Errors
    query: '"my-service" AND level:"ERROR"'
    label: Errors

charts:                         # Line charts (half or full width)
  - title: Requests Over Time
    width: half                 # half (default) or full
    series:
      All: '"my-service"'
      Errors: '"my-service" AND level:"ERROR"'

  - title: Errors by Type
    width: half
    series:
      Timeout: '"my-service" AND "timeout"'
      Connection: '"my-service" AND "connection refused"'

Commands

# Create/update a dashboard from YAML
node $OS_SCRIPTS/dist/os-dash.js --mode create --config my-dashboard

# Preview what would be created (no OSD calls)
node $OS_SCRIPTS/dist/os-dash.js --mode create --config my-dashboard --dry-run

# Delete a dashboard and its visualizations
node $OS_SCRIPTS/dist/os-dash.js --mode delete --config my-dashboard

YAML files live in scripts/dashboards/. The --config flag accepts a name (looked up in that directory) or a path to any YAML file.

Built-in dashboard configs

Config Description
langfuse-usage LangFuse LLM tracing: validations, timeouts, API key issues, per-service breakdown

Automated Alerts (os-alert)

Deploy OSD monitors that run on a schedule and send rich Google Chat Card v2 messages when thresholds are breached.

Alert profile YAML schema

name: my-alerts
description: Monitors for my service

destination:
  name: gchat-my-team          # Notification channel name
  type: gchat                  # Webhook URL from env var

card_templates:
  default: |                   # Mustache template (OSD renders at alert time)
    { "cardsV2": [{ "cardId": "alert-{{ctx.monitor.name}}", "card": {
      "header": { "title": "{{ctx.trigger.name}}" },
      "sections": [{ "widgets": [
        { "decoratedText": { "topLabel": "Hits", "text": "{{ctx.results.0.hits.total.value}}" } }
      ]}]
    }}]}

monitors:
  error_spike:
    description: "Error rate exceeds threshold"
    schedule: { interval: 5, unit: MINUTES }
    query:
      query_string:
        query: '"my-service" AND level:"ERROR"'
    trigger:
      name: error-spike
      severity: 2
      condition: "ctx.results[0].hits.total.value > {{thresholds.errors}}"
    card_template: default
    throttle: { value: 10, unit: MINUTES }

  # Health check monitor — always fires, sends a summary card
  hourly_health:
    description: "Hourly status with aggregated metrics"
    schedule: { interval: 1, unit: HOURS }
    query:
      bool:
        must:
          - query_string: { query: '"my-service"' }
        aggs:                 # Aggregations hoisted to top level automatically
          errors:
            filter:
              term: { level.keyword: ERROR }
    trigger:
      name: hourly-health
      severity: 5
      condition: "ctx.results[0].hits.total.value >= 0"  # Always fires
    card_template: health_check
    throttle: { value: 55, unit: MINUTES }

thresholds:
  errors: 50                  # Pre-flight resolved before sending to OSD

defaults:
  env: production
  dashboard_url: "https://..."

Template layers

  1. Pre-flight ({{thresholds.*}}, {{defaults.*}}) — resolved by our code before sending to OSD
  2. OSD Mustache ({{ctx.*}}) — resolved by OSD at alert time (monitor name, trigger, results)
  3. Collision avoidance{{ctx.*}} expressions are never touched by pre-flight resolution

Commands

# Create all monitors from a profile
node $OS_SCRIPTS/dist/os-alert.js --mode create --profile my-alerts

# Dry-run (resolve templates, print payloads, don't call API)
node $OS_SCRIPTS/dist/os-alert.js --mode create --profile my-alerts --dry-run

# Create a single monitor
node $OS_SCRIPTS/dist/os-alert.js --mode create --profile my-alerts --monitor error_spike

# List deployed monitors
node $OS_SCRIPTS/dist/os-alert.js --mode list
node $OS_SCRIPTS/dist/os-alert.js --mode list --profile my-alerts

# Test: execute a monitor (dry-run on OSD)
node $OS_SCRIPTS/dist/os-alert.js --mode test --profile my-alerts --monitor error_spike

# Test: send a test card to the webhook
node $OS_SCRIPTS/dist/os-alert.js --mode test --profile my-alerts

# Send a live health report card to Google Chat
node $OS_SCRIPTS/dist/os-alert.js --mode report --hours 1

# Delete all monitors for a profile
node $OS_SCRIPTS/dist/os-alert.js --mode delete --profile my-alerts

Built-in alert profiles

Profile Monitors Description
production-incidents 10 500s, 503s, browser pool, Playwright, connection refused, worker shutdowns, OOM, ReadTimeout, unsupported PDS, hourly health check
service-health 3 Per-service 5xx errors, high latency, 4xx spikes (use with --service)
langfuse-usage 4 LangFuse health check, timeout spike, API key missing, no-activity detector

Log Search (os-search)

Query logs directly for ad-hoc investigation. 9 modes covering search, aggregation, and reporting.

Quick start

# Adaptive report — discovers what's interesting automatically
node $OS_SCRIPTS/dist/os-search.js --mode report --hours 12

# Health dashboard
node $OS_SCRIPTS/dist/os-search.js --mode dashboard --hours 1

# Per-service health table
node $OS_SCRIPTS/dist/os-search.js --mode services --hours 1

# Free-text search (works with any log pattern)
node $OS_SCRIPTS/dist/os-search.js --query '"your-search-term"' --hours 4

# Error breakdown (by service, status code, endpoint)
node $OS_SCRIPTS/dist/os-search.js --mode errors --hours 1

# Latency analysis (p50/p95/p99)
node $OS_SCRIPTS/dist/os-search.js --mode latency --hours 4

# Histogram of events over time
node $OS_SCRIPTS/dist/os-search.js --query '"Connection refused"' --mode histogram --hours 24

# Trace a request by correlation ID
node $OS_SCRIPTS/dist/os-search.js --correlation-id "abc-123-def" --full

Flags

Flag Short Description
--env production (default) or staging
--query -q Free-text OpenSearch query string
--service -s Filter by service alias
--level -l Filter by log level: ERROR, WARNING, INFO
--status HTTP status code filter: 500, 5xx, 4xx, >=400
--correlation-id Trace a correlation ID
--hours Look back N hours (default: 1)
--from / --to Explicit time range (ISO 8601)
--mode -m search, count, histogram, timeline, dashboard, latency, errors, services, report
--limit -n Max results (default: 20)
--interval Histogram bucket size (default: 1h)
--json JSON output
--full No truncation
--link Print shareable OSD URL

Modes

Mode Description
search Return matching log lines (default)
count Count matching documents
histogram Event counts bucketed by time
timeline Chronological events (ascending)
dashboard Health overview: global checks + per-service errors + latency
latency p50/p95/p99 overall, per-service, over time
errors Breakdown by status code, service, endpoint, error type + samples
services Per-service health table
report Adaptive report — discovers what's interesting, only shows relevant findings

Monitoring (os-monitor)

Pre-built monitoring views for deploys and incidents.

# Post-deploy health check (compares current vs baseline)
node $OS_SCRIPTS/dist/os-monitor.js --mode post-deploy --minutes 15

# Incident watchboard
node $OS_SCRIPTS/dist/os-monitor.js --mode incident-watch --hours 1

# Service spotlight
node $OS_SCRIPTS/dist/os-monitor.js --mode spotlight --service deal_structure --hours 1

# Compare two time windows
node $OS_SCRIPTS/dist/os-monitor.js --mode compare \
  --from1 2026-02-25T13:00Z --to1 2026-02-25T17:00Z \
  --from2 2026-02-25T17:30Z --to2 2026-02-25T18:00Z

# Continuous monitoring (re-runs every N seconds)
node $OS_SCRIPTS/dist/os-monitor.js --mode post-deploy --watch 60

Playbooks (os-playbook)

DAG-based multi-step query orchestration. Steps run in parallel where possible with dependency ordering, retries, and result passing.

# Broad incident triage
node $OS_SCRIPTS/dist/os-playbook.js --playbook incident-triage --hours 1

# Service deep dive
node $OS_SCRIPTS/dist/os-playbook.js --playbook service-deep-dive --service deal_structure --hours 2

# Create an OSD dashboard from playbook results
node $OS_SCRIPTS/dist/os-playbook.js --playbook incident-triage --hours 1 --dashboard

Built-in playbooks

Playbook Steps Description
incident-triage 5 Patterns, errors, services, latency, error samples
service-deep-dive 6 Count, errors, latency, timeline, 500 samples, histogram
post-deploy-validation 4 Patterns, errors, latency, services
error-investigation 6 Patterns, error breakdown, histogram, latency, timeline, samples

Debugging Decision Tree

1. Something is broken — start broad

node $OS_SCRIPTS/dist/os-search.js --mode report --hours 12

2. Narrow to a service

node $OS_SCRIPTS/dist/os-search.js --mode errors --service <service> --hours 1
node $OS_SCRIPTS/dist/os-search.js --service <service> --level ERROR --hours 1 --full

3. Build a timeline

node $OS_SCRIPTS/dist/os-search.js --service <service> --mode timeline --from "<start>" --to "<end>"
node $OS_SCRIPTS/dist/os-search.js --correlation-id "<id>" --full

4. Check latency

node $OS_SCRIPTS/dist/os-search.js --mode latency --service <service> --hours 4

5. Collect evidence and alert

# Create a dashboard from YAML for the issue
node $OS_SCRIPTS/dist/os-dash.js --mode create --config my-investigation

# Set up monitors so it doesn't happen again
node $OS_SCRIPTS/dist/os-alert.js --mode create --profile my-service-alerts

# Send a health report to the team
node $OS_SCRIPTS/dist/os-alert.js --mode report --hours 1

Use Cases Beyond Debugging

  • Feature readiness: "Create a monitor that tracks error rates for my new endpoint — when can we safely enable the feature flag?"
  • Post-deploy validation: "Watch error rates for 15 minutes after deploy, compare to baseline"
  • SLA monitoring: "Alert when p95 latency exceeds 5s for 3 consecutive checks"
  • Capacity planning: "Dashboard showing request volume trends by service over 7 days"
  • Incident response: "Triage playbook → evidence dashboard → Google Chat alert to the team"
  • Integration health: "Monitor LangFuse/Salesforce/BigQuery integration health with dedicated dashboards"

Architecture

The skill provides 6 CLI tools backed by a shared repository layer:

os-search     Search, count, histogram, timeline, dashboard, errors, latency, services, report
os-monitor    Post-deploy, incident-watch, spotlight, compare (with --watch for continuous)
os-playbook   DAG-based multi-step queries with parallel execution + dashboard generation
os-alert      YAML-driven OSD monitor management + Google Chat Card v2 alerts + health reports
os-dash       YAML-driven OSD dashboard generator (metrics, charts, markdown — auto-layout)
create-dashboards   Static production health dashboard (hardcoded for Python microservices)

All tools share:

  • OpenSearchRepository — data API (search, alerting, notifications) + dashboards API (saved objects)
  • config.ts — credential loading, index resolution, time range parsing
  • template.ts{{expression}} resolution with ctx.* pass-through for OSD Mustache

Compatible with OpenSearch 2.x clusters using basic auth. Tested on DigitalOcean Managed OpenSearch and AWS ConveyorCloud (ServiceCatalog) deployments.

Install via CLI
npx skills add https://github.com/dipseth/opensearch-logs --skill opensearch-logs
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator