name: opensearch-logs description: >- OpenSearch observability toolkit — search logs, create dashboards, deploy automated monitors with Google Chat alerts. Works with any OpenSearch 2.x cluster. Use when building observability for a service, debugging production issues, creating monitoring dashboards, or setting up automated alerting. user_invocable: true args: Optional flags like "--mode report --hours 6" or "--service deal_structure --level ERROR"
OpenSearch Observability Toolkit
What This Skill Does
A complete observability toolkit for any OpenSearch 2.x cluster. The primary workflow is zero-code: discover what's in your logs, create dashboards and alerts from YAML configs, and get Google Chat notifications when things go wrong.
Core Flow: Discover → Dashboard → Alert
1. DISCOVER os-search --query '"your-pattern"' --hours 24
os-search --mode errors --hours 4
→ Understand what's in your logs
2. DASHBOARD Write a YAML config (metrics, charts, queries)
os-dash --mode create --config my-dashboard
→ OSD dashboard appears with live visualizations
3. ALERT Write a YAML config (monitors, thresholds, webhooks)
os-alert --mode create --profile my-alerts
→ OSD monitors run on schedule, fire Google Chat cards
4. REPORT os-alert --mode report --hours 1
→ Rich health card sent to Google Chat on demand
No TypeScript required for steps 2-4 — just YAML files and CLI commands.
Setup
1. Install and build
cd .claude/skills/opensearch-logs/scripts && pnpm install && pnpm build
2. Configure credentials
cp .env.example .env # then fill in your values
Credentials are loaded automatically in this order:
- Local
.env(in thescripts/directory) — recommended - Home-dir fallback:
~/.opensearch-env - Environment variables:
OPENSEARCH_HOST,OPENSEARCH_USERNAME,OPENSEARCH_PASSWORD
See .env.example for all available variables.
Optional config for non-default clusters (e.g., AWS ConveyorCloud)
| Env Var | Default | Purpose |
|---|---|---|
OPENSEARCH_DATA_PORT |
25060 |
Data API port (AWS OpenSearch uses 443) |
OPENSEARCH_DASHBOARDS_PORT |
443 |
Dashboards API port |
OPENSEARCH_INDEX_PREFIX |
python-services |
Index name prefix (override per team) |
OPENSEARCH_TENANT |
global |
Security tenant for FGAC-enabled clusters |
GCHAT_WEBHOOK_URL or PROD_GCHAT_WEBHOOK |
— | Google Chat webhook for alerts/reports |
3. Set the scripts path
export OS_SCRIPTS=.claude/skills/opensearch-logs/scripts
4. Verify
node $OS_SCRIPTS/dist/os-search.js --mode count --hours 1
YAML-Driven Dashboards (os-dash)
Create OSD saved dashboards from declarative YAML — no code required.
Dashboard YAML schema
name: my-dashboard # Used for OSD saved object ID
title: My Service Dashboard # Displayed in OSD UI
description: Monitoring for...
time_from: now-24h # Default time range
refresh_interval: 60000 # Auto-refresh (ms)
header: | # Optional markdown header panel
## My Dashboard
Tracking key metrics for my service.
metrics: # Metric tiles (auto-layout, 6 per row)
- title: Total Requests
query: '"my-service"'
label: Requests
- title: Errors
query: '"my-service" AND level:"ERROR"'
label: Errors
charts: # Line charts (half or full width)
- title: Requests Over Time
width: half # half (default) or full
series:
All: '"my-service"'
Errors: '"my-service" AND level:"ERROR"'
- title: Errors by Type
width: half
series:
Timeout: '"my-service" AND "timeout"'
Connection: '"my-service" AND "connection refused"'
Commands
# Create/update a dashboard from YAML
node $OS_SCRIPTS/dist/os-dash.js --mode create --config my-dashboard
# Preview what would be created (no OSD calls)
node $OS_SCRIPTS/dist/os-dash.js --mode create --config my-dashboard --dry-run
# Delete a dashboard and its visualizations
node $OS_SCRIPTS/dist/os-dash.js --mode delete --config my-dashboard
YAML files live in scripts/dashboards/. The --config flag accepts a name (looked up in that directory) or a path to any YAML file.
Built-in dashboard configs
| Config | Description |
|---|---|
langfuse-usage |
LangFuse LLM tracing: validations, timeouts, API key issues, per-service breakdown |
Automated Alerts (os-alert)
Deploy OSD monitors that run on a schedule and send rich Google Chat Card v2 messages when thresholds are breached.
Alert profile YAML schema
name: my-alerts
description: Monitors for my service
destination:
name: gchat-my-team # Notification channel name
type: gchat # Webhook URL from env var
card_templates:
default: | # Mustache template (OSD renders at alert time)
{ "cardsV2": [{ "cardId": "alert-{{ctx.monitor.name}}", "card": {
"header": { "title": "{{ctx.trigger.name}}" },
"sections": [{ "widgets": [
{ "decoratedText": { "topLabel": "Hits", "text": "{{ctx.results.0.hits.total.value}}" } }
]}]
}}]}
monitors:
error_spike:
description: "Error rate exceeds threshold"
schedule: { interval: 5, unit: MINUTES }
query:
query_string:
query: '"my-service" AND level:"ERROR"'
trigger:
name: error-spike
severity: 2
condition: "ctx.results[0].hits.total.value > {{thresholds.errors}}"
card_template: default
throttle: { value: 10, unit: MINUTES }
# Health check monitor — always fires, sends a summary card
hourly_health:
description: "Hourly status with aggregated metrics"
schedule: { interval: 1, unit: HOURS }
query:
bool:
must:
- query_string: { query: '"my-service"' }
aggs: # Aggregations hoisted to top level automatically
errors:
filter:
term: { level.keyword: ERROR }
trigger:
name: hourly-health
severity: 5
condition: "ctx.results[0].hits.total.value >= 0" # Always fires
card_template: health_check
throttle: { value: 55, unit: MINUTES }
thresholds:
errors: 50 # Pre-flight resolved before sending to OSD
defaults:
env: production
dashboard_url: "https://..."
Template layers
- Pre-flight (
{{thresholds.*}},{{defaults.*}}) — resolved by our code before sending to OSD - OSD Mustache (
{{ctx.*}}) — resolved by OSD at alert time (monitor name, trigger, results) - Collision avoidance —
{{ctx.*}}expressions are never touched by pre-flight resolution
Commands
# Create all monitors from a profile
node $OS_SCRIPTS/dist/os-alert.js --mode create --profile my-alerts
# Dry-run (resolve templates, print payloads, don't call API)
node $OS_SCRIPTS/dist/os-alert.js --mode create --profile my-alerts --dry-run
# Create a single monitor
node $OS_SCRIPTS/dist/os-alert.js --mode create --profile my-alerts --monitor error_spike
# List deployed monitors
node $OS_SCRIPTS/dist/os-alert.js --mode list
node $OS_SCRIPTS/dist/os-alert.js --mode list --profile my-alerts
# Test: execute a monitor (dry-run on OSD)
node $OS_SCRIPTS/dist/os-alert.js --mode test --profile my-alerts --monitor error_spike
# Test: send a test card to the webhook
node $OS_SCRIPTS/dist/os-alert.js --mode test --profile my-alerts
# Send a live health report card to Google Chat
node $OS_SCRIPTS/dist/os-alert.js --mode report --hours 1
# Delete all monitors for a profile
node $OS_SCRIPTS/dist/os-alert.js --mode delete --profile my-alerts
Built-in alert profiles
| Profile | Monitors | Description |
|---|---|---|
production-incidents |
10 | 500s, 503s, browser pool, Playwright, connection refused, worker shutdowns, OOM, ReadTimeout, unsupported PDS, hourly health check |
service-health |
3 | Per-service 5xx errors, high latency, 4xx spikes (use with --service) |
langfuse-usage |
4 | LangFuse health check, timeout spike, API key missing, no-activity detector |
Log Search (os-search)
Query logs directly for ad-hoc investigation. 9 modes covering search, aggregation, and reporting.
Quick start
# Adaptive report — discovers what's interesting automatically
node $OS_SCRIPTS/dist/os-search.js --mode report --hours 12
# Health dashboard
node $OS_SCRIPTS/dist/os-search.js --mode dashboard --hours 1
# Per-service health table
node $OS_SCRIPTS/dist/os-search.js --mode services --hours 1
# Free-text search (works with any log pattern)
node $OS_SCRIPTS/dist/os-search.js --query '"your-search-term"' --hours 4
# Error breakdown (by service, status code, endpoint)
node $OS_SCRIPTS/dist/os-search.js --mode errors --hours 1
# Latency analysis (p50/p95/p99)
node $OS_SCRIPTS/dist/os-search.js --mode latency --hours 4
# Histogram of events over time
node $OS_SCRIPTS/dist/os-search.js --query '"Connection refused"' --mode histogram --hours 24
# Trace a request by correlation ID
node $OS_SCRIPTS/dist/os-search.js --correlation-id "abc-123-def" --full
Flags
| Flag | Short | Description |
|---|---|---|
--env |
production (default) or staging |
|
--query |
-q |
Free-text OpenSearch query string |
--service |
-s |
Filter by service alias |
--level |
-l |
Filter by log level: ERROR, WARNING, INFO |
--status |
HTTP status code filter: 500, 5xx, 4xx, >=400 |
|
--correlation-id |
Trace a correlation ID | |
--hours |
Look back N hours (default: 1) | |
--from / --to |
Explicit time range (ISO 8601) | |
--mode |
-m |
search, count, histogram, timeline, dashboard, latency, errors, services, report |
--limit |
-n |
Max results (default: 20) |
--interval |
Histogram bucket size (default: 1h) |
|
--json |
JSON output | |
--full |
No truncation | |
--link |
Print shareable OSD URL |
Modes
| Mode | Description |
|---|---|
search |
Return matching log lines (default) |
count |
Count matching documents |
histogram |
Event counts bucketed by time |
timeline |
Chronological events (ascending) |
dashboard |
Health overview: global checks + per-service errors + latency |
latency |
p50/p95/p99 overall, per-service, over time |
errors |
Breakdown by status code, service, endpoint, error type + samples |
services |
Per-service health table |
report |
Adaptive report — discovers what's interesting, only shows relevant findings |
Monitoring (os-monitor)
Pre-built monitoring views for deploys and incidents.
# Post-deploy health check (compares current vs baseline)
node $OS_SCRIPTS/dist/os-monitor.js --mode post-deploy --minutes 15
# Incident watchboard
node $OS_SCRIPTS/dist/os-monitor.js --mode incident-watch --hours 1
# Service spotlight
node $OS_SCRIPTS/dist/os-monitor.js --mode spotlight --service deal_structure --hours 1
# Compare two time windows
node $OS_SCRIPTS/dist/os-monitor.js --mode compare \
--from1 2026-02-25T13:00Z --to1 2026-02-25T17:00Z \
--from2 2026-02-25T17:30Z --to2 2026-02-25T18:00Z
# Continuous monitoring (re-runs every N seconds)
node $OS_SCRIPTS/dist/os-monitor.js --mode post-deploy --watch 60
Playbooks (os-playbook)
DAG-based multi-step query orchestration. Steps run in parallel where possible with dependency ordering, retries, and result passing.
# Broad incident triage
node $OS_SCRIPTS/dist/os-playbook.js --playbook incident-triage --hours 1
# Service deep dive
node $OS_SCRIPTS/dist/os-playbook.js --playbook service-deep-dive --service deal_structure --hours 2
# Create an OSD dashboard from playbook results
node $OS_SCRIPTS/dist/os-playbook.js --playbook incident-triage --hours 1 --dashboard
Built-in playbooks
| Playbook | Steps | Description |
|---|---|---|
incident-triage |
5 | Patterns, errors, services, latency, error samples |
service-deep-dive |
6 | Count, errors, latency, timeline, 500 samples, histogram |
post-deploy-validation |
4 | Patterns, errors, latency, services |
error-investigation |
6 | Patterns, error breakdown, histogram, latency, timeline, samples |
Debugging Decision Tree
1. Something is broken — start broad
node $OS_SCRIPTS/dist/os-search.js --mode report --hours 12
2. Narrow to a service
node $OS_SCRIPTS/dist/os-search.js --mode errors --service <service> --hours 1
node $OS_SCRIPTS/dist/os-search.js --service <service> --level ERROR --hours 1 --full
3. Build a timeline
node $OS_SCRIPTS/dist/os-search.js --service <service> --mode timeline --from "<start>" --to "<end>"
node $OS_SCRIPTS/dist/os-search.js --correlation-id "<id>" --full
4. Check latency
node $OS_SCRIPTS/dist/os-search.js --mode latency --service <service> --hours 4
5. Collect evidence and alert
# Create a dashboard from YAML for the issue
node $OS_SCRIPTS/dist/os-dash.js --mode create --config my-investigation
# Set up monitors so it doesn't happen again
node $OS_SCRIPTS/dist/os-alert.js --mode create --profile my-service-alerts
# Send a health report to the team
node $OS_SCRIPTS/dist/os-alert.js --mode report --hours 1
Use Cases Beyond Debugging
- Feature readiness: "Create a monitor that tracks error rates for my new endpoint — when can we safely enable the feature flag?"
- Post-deploy validation: "Watch error rates for 15 minutes after deploy, compare to baseline"
- SLA monitoring: "Alert when p95 latency exceeds 5s for 3 consecutive checks"
- Capacity planning: "Dashboard showing request volume trends by service over 7 days"
- Incident response: "Triage playbook → evidence dashboard → Google Chat alert to the team"
- Integration health: "Monitor LangFuse/Salesforce/BigQuery integration health with dedicated dashboards"
Architecture
The skill provides 6 CLI tools backed by a shared repository layer:
os-search Search, count, histogram, timeline, dashboard, errors, latency, services, report
os-monitor Post-deploy, incident-watch, spotlight, compare (with --watch for continuous)
os-playbook DAG-based multi-step queries with parallel execution + dashboard generation
os-alert YAML-driven OSD monitor management + Google Chat Card v2 alerts + health reports
os-dash YAML-driven OSD dashboard generator (metrics, charts, markdown — auto-layout)
create-dashboards Static production health dashboard (hardcoded for Python microservices)
All tools share:
OpenSearchRepository— data API (search, alerting, notifications) + dashboards API (saved objects)config.ts— credential loading, index resolution, time range parsingtemplate.ts—{{expression}}resolution withctx.*pass-through for OSD Mustache
Compatible with OpenSearch 2.x clusters using basic auth. Tested on DigitalOcean Managed OpenSearch and AWS ConveyorCloud (ServiceCatalog) deployments.