name: stack-health description: Check observability stack component health, verify data ingestion, and troubleshoot common issues. allowed-tools: - Bash - curl
Stack Health and Troubleshooting
Overview
This skill provides health check commands, data verification queries, and troubleshooting guidance for the observability stack. Use it to verify that OpenSearch, Prometheus, the OTel Collector, and Data Prepper are running correctly, and to diagnose data flow problems.
Credentials are read from the .env file (default: admin / My_password_123!@#). All OpenSearch curl commands use HTTPS with -k to skip TLS certificate verification for local development.
Connection Defaults
| Variable | Default | Description |
|---|---|---|
OPENSEARCH_ENDPOINT |
https://localhost:9200 |
OpenSearch base URL |
OPENSEARCH_USER |
admin |
OpenSearch username |
OPENSEARCH_PASSWORD |
My_password_123!@# |
OpenSearch password |
Health Checks
OpenSearch Cluster Health
Check the overall cluster status (green, yellow, or red):
curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" "$OPENSEARCH_ENDPOINT/_cluster/health?pretty"
A healthy cluster returns "status": "green" or "status": "yellow" (yellow is normal for single-node development clusters).
Prometheus Health
Verify Prometheus is running and healthy:
curl -s "$PROMETHEUS_ENDPOINT/-/healthy"
Returns Prometheus Server is Healthy. when operational.
OTel Collector Metrics
Check the OpenTelemetry Collector's internal metrics to verify it is receiving and exporting telemetry:
curl -s http://localhost:8888/metrics
Look for otelcol_receiver_accepted_spans_total, otelcol_exporter_sent_spans_total, and otelcol_exporter_send_failed_spans_total in the output to confirm data flow. (OTel Collector metrics use the _total suffix for counters.)
OpenSearch Index Listing
List all indices to verify data ingestion has created the expected trace, log, and service map indices:
curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" "$OPENSEARCH_ENDPOINT/_cat/indices?v"
You should see indices matching otel-v1-apm-span-*, logs-otel-v1-*, and otel-v2-apm-service-map if data is flowing.
Data Verification
Trace Document Count
Verify trace data exists by counting documents in the trace index:
curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
-X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
-H 'Content-Type: application/json' \
-d '{"query": "source=otel-v1-apm-span-* | stats count()"}'
Log Document Count
Verify log data exists by counting documents in the log index:
curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
-X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
-H 'Content-Type: application/json' \
-d '{"query": "source=logs-otel-v1-* | stats count()"}'
A count of 0 in either query indicates no data has been ingested for that signal. See the Troubleshooting section below.
Docker Compose Diagnostics
Check Container Status
View the status of all stack containers:
docker compose ps
All services should show Up or Up (healthy). If a service is restarting or exited, check its logs.
View Service Logs
View logs for a specific service:
docker compose logs <service-name>
Data Prepper Logs
Check Data Prepper for pipeline errors or OpenSearch connection issues:
docker compose logs data-prepper
OTel Collector Logs
Check the OTel Collector for receiver, processor, or exporter errors:
docker compose logs otel-collector
Troubleshooting Common Failures
OpenSearch Unreachable
Symptoms: Connection refused on port 9200, curl commands timeout or fail.
Diagnostic steps:
- Check if the OpenSearch container is running:
docker compose ps opensearch - Verify port 9200 is exposed and listening:
docker compose ps | grep 9200 - Check the OpenSearch health endpoint directly:
curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" "$OPENSEARCH_ENDPOINT/_cluster/health?pretty" - Check OpenSearch container logs for startup errors:
docker compose logs opensearch - If the container is restarting, check for memory issues — OpenSearch requires at least 512MB heap. Verify
OPENSEARCH_JAVA_OPTSindocker-compose.yml.
No Data in Indices
Symptoms: Index listing shows no otel-v1-apm-* indices, or document counts are 0.
Diagnostic steps:
- Verify the OTel Collector is receiving data — check its metrics:
curl -s http://localhost:8888/metrics | grep otelcol_receiver_accepted_spans_total - Check the Data Prepper pipeline for errors:
docker compose logs data-prepper | grep -i error - Verify the OTLP endpoint is reachable from your application. The OTel Collector listens on:
- gRPC:
localhost:4317 - HTTP:
localhost:4318
- gRPC:
- Send test telemetry and verify it appears:
curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" "$OPENSEARCH_ENDPOINT/_cat/indices?v" - Check that Data Prepper can connect to OpenSearch — look for authentication or TLS errors in Data Prepper logs.
Data Prepper Pipeline Errors
Symptoms: Data reaches the OTel Collector but does not appear in OpenSearch indices.
Diagnostic steps:
- Check Data Prepper logs for pipeline processing errors:
docker compose logs data-prepper - Look for OpenSearch connection failures, authentication errors, or index creation failures in the logs.
- Verify Data Prepper is receiving data from the OTel Collector on port 21890.
- Restart Data Prepper if configuration was changed:
docker compose restart data-prepper
OTel Collector Export Failures
Symptoms: Applications send telemetry but data does not reach Data Prepper or Prometheus.
Diagnostic steps:
- Check the OTel Collector's internal metrics for export failures:
curl -s http://localhost:8888/metrics | grep otelcol_exporter_send_failed - Check OTel Collector logs for exporter errors:
docker compose logs otel-collector - Verify the collector can reach Data Prepper (
data-prepper:21890) and Prometheus (prometheus:9090) on the Docker network. - Check for batch processor backpressure or memory limiter drops in the collector metrics.
Port Reference
| Component | Port | Protocol |
|---|---|---|
| OpenSearch | 9200 | HTTPS |
| OTel Collector (gRPC) | 4317 | gRPC |
| OTel Collector (HTTP) | 4318 | HTTP |
| Data Prepper | 21890 | HTTP |
| Prometheus | 9090 | HTTP |
| OpenSearch Dashboards | 5601 | HTTP |
PPL Diagnostic Commands
Describe Index Mappings
Use the PPL describe command to inspect the field mappings and types of an index. This is useful for verifying which fields are available for querying:
curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
-X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
-H 'Content-Type: application/json' \
-d '{"query": "describe otel-v1-apm-span-*"}'
Explain Query Execution Plan
Use the PPL _explain endpoint to debug query execution plans. This shows how OpenSearch will execute a PPL query without actually running it:
curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
-X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl/_explain" \
-H 'Content-Type: application/json' \
-d '{"query": "source=otel-v1-apm-span-* | head 10"}'
This is useful for diagnosing slow queries, understanding how filters are applied, and verifying that field names resolve correctly.
Dynamic Index Discovery
List All Observability Indices
Discover which observability indices exist and their sizes:
curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
"$OPENSEARCH_ENDPOINT/_cat/indices/otel-*,logs-otel-*?format=json&h=index,health,docs.count,store.size&s=index"
Get Index Field Mappings
Discover available fields in each index dynamically instead of relying on hardcoded field names:
curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
"$OPENSEARCH_ENDPOINT/otel-v1-apm-span-*/_mapping?pretty"
curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
"$OPENSEARCH_ENDPOINT/logs-otel-v1-*/_mapping?pretty"
PPL Describe for Field Discovery
Use PPL describe to list all fields and types in an index:
curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
-X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
-H 'Content-Type: application/json' \
-d '{"query": "describe otel-v1-apm-span-000001"}'
curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
-X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
-H 'Content-Type: application/json' \
-d '{"query": "describe logs-otel-v1-000001"}'
References
- PPL Language Reference — Official PPL syntax documentation. Fetch this if queries fail due to OpenSearch version differences or new syntax.
AWS Managed Variants
Amazon OpenSearch Service Health Check
Replace the local endpoint and authentication with AWS SigV4:
curl -s --aws-sigv4 "aws:amz:REGION:es" \
--user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \
https://DOMAIN-ID.REGION.es.amazonaws.com/_cluster/health?pretty
Index listing on AWS managed OpenSearch:
curl -s --aws-sigv4 "aws:amz:REGION:es" \
--user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \
https://DOMAIN-ID.REGION.es.amazonaws.com/_cat/indices?v
- Endpoint format:
https://DOMAIN-ID.REGION.es.amazonaws.com - Auth:
--aws-sigv4 "aws:amz:REGION:es"with--user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" - No
-kflag needed — AWS managed endpoints use valid TLS certificates
Amazon Managed Service for Prometheus Health
Check Prometheus health on Amazon Managed Service for Prometheus (AMP):
curl -s --aws-sigv4 "aws:amz:REGION:aps" \
--user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \
https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query \
--data-urlencode 'query=up'
- Endpoint format:
https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query - Auth:
--aws-sigv4 "aws:amz:REGION:aps"with--user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" - PromQL query syntax is identical to local Prometheus; only the endpoint and authentication differ