name: elasticsearch description: "Elasticsearch and OpenSearch cluster operations and troubleshooting — covers cluster health (red/yellow/green), shard allocation failures, slow queries and DSL optimization, index lifecycle management, JVM heap pressure, circuit breakers, snapshot/restore, reindex operations, and node diagnostics." metadata: author: agenticops version: "1.0" domain: data
Elasticsearch Skill
Quick Decision Trees
Cluster Health Red
- Check cluster health:
GET _cluster/health - If
status: red→ unassigned PRIMARY shards exist - Identify:
GET _cluster/allocation/explainNO_VALID_SHARD_COPY→ data node lost, check node statusALLOCATION_FAILED→ disk full, corrupt shard, incompatible mappingNODE_LEFT→ node crashed or was removed
- List unassigned shards:
GET _cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state - If node lost permanently:
- Accept data loss:
POST _cluster/reroutewithallocate_stale_primaryorallocate_empty_primary - Restore from snapshot if available
- Accept data loss:
- If disk full → free disk space or adjust watermark settings
Escalation path:
Cluster RED
|
+-- All data nodes reachable?
| +-- Yes → check disk watermarks, allocation explain
| +-- No → recover nodes first, check systemd/docker logs
|
+-- Was there a recent deployment?
| +-- Mapping conflict? → check index template + reindex
| +-- Version mismatch? → rolling restart in correct order
|
+-- Snapshot available?
+-- Yes → restore missing indices from snapshot
+-- No → allocate_stale_primary (accepts potential data loss)
Cluster Health Yellow
- Check:
GET _cluster/health→status: yellowmeans unassigned REPLICA shards - Common causes:
- Single-node cluster → replicas can never allocate (set
number_of_replicas: 0) - Not enough nodes → need at least N+1 nodes for N replicas
- Disk watermark hit → replicas won't allocate on full nodes
- Allocation filtering → check
index.routing.allocation.*settings
- Single-node cluster → replicas can never allocate (set
- Check:
GET _cat/allocation?v— see shard distribution per node - Check:
GET _cluster/settings?include_defaults&filter_path=*.cluster.routing.allocation.disk* - If transient after node restart → wait for recovery; monitor with
GET _cat/recovery?v&active_only
Shard Allocation Failures
- Diagnose:
GET _cluster/allocation/explain - Common reasons:
max_retries_exceeded→POST _cluster/reroute?retry_faileddisk_threshold_exceeded→ increase disk or adjust watermarks:PUT _cluster/settings {"transient": {"cluster.routing.allocation.disk.watermark.low": "85%", "cluster.routing.allocation.disk.watermark.high": "90%", "cluster.routing.allocation.disk.watermark.flood_stage": "95%"}}too_many_shards_on_node→ increasecluster.max_shards_per_nodeor reduce shard countawareness_zone→ rack/zone awareness blocking allocation
- Rebalance stuck:
GET _cat/shards?v&s=state→ check INITIALIZING/RELOCATING count - Force allocation (dangerous):
POST _cluster/reroutewith allocate commands
Slow Queries / DSL Optimization
- Enable slow log:
PUT my-index/_settings {"index.search.slowlog.threshold.query.warn": "5s", "index.search.slowlog.threshold.query.info": "2s", "index.search.slowlog.threshold.fetch.warn": "1s"} - Profile a query:
GET my-index/_search {"profile": true, "query": {"match": {"field": "value"}}} - Check query patterns:
wildcardontextfields → usekeywordsub-field- Leading wildcard (
*foo) → extremely slow, consider ngram tokenizer - Deep
nestedqueries → flatten if possible - Large
termsarrays → usetermslookup from another index script_scoreon every doc → pre-compute and store as field
- Check fielddata usage:
GET _cat/fielddata?v— high fielddata = text field aggregation - Expensive queries circuit breaker: check
indices.query.bool.max_clause_count
JVM Heap Pressure
- Check:
GET _nodes/stats/jvmheap_used_percent> 75% sustained → investigateheap_used_percent> 85% → immediate action needed
- Check GC pressure:
GET _nodes/stats/jvm→gc.collectors.old.collection_count- Frequent old GC (> 10/min) → heap too small or too much data in heap
- Common causes:
- Too many shards → merge small indices, increase shard size
- Fielddata on text fields → use
keywordtype for aggregations - Large aggregations → use
compositeaggregation with pagination - Parent-child/nested joins → flatten data model
- Too many open contexts → check
GET _nodes/stats/indices/search→open_contexts
- Fix strategies:
- Increase heap (max 50% of RAM, max 31 GB for compressed oops)
- Reduce shard count (target: 20-40 shards per GB heap)
- Use
doc_values: true(default) instead of fielddata - Circuit breakers: check
GET _nodes/stats/breaker
Circuit Breakers Tripping
- Check:
GET _nodes/stats/breaker - Types:
parent— total heap usage across all breakersfielddata— aggregations on text fieldsrequest— per-request memory (large aggs, scroll contexts)inflight_requests— incoming HTTP request data
- If
parenttrips → overall heap pressure, see JVM section - If
fielddatatrips → switch text field aggregations to keyword - Adjust limits (temporary):
PUT _cluster/settings {"transient": {"indices.breaker.total.limit": "85%", "indices.breaker.fielddata.limit": "50%", "indices.breaker.request.limit": "50%"}}
Index Lifecycle Management (ILM)
- Check ILM status:
GET _ilm/status - Check policy:
GET _ilm/policy/my-policy - Check index ILM state:
GET my-index/_ilm/explain- If
step: ERROR→GET my-index/_ilm/explainshows error details - Retry:
POST my-index/_ilm/retry
- If
- Common lifecycle phases:
hot→ active indexing, full resourceswarm→ read-only, can shrink/force-mergecold→ infrequent access, searchable snapshotsfrozen→ rare access, fully mounted from snapshotdelete→ remove after retention period
- Force-move index to next phase:
POST _ilm/move/my-index {"current_step": {"phase": "hot", "action": "complete", "name": "complete"}, "next_step": {"phase": "warm", "action": "shrink", "name": "shrink"}}
Snapshot and Restore
- Check repository:
GET _snapshot/_all - Check snapshots:
GET _snapshot/my-repo/_all - Create snapshot:
PUT _snapshot/my-repo/snap-2026-02-28?wait_for_completion=true {"indices": "index-*", "ignore_unavailable": true} - Restore:
POST _snapshot/my-repo/snap-2026-02-28/_restore {"indices": "index-*", "rename_pattern": "(.+)", "rename_replacement": "restored_$1"} - Monitor progress:
GET _snapshot/my-repo/snap-2026-02-28/_status - S3 repository setup:
PUT _snapshot/s3-repo {"type": "s3", "settings": {"bucket": "my-es-backups", "region": "us-east-1"}}
Common Patterns
Node Diagnostics
# Cluster overview
GET _cat/nodes?v&h=name,ip,heap.percent,ram.percent,cpu,load_1m,disk.used_percent,node.role
# Hot threads (find CPU-bound operations)
GET _nodes/hot_threads
# Pending tasks
GET _cluster/pending_tasks
# Task management (find long-running tasks)
GET _tasks?actions=*search*&detailed&group_by=parents
Reindex Operations
# Reindex with updated mapping
POST _reindex
{"source": {"index": "old-index"},
"dest": {"index": "new-index"}}
# Reindex with query filter
POST _reindex
{"source": {"index": "old-index", "query": {"range": {"@timestamp": {"gte": "2026-01-01"}}}},
"dest": {"index": "new-index"}}
# Reindex from remote cluster
POST _reindex
{"source": {"remote": {"host": "https://old-cluster:9200"}, "index": "old-index"},
"dest": {"index": "new-index"}}
# Monitor reindex progress
GET _tasks?actions=*reindex*&detailed
Template and Mapping Management
# Check index template
GET _index_template/my-template
# Check mapping
GET my-index/_mapping
# Add field to existing mapping (non-breaking)
PUT my-index/_mapping
{"properties": {"new_field": {"type": "keyword"}}}
# Check for mapping explosion
GET _cat/indices?v&h=index,docs.count,store.size&s=store.size:desc
AWS OpenSearch Service Specifics
Service-level Checks
# Describe domain
aws opensearch describe-domain --domain-name my-domain
# Check domain config
aws opensearch describe-domain-config --domain-name my-domain
# Check service software update
aws opensearch describe-domain --domain-name my-domain \
--query 'DomainStatus.ServiceSoftwareOptions'
# Check cluster health via endpoint
curl -XGET "https://search-my-domain-xxx.us-east-1.es.amazonaws.com/_cluster/health?pretty"
Key CloudWatch Metrics
| Metric | Warning | Critical | Notes |
|---|---|---|---|
| ClusterStatus.red | > 0 | sustained | Unassigned primary shards |
| ClusterStatus.yellow | sustained | - | Unassigned replica shards |
| FreeStorageSpace | < 25% | < 10% | Per-node free space |
| JVMMemoryPressure | > 80% | > 92% | May trigger circuit breakers |
| CPUUtilization | > 80% | > 95% | Per-node CPU |
| MasterCPUUtilization | > 50% | > 80% | Dedicated master node |
| ThreadpoolSearchRejected | > 0 | > 100/5min | Search thread pool full |
| ThreadpoolWriteRejected | > 0 | > 100/5min | Write thread pool full |
| AutomatedSnapshotFailure | > 0 | sustained | Backup failure |
| KibanaHealthyNodes | < expected | 0 | Dashboard availability |
UltraWarm and Cold Storage
# Migrate index to warm storage
POST _ultrawarm/migration/my-index/_warm
# Check migration status
GET _ultrawarm/migration/my-index/_status
# Move to cold storage
POST _cold/migration/my-index/_cold
# Query across tiers works transparently
GET my-index/_search
{"query": {"match_all": {}}}