java-observability-metrics - SKILL.md Agent Skill

name: java-observability-metrics description: Design and implement service metrics (counters, gauges, timers, histograms) with strict label/cardinality rules and SLO-ready signals. Use when adding KPIs/SLOs, building dashboards, or preventing metrics cardinality incidents. license: MIT compatibility: JDK 17+ (recommended 21), Prometheus/OpenTelemetry-compatible metrics systems, any framework metadata: owner: platform version: "1.0" tags: [java, observability, metrics, prometheus, micrometer, cardinality, slo]

In scope

Out of scope

Counter: monotonic count (requests_total, errors_total).
Gauge: instantaneous value (queue_depth, memory_used).
Timer: duration measurements (request_latency).
Histogram/Distribution: latency/size distributions suitable for percentile-ish analysis.
Summary (if your backend supports): use carefully; often harder to aggregate across instances.

Traffic: http_server_requests_total
Errors: http_server_errors_total or http_server_requests_total{outcome="error"}
Latency: http_server_request_duration_seconds (histogram)
Saturation: thread pool utilization, queue size, DB pool usage

Use snake_case.
Base unit in name if needed:
- _seconds, _bytes, _total.
Prefer a consistent prefix/domain:
- http_server_*, db_*, cache_*, mq_*.

Prefer histograms for latency:
- enable bucketed distributions suitable for SLOs.
Choose buckets that match SLO thresholds:
- e.g., 50ms, 100ms, 200ms, 500ms, 1s, 2s, 5s.
Avoid per-endpoint custom histograms unless truly needed.

Define the questions:
- "What is p95 latency for /orders in prod?"
- "What is error rate for dependency X?"
Define the minimal metric set to answer them.
Define label set and enforce boundedness.
Implement instrumentation:
- inbound HTTP
- outbound HTTP clients
- DB calls
- cache
- messaging consumers/producers
Validate on a staging environment:
- confirm label cardinality is bounded
- confirm naming conventions
Add dashboards and alerts:
- SLO burn rate alerts (if used)
- saturation alerts

docs/metrics.md (metric catalog + labels + intent)
metrics/metric-catalog.yml (machine-readable inventory)
metrics/cardinality-budget.md
Code changes:
- instrumentation wrappers
- timers/histograms config
- tests for bounded labels

Symptom: Prometheus/metrics backend memory spikes -> Fix: remove high-cardinality labels, normalize routes.
Symptom: metrics not aggregatable -> Fix: avoid instance-specific dimensions; use consistent naming.
Symptom: percentiles inconsistent -> Fix: use histograms with defined buckets; avoid per-instance summaries.