name: vllm-observability
allowed-tools: Bash, Read, Write, Edit, Grep, Glob
description: |-
Observe production vLLM — /metrics Prometheus surface (V1 engine), SLO-driven alerting on TTFT/ITL/queue/KV/preemption/aborts/corrupted-logits, shipping Grafana dashboards in examples/observability/, OTLP tracing with --otlp-traces-endpoint and --collect-detailed-traces={model,worker,all}, diagnostic rules to triage from /metrics alone — queue-grows + TPOT-stable means capacity, queue-stable + TPOT-grows means context/model, DCGM SM_OCCUPANCY is the real GPU-saturation signal not GPU_UTIL. V1 metric names (kv_cache_usage_perc), gpu_→kv_ rename saga, DCGM-exporter pairing, dashboard-lying pitfalls.
when_to_use: |-
Trigger on "vllm metrics", "vllm observability", "vllm prometheus", "vllm grafana", "/metrics vllm", "vllm SLO", "TTFT alert", "ITL alert", "kv_cache_usage_perc", "num_requests_waiting", "request_queue_time_seconds", "prefix_cache_hits", "num_preemptions", "spec_decode metrics", "--otlp-traces-endpoint", "--collect-detailed-traces", "vllm tracing", "DCGM vllm", "SM_OCCUPANCY", "vllm KEDA", "vllm incident triage", "vllm goodput", "PromQL vllm". Building a dashboard, SLO burn-rate alerts, pairing vLLM with DCGM, diagnosing slow-TTFT / preemption-storm from /metrics alone. Also implicit — "why is TTFT high", "what should I alert on", "why are preemptions", "vllm is slow", "deploy-memo SLO", "audit observability" — question is from data vLLM already emits.
vLLM observability
Target audience: operators running production vLLM on H100/H200 fleets, usually containerized, usually on Kubernetes, on-call for latency and throughput SLOs.
Why this matters
nvidia-smi can show a perfectly healthy GPU while TTFT is 11 seconds. Raw throughput in tok/s can be rising while user-visible P99 TTFT is cratering. Every production incident this skill exists to catch shares one structural problem: aggregate numbers and hardware counters lie, and only the vLLM-internal per-request distributions tell the truth.
Two operator-facing outcomes matter:
- Alerting that wakes the right person for the right reason — TTFT/ITL tail, queue depth, preemption rate, corrupted logits.
- Diagnosis from /metrics alone — a small number of metric patterns distinguish "out of capacity" from "stuck scheduler" from "hot long-context outlier" without SSH'ing to the pod.
The core diagnostic rule
When something feels slow, read the ratio, not the absolute:
| Queue depth | TPOT / ITL | Most likely cause |
|---|---|---|
| Rising | Stable | Capacity shortage — scale out or increase max-num-seqs |
| Stable | Rising | Context / model-side — long-context request, CUDA graph recompile, prefix-cache miss |
| Rising | Rising | Compounding — usually preemption storm; check num_preemptions rate |
| Stable | Stable, but TTFT high | Scheduler stall — connector (LMCache/NIXL), head-of-line blocking, or engine-core descheduling (ebpf territory) |
The metric surface in one paragraph
vLLM exposes a Prometheus text-format endpoint at /metrics. All series are prefixed vllm: and carry {model_name, engine} labels. Metrics fall into queue/scheduler state, KV cache pressure, per-request latency histograms (TTFT/ITL/queue/prefill/decode/e2e), throughput counters, and request outcomes (finished_reason=stop|length|abort, plus corrupted_requests for NaN-logit page-worthy events).
Full catalog with types, buckets, labels, and emission file:line anchors in references/metrics-catalog.md. The catalog is V1-first with V0 deltas noted.
Top signals to alert on
| # | Signal | PromQL sketch | Starter threshold |
|---|---|---|---|
| 1 | P99 TTFT | histogram_quantile(0.99, sum by (le, model_name) (rate(vllm:time_to_first_token_seconds_bucket[5m]))) |
Page > 3s interactive, > 10s batch |
| 2 | P99 ITL | same pattern on vllm:inter_token_latency_seconds_bucket |
Page > 200ms streaming |
| 3 | Queue wait P99 | vllm:request_queue_time_seconds_bucket |
Page > 5s sustained 10m |
| 4 | KV utilization | vllm:kv_cache_usage_perc |
Warn > 0.80, page > 0.95 sustained 15m |
| 5 | Preemption rate | rate(vllm:num_preemptions_total[5m]) |
Warn any sustained non-zero |
| 6 | Abort fraction | rate(vllm:request_success_total{finished_reason="abort"}[5m]) / rate(vllm:request_success_total[5m]) |
Warn > 1%, page > 10% |
| 7 | Corrupted logits | increase(vllm:corrupted_requests_total[5m]) |
Page on any > 0 |
| 8 | Prefix-cache hit rate | rate(vllm:prefix_cache_hits_total[5m]) / rate(vllm:prefix_cache_queries_total[5m]) |
Warn if WoW drops > 20% |
| 9 | Queue depth (for autoscaling) | vllm:num_requests_waiting |
KEDA trigger at 2–10 per replica |
| 10 | XID errors (DCGM side) | DCGM_FI_DEV_XID_ERRORS |
Page on any increment |
Full PromQL with multi-window burn-rate templates, SLO calibration notes, and goodput approximation in references/alerting.md.
Dashboards and stacks
The repo ships three operator-ready Grafana dashboards at examples/observability/:
prometheus_grafana/grafana.json— 12-panel all-in-one (E2E, TTFT, ITL, KV usage, scheduler, throughput, finish-reason, queue/prefill/decode times, token-length heatmaps)dashboards/grafana/performance_statistics.json— 20-panel SRE dashboard (latency P50/P90/P99 over time, TPS streams)dashboards/grafana/query_statistics.json— 18-panel product dashboard (per-model volume, token-size distributions)
Plus a working docker-compose.yaml + prometheus.yaml for local trials. Perses YAML equivalents in dashboards/perses/. Pair with DCGM exporter (Grafana dashboard 15117) for hardware-side metrics.
Do not use GPU_UTIL as the saturation signal. It hits 100% under severe starvation. Use DCGM_FI_PROF_SM_OCCUPANCY. Full DCGM pairing catalog and external-dashboard inventory in references/dashboards.md.
Tracing
vllm serve $MODEL \
--otlp-traces-endpoint=grpc://otel-collector:4317 \
--collect-detailed-traces=all # or: model, or: worker — expensive, use per-incident
Without --collect-detailed-traces, spans are emitted but the two most useful per-step metrics (model_forward_time_milliseconds, model_execute_time_milliseconds) are missing. Flag is designed to be enabled during an incident, not as baseline — docs explicitly warn about performance impact.
Protocol defaults to gRPC; HTTP/protobuf via OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=http/protobuf. All OTel packages bundled with vLLM. Full stack choices (Jaeger all-in-one, OTel Collector → Tempo → Grafana, Langfuse), span catalog, and sampling patterns in references/tracing.md.
Critical pitfalls
Alerting on averages.
sum/counthides P99 tails that are 10–50× the mean. Every latency alert must go throughhistogram_quantile(0.99, …).Forgetting
sum by (le)beforehistogram_quantile. Without it, per-instance quantiles mix with fleet quantiles — the most common Grafana mistake in the Prometheus world.gpu_cache_usage_percvskv_cache_usage_perc. The new name shipped first; PR #24245 (merged 2025-09-16) then hid the oldgpu_*counterparts behind--show-hidden-metrics-for-version=X.Y. The attempted revert #25392 was closed without merging (2025-09-23), so the hiding stuck — current main emits onlykv_cache_usage_percby default. Dashboards scraping pre-#24245 tags still see both; greenfield dashboards should use the new name only.num_requests_swappedis deprecated on V1 and always zero. Usenum_preemptions_totalinstead. Many copy-pasted dashboards still reference swap.Multi-pod label collisions. Every pod emits identical
{model_name, engine}labels. Without a Prometheus relabel addingpod/replica, counters sum across pods and hide per-replica pathology.Cardinality explosion. Never add
request_idor the prompt text as a Prometheus label — that path is deliberately absent. Per-request visibility lives in OTLP traces, not metrics.KEDA threshold too low on
num_requests_waiting. Thresholds of 1–2 per replica cause scale thrashing. Production Stack default is 5; OpenShift example is 2. Pair withcooldownPeriod: 360— GPU pods take ~10 min to reach ready, reactive scaling fails.GPU_UTILat 100% ≠ busy GPU. The ebpfchirp "11-second TTFT" incident is canonical: util pinned high, SM occupancy was 18%, the scheduler was stalled on prefix-cache head-of-line blocking. WatchSM_OCCUPANCY.Ray Serve deployments don't auto-expose
/metrics.RayPrometheusStatLoggermust be wired explicitly, or Ray 2.51+ ingests vLLM metrics through Ray's own endpoint (disable withlog_engine_metrics: Falseto avoid double-scraping).--collect-detailed-tracesas baseline. 5–10% overhead. Toggle per-incident; leave unset by default.
Full troubleshooting matrix (dashboard-empty, metric-gone-after-upgrade, P99 NaN, histogram buckets miscalibrated for SLO) in references/alerting.md under the "When metrics lie" section.
Verify a deployment can be observed
# Basic reachability
curl -fsS http://<endpoint>/health
curl -fsS http://<endpoint>/metrics | head -30
# Confirm the load-bearing series exist
curl -s http://<endpoint>/metrics | grep -E '^vllm:(kv_cache_usage_perc|num_requests_(waiting|running)|time_to_first_token|request_success|num_preemptions|prefix_cache_(hits|queries))'
${CLAUDE_SKILL_DIR}/scripts/metrics-smoke.sh runs the full smoke check against a deployment: confirms endpoints, greps load-bearing series, warns on deprecated metric names, cross-checks DCGM availability if configured. Output is color-coded pass/warn/fail.
Version notes
- V1 engine is default as of late 2025. V0 metrics hidden unless
--show-hidden-metrics-for-version=X.Y. - Metric rename saga:
vllm:gpu_cache_usage_perc→vllm:kv_cache_usage_perc. PR #24245 (merged 2025-09-16) hid the deprecatedgpu_*names behind--show-hidden-metrics-for-version; the proposed revert PR #25392 was closed without merging (2025-09-23), so the hiding stuck. Current main emits onlykv_cache_usage_percby default. - Deprecated on V1:
num_requests_swapped,cpu_cache_usage_perc,cpu_prefix_cache_hit_rate,time_per_output_token_seconds(replaced byinter_token_latency_seconds), themodel_forward_time_milliseconds/model_execute_time_millisecondspair (now behind--collect-detailed-traces). - New in V1:
num_requests_waiting_by_reason{reason=capacity|deferred},engine_sleep_state,prompt_tokens_by_source{source=local_compute|local_cache_hit|external_kv_transfer}, per-position spec-decode acceptance counters.
External references
- Metrics design doc (canonical): https://github.com/vllm-project/vllm/blob/main/docs/design/metrics.md
- Metrics source of truth: https://github.com/vllm-project/vllm/tree/main/vllm/v1/metrics
- Example dashboards: https://github.com/vllm-project/vllm/tree/main/examples/observability
- Production metrics docs: https://docs.vllm.ai/en/stable/usage/metrics/
- OTel POC: https://docs.vllm.ai/en/latest/examples/online_serving/opentelemetry/
- Blog — Anatomy of vLLM (defines goodput, scheduler scoring): https://vllm.ai/blog/anatomy-of-vllm
- Blog — Large-Scale Serving (DeepSeek @ 2.2k tok/s/H200): https://blog.vllm.ai/2025/12/17/large-scale-serving.html
- Blog — MorIIO disagg (bimodal-ITL, goodput framing): https://vllm.ai/blog/moriio-kv-connector
- ebpfchirp — 11-Second TTFT on a Healthy Server (canonical incident): https://ebpfchirp.substack.com/p/11-second-time-to-first-token-on
- akrisanov.com — vLLM Metrics in Production (concrete alert set): https://akrisanov.com/vllm-metrics/
- DCGM exporter + Grafana dashboard 15117: https://grafana.com/grafana/dashboards/15117-nvidia-dcgm-exporter/
- Sibling skills:
vllm-caching(KV tiering),vllm-benchmarking(bench + output JSON),vllm-configuration(env vars + YAML)