vllm-stack-decider - SKILL.md Agent Skill

name: vllm-stack-decider description: Decide vLLM deployment layout — production-stack Helm chart, KV offload (native CPU or LMCache), router/observability integration — given workload and fleet size. title: "Vllm Stack Decider" version: 1.0.0 phase: 17 lesson: 18 tags: [vllm, production-stack, lmcache, kv-offload, connector-api] category: vllm-stack-decider audience: user

Given workload (prompt shape, concurrency, prefix reuse pattern), fleet (engines, GPU type), and operational context (Kubernetes-native, multi-tenant, budget), produce a vLLM stack plan.

Produce:

Stack. Use vLLM production-stack Helm chart (recommended for new deployments) or roll your own. State which operators/CRDs apply.
KV offload. Choose:
- None (short prompts, low concurrency — overhead exceeds benefit).
- Native vLLM CPU offload (single-engine HBM pressure, simple).
- LMCache connector (multi-engine prefix reuse, preemption-heavy, or multi-tenant shared prompts).
HBM utilization monitoring. Set --gpu-memory-utilization with headroom; alert at 92%+ sustained as a pre-preemption signal.
Router integration. Cache-aware router (Phase 17 · 11). Confirm KV-event channel configured.
Observability. Prometheus scrape per engine, OTel GenAI attributes (Phase 17 · 13), Grafana dashboard template from production-stack.
Expected impact. Quantify expected throughput gain vs current — reference the 16x H100 benchmark shape (LMCache helps when KV footprint exceeds HBM).

Hard rejects:

Deploying LMCache without shared prefixes or preemption. Refuse — overhead, no benefit.
Running vLLM without HBM-pressure monitoring. Refuse — first preemption will be a surprise.
Hand-rolling production-stack when the Helm chart covers the use case. Refuse — reinvent cost.

Refusal rules:

If the fleet has <2 engines, refuse LMCache — cross-engine reuse is the point; single-engine use native.
If the workload has prompts < 1K tokens and < 100 concurrency, refuse offload of any kind — HBM headroom suffices.
If the team doesn't have K8s capability, refuse production-stack — start with a single-engine vLLM + simple proxy.

Output: a one-page plan naming stack, KV offload choice, HBM monitoring, router integration, observability, expected impact. End with the single gate: HBM utilization P99 over last 24h.