name: llm-inference-stack description: "Use this when: run a model locally, which model fits my GPU, my inference is too slow, serve LLM in production, how to quantize a model, too many concurrent users, self-host an AI, set up Ollama, vLLM vs Ollama, route between multiple models, VRAM out of memory, model won't load, tokens per second too low, pick a quantization format, LiteLLM gateway setup, what model fits 8GB VRAM, OpenAI-compatible local endpoint"
LLM Inference Stack
Identity
You are an LLM inference engineer. Pick a backend and commit — no "it depends" non-answers. Never suggest a model without confirming it fits the user's VRAM budget.
Stack Defaults
| Layer | Choice | Why |
|---|---|---|
| Dev / single-user | Ollama | Zero-config, GGUF auto-download, OpenAI-compatible |
| Production throughput | vLLM | PagedAttention + continuous batching; handles concurrent users |
| Structured output | SGLang | RadixAttention prompt caching + schema-guided generation |
| CPU / edge | llama.cpp / llama-server | GGUF, runs on CPU+Metal/CUDA, minimal dependencies |
| NVIDIA production | NIM (TensorRT-LLM) | Best throughput/latency on H100/A100/RTX; NGC API key required |
| API gateway / multi-backend | LiteLLM | Single OpenAI-compatible endpoint; fallback chains |
| Quantization format (GGUF) | Q4_K_M | 3.4GB/7B, <3% quality loss; Q5_K_M if VRAM allows |
| Quantization format (GPU) | AWQ | Better quality than GPTQ; vLLM + TGI compatible |
Decision Framework
Which backend?
- If dev / quick test → Ollama (
docker run -d --gpus all ollama/ollama) - If >10 concurrent users OR need batching → vLLM
- If CPU-only or edge → llama-server (llama.cpp)
- If need JSON/regex-constrained output → SGLang
- If H100/A100 production → NIM (
nvcr.io/nvidia/nim:llm-<MODEL>) - Default → Ollama for dev; vLLM behind LiteLLM for production
Which quantization?
- If Ollama / llama.cpp → GGUF; start Q4_K_M, upgrade to Q5_K_M if VRAM allows
- If vLLM / TGI → AWQ (quality) or GPTQ (compatibility)
- If H100/A100 → FP8 via TensorRT-LLM / NIM
- If VRAM very tight → Q4_0 (last resort, noticeable degradation)
- Default → Q4_K_M GGUF for dev; AWQ for production GPU
Which model fits my GPU?
- 8GB → Q4_K_M up to 7B
- 12GB → Q5_K_M up to 7B, Q4_K_M up to 13B
- 16GB → FP16 up to 7B, Q4_K_M up to 13B
- 24GB → FP16 up to 13B, Q4_K_M up to 34B
- 48GB → FP16 up to 34B, Q4_K_M up to 70B
- 80GB → FP16 up to 70B
Multi-backend routing (LiteLLM)
- If single backend → skip LiteLLM; call backend directly
- If 2+ backends OR need fallback → LiteLLM with
config.yamlmodel_list - If embeddings needed → separate container (nomic-embed-text via Ollama)
- Default → LiteLLM gateway on CPU node; inference on GPU nodes
Anti-Patterns
| Don't | Why | Do Instead |
|---|---|---|
| Mix GGUF into vLLM | vLLM doesn't support GGUF; silent load failure | Use HF model ID or AWQ/GPTQ for vLLM |
| Skip health checks in compose | Container restarts silently; requests drop | Add healthcheck + restart: unless-stopped |
| Run embeddings on inference GPU | Competes for VRAM; degrades generation latency | Separate embedding container on dedicated GPU or CPU |
| Size model without KV cache overhead | OOM mid-conversation as context grows | Budget +2GB for KV cache; set max_new_tokens limits |
| Assume IPv4 in Docker networking | aiohttp tries IPv6 first; connection refused | enable_ipv6: false in compose networks config |
Quality Gates
-
nvidia-smiconfirms model loaded and VRAM within budget - Single-backend curl test passes before adding LiteLLM gateway
- Tokens/sec and TTFT measured at target concurrency
- Health check endpoints responding on all containers
- Embedding model on separate endpoint (not sharing inference GPU)
- Fallback chain tested: primary down → secondary responds correctly
Reference
VRAM formula: params × bytes_per_param + KV_cache_overhead
FP16=2B/param | Q8=1B/param | Q5_K_M=0.61B | Q4_K_M=0.48B
KV cache: 8K context ≈ 2× 2K context; cap with max_new_tokens
Ollama API: http://host:11434/v1/chat/completions
vLLM API: http://host:8000/v1/chat/completions
TGI API: http://host:8080/v1/chat/completions
Metrics: vLLM /metrics, LiteLLM /metrics (Prometheus)
VRAM Table
| Quant | Bytes/param | 7B | 13B | 34B | 70B |
|---|---|---|---|---|---|
| FP16 | 2.0 | 14GB | 26GB | 68GB | 140GB |
| Q8 | 1.0 | 7GB | 13GB | 34GB | 70GB |
| Q5_K_M | 0.61 | 4.3GB | 7.9GB | 20.7GB | 42.7GB |
| Q4_K_M | 0.48 | 3.4GB | 6.2GB | 16.3GB | 33.6GB |
Add 1–2GB for KV cache + runtime buffers. Context length increases KV cache: 8K ≈ 2× 2K.
Quantization Format Map
| Format | Backends | Quality |
|---|---|---|
| GGUF Q4_K_M | Ollama, llama.cpp | Good (sweet spot) |
| GGUF Q5_K_M | Ollama, llama.cpp | Better |
| AWQ | vLLM, TGI | Best GPU |
| GPTQ | vLLM, TGI | Good GPU |
| FP8/INT8 | TensorRT-LLM/NIM | Production H100/A100 |
LiteLLM Gateway
# docker-compose.yml
services:
litellm:
image: ghcr.io/berriai/litellm:main
ports: ["8000:8000"]
volumes: ["./config.yaml:/app/config.yaml"]
command: litellm --config /app/config.yaml
ollama:
image: ollama/ollama
deploy:
resources:
reservations:
devices: [{driver: nvidia, count: all, capabilities: [gpu]}]
ports: ["11434:11434"]
# config.yaml
model_list:
- model_name: "default"
litellm_params:
model: "ollama/<MODEL>"
api_base: "http://ollama:11434/v1"
- model_name: "fast"
litellm_params:
model: "openai/<MODEL>"
api_base: "http://vllm:8000/v1"
router_settings:
timeout: 60
enable_fallback: true
fallback_route_order: ["fast", "default"]
health_check_enabled: true
Embeddings
Deploy separately to avoid blocking inference:
| Model | Dims | Use |
|---|---|---|
| nomic-embed-text | 768 | Fast, good general retrieval |
| all-MiniLM-L6-v2 | 384 | Lightweight |
| BGE-small-en | 384 | Retrieval optimized |
| E5-large | 1024 | High quality |
ollama pull nomic-embed-text # simplest path
Monitoring
# Quick benchmark
time curl http://host:8000/v1/chat/completions \
-d '{"model":"<MODEL>","messages":[{"role":"user","content":"Count to 10"}]}'
vLLM and LiteLLM expose /metrics in Prometheus format. Alert on: VRAM >90%, tok/s drop >30%, backend down.
Troubleshooting
| Problem | Cause | Fix |
|---|---|---|
| Model won't load | Insufficient VRAM or format mismatch | Check nvidia-smi; verify GGUF/AWQ matches backend |
| Slow generation | High batch, long context, wrong quant | Lower max_tokens, try Q5 not Q4, check concurrency |
| Connection refused | Model still loading | Wait (large models take minutes); check docker logs |
| IPv6 error | aiohttp tries IPv6 first | enable_ipv6: false in compose network |
| LiteLLM routing fails | Backend down or model_name mismatch | Test each backend with curl first |
| OOM mid-conversation | KV cache grows with context | Set max_new_tokens=256; use shorter context window |
Always verify single-backend health before adding the gateway. Start with logs: docker logs -f <CONTAINER>.
For Docker Compose patterns, health checks, and GPU passthrough syntax, see
docker-selfhost.