name: vllm-v1-engine description: V1 engine architecture, migration guide triggers: - When user wants to use the new V1 engine - When user wants to understand V0 vs V1 differences - When user needs to migrate from V0 to V1 - When user wants to use V1 features like prefix caching
vllm-v1-engine
Overview
vLLM V1 is a new inference engine with improved performance and features. This skill covers enabling V1, understanding architectural differences, and migration from V0.
Prerequisites
- vLLM 0.5.0+ (V1 is opt-in for now)
- Understanding of V0 engine behavior (for comparison)
Main Workflow
Step 1: Enable V1 Engine
# Environment variable
export VLLM_USE_V1=1
# Then run as normal
vllm serve meta-llama/Llama-2-7b-chat-hf
# Python API
import os
os.environ["VLLM_USE_V1"] = "1"
from vllm import LLM
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")
Step 2: Verify V1 is Active
import os
print(f"V1 enabled: {os.environ.get('VLLM_USE_V1')}")
# Check engine type
from vllm.engine.llm_engine import LLMEngine
print(f"Engine: {LLMEngine.__module__}")
Step 3: Use V1 Features
Prefix Caching:
import os
os.environ["VLLM_USE_V1"] = "1"
from vllm import LLM
# V1 automatically enables prefix caching
llm = LLM(
model="meta-llama/Llama-2-7b-chat-hf",
enable_prefix_caching=True # Explicit enable (V1 default)
)
# First request populates cache
output1 = llm.generate("Long context prefix... ")
# Second request with same prefix is faster
output2 = llm.generate("Long context prefix... different continuation")
Step 4: Compare V0 vs V1
import time
import os
def benchmark_engine(use_v1):
os.environ["VLLM_USE_V1"] = "1" if use_v1 else "0"
# Must import after setting env
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")
prompts = ["Hello world"] * 100
start = time.time()
outputs = llm.generate(prompts, SamplingParams(max_tokens=50))
elapsed = time.time() - start
return elapsed
v0_time = benchmark_engine(use_v1=False)
v1_time = benchmark_engine(use_v1=True)
print(f"V0: {v0_time:.2f}s")
print(f"V1: {v1_time:.2f}s")
print(f"Speedup: {v0_time/v1_time:.2f}x")
V0 vs V1 Comparison
| Feature | V0 | V1 |
|---|---|---|
| Scheduler | Simple | Continuous batching |
| Prefix Caching | Optional | Default |
| Speculative Decoding | Experimental | Improved |
| Chunked Prefill | Optional | Default |
| Performance | Baseline | 10-30% faster |
| Memory Usage | Standard | Optimized |
Common Patterns
Pattern 1: Gradual Migration
# Keep V0 as default, use V1 for specific workloads
import os
# Default (V0)
os.environ["VLLM_USE_V1"] = "0"
from vllm import LLM as LLM_V0
# V1 for workloads benefiting from prefix caching
os.environ["VLLM_USE_V1"] = "1"
from vllm import LLM as LLM_V1
# Use V1 for chat applications with system prompts
chat_llm = LLM_V1(model="model")
# Use V0 for other workloads
batch_llm = LLM_V0(model="model")
Pattern 2: Feature Detection
import os
def get_llm_class():
"""Get appropriate LLM class based on environment."""
use_v1 = os.environ.get("VLLM_USE_V1", "0") == "1"
if use_v1:
from vllm.v1.engine.llm_engine import LLMEngine as V1Engine
return V1Engine
else:
from vllm.engine.llm_engine import LLMEngine as V0Engine
return V0Engine
# Usage
LLMEngine = get_llm_class()
Pattern 3: Prefix Caching Optimization
import os
os.environ["VLLM_USE_V1"] = "1"
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")
# Common system prompt
system_prompt = """You are a helpful assistant. Be concise and accurate.
Always cite sources when providing information."""
# With V1 prefix caching, this prefix is computed once
conversations = [
f"{system_prompt}\nUser: What is Python?\nAssistant:",
f"{system_prompt}\nUser: Explain recursion\nAssistant:",
f"{system_prefix}\nUser: Write a sorting algorithm\nAssistant:",
]
outputs = llm.generate(conversations, SamplingParams(max_tokens=100))
Troubleshooting
Problem: V1 not available
Solution:
# Upgrade vLLM
pip install --upgrade vllm
# Check version
python -c "import vllm; print(vllm.__version__)"
# V1 requires 0.5.0+
Problem: Different behavior in V1
Solution:
Some parameters may behave differently:
max_num_seqs: V1 uses different schedulingscheduler_delay_factor: May not apply in V1
Check V1 documentation for parameter compatibility.
Problem: Prefix caching not working
Solution:
# Verify prefix caching is enabled
llm = LLM(
model="model",
enable_prefix_caching=True,
block_size=16 # Must match for cache hits
)
# Check cache hit rate (if metrics available)
References
- V1 Migration Guide - Detailed migration steps
- V1 Features - New features in V1