vllm-v1-engine

name: vllm-v1-engine description: V1 engine architecture, migration guide triggers: - When user wants to use the new V1 engine - When user wants to understand V0 vs V1 differences - When user needs to migrate from V0 to V1 - When user wants to use V1 features like prefix caching

Overview

vLLM V1 is a new inference engine with improved performance and features. This skill covers enabling V1, understanding architectural differences, and migration from V0.

Prerequisites

vLLM 0.5.0+ (V1 is opt-in for now)
Understanding of V0 engine behavior (for comparison)

Main Workflow

Step 1: Enable V1 Engine

# Environment variable
export VLLM_USE_V1=1

# Then run as normal
vllm serve meta-llama/Llama-2-7b-chat-hf

# Python API
import os
os.environ["VLLM_USE_V1"] = "1"

from vllm import LLM

llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")

Step 2: Verify V1 is Active

import os
print(f"V1 enabled: {os.environ.get('VLLM_USE_V1')}")

# Check engine type
from vllm.engine.llm_engine import LLMEngine
print(f"Engine: {LLMEngine.__module__}")

Step 3: Use V1 Features

Prefix Caching:

import os
os.environ["VLLM_USE_V1"] = "1"

from vllm import LLM

# V1 automatically enables prefix caching
llm = LLM(
    model="meta-llama/Llama-2-7b-chat-hf",
    enable_prefix_caching=True  # Explicit enable (V1 default)
)

# First request populates cache
output1 = llm.generate("Long context prefix... ")

# Second request with same prefix is faster
output2 = llm.generate("Long context prefix... different continuation")

Step 4: Compare V0 vs V1

import time
import os

def benchmark_engine(use_v1):
    os.environ["VLLM_USE_V1"] = "1" if use_v1 else "0"

    # Must import after setting env
    from vllm import LLM, SamplingParams

    llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")
    prompts = ["Hello world"] * 100

    start = time.time()
    outputs = llm.generate(prompts, SamplingParams(max_tokens=50))
    elapsed = time.time() - start

    return elapsed

v0_time = benchmark_engine(use_v1=False)
v1_time = benchmark_engine(use_v1=True)

print(f"V0: {v0_time:.2f}s")
print(f"V1: {v1_time:.2f}s")
print(f"Speedup: {v0_time/v1_time:.2f}x")

V0 vs V1 Comparison

Feature	V0	V1
Scheduler	Simple	Continuous batching
Prefix Caching	Optional	Default
Speculative Decoding	Experimental	Improved
Chunked Prefill	Optional	Default
Performance	Baseline	10-30% faster
Memory Usage	Standard	Optimized

Common Patterns

Pattern 1: Gradual Migration

# Keep V0 as default, use V1 for specific workloads
import os

# Default (V0)
os.environ["VLLM_USE_V1"] = "0"
from vllm import LLM as LLM_V0

# V1 for workloads benefiting from prefix caching
os.environ["VLLM_USE_V1"] = "1"
from vllm import LLM as LLM_V1

# Use V1 for chat applications with system prompts
chat_llm = LLM_V1(model="model")

# Use V0 for other workloads
batch_llm = LLM_V0(model="model")

Pattern 2: Feature Detection

import os

def get_llm_class():
    """Get appropriate LLM class based on environment."""
    use_v1 = os.environ.get("VLLM_USE_V1", "0") == "1"

    if use_v1:
        from vllm.v1.engine.llm_engine import LLMEngine as V1Engine
        return V1Engine
    else:
        from vllm.engine.llm_engine import LLMEngine as V0Engine
        return V0Engine

# Usage
LLMEngine = get_llm_class()

Pattern 3: Prefix Caching Optimization

import os
os.environ["VLLM_USE_V1"] = "1"

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")

# Common system prompt
system_prompt = """You are a helpful assistant. Be concise and accurate.
Always cite sources when providing information."""

# With V1 prefix caching, this prefix is computed once
conversations = [
    f"{system_prompt}\nUser: What is Python?\nAssistant:",
    f"{system_prompt}\nUser: Explain recursion\nAssistant:",
    f"{system_prefix}\nUser: Write a sorting algorithm\nAssistant:",
]

outputs = llm.generate(conversations, SamplingParams(max_tokens=100))

Troubleshooting

Problem: V1 not available

Solution:

# Upgrade vLLM
pip install --upgrade vllm

# Check version
python -c "import vllm; print(vllm.__version__)"

# V1 requires 0.5.0+

Problem: Different behavior in V1

Solution:

Some parameters may behave differently:

max_num_seqs: V1 uses different scheduling
scheduler_delay_factor: May not apply in V1

Check V1 documentation for parameter compatibility.

Problem: Prefix caching not working

Solution:

# Verify prefix caching is enabled
llm = LLM(
    model="model",
    enable_prefix_caching=True,
    block_size=16  # Must match for cache hits
)

# Check cache hit rate (if metrics available)

References

V1 Migration Guide - Detailed migration steps
V1 Features - New features in V1