name: vllm-omni-npu-model-runner-upgrade description: "Upgrade vllm-omni NPU model runners (OmniNPUModelRunner, NPUARModelRunner, NPUGenerationModelRunner) to align with the latest vllm-ascend NPUModelRunner while preserving omni-specific logic."
vLLM-Omni NPU Model Runner Upgrade Skill
Overview
This skill guides the process of upgrading vllm-omni's NPU model runners to align with the latest vllm-ascend codebase while preserving omni-specific enhancements. The NPU runners are designed to run omni multimodal models (like Qwen3-Omni, Bagel, MiMoAudio) on Ascend NPUs.
File Structure
NPU Model Runner Files
vllm-omni/vllm_omni/platforms/npu/worker/
├── __init__.py
├── npu_model_runner.py # OmniNPUModelRunner (base class)
├── npu_ar_model_runner.py # NPUARModelRunner (autoregressive)
├── npu_ar_worker.py # AR worker
├── npu_generation_model_runner.py # NPUGenerationModelRunner (diffusion/non-AR)
└── npu_generation_worker.py # Generation worker
GPU Reference Files (for omni-specific logic sync)
vllm-omni/vllm_omni/worker/
├── __init__.py
├── gpu_model_runner.py # OmniGPUModelRunner
├── gpu_ar_model_runner.py # GPUARModelRunner
├── gpu_ar_worker.py
├── gpu_generation_model_runner.py
├── gpu_generation_worker.py
├── mixins.py
├── base.py
└── gpu_memory_utils.py
vllm-ascend Reference Files
vllm-ascend/vllm_ascend/worker/
├── model_runner_v1.py # NPUModelRunner (base class to copy from)
├── npu_input_batch.py
├── block_table.py
├── pcp_utils.py
└── worker.py
Inheritance Hierarchy
GPUModelRunner (vllm)
|
+----------------+----------------+
| |
OmniGPUModelRunner NPUModelRunner (vllm-ascend)
(vllm_omni/worker) (vllm_ascend/worker)
| |
+----------- OmniNPUModelRunner --+
(multiple inheritance)
|
+---------------+---------------+
| |
NPUARModelRunner NPUGenerationModelRunner
(autoregressive) (non-autoregressive/diffusion)
Omni-Specific Comment Markers
Omni-specific logic is marked with comment blocks:
# -------------------------------------- Omni-new -------------------------------------------------
# ... omni-specific code ...
# -------------------------------------- Omni-new -------------------------------------------------
Or simpler variations:
# -------------------------------------- Omni-new -------------------------------------------------
# ------------------------------------------------------------------------------------------------
Important:
- Always preserve and add these markers when modifying code.
- The reference documents (
references/omni-specific-blocks.md) may not be up-to-date. Always grep forOmni-newin the GPU implementations to find the authoritative list of omni-specific blocks. - When you discover new omni-specific code that is not documented in the references, please update the reference files.
Key Methods Requiring Attention
OmniNPUModelRunner (npu_model_runner.py)
| Method | Description | Omni-Specific Logic |
|---|---|---|
load_model |
Load model and initialize talker_mtp | Uses ACLGraphWrapper instead of CUDAGraphWrapper, initializes talker buffers |
_dummy_run |
Warmup/profiling run | talker_mtp dummy forward, extract_multimodal_outputs |
_model_forward |
Forward pass wrapper | Injects model_kwargs_extra, wraps with OmniOutput, NPU-specific graph updates |
_talker_mtp_forward |
Talker MTP forward for Qwen3-Omni | Uses set_ascend_forward_context |
NPUARModelRunner (npu_ar_model_runner.py)
| Method | Description | Omni-Specific Logic |
|---|---|---|
__init__ |
Initialize with KV transfer manager | OmniKVTransferManager setup |
execute_model |
Main inference entry | KV transfer handling, _update_states override, extract_multimodal_outputs |
sample_tokens |
Token sampling | Hidden states extraction, multimodal outputs processing, OmniModelRunnerOutput |
_resolve_global_request_id |
Request ID resolution | For disaggregated inference |
NPUGenerationModelRunner (npu_generation_model_runner.py)
| Method | Description | Omni-Specific Logic |
|---|---|---|
_update_request_states |
Update request states for async chunk | async_chunk handling |
execute_model |
Generation forward | async_chunk, seq_token_counts, _run_generation_model |
sample_tokens |
Output processing | multimodal output packaging to OmniModelRunnerOutput |
_dummy_run |
Dummy run override | model_kwargs initialization, multimodal extraction |
_run_generation_model |
Run generation model | Calls _model_forward with sampler |
Upgrade Workflow
Step 1: Preparation
Identify target versions(Use gh cli to check):
- We're using vllm-omni main branch
- Check the last release of vllm-omni
- Target vllm-ascend version(Just directly use the local latest vllm-ascend code)
Check GPU-side changes (since last release):
cd /root/vllm-workspace/vllm-omni git log --oneline --since="<last-release-date>" -- vllm_omni/worker/Read latest vllm-ascend code:
- We don't track vllm-ascend changes - just directly use the latest code from
/root/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py - Copy the relevant methods and re-insert omni-specific blocks
- We don't track vllm-ascend changes - just directly use the latest code from
Step 2: Analyze Omni-Specific Logic
For each NPU model runner file:
Extract existing omni-specific blocks:
grep -n "Omni-new" vllm_omni/platforms/npu/worker/npu_model_runner.pyDocument each omni block:
- Which method it belongs to
- What functionality it provides
- Dependencies on other omni code
Step 3: Update Base Class (OmniNPUModelRunner)
Note: Always check the GPU implementation gpu_model_runner.py for any new omni logic not yet documented in references.
Read the latest vllm-ascend
NPUModelRunner.load_modelCopy the method, keeping the structure
Re-insert omni-specific logic (check GPU
gpu_model_runner.pyfor authoritative list):- Replace
CUDAGraphWrapperwithACLGraphWrapper - Keep talker_mtp initialization
- Preserve buffer allocations for talker
- Check for any new omni blocks added since last sync
- Replace
Update
_dummy_run:- Copy from vllm-ascend
- Compare with GPU
_dummy_runfor omni-specific blocks - Re-insert all
Omni-newmarked code from GPU version
Update
_model_forward:- Keep the omni wrapper logic
- Update NPU-specific parts (graph params, SP all-gather)
- Check GPU version for any new omni logic
Step 4: Update AR Model Runner
Compare with GPU
gpu_ar_model_runner.pyfor any new omni featuresCopy
execute_modelfrom vllm-ascendRe-insert omni blocks (reference
references/omni-specific-blocks.md, but note it may be incomplete):- IMPORTANT: Always check the GPU implementation
gpu_ar_model_runner.pyfor allOmni-newmarked code blocks - The reference doc may not include newly added omni logic - treat it as a starting point, not exhaustive
- When discovering new omni code blocks, please update
references/omni-specific-blocks.md - Common omni blocks include but are not limited to: KV transfer, multimodal outputs, sampling_metadata handling, etc.
- IMPORTANT: Always check the GPU implementation
Update
sample_tokens(also compare with GPU implementation):- Compare with
gpu_ar_model_runner.py'ssample_tokensmethod - Identify all
Omni-newmarked code blocks - Ensure NPU version includes all omni-specific logic
- Compare with
Step 5: Update Generation Model Runner
Note: Generation model runner may have unique omni logic for diffusion/non-AR models.
Compare with GPU
gpu_generation_model_runner.py- grep for allOmni-newblocksUpdate
execute_model:- Check GPU version for all omni-specific blocks
- Keep async_chunk handling
- Keep
seq_token_countsinjection - Update forward/context setup from vllm-ascend
- Look for any new omni logic not documented in references
Update
_dummy_run:- Copy from vllm-ascend base
- Compare with GPU
_dummy_runif exists - Re-insert all omni-specific logic
Step 6: Update Imports
Check and update imports at the top of each file:
# Common vllm-ascend imports
from vllm_ascend.ascend_forward_context import get_forward_context, set_ascend_forward_context
from vllm_ascend.attention.attention_v1 import AscendAttentionState
from vllm_ascend.attention.utils import using_paged_attention
from vllm_ascend.compilation.acl_graph import ACLGraphWrapper, update_full_graph_params
from vllm_ascend.ops.rotary_embedding import update_cos_sin
from vllm_ascend.utils import enable_sp, lmhead_tp_enable
from vllm_ascend.worker.model_runner_v1 import SEQ_LEN_WITH_MAX_PA_WORKSPACE, NPUModelRunner
# Omni-specific imports
from vllm_omni.model_executor.models.output_templates import OmniOutput
from vllm_omni.worker.gpu_model_runner import OmniGPUModelRunner
from vllm_omni.outputs import OmniModelRunnerOutput
from vllm_omni.distributed.omni_connectors.kv_transfer_manager import OmniKVTransferManager
Step 7: Sync GPU-Side Omni Changes
Check recent GPU worker changes:
git diff <from-tag>..<to-tag> -- vllm_omni/worker/gpu_model_runner.py git diff <from-tag>..<to-tag> -- vllm_omni/worker/gpu_ar_model_runner.pyIdentify new omni features that need to be ported to NPU
Apply corresponding changes to NPU runners
Step 8: Validation
Run type checking:
cd /root/vllm-workspace/vllm-omni python -m py_compile vllm_omni/platforms/npu/worker/npu_model_runner.py python -m py_compile vllm_omni/platforms/npu/worker/npu_ar_model_runner.py python -m py_compile vllm_omni/platforms/npu/worker/npu_generation_model_runner.pyRun import test:
python -c "from vllm_omni.platforms.npu.worker import *"Run model serving test (if hardware available):
vllm serve <model-path> --trust-remote-code
Common Pitfalls
1. Forward Context Differences
- GPU uses
set_forward_context - NPU uses
set_ascend_forward_context - Parameters may differ slightly
2. Graph Wrapper Differences
- GPU:
CUDAGraphWrapper - NPU:
ACLGraphWrapper - Constructor parameters may differ
3. Buffer Creation
- GPU:
_make_bufferreturns different structure - NPU: May need numpy=True/False parameter
4. Attention Metadata
- GPU: Uses vllm attention metadata builders
- NPU: Uses
AscendCommonAttentionMetadata
5. Sampling
- GPU: Uses vllm sampler
- NPU: Uses
AscendSampler
Checklist Before Commit
- All omni-specific comment markers preserved
- New omni logic from GPU side synced
- Imports updated to latest vllm-ascend
- No
CUDAGraphWrapperreferences in NPU code -
set_ascend_forward_contextused instead ofset_forward_context -
ACLGraphWrapperused for talker_mtp wrapping - Type hints match vllm-ascend signatures
- No duplicate code blocks
- Python syntax valid (py_compile passes)
Reference Files for Comparison
When upgrading, keep these files open for reference:
- vllm-ascend NPUModelRunner:
/root/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py - vllm GPUModelRunner:
/root/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py - vllm-omni OmniGPUModelRunner:
/root/vllm-workspace/vllm-omni/vllm_omni/worker/gpu_model_runner.py