vllm-omni-npu-model-runner-upgrade - SKILL.md Agent Skill

name: vllm-omni-npu-model-runner-upgrade description: "Upgrade vllm-omni NPU model runners (OmniNPUModelRunner, NPUARModelRunner, NPUGenerationModelRunner) to align with the latest vllm-ascend NPUModelRunner while preserving omni-specific logic."

vLLM-Omni NPU Model Runner Upgrade Skill

Overview

This skill guides the process of upgrading vllm-omni's NPU model runners to align with the latest vllm-ascend codebase while preserving omni-specific enhancements. The NPU runners are designed to run omni multimodal models (like Qwen3-Omni, Bagel, MiMoAudio) on Ascend NPUs.

File Structure

NPU Model Runner Files

vllm-omni/vllm_omni/platforms/npu/worker/
├── __init__.py
├── npu_model_runner.py           # OmniNPUModelRunner (base class)
├── npu_ar_model_runner.py        # NPUARModelRunner (autoregressive)
├── npu_ar_worker.py              # AR worker
├── npu_generation_model_runner.py # NPUGenerationModelRunner (diffusion/non-AR)
└── npu_generation_worker.py      # Generation worker

GPU Reference Files (for omni-specific logic sync)

vllm-omni/vllm_omni/worker/
├── __init__.py
├── gpu_model_runner.py           # OmniGPUModelRunner
├── gpu_ar_model_runner.py        # GPUARModelRunner
├── gpu_ar_worker.py
├── gpu_generation_model_runner.py
├── gpu_generation_worker.py
├── mixins.py
├── base.py
└── gpu_memory_utils.py

vllm-ascend Reference Files

vllm-ascend/vllm_ascend/worker/
├── model_runner_v1.py            # NPUModelRunner (base class to copy from)
├── npu_input_batch.py
├── block_table.py
├── pcp_utils.py
└── worker.py

Inheritance Hierarchy

                    GPUModelRunner (vllm)
                         |
        +----------------+----------------+
        |                                 |
  OmniGPUModelRunner              NPUModelRunner (vllm-ascend)
  (vllm_omni/worker)              (vllm_ascend/worker)
        |                                 |
        +----------- OmniNPUModelRunner --+
                     (multiple inheritance)
                            |
            +---------------+---------------+
            |                               |
    NPUARModelRunner            NPUGenerationModelRunner
    (autoregressive)            (non-autoregressive/diffusion)

Omni-Specific Comment Markers

Omni-specific logic is marked with comment blocks:

# -------------------------------------- Omni-new -------------------------------------------------
# ... omni-specific code ...
# -------------------------------------- Omni-new -------------------------------------------------

Or simpler variations:

#  -------------------------------------- Omni-new -------------------------------------------------
#  ------------------------------------------------------------------------------------------------

Important:

Always preserve and add these markers when modifying code.
The reference documents (references/omni-specific-blocks.md) may not be up-to-date. Always grep for Omni-new in the GPU implementations to find the authoritative list of omni-specific blocks.
When you discover new omni-specific code that is not documented in the references, please update the reference files.

Key Methods Requiring Attention

OmniNPUModelRunner (npu_model_runner.py)

Method	Description	Omni-Specific Logic
`load_model`	Load model and initialize talker_mtp	Uses `ACLGraphWrapper` instead of `CUDAGraphWrapper`, initializes talker buffers
`_dummy_run`	Warmup/profiling run	talker_mtp dummy forward, `extract_multimodal_outputs`
`_model_forward`	Forward pass wrapper	Injects `model_kwargs_extra`, wraps with `OmniOutput`, NPU-specific graph updates
`_talker_mtp_forward`	Talker MTP forward for Qwen3-Omni	Uses `set_ascend_forward_context`

NPUARModelRunner (npu_ar_model_runner.py)

Method	Description	Omni-Specific Logic
`__init__`	Initialize with KV transfer manager	`OmniKVTransferManager` setup
`execute_model`	Main inference entry	KV transfer handling, `_update_states` override, `extract_multimodal_outputs`
`sample_tokens`	Token sampling	Hidden states extraction, multimodal outputs processing, `OmniModelRunnerOutput`
`_resolve_global_request_id`	Request ID resolution	For disaggregated inference

NPUGenerationModelRunner (npu_generation_model_runner.py)

Method	Description	Omni-Specific Logic
`_update_request_states`	Update request states for async chunk	async_chunk handling
`execute_model`	Generation forward	async_chunk, `seq_token_counts`, `_run_generation_model`
`sample_tokens`	Output processing	multimodal output packaging to `OmniModelRunnerOutput`
`_dummy_run`	Dummy run override	model_kwargs initialization, multimodal extraction
`_run_generation_model`	Run generation model	Calls `_model_forward` with sampler

Upgrade Workflow

Step 1: Preparation

Identify target versions(Use gh cli to check):
- We're using vllm-omni main branch
- Check the last release of vllm-omni
- Target vllm-ascend version(Just directly use the local latest vllm-ascend code)

Check GPU-side changes (since last release):

cd /root/vllm-workspace/vllm-omni
git log --oneline --since="<last-release-date>" -- vllm_omni/worker/

Read latest vllm-ascend code:
- We don't track vllm-ascend changes - just directly use the latest code from /root/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py
- Copy the relevant methods and re-insert omni-specific blocks

Step 2: Analyze Omni-Specific Logic

For each NPU model runner file:

Extract existing omni-specific blocks:

grep -n "Omni-new" vllm_omni/platforms/npu/worker/npu_model_runner.py

Document each omni block:
- Which method it belongs to
- What functionality it provides
- Dependencies on other omni code

Step 3: Update Base Class (OmniNPUModelRunner)

Note: Always check the GPU implementation gpu_model_runner.py for any new omni logic not yet documented in references.

Read the latest vllm-ascend NPUModelRunner.load_model
Copy the method, keeping the structure
Re-insert omni-specific logic (check GPU gpu_model_runner.py for authoritative list):
- Replace CUDAGraphWrapper with ACLGraphWrapper
- Keep talker_mtp initialization
- Preserve buffer allocations for talker
- Check for any new omni blocks added since last sync
Update _dummy_run:
- Copy from vllm-ascend
- Compare with GPU _dummy_run for omni-specific blocks
- Re-insert all Omni-new marked code from GPU version
Update _model_forward:
- Keep the omni wrapper logic
- Update NPU-specific parts (graph params, SP all-gather)
- Check GPU version for any new omni logic

Step 4: Update AR Model Runner

Compare with GPU gpu_ar_model_runner.py for any new omni features
Copy execute_model from vllm-ascend
Re-insert omni blocks (reference references/omni-specific-blocks.md, but note it may be incomplete):
- IMPORTANT: Always check the GPU implementation gpu_ar_model_runner.py for all Omni-new marked code blocks
- The reference doc may not include newly added omni logic - treat it as a starting point, not exhaustive
- When discovering new omni code blocks, please update references/omni-specific-blocks.md
- Common omni blocks include but are not limited to: KV transfer, multimodal outputs, sampling_metadata handling, etc.
Update sample_tokens (also compare with GPU implementation):
- Compare with gpu_ar_model_runner.py's sample_tokens method
- Identify all Omni-new marked code blocks
- Ensure NPU version includes all omni-specific logic

Step 5: Update Generation Model Runner

Note: Generation model runner may have unique omni logic for diffusion/non-AR models.

Compare with GPU gpu_generation_model_runner.py - grep for all Omni-new blocks
Update execute_model:
- Check GPU version for all omni-specific blocks
- Keep async_chunk handling
- Keep seq_token_counts injection
- Update forward/context setup from vllm-ascend
- Look for any new omni logic not documented in references
Update _dummy_run:
- Copy from vllm-ascend base
- Compare with GPU _dummy_run if exists
- Re-insert all omni-specific logic

Step 6: Update Imports

Check and update imports at the top of each file:

# Common vllm-ascend imports
from vllm_ascend.ascend_forward_context import get_forward_context, set_ascend_forward_context
from vllm_ascend.attention.attention_v1 import AscendAttentionState
from vllm_ascend.attention.utils import using_paged_attention
from vllm_ascend.compilation.acl_graph import ACLGraphWrapper, update_full_graph_params
from vllm_ascend.ops.rotary_embedding import update_cos_sin
from vllm_ascend.utils import enable_sp, lmhead_tp_enable
from vllm_ascend.worker.model_runner_v1 import SEQ_LEN_WITH_MAX_PA_WORKSPACE, NPUModelRunner

# Omni-specific imports
from vllm_omni.model_executor.models.output_templates import OmniOutput
from vllm_omni.worker.gpu_model_runner import OmniGPUModelRunner
from vllm_omni.outputs import OmniModelRunnerOutput
from vllm_omni.distributed.omni_connectors.kv_transfer_manager import OmniKVTransferManager

Step 7: Sync GPU-Side Omni Changes

Check recent GPU worker changes:

git diff <from-tag>..<to-tag> -- vllm_omni/worker/gpu_model_runner.py
git diff <from-tag>..<to-tag> -- vllm_omni/worker/gpu_ar_model_runner.py

Identify new omni features that need to be ported to NPU
Apply corresponding changes to NPU runners

Step 8: Validation

Run type checking:

cd /root/vllm-workspace/vllm-omni
python -m py_compile vllm_omni/platforms/npu/worker/npu_model_runner.py
python -m py_compile vllm_omni/platforms/npu/worker/npu_ar_model_runner.py
python -m py_compile vllm_omni/platforms/npu/worker/npu_generation_model_runner.py

Run import test:

python -c "from vllm_omni.platforms.npu.worker import *"

Run model serving test (if hardware available):
```
vllm serve <model-path> --trust-remote-code
```

Common Pitfalls

1. Forward Context Differences

GPU uses set_forward_context
NPU uses set_ascend_forward_context
Parameters may differ slightly

2. Graph Wrapper Differences

GPU: CUDAGraphWrapper
NPU: ACLGraphWrapper
Constructor parameters may differ

3. Buffer Creation

GPU: _make_buffer returns different structure
NPU: May need numpy=True/False parameter

4. Attention Metadata

GPU: Uses vllm attention metadata builders
NPU: Uses AscendCommonAttentionMetadata

5. Sampling

GPU: Uses vllm sampler
NPU: Uses AscendSampler

Checklist Before Commit

All omni-specific comment markers preserved
New omni logic from GPU side synced
Imports updated to latest vllm-ascend
No CUDAGraphWrapper references in NPU code
set_ascend_forward_context used instead of set_forward_context
ACLGraphWrapper used for talker_mtp wrapping
Type hints match vllm-ascend signatures
No duplicate code blocks
Python syntax valid (py_compile passes)

Reference Files for Comparison

When upgrading, keep these files open for reference:

vllm-ascend NPUModelRunner: /root/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py
vllm GPUModelRunner: /root/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py
vllm-omni OmniGPUModelRunner: /root/vllm-workspace/vllm-omni/vllm_omni/worker/gpu_model_runner.py