name: veomni-new-op description: "Use this skill when adding a new optimized kernel or operator to veomni/ops/. Covers the full lifecycle: understanding VeOmni's ops architecture (KERNEL_REGISTRY + OpSlot dispatch, with a thin function-pointer shim for a few legacy global ops), implementing the kernel, registering it, adding tests, and documenting it. Trigger: 'add op', 'new kernel', 'add attention variant', 'new fused op', 'add triton kernel', 'optimize operator'."
Before You Start
- Read
.agents/knowledge/constraints.md— especially rules about NPU guards (#19, #20). - Read
docs/design/kernel_selection.mdanddocs/design/unified_kernel_registry.md— understand the kernel lifecycle, theKERNEL_REGISTRY, andOpSlotdispatch. - Familiarize yourself with the ops architecture below.
VeOmni Ops Architecture
Most VeOmni ops in v5 are registry-driven: a kernel registers itself in
veomni.ops.kernel_registry.KERNEL_REGISTRY and is dispatched at model-build
time through OpSlot instances declared in the patchgen-generated modeling
files (see veomni/ops/dispatch.py and _bind_veomni_ops() in
veomni/models/auto.py).
veomni/ops/
├── __init__.py # apply_ops_patch / apply_ops_config entry points
├── kernel_registry.py # KERNEL_REGISTRY (the single source of truth)
├── dispatch.py # OpSlot + binding helpers
├── config/ # OpsImplementationConfig + per-op registry helpers
├── kernels/ # all registry-driven kernels
│ ├── attention/ # FA2/3/4 + sequence-parallel wrappers
│ ├── cross_entropy/ # eager + liger fused CE
│ ├── load_balancing_loss/
│ ├── moe/ # fused MoE (group_gemm / quack / npu_group_gemm)
│ ├── rms_norm/ # eager / liger / batch-invariant
│ ├── rotary/ # default / triton-deterministic
│ ├── swiglu/ # eager / liger
│ └── gated_delta_rule/
├── batch_invariant_ops/ # ATen-level interception for bitwise determinism
├── liger/ # Liger kernel adapters
└── platform/ # NPU-specific helpers
Two complementary mechanisms coexist:
KERNEL_REGISTRY+OpSlot(preferred for new ops). Each kernel registers itself under a(slot_name, variant)pair (e.g.("cross_entropy_loss", "causal"),("moe_experts", "standard")). Patchgen-generated modeling code declares matchingOpSlotinstances; at model-build time_bind_veomni_ops()walks the generated module, finds eachOpSlot, and binds it to the concrete registry entry chosen byOpsImplementationConfig(config/registry.py).- Legacy global function pointer shim (kept for a few global ops that
are dispatched outside generated modeling). Public-API functions like
fused_moe_forwardandload_balancing_lossstill expose a thin pointer that is rebound byapply_ops_config()so call sites in non-patchgen code (DeepSeek MLA inference paths, NPU custom forwards) can keep importing the public name without going through anOpSlot.
Pick mechanism 1 for any kernel that lives inside a patchgen-generated modeling file. Use mechanism 2 only when the kernel must be callable from unpatched (or non-Transformers) Python code.
Phase 1: Design
Determine op category:
- Registry-driven kernel (the common case, used inside patchgen-generated modeling): register under a
(slot_name, variant)inKERNEL_REGISTRYand add a matchingOpSlotin the relevant<model>_patch_gen_config.py. No global mutation; selection is driven byOpsImplementationConfig. - Global op with public API (e.g.
fused_moe_forward,load_balancing_loss): expose a public function inveomni/ops/__init__.pyand rebind it fromapply_ops_config()based on the activeOpsImplementationConfig. Only use this when a non-patchgen call site (NPU MLA forward, manual inference scripts, etc.) needs to import the kernel directly. - Library op (no dispatch — called directly by model code): just create the module, no registry entry needed.
- NPU variant: add alongside the GPU implementation behind an
is_torch_npu_available()guard.
- Registry-driven kernel (the common case, used inside patchgen-generated modeling): register under a
Decide selection mechanism: read
docs/design/kernel_selection.mdanddocs/design/unified_kernel_registry.mdto determine if you need:- Config field in
OpsImplementationConfig(veomni/arguments/arguments_types.py) - Environment variable
- Both
- Config field in
Determine binding timing:
- Model build time (default): registry entries are resolved by
_bind_veomni_ops()inveomni/models/auto.pywhen a model is constructed. New kernels just need to register themselves at import time. apply_ops_config()time: legacy global ops (rebound function pointers) are wired inveomni/ops/__init__.py::apply_ops_config(ops_config).
- Model build time (default): registry entries are resolved by
Phase 2: Implement
Create the op directory under
veomni/ops/kernels/<op_name>/.Implement each kernel variant in its own file (e.g.
triton_kernel.py,eager.py,npu_kernel.py). Each variant declares a concrete function with the kernel's canonical signature.Register the kernel in
veomni/ops/kernels/<op_name>/__init__.py:from veomni.ops.kernel_registry import KERNEL_REGISTRY from .eager import my_op_eager from .triton_kernel import my_op_triton KERNEL_REGISTRY.register(slot="my_op", variant="eager")(my_op_eager) KERNEL_REGISTRY.register(slot="my_op", variant="triton")(my_op_triton)Then declare a matching
OpSlotin the patchgen config of every model that uses it:from veomni.ops.dispatch import OpSlot veomni_my_op = OpSlot("my_op", "eager") # default variant_bind_veomni_ops()will swap this for the registry entry selected byOpsImplementationConfig.Wire the config field (if the user needs to choose a variant):
- Add a field to
OpsImplementationConfiginveomni/arguments/arguments_types.py. - In
veomni/ops/config/registry.py, map the new config field to the(slot, variant)tuple consumed by_bind_veomni_ops().
- Add a field to
For legacy global ops (only when needed): add the public function to
veomni/ops/__init__.pyand rebind it fromapply_ops_config(ops_config).NPU support:
- Always guard NPU imports with
is_torch_npu_available(). - Put NPU implementations in a separate file (e.g.,
npu_kernel.py). - Register the NPU variant under the same slot with a distinct variant name.
- Always guard NPU imports with
Phase 3: Test
Add unit tests to
tests/ops/:- Test correctness: compare output against a reference implementation (eager PyTorch)
- Test numerical precision: verify tolerance for bf16/fp16
- Test edge cases: empty inputs, single-element tensors, extreme shapes
Add benchmark (optional but recommended for performance-critical ops):
- Use
veomni/ops/group_gemm/utils/benchmark_utils.pyas reference - Compare against baseline implementation
- Use
Run:
pytest tests/ops/ -v
Phase 4: Document
Update
docs/design/kernel_selection.md:- Add the new op to the Quick Reference table
- Describe the selection mechanism
Update
.agents/knowledge/architecture.mdif the op adds a new subdirectory toveomni/ops/.
Phase 5: Finalize
- Run
/veomni-reviewskill. - Run
make quality. - Verify the new variant shows up in
KERNEL_REGISTRY.dump()and that the relevantOpSlotis rebound afterbuild_foundation_model.
Common Pitfalls
- Forgetting to register in
KERNEL_REGISTRY: the variant is invisible to_bind_veomni_ops()andOpSlotwill fall through to its default — you'll silently exercise the wrong kernel. - Forgetting to add the matching
OpSlotto the patchgen config: registering a kernel alone has no effect — generated modeling code must declare anOpSlotfor it to be picked up. - Unconditional NPU imports: importing NPU modules without an
is_torch_npu_available()guard crashes on GPU-only environments. - Binding at wrong time: registry entries are resolved when
build_foundation_modelruns_bind_veomni_ops(). Kernels that depend on per-model config must be picked at that point — not at module-import time. - Sequence parallel interaction: ops that touch attention or loss must handle sequence parallel correctly — use
get_parallel_state().sp_enabledto check and dispatch. - Mixed precision: fused kernels often require specific dtypes (bf16/fp16). Add assertions at the public API level to catch dtype mismatches early.
- Not exporting public APIs: if the op provides a public function (legacy global ops), export it from
veomni/ops/__init__.py's__all__.