name: develop-feature description: Primus-Turbo feature development workflow — the layered architecture (ops / kernels-dispatcher / Triton / HIP-CK csrc / modules), how to wire a new operator end-to-end, which layer to touch, and which existing file to copy. Use when adding or changing a Primus-Turbo operator or module on AMD GPUs.
Developing a Feature in Primus-Turbo
Read the hub ../SKILL.md first for the architecture diagram and code map. This file describes the workflow; instead of templates it points to the canonical file to copy for each layer.
Decide which layers to touch
1. Existing op close to mine? → copy its vertical slice and adapt (fastest path).
2. Language? Triton → Python only, NO rebuild. HIP/CK → C++ in csrc/, rebuild.
3. Backends? one → call the kernel directly. several → add a kernels/ dispatcher.
4. Autograd? implement forward AND backward in a torch.autograd.Function.
5. Surface? expose a function in ops/, optionally a nn.Module in modules/.
Minimum viable op = ops/ Function + one kernel. Add the dispatcher and module only when justified.
Layer 1 — ops/ (Python API + autograd)
A torch.autograd.Function (forward/backward) wrapped by a thin user function, exported from primus_turbo/pytorch/ops/__init__.py. Copy the closest existing op:
| Your case | Copy from |
|---|---|
| Simple 2-input op | ops/gemm.py |
| Multi-backend with quant variants | ops/gemm_fp8.py |
| Grouped / variable-shape | ops/grouped_gemm.py |
| Direct C++ op (no dispatcher) | ops/normalization.py |
| Direct Triton op (no dispatcher) | ops/activation.py |
Conventions visible in those files: validate dims early; infer out_dtype with torch.result_type when None; carry forward intermediates via ctx.save_for_backward; return one grad per forward arg (None for non-tensors); grad_out = grad_out.contiguous() before feeding kernels that need contiguity. For FP8/FP4, the config.granularity selects which autograd.Function variant runs (see ops/gemm_fp8.py).
Layer 2 — kernels/ dispatcher (multi-backend ops only)
Add a dispatcher when more than one backend can serve the op. Copy kernels/gemm/gemm_impl.py, which shows the whole pattern:
- one
KernelBackendsubclass per backend, each withcan_handle+execute; - an
AutoKernelDispatchersubclass declaring_backends(BackendType → BackendEntry) andmake_key; - the
*_implentry wrapped as@torch.library.custom_op(..., device_types="cuda")with a@*_impl.register_faketwin, callingDispatcher.dispatch(default_backend, user_backend, **kwargs).
Rules: can_handle must return False, never raise, for unsupported inputs (the dispatcher uses it for fallback). make_key must capture everything that changes the best kernel (shapes, dtypes, layout, key scalars) but never id(tensor). register_fake must mirror the real output shape/dtype. To add a new backend env var / global setter, extend core/backend.py and common/constants.py following the _gemm_backend / ENV_GEMM_BACKEND pattern.
Layer 3a — Triton kernel (triton/, no rebuild)
Put the @triton.jit kernel under primus_turbo/triton/<area>/. Bind it in the kernels-layer launcher with torch._library.wrap_triton so it composes with torch.compile. Reference: kernels/attention/attention_triton_impl.py. Pushing autotune space / MFMA variant / tiles / pipelining toward the hardware limit is an optimization campaign — see ../optimize-handoff/SKILL.md.
Layer 3b — C++ / HIP kernel (csrc/, rebuild required)
Four edits to expose a HIP/CK kernel to PyTorch — trace any GEMM op end-to-end as the reference:
- Kernel under
csrc/kernels/<area>/→ compiles intolibprimus_turbo_kernels.so. Arch-specialized files use the*_gfx942.{cu,hip}/*_gfx950.{cu,hip}suffix (filtered byGPU_ARCHSinsetup.py). - Declare the host function and its
_metatwin incsrc/pytorch/extensions.h. - Register in
csrc/pytorch/bindings_pytorch.cpp: schema underTORCH_LIBRARY, real impl underTORCH_LIBRARY_IMPL(..., CUDA), shape inference underTORCH_LIBRARY_IMPL(..., Meta). - Call from Python via
torch.ops.primus_turbo_cpp_extension.<name>(seeops/normalization.py).
Rebuild after: GPU_ARCHS=<arch> pip install --no-build-isolation -e . -v. Public headers: csrc/include/primus_turbo/. CK template / pipeline tuning toward the hardware limit is an optimization campaign — see ../optimize-handoff/SKILL.md.
Layer 4 — modules/ (optional nn.Module)
Wrap the op for model code (hold nn.Parameters, reshape, call the op). Reference: modules/linear.py, modules/grouped_linear.py.
After implementing
- [ ] Correctness first → ../verify-accuracy/SKILL.md (add ref in tests/pytorch/ref/, test in tests/pytorch/ops/)
- [ ] Then performance → ../verify-performance/SKILL.md (add/extend benchmark/ops/training/bench_*.py)
- [ ] Focused test green, then `pytest tests/pytorch/ -n 8`
- [ ] Rebuilt if a C++ schema changed
- [ ] Push toward HW limit → ../optimize-handoff/SKILL.md