add-sgl-kernel

name: add-sgl-kernel description: Step-by-step tutorial for adding a heavyweight AOT CUDA/C++ kernel to sgl-kernel (including tests & benchmarks)

Tutorial: Adding a New Kernel to `sgl-kernel` (AOT / Heavyweight)

This SKILL is a step-by-step guide for adding a heavyweight CUDA/C++ kernel to sgl-kernel/.

Typical characteristics:

Depends on heavyweight components such as CUTLASS / FlashInfer / DeepGEMM / sgl-attn
Needs AOT build and distribution (wheel / torch extension), so build time, link flags, CUDA arch targets, and binary size matter
Exposed as a stable sgl_kernel API and used by higher-level code (including torch.compile)

Two rules of thumb (must follow)

Heavyweight kernels go to sgl-kernel. If it depends on CUTLASS/FlashInfer/DeepGEMM (or similarly heavy stacks), implement it in sgl-kernel/.
Lightweight kernels go to python/sglang/jit_kernel. If it is small, has few dependencies, and benefits from rapid iteration, implement it as a JIT kernel instead.

In addition, every new kernel must ship with:

Tests (pytest)
A benchmark script (triton.testing)

Goal

Add a new kernel end-to-end, including:

CUDA/C++ implementation
Torch library registration (m.def schema + m.impl dispatch)
Build system integration (CMake sources list)
Python-facing API
Correctness tests and performance benchmarks

Repository integration map

You will typically touch these files/areas:

Implementation: sgl-kernel/csrc/...
Public declarations: sgl-kernel/include/sgl_kernel_ops.h
Torch extension registration: sgl-kernel/csrc/common_extension.cc
Build: sgl-kernel/CMakeLists.txt (set(SOURCES ...))
Python API: sgl-kernel/python/sgl_kernel/... and sgl-kernel/python/sgl_kernel/__init__.py
Tests: sgl-kernel/tests/test_<op>.py
Benchmarks: sgl-kernel/benchmark/bench_<op>.py

Step 1: Implement the kernel in `csrc/`

Pick the right subdirectory:

csrc/elementwise/
csrc/gemm/
csrc/attention/
csrc/moe/

Implementation requirements:

Clearly define dtype/shape/stride/contiguity assumptions
If assumptions are violated, fail fast with a readable error (e.g. TORCH_CHECK(...))
After kernel launch, perform device error checking (follow existing project conventions)

Key points:

Prefer explicit validation over "it probably works".
If a kernel only works on certain architectures, make that restriction explicit (error/skip behavior).

Step 2: Add a C++ declaration in `include/sgl_kernel_ops.h`

Edit:

sgl-kernel/include/sgl_kernel_ops.h

Add your function declaration in the appropriate section.

Step 3: Register the op in `csrc/common_extension.cc` (schema + dispatch)

Edit:

sgl-kernel/csrc/common_extension.cc

Inside TORCH_LIBRARY_FRAGMENT(sgl_kernel, m):

Add m.def(...) with a schema.
Add m.impl(...) for CUDA dispatch.

Key points:

The schema is important for torch.compile and for consistent call signatures.
If your underlying C++ API uses native types (e.g. int, float), but PyTorch bindings expect int64_t / double, use the project’s recommended shim approach (see sgl-kernel/README.md).

Step 4: Add the new source file to `CMakeLists.txt`

Edit:

sgl-kernel/CMakeLists.txt

Add your new .cu / .cc file to the set(SOURCES ...) list.

Key points:

Keep the list alphabetically sorted (the file explicitly requires this).
If your kernel has arch constraints, reflect that in tests/benchmarks via skip logic.

Step 5: Expose a Python API under `sgl-kernel/python/sgl_kernel/`

Goal: users can call sgl_kernel.<op>(...).

Add/extend a Python wrapper under sgl-kernel/python/sgl_kernel/ (follow existing module organization).
Export it from sgl-kernel/python/sgl_kernel/__init__.py.

Step 6: Write tests (required)

Create:

sgl-kernel/tests/test_<op>.py

Minimum coverage:

Shapes: typical + edge cases
Dtypes: whatever the kernel claims to support
Correctness: compare with a reference implementation (PyTorch / FlashInfer / another stable backend)
Negative cases: unsupported dtype/shape/arch should either raise a clear error or be explicitly skipped

Skipping by architecture:

Use @pytest.mark.skipif(..., reason="...") when compute capability requirements apply.

Run:

pytest sgl-kernel/tests/test_<op>.py -q

Step 7: Add a benchmark (required)

Create:

sgl-kernel/benchmark/bench_<op>.py

Follow the repository convention:

Use triton.testing.Benchmark + triton.testing.perf_report
Prefer triton.testing.do_bench_cudagraph for timing

Minimum benchmark requirements:

At least two providers/variants:
- Your sgl_kernel implementation
- A baseline (PyTorch / torch.compile / Triton / FlashInfer)
Quantiles output (median/min/max)
CI-friendly ranges controlled by CI / GITHUB_ACTIONS

Run:

python sgl-kernel/benchmark/bench_<op>.py

Step 8: Build and validate

Build:

cd sgl-kernel
make build -j16

If you need to limit host resource usage:

cd sgl-kernel
make build -j1 MAX_JOBS=2 CMAKE_ARGS="-DSGL_KERNEL_COMPILE_THREADS=1"

Validate:

Tests: pytest sgl-kernel/tests/test_<op>.py -q
Benchmark: python sgl-kernel/benchmark/bench_<op>.py

Troubleshooting

Async CUDA errors: CUDA_LAUNCH_BLOCKING=1
Memory errors: compute-sanitizer --tool memcheck python ...
Build is too slow / OOM: reduce MAX_JOBS and SGL_KERNEL_COMPILE_THREADS
Binary bloat: use sgl-kernel/analyze_whl_kernel_sizes.py

References

sgl-kernel/README.md
sgl-kernel/include/sgl_kernel_ops.h
sgl-kernel/csrc/common_extension.cc
sgl-kernel/CMakeLists.txt

name: add-sgl-kernel description: Step-by-step tutorial for adding a heavyweight AOT CUDA/C++ kernel to sgl-kernel (including tests & benchmarks)

Tutorial: Adding a New Kernel to sgl-kernel (AOT / Heavyweight)

Two rules of thumb (must follow)

Goal

Repository integration map

Step 1: Implement the kernel in csrc/

Step 2: Add a C++ declaration in include/sgl_kernel_ops.h

Step 3: Register the op in csrc/common_extension.cc (schema + dispatch)

Step 4: Add the new source file to CMakeLists.txt

Step 5: Expose a Python API under sgl-kernel/python/sgl_kernel/

Step 6: Write tests (required)

Step 7: Add a benchmark (required)

Step 8: Build and validate

Troubleshooting

References

Tutorial: Adding a New Kernel to `sgl-kernel` (AOT / Heavyweight)

Step 1: Implement the kernel in `csrc/`

Step 2: Add a C++ declaration in `include/sgl_kernel_ops.h`

Step 3: Register the op in `csrc/common_extension.cc` (schema + dispatch)

Step 4: Add the new source file to `CMakeLists.txt`

Step 5: Expose a Python API under `sgl-kernel/python/sgl_kernel/`