name: blockptr-to-tdesc description: >- Translate a Triton kernel from the deprecated block-pointer API (tl.make_block_ptr / tl.advance / tl.load(boundary_check=...)) into an equivalent kernel using the modern device-side tensor-descriptor API (tl.make_tensor_descriptor / desc.load / desc.store) for the Intel XPU backend. Use this skill whenever the user wants to migrate, convert, translate, port, modernize, or "update" a kernel from block pointers to tensor descriptors; whenever they mention tl.make_block_ptr or tl.advance and ask for a modern/non-deprecated equivalent; whenever they ask how to use tensor descriptors in a kernel that currently uses block pointers; or when they paste a kernel using block pointers and ask how to speed it up or make it use DPAS / 2D block I/O on Intel GPU (PVC/BMG). Produce the descriptor form the XPU backend can lower efficiently, not just any form that compiles.
Block Pointer → Tensor Descriptor (Intel XPU)
Translate kernels from the deprecated block-pointer API to the device-side
tensor-descriptor API. tl.make_block_ptr is deprecated; its docstring points
users to tl.make_tensor_descriptor. On Intel XPU, a well-formed descriptor
lowers to the same hardware 2D-block-I/O path the block pointer used, so a
careful translation is at worst neutral and removes the deprecation.
Prime directive: translate toward an efficiently-lowerable descriptor
Do not emit the first descriptor that compiles. There are usually several faithful ways to write a descriptor, and only some keep the fast 2D-block-I/O path on XPU. Always aim for a descriptor that is 2D, last-stride-1, locally-created, dot-feeding, PAD_ZERO. When that is achievable (the common GEMM/attention cases), produce it. When a particular load genuinely cannot be put on the fast path, still translate it (the user wants the modern API) but explicitly flag the caveat in the "Changes made" section so nobody is misled into thinking every load got block I/O.
See references/performance-notes.md for why (the verified backend gates) and
references/api-reference.md for exact signatures and constraints. See
references/examples.md for verbatim before→after pairs you can pattern-match
against.
Workflow
- Read the kernel (inline or from the given file path). Identify every
block-pointer construct:
tl.make_block_ptr,tl.advance, and thetl.load/tl.storecalls that consume block pointers. Also note any interleavedtl.load(ptrs, mask=...)tensor-of-pointer loads (Rule 9) and the rank of each block pointer (Rule 11). - Apply the transformation rules below, honoring the prime directive.
- Emit the output in the format specified at the bottom.
This is XPU-only. Never add triton.set_allocator or any host-launch change
— device-side descriptors need no global-memory workspace on XPU (that hook is
NVIDIA-TMA-specific). The migration is a pure kernel-body rewrite unless Rule 12
applies.
Transformation rules
tl.make_block_ptr(base, shape, strides, offsets, block_shape, order)→tl.make_tensor_descriptor(base, shape, strides, block_shape). Dropoffsetsandorderfrom the constructor. The initialoffsetsvalues become the starting load/store offsets (Rules 2 & 7). Hoist the descriptor to the same scope as the oldmake_block_ptr— it is loop-invariant, create it once before the loop.tl.advance(ptr, delta)→ integer offset increment. Delete thetl.advance. Introduce a running integer for each advancing dimension, initialized to that dim's initial offset (off_k = 0typically), and add the delta at the loop tail (off_k += BLOCK_K). Pass the running offset intodesc.load([...])/desc.store([...]).tl.load(ptr, boundary_check=...)→desc.load([d0, d1]). Dropboundary_check— the descriptor'sshapealready masks out-of-bounds elements to the pad value, which is exactly whatboundary_checkdid. Padding caveat:padding_optionis set at descriptor-creation time, not at load time. If the original load usedpadding_option="nan", carry it onto the constructor:tl.make_tensor_descriptor(..., padding_option="nan")."zero"/""is the default and is dropped.tl.store(ptr, val, boundary_check=...)→desc.store([d0, d1], val). Dropboundary_check; out-of-bounds writes are ignored automatically. Store has nopadding_option.Block-pointer loads/stores never carry
mask=/other=(the API forbids it). Their only masking isboundary_check, handled by Rules 3–4. Amask=on the input only appears on legacy tensor-of-pointer loads — see Rule 9. (The skill itself only emits amask=tensor-of-pointer load in the single sanctioned case of a non-unit-stride rank-1 block pointer that has no legal descriptor — see Rule 11.)Transpose (
order=(0, 1)in the block pointer). Do NOT emit a descriptor whose last stride ≠ 1 — that drops off the fast path. Instead describe the tensor in its native row-major layout (so the last stride is 1) and apply.Tto the loaded block. Concretely, if B is stored(N, K)row-major but the dot wants(K, N):b_desc = tl.make_tensor_descriptor(base=b_ptr, shape=(N, K), strides=(stride_bn, stride_bk), block_shape=(BLOCK_N, BLOCK_K))thenb = b_desc.load([pid_n*BLOCK_N, off_k]).T.Static offsets (a dim that never advances, e.g.
pid_m * BLOCK_M) → pass the expression directly in.load()/.store(); no running variable needed.Batched / 3D → pre-add the batch offset to
base(base = a_ptr + bid.to(tl.int64) * stride_az) and keep the descriptor 2D over the M×K (or K×N) slice. This is the rank-3 special case of Rule 11.Masked tensor-of-pointer loads (
tl.load(ptrs, mask=..., other=...), not block pointers — may be interleaved in the same kernel). Classify the mask:- Boundary mask — a range check vs a dimension limit, e.g.
offs_k[None,:] < K - k*BLOCK_Koroffs_m[:,None] < M. This is a bounds check: fold it into the descriptorshape(e.g.shape=(M, K)), drop the mask, and let the descriptor zero-pad. - Data-dependent / non-boundary mask — causal masks, value-dependent
predicates. Cannot be a rectangular
shape. Load the full block withdesc.load([...]), then re-apply in registers:v = tl.where(mask, v, other). - Mixed — split: boundary part →
shape, residual predicate →tl.where.
- Boundary mask — a range check vs a dimension limit, e.g.
Type-annotation hygiene → if the last stride is annotated
tl.int64/tl.int32, flag it (best practice: never annotate the last stride; prefertl.constexpror no annotation). Do not silently change a signature.Rank normalization → always aim for 2D (XPU optimizes only 2D; block pointers may be any rank):
- Rank 2 → map 1:1.
- Rank ≥ 3 → fold the leading dims into
base(one term per outer dim,base + z*stride_z + h*stride_h) and build a 2D descriptor over the inner two dims;.T/reshapethe result if the consumer needs it. Each folded outer index must be a per-program scalar (e.g.program_id, a scalar load). If an outer index is itself a vector/range it cannot be folded — flag it and keep the original. - Rank 1, unit stride (
strides=(1,)) → translate to a 1D descriptor:tl.make_tensor_descriptor(base, shape=(N,), strides=(1,), block_shape=(BLOCK,)),desc.load([off])/desc.store([off], val). Note that XPU does not give 1D descriptors 2D block I/O today (the 1D fast path is pending), so flag that this load lowers via the pointer path — the form is correct and forward-compatible. - Rank 1, non-unit stride (a strided 1D view, e.g. SGLang FLA's
p_beta=tl.make_block_ptr(beta + bos*H + i_h, (T,), (H,), (i_t*BT,), (BT,), (0,))with strideH != 1) → no legal descriptor exists: a descriptor's last (here only) stride must be 1, sostrides=(H,)fails to compile (Tensor descriptor last dim must be 1 but got H). This is the one case where aboundary_checkcannot be modernized into a descriptor at all — so do not emit a descriptor. Reproduce the access as an explicit tensor-of-pointer load/store whosemaskre-expresses the originalboundary_check(it stays on the pointer path — no 2D block I/O — which is the only correct option here, unlike Rule 9 which removes such masks by folding them into a descriptorshape):
If the offset advances in a loop, fold the running offset intoo = i_t * BT + tl.arange(0, BT) b_beta = tl.load(beta + bos*H + i_h + o*H, mask=o < T, other=0.0) # strided rank-1 store: tl.store(g + bos*H + i_h + o*H, b_val, mask=o < T)oexactly as Rule 2 does for descriptors. Report this under "Changes made" (it emits no descriptor), and flag it off the 2D-block fast path.
Block pointer received as a helper argument. A block pointer can never arrive from host launch (there is no host-side block-pointer object); it can only be a
@triton.jit→@triton.jitparameter — and arriving that way already defeats the fast path (a descriptor passed acrosstt.callloses its identity). If the kernel body uses a block pointer that was not created locally, the translation is not body-only:- Change the helper's signature to take raw
base, shape, strides, offsets(andblock_shape/orderif not alreadytl.constexprthere). - Build the descriptor inside the helper with
tl.make_tensor_descriptor. - Update every call site to pass the raw ingredients. Report this as a caller-affecting change and list the touched call sites. No-op for the common create-and-use-in-the-same-kernel case.
- Change the helper's signature to take raw
Output format
- The fully translated
@triton.jitkernel — every block-pointer construct replaced. Preserve surrounding code, comments, and signatures (except Rule 12, which legitimately changes a helper signature). - "Changes made" — a short list: counts and each transformation applied,
e.g. "Replaced 3
make_block_ptrwithmake_tensor_descriptor; removed 2tl.advance, introducedoff_k; B loaded transposed via.T; flaggedstride_ak: tl.int64". State plainly that no host-launch changes are required on XPU. - Performance note (per descriptor) — for each descriptor, say whether it is
on the efficient path (2D, last-stride-1, locally-created, dot-feeding,
OWord-pitched → same 2D-block-I/O the block pointer used). For any descriptor
that could not be put on the fast path, flag it with the reason
(untraceable across
tt.call, non-OWord pitch, not feeding atl.dot, rank>2 unfoldable, rank-1 unit-stride 1D descriptor). A non-unit-stride rank-1 block pointer (Rule 11) emits no descriptor — report it under "Changes made" as a masked tensor-of-pointer load on the pointer path, not as a per-descriptor note. Do not over-promise: the README's ">2x" figure is vs tensor-of-pointers, not vs block pointers. - Caller-affecting changes (only if Rule 12 fired) — the changed helper signature and the list of call sites to update.
Self-check before returning
- Zero
tl.make_block_ptr/tl.advanceremain (unless Rule 12 deliberately keeps a block-ptr arg pending a caller fix you've flagged). - Number of
make_tensor_descriptor≈ number of distinct block pointers, minus any non-unit-stride rank-1 pointers that became masked tensor-of-pointer loads (Rule 11). - Every descriptor has last stride == 1 (transposes via
.T, not last-stride≠1). - No descriptor was emitted for a non-unit-stride rank-1 block pointer — each
such pointer became a masked
tl.load/tl.storewhosemask=reproduces the originalboundary_check, flagged off the fast path (Rule 11). (Astrides=(S,)descriptor withS != 1would not compile.) - Descriptors are created in the function that loads from them (Rule 12).
padding_option="nan"carried to the constructor iff the source used it.- No
triton.set_allocator/ host-launch change anywhere in the output.