flydsl-tile-programming - SKILL.md Agent Skill

name: flydsl-tile-programming description: > Guided step-by-step wizard for producing a new FlyDSL GPU kernel from a requirement: classify the kernel type, pick a skeleton, fill in compute, add control flow / sync / LDS, then test on GPU. Use when the user wants to WRITE a new kernel, port a Triton kernel to FlyDSL, or learn tile programming by following a procedure. For API/layout-algebra lookups, per-op reference tables, and troubleshooting, use the flydsl-kernel-authoring skill instead. allowed-tools: Read Edit Bash Grep Glob Agent

FlyDSL Tile Programming

Guide users through writing GPU kernels using FlyDSL's tile programming model (CuTe-style layout algebra). This skill is a step-by-step wizard that takes a kernel requirement and produces a correct, tested FlyDSL kernel.

Trigger: User wants to write a new FlyDSL kernel, port a Triton kernel to FlyDSL, or learn tile programming patterns.

Prerequisites: FlyDSL installed (editable mode via pip install -e .). GPU access required for testing.

Scope (read this first): This skill is the procedure — follow the steps in order to produce a kernel. It is the companion to the flydsl-kernel-authoring skill, which is the reference (the full layout-algebra API surface, per-op tables, environment variables, and an exhaustive troubleshooting list). When you need to look something up rather than follow a step, go to flydsl-kernel-authoring. This wizard links there instead of duplicating those tables.

Step 1: Classify the Kernel Type

Ask the user what kind of kernel they need. Map to one of these patterns:

Pattern	Examples	Key Primitives
Elementwise	vecadd, scale, relu, abs	`logical_divide` + `copy_atom_call`
Reduction	sum, max, softmax, layernorm	`buffer_load` + warp shuffle + LDS
Tiled Copy	transpose, permute, gather	`zipped_divide` + `TiledCopy`
GEMM	matmul, batched gemm	`TiledMma` + `TiledCopy` + LDS
Fused	fused attention, GEMM+epilogue	Combine GEMM + elementwise

Step 2: Generate Kernel Skeleton

Based on the pattern, generate the appropriate skeleton. Every FlyDSL kernel has two parts:

import torch
import flydsl.compiler as flyc
import flydsl.expr as fx

@flyc.kernel
def my_kernel(A: fx.Tensor, B: fx.Tensor, ...):
    tid = fx.thread_idx.x
    bid = fx.block_idx.x
    # ... kernel body ...

@flyc.jit
def my_launch(A: fx.Tensor, B: fx.Tensor, ...,
              stream: fx.Stream = fx.Stream(None)):
    my_kernel(A, B, ...).launch(
        grid=(grid_x, grid_y, grid_z),
        block=(block_x, 1, 1),
        stream=stream
    )

Pattern A: Elementwise Kernel

The simplest pattern. Each thread processes VEC_WIDTH elements independently.

Data flow: Global -> Register -> Compute -> Register -> Global

import torch
import flydsl.compiler as flyc
import flydsl.expr as fx
from flydsl.expr.typing import Vector as Vec

BLOCK_DIM = 256
VEC_WIDTH = 4

@flyc.kernel
def elementwise_kernel(
    A: fx.Tensor,
    Out: fx.Tensor,
    BLOCK_DIM: fx.Constexpr[int],
    VEC_WIDTH: fx.Constexpr[int],
):
    bid = fx.block_idx.x
    tid = fx.thread_idx.x

    # === Step 1: Divide global tensor into block-sized tiles ===
    tile_size = BLOCK_DIM * VEC_WIDTH
    tA = fx.logical_divide(A, fx.make_layout(tile_size, 1))
    tOut = fx.logical_divide(Out, fx.make_layout(tile_size, 1))

    # === Step 2: Select this block's tile ===
    tA = fx.slice(tA, (None, bid))
    tOut = fx.slice(tOut, (None, bid))

    # === Step 3: Divide tile for per-thread vectorized access ===
    tA = fx.logical_divide(tA, fx.make_layout(VEC_WIDTH, 1))
    tOut = fx.logical_divide(tOut, fx.make_layout(VEC_WIDTH, 1))

    # === Step 4: Allocate register and set up copy atom ===
    copy_bits = VEC_WIDTH * 32
    copy_atom = fx.make_copy_atom(fx.UniversalCopy(copy_bits), fx.Float32)
    rA = fx.make_rmem_tensor(VEC_WIDTH, fx.Float32)
    rOut = fx.make_rmem_tensor(VEC_WIDTH, fx.Float32)

    # === Step 5: Load -> Compute -> Store ===
    fx.copy_atom_call(copy_atom, fx.slice(tA, (None, tid)), rA)

    vA = Vec(fx.memref_load_vec(rA))
    # --- YOUR COMPUTE HERE ---
    vOut = vA * vA  # example: square
    # --- END COMPUTE ---
    fx.memref_store_vec(vOut, rOut)

    fx.copy_atom_call(copy_atom, rOut, fx.slice(tOut, (None, tid)))

@flyc.jit
def elementwise_launch(
    A: fx.Tensor, Out: fx.Tensor, N: fx.Int32,
    stream: fx.Stream = fx.Stream(None),
):
    tile_size = BLOCK_DIM * VEC_WIDTH
    grid_x = (N + tile_size - 1) // tile_size
    elementwise_kernel(A, Out, BLOCK_DIM, VEC_WIDTH).launch(
        grid=(grid_x, 1, 1), block=(BLOCK_DIM, 1, 1), stream=stream
    )

# === Test ===
N = 1024
A = torch.randn(N, dtype=torch.float32, device="cuda")
Out = torch.empty(N, dtype=torch.float32, device="cuda")
elementwise_launch(A, Out, N, stream=torch.cuda.Stream())
torch.cuda.synchronize()
assert torch.allclose(Out, A * A, atol=1e-5)

Pattern B: Tiled 2D Copy (Transpose, Gather)

Uses zipped_divide + TiledCopy for 2D data movement with explicit thread-value mapping.

Data flow: Global[M,N] -> Fragment -> Global[M,N] (with layout change)

@flyc.kernel
def tiled_copy_kernel(A: fx.Tensor, B: fx.Tensor):
    tid = fx.thread_idx.x
    bid = fx.block_idx.x

    block_m, block_n = 8, 24
    tile = fx.make_tile([
        fx.make_layout(block_m, 1),
        fx.make_layout(block_n, 1)
    ])

    # Wrap as buffer tensors (AMD buffer descriptors)
    A = fx.rocdl.make_buffer_tensor(A)
    B = fx.rocdl.make_buffer_tensor(B)

    # Divide into tiles, select block's tile
    bA = fx.zipped_divide(A, tile)
    bB = fx.zipped_divide(B, tile)
    bA = fx.slice(bA, (None, bid))
    bB = fx.slice(bB, (None, bid))

    # Thread-value layout: how threads cooperate on the tile
    thr_layout = fx.make_layout((4, 1), (1, 1))   # 4 threads along M
    val_layout = fx.make_layout((1, 8), (1, 1))    # each loads 8 along N
    copy_atom = fx.make_copy_atom(fx.rocdl.BufferCopy128b(), fx.Float32)
    layout_tv = fx.raked_product(thr_layout, val_layout)
    tile_mn = fx.make_tile(4, 8)

    # Build tiled copy and get thread's partition
    tiled_copy = fx.make_tiled_copy(copy_atom, layout_tv, tile_mn)
    thr_copy = tiled_copy.get_slice(tid)
    src = thr_copy.partition_S(bA)
    dst = thr_copy.partition_D(bB)
    frag = fx.make_fragment_like(src)

    # Copy: global A -> frag -> global B
    fx.copy(copy_atom, src, frag)
    fx.copy(copy_atom, frag, dst)

Pattern C: Tiled MMA (GEMM)

Uses TiledMma + TiledCopy for matrix multiply with AMD MFMA instructions.

Data flow: Global -> (TiledCopy) -> Fragment A,B -> (MFMA) -> Fragment C -> Global

block_m, block_n, block_k = 64, 64, 8

@flyc.kernel
def gemm_kernel(A: fx.Tensor, B: fx.Tensor, C: fx.Tensor):
    tid = fx.thread_idx.x
    bid = fx.block_idx.x

    # Define tiles
    tileA = fx.make_tile(block_m, block_k)
    tileB = fx.make_tile(block_n, block_k)
    tileC = fx.make_tile(block_m, block_n)

    # Wrap as buffer tensors
    A = fx.rocdl.make_buffer_tensor(A)
    B = fx.rocdl.make_buffer_tensor(B)
    C = fx.rocdl.make_buffer_tensor(C)

    # Divide and select block's tile
    bA = fx.slice(fx.zipped_divide(A, tileA), (None, bid))
    bB = fx.slice(fx.zipped_divide(B, tileB), (None, bid))
    bC = fx.slice(fx.zipped_divide(C, tileC), (None, bid))

    # === MMA setup ===
    # MFMA(M, N, K, AccType) -- hardware instruction shape
    mma_atom = fx.make_mma_atom(fx.rocdl.MFMA(16, 16, 4, fx.Float32))

    # Tile the MMA atom across threads: 2x2 = 4 MMA atoms per warp
    tiled_mma = fx.make_tiled_mma(
        mma_atom,
        fx.make_layout((2, 2, 1), (1, 2, 0))  # (M_rep, N_rep, K_rep)
    )
    thr_mma = tiled_mma.thr_slice(tid)

    # === Copy setup (matched to MMA layout) ===
    copy_atom = fx.make_copy_atom(fx.rocdl.BufferCopy32b(), fx.Float32)
    tiled_copy_A = fx.make_tiled_copy_A(copy_atom, tiled_mma)
    tiled_copy_B = fx.make_tiled_copy_B(copy_atom, tiled_mma)
    tiled_copy_C = fx.make_tiled_copy_C(copy_atom, tiled_mma)

    thr_copy_A = tiled_copy_A.get_slice(tid)
    thr_copy_B = tiled_copy_B.get_slice(tid)
    thr_copy_C = tiled_copy_C.get_slice(tid)

    # === Partition data ===
    # Copy partitions (for data movement)
    copy_src_A = thr_copy_A.partition_S(bA)
    copy_src_B = thr_copy_B.partition_S(bB)
    copy_dst_C = thr_copy_C.partition_S(bC)

    # MMA partitions (for compute)
    part_A = thr_mma.partition_A(bA)
    part_B = thr_mma.partition_B(bB)
    part_C = thr_mma.partition_C(bC)

    # === Allocate fragments (registers) ===
    frag_A = thr_mma.make_fragment_A(part_A)
    frag_B = thr_mma.make_fragment_B(part_B)
    frag_C = thr_mma.make_fragment_C(part_C)

    # Retile fragments for copy compatibility
    copy_frag_A = thr_copy_A.retile(frag_A)
    copy_frag_B = thr_copy_B.retile(frag_B)
    copy_frag_C = thr_copy_C.retile(frag_C)

    # === Execute: Load A,B -> GEMM -> Store C ===
    fx.copy(copy_atom, copy_src_A, copy_frag_A, pred=None)
    fx.copy(copy_atom, copy_src_B, copy_frag_B, pred=None)
    fx.gemm(mma_atom, frag_C, frag_A, frag_B, frag_C)
    fx.copy(copy_atom, copy_frag_C, copy_dst_C, pred=None)

Pattern D: Buffer Load/Store (Low-level)

Direct AMD buffer intrinsics for maximum control. Bypasses the layout algebra.

from flydsl.expr import buffer_ops

@flyc.kernel
def buffer_kernel(A: fx.Tensor, B: fx.Tensor, N: fx.Constexpr[int]):
    tid = fx.thread_idx.x
    bid = fx.block_idx.x
    gid = bid * 256 + tid

    rsrc_a = buffer_ops.create_buffer_resource(A)
    rsrc_b = buffer_ops.create_buffer_resource(B)

    # offset is in ELEMENTS (not bytes!) -- buffer_load converts internally
    data = buffer_ops.buffer_load(rsrc_a, gid * 4, vec_width=4, dtype=fx.T.f32())
    # ... compute on data ...
    buffer_ops.buffer_store(data, rsrc_b, gid * 4)

Step 3: Fill in the Compute Logic

Common compute recipes (all work on vectors):

from flydsl.expr.typing import Vector as Vec

# Scale: C = A * scalar
scale = Vec.filled(VEC_WIDTH, 2.0, fx.Float32)
vC = Vec(vA) * scale

# Add: C = A + B
vC = Vec(vA) + Vec(vB)

# FMA: D = A * B + C
vC = Vec(vA) * Vec(vB) + Vec(vC)

# ReLU: C = max(A, 0)
zero = Vec.filled(VEC_WIDTH, 0.0, fx.Float32)
vC = Vec(vA).maximumf(zero)

# Abs: C = |A|
v = Vec(vA)
neg = -v
is_neg = v < zero
vC = is_neg.select(neg, v)

# Type conversion
vC = Vec(vI32).to(fx.Float32)  # int -> float
vC = Vec(vF32).to(fx.Float16)  # f32 -> f16

Step 4: Add Control Flow

from flydsl.expr import range_constexpr

# Compile-time unrolled loop (constant bounds)
for i in range_constexpr(K):
    ...

# Runtime loop (dynamic bounds)
for i in range(runtime_N):
    ...

# Loop with carried state (software pipelining)
start, stop, step = fx.Index(0), fx.Index(N - 1), fx.Index(1)
for iv, state in range(start, stop, step, init=[acc_init, ...]):
    acc = state[0]
    # ... compute ...
    results = yield [new_acc, ...]
final_acc = results[0]

# Static if (compile-time, no MLIR)
from flydsl.expr import const_expr
if const_expr(USE_FAST_PATH):
    ...

# Dynamic if (runtime, rewritten by the frontend)
if bid == 0:
    ...

Step 5: Add Synchronization (if needed)

# Workgroup barrier (__syncthreads)
fx.gpu.barrier()

# Fine-grained waitcnt (CDNA3)
fx.rocdl.s_waitcnt(0)

# Fine-grained waitcnt (CDNA4 / gfx950)
fx.rocdl.s_wait_loadcnt(0)
fx.rocdl.s_wait_storecnt(0)
fx.rocdl.s_wait_dscnt(0)

# Scheduling hints
fx.rocdl.sched_mfma(N)     # schedule N MFMA before next barrier
fx.rocdl.sched_vmem(N)     # schedule N VMEM reads
fx.rocdl.sched_dsrd(N)     # schedule N DS reads
fx.rocdl.sched_dswr(N)     # schedule N DS writes

Step 6: Add Shared Memory (if needed)

from flydsl.utils.smem_allocator import SmemAllocator
from flydsl.compiler.kernel_function import CompilationContext
from flydsl._mlir import ir

allocator = SmemAllocator(None, arch="gfx942", global_sym_name="smem0")
lds_buf = allocator.allocate_array(fx.T.f16, num_elements)

@flyc.kernel
def kernel_with_lds(A: fx.Tensor, ...):
    lds_base = allocator.get_base()
    lds_ptr = lds_buf(lds_base)

    # Write to LDS
    lds_ptr.store(value, [idx])
    fx.gpu.barrier()

    # Read from LDS
    val = lds_ptr.load([idx])

    # Finalize (inside GPU module body, before launch)
    comp_ctx = CompilationContext.get_current()
    with ir.InsertionPoint(comp_ctx.gpu_module_body):
        allocator.finalize()

LDS capacity: gfx942 (MI300X) = 64KB, gfx950 (MI350) = 160KB.

Step 7: Test the Kernel

Run the kernel locally or on a remote GPU:

# Run locally
PYTHONPATH=./ python my_kernel.py

# Run with IR dump for debugging
FLYDSL_DUMP_IR=1 PYTHONPATH=./ python my_kernel.py

Step 8: Debug Common Errors

If the kernel fails to compile or produces wrong results, consult the full error -> cause -> fix table in the flydsl-kernel-authoring skill (§10 Troubleshooting), which covers the common wizard pitfalls: Python int where a DSL value is expected, NameError inside extracted __then_* branches, missing arith.absf, scalar/vector mismatches, LDS overflow, buffer_load element-vs-byte offsets, range(..., init=...) being unrolled, and stale caches. For deeper kernel-debugging methodology (all-1s test, single-partition isolation, MFMA operand layout checks), use the debug-flydsl-kernel skill.

Tile Programming Mental Model

                  Layout Algebra
                  =============
   make_layout(shape, stride)  ->  Layout = mapping: coord -> index

                  Divide (Partition)
                  =================
   zipped_divide(Tensor, Tile)  ->  (tile_interior, tile_id)
   slice(divided, (None, bid))  ->  this block's tile

                  Atom (Hardware Instruction)
                  ==========================
   CopyAtom  = one hardware copy instruction (32b/64b/128b)
   MmaAtom   = one MFMA instruction (16x16x4, 16x16x16, etc.)

                  Tiled Operation (Thread Cooperation)
                  ====================================
   TiledCopy = CopyAtom x thread_layout  -> many threads cooperate on copy
   TiledMma  = MmaAtom  x atom_layout   -> many threads cooperate on MMA

                  Per-Thread View
                  ===============
   ThrCopy.partition_S/D(tensor)  ->  this thread's source/dest data
   ThrMma.partition_A/B/C(tensor) ->  this thread's operand data

                  Fragment (Register Storage)
                  ==========================
   make_fragment_like(partition)  ->  register tile
   retile(fragment)              ->  reshape for copy compatibility

                  Execute
                  =======
   fx.copy(atom, src, dst)           ->  data movement
   fx.gemm(atom, D, A, B, C)        ->  matrix multiply: D = A @ B + C

Key insight: Layout is the glue. Every operation (divide, partition, copy, gemm) is defined in terms of layouts that describe the mapping from logical coordinates to physical locations. Getting the layouts right is 90% of FlyDSL programming.

MFMA Instruction Reference (AMD CDNA3/4)

For the table of available MFMA instruction shapes (MFMA(16,16,4,Float32), MFMA(16,16,16,Float32), FP8/BF16/CDNA4-scaled variants) and how make_tiled_mma's (M_rep, N_rep, K_rep) atom_layout works, see the flydsl-kernel-authoring skill (§6 MFMA Integration). Use it when choosing the MMA atom for the Pattern C (GEMM) skeleton above.

Checklist for New Kernels

Identified kernel pattern (elementwise / reduction / copy / GEMM)
Chose appropriate copy atom type (Universal vs Buffer, bit width)
Set tile sizes matching MFMA instruction shape (if GEMM)
Verified VEC_WIDTH * sizeof(elem) <= copy atom bits
Used Constexpr[int] for compile-time constants, Int32 for runtime
Added torch.cuda.synchronize() before checking results
Verified correctness with torch.allclose()