barrier-visualization - SKILL.md Agent Skill

name: barrier-visualization description: > Produce a structured barrier report for AutoWS (automatic warp specialization) IR. Use when the user wants to visualize, audit, or debug barrier usage across warp-specialized partitions, or when debugging a GPU kernel hang (deadlock). For hangs, first dump IR using the ir-debugging skill, then run this barrier analysis to identify mismatched arrive/wait counts, missing backward barriers, or other synchronization issues that cause deadlocks. Covers mbarriers, named barriers, tcgen05 commit, TMA-implicit arrives, Aref-based synchronization, and producer/consumer barrier patterns.

Barrier Visualization Report

When the user asks for a barrier visualization report, produce a structured analysis of barrier usage in the given IR (either from a file, an IR dump, or from running a compilation with MLIR_ENABLE_DUMP). The report has five sections. Use the IR directly as input -- read the file or dump and analyze it.

Report Format

Section 1: Partition Summary

Label each partition by its key ops -- the operations that differentiate it. Use short descriptive names. When multiple partitions contain similar ops, add qualifying detail.

Format as a table:

| Partition   | Role             | Key Ops                        | Warps |
|-------------|------------------|--------------------------------|-------|
| default     | Acc correction   | tmem_load, tmem_store          | 4     |
| partition0  | MMA              | tc_gen5_mma x2                 | 4     |
| partition1  | TMA loads (Q,K,V)| async_tma_copy_global_to_local | 1     |
| partition2  | Output store     | descriptor_store               | 1     |
| partition3  | Softmax (QK_1)   | tmem_load, exp2, reduce        | 2     |

How to identify key ops:

MMA partition: contains tt.dot, warp_group_dot, tc_gen5_mma, or tc_gen5_mma_scaled
TMA load partition: contains async_tma_copy_global_to_local or descriptor_load feeding local_alloc
Store/epilogue partition: contains descriptor_store, tt.store, tmem_load at loop exit
Softmax/reduction partition: contains tt.reduce, math.exp2, arith.maxf
Accumulator correction: contains tmem_load + tmem_store (re-scaling accumulators)

When two partitions both do TMA loads, differentiate by what they load:

"TMA load (Q, K)" vs "TMA load (V, scales)"
Use loc metadata or tensor shapes to identify operand names when available

Section 2: Barrier Dependency Graph

Draw an ASCII diagram showing which partitions produce/consume through each barrier. Use arrows to show data flow direction.

Barrier Dependency Graph
========================

  Forward barriers:

  partition1 (TMA loads)
      |
      | barrier_expect + async_tma_copy (mbarrier, SMEM buffers A, B)
      v
  partition0 (MMA)
      |
      | tc_gen5_commit (mbarrier on TMEM result)
      v
  partition3/4 (Softmax)
      |
      | aref.put / aref.get  (SMEM buffer for P)
      v
  partition0 (MMA, 2nd use)
      |
      | tc_gen5_commit
      v
  partition2 (Output store)

  Backwards barriers (next-iteration dependencies):

  partition2 (Output store)
      |
      | TMEM token (backward): tmem_load token → next iter's tmem_store
      v
  partition0 (MMA, next iteration)

  partition0 (MMA)
      |
      | mbarrier phase (backward, implicit): phase tracking prevents
      |   TMA re-arrival until MMA has consumed the buffer
      v
  partition1 (TMA loads, next iteration)

For each arrow, annotate:

The barrier mechanism type (see table below)
What data flows across (buffer name or tensor shape)
The direction: forward (producer → consumer) or backward (consumer → producer, signaling resource reuse)

Backwards-Direction Barriers

In persistent kernels (those with an outer tile loop), downstream partitions often need to signal upstream partitions that shared resources can be reused. These "backwards" barriers create cycles in the dependency graph.

Common backwards barriers:

TMEM token chain: tmem_load (epilogue) produces a token consumed by tmem_store (MMA) in the next iteration — prevents zeroing the accumulator before the epilogue finishes reading it.
consumer_release (legacy WS): Consumer releases the mbarrier slot, allowing the producer to re-acquire it for the next iteration.
Phase-based mbarrier: Multi-buffered SMEM implicitly handles backwards sync — the producer can't re-arrive on a slot until the consumer has waited on it (phase flip).

Show backwards barriers as upward arrows or annotated return edges in the dependency graph. When a backwards token chain is expected but the SSA token is unused (not loop-carried), flag it as a potential issue.

Barrier Mechanism Types

Mechanism	Arrive Side	Wait Side	Notes
mbarrier (TMA)	`async_tma_copy_global_to_local` (implicit arrive)	`wait_barrier` with phase	TMA HW auto-arrives on mbarrier after copy completes. `barrier_expect` sets expected byte count.
mbarrier (explicit)	`arrive_barrier`	`wait_barrier`	Thread-side explicit arrive with count.
tcgen05 commit	`tc_gen5_commit` on barrier	`wait_barrier`	Tracks completion of prior async tcgen5 ops (MMA, tmem_copy). Arrive count = 1. Sequential ordering between commits.
tc_gen5_mma barrier arg	`tc_gen5_mma ... barriers(%bar)`	`wait_barrier`	MMA op directly arrives on given barrier(s) upon completion.
Named barrier	`arrive_barrier_named`	`wait_barrier_named`	HW barrier (index 0-15), no SMEM. Used for intra-CTA sync between warp groups.
Producer/Consumer (legacy)	`producer_acquire` + `producer_commit`	`consumer_wait` + `consumer_release`	Legacy Hopper WS. Producer acquires mbarrier slot, does copies, commits. Consumer waits then releases.
Aref (new pipeline)	`aref.put.enter` / `aref.put.exit`	`aref.get.enter` / `aref.get.exit`	Cross-partition SSA deps rewritten to SMEM multibuffers. Handles sync internally. `async_ops` attr on exit specifies what async ops to wait on.
async_copy_mbarrier_arrive	`async_copy_mbarrier_arrive`	`wait_barrier`	Arrives on mbarrier after all prior `cp.async` copies complete.

Section 3: Index and Phase Analysis

For each barrier instance, describe:

Buffer depth (number of multibuffer slots, from buffer.copy attr or memdesc shape dim 0)
Index computation (how the buffer/barrier slot index is derived -- typically iteration % num_buffers)
Phase tracking (how the phase bit flips -- typically iteration / num_buffers)
Stagger offsets (for data-partitioned barriers sharing buffer.id, each operand gets a different offset: (accumCnt + offset) % num_buffers)

Example:

Barrier: mbarrier for SMEM buffers A, B (buffer.id = 0, merged)
  Depth: 3 (triple-buffered)
  Index: accumCnt % 3
  Phase: accumCnt / 3 (1-bit: flips every 3 iterations)
  Merged: barrier_expect size = 49152 (128*64*2 + 64*256*2)

Barrier: mbarrier for data-partitioned operands a0, a1, b (buffer.id = 2)
  Depth: 3
  Index (a0): (accumCnt + 1) % 3
  Index (a1): (accumCnt + 2) % 3
  Index (b):  accumCnt % 3
  Phase: same for all, accumCnt / 3

Flag potential issues:

Mismatched arrive/wait counts
Missing phase tracking
Barriers with buffer.copy = 1 (no pipelining)
Merged barriers where byte counts don't match tensor sizes

Section 4: Shared Data Description

For each barrier, describe what logical data it protects and which partitions share it. Group by logical purpose.

Shared Data Map
===============

Buffer Group: "K tile" (SMEM)
  Storage: !ttg.memdesc<3x128x64xf16, #shared, #smem, mutable>
  buffer.id: 0 (merged with V tile)
  Writer: partition1 (TMA load)
  Reader: partition0 (MMA operand A)
  Barrier: mbarrier[buffer.id=0], merged expect=49152

Buffer Group: "V tile" (SMEM)
  Storage: !ttg.memdesc<3x64x128xf16, #shared, #smem, mutable>
  buffer.id: 0 (merged with K tile)
  Writer: partition1 (TMA load)
  Reader: partition0 (MMA operand B)
  Barrier: mbarrier[buffer.id=0], merged expect=49152

Buffer Group: "QK accumulator" (TMEM)
  Storage: !ttg.memdesc<128x128xf32, #tmem, #ttng.tensor_memory, mutable>
  buffer.id: 1
  Writer: partition0 (MMA result)
  Reader: partition3 (softmax tmem_load)
  Barrier: tc_gen5_commit

Buffer Group: "P matrix" (Aref)
  Storage: !ttg.memdesc<1x128x128xf16, #shared, #smem>
  Writer: partition3 (softmax output, via aref.put)
  Reader: partition0 (MMA 2nd operand, via aref.get)
  Barrier: Aref-internal sync

Note when:

Multiple logical buffers share the same buffer.id (merged barriers)
Data aliases exist (same physical storage, different views)
TMEM vs SMEM vs register data flows

Section 5: SSA Value to Barrier Mapping

List all SSA values that refer to the same logical barrier, tracing through block arguments, iter_args, and aliases.

Barrier Alias Map
=================

Logical barrier "mbarrier_0" (buffer.id = 0):
  %bar_alloc   = ttg.local_alloc  (line 12)    -- allocation
  %arg35       = block argument   (line 45)     -- passed into loop body
  %bar_idx     = ttg.memdesc_index %arg35[%idx] -- indexed for iteration
  Used in:
    barrier_expect %bar_idx, 49152  (partition1, line 82)
    async_tma_copy ... %bar_idx     (partition1, line 84)
    wait_barrier %bar_idx, %phase   (partition0, line 67)

Logical barrier "named_bar_1":
  %c1 = arith.constant 1 : i32
  Used in:
    arrive_barrier_named %c1, 128  (default, line 50)
    wait_barrier_named %c1, 128    (partition0, line 55)

Include:

The allocation site (local_alloc, or constant for named barriers)
All aliases through block args, loop iter_args, memdesc_index, memdesc_subview
Every use site with partition and line number
For Arefs: the aref.create site and all enter/exit pairs

How to Generate the Report

Read the IR from the file or dump the user provides.
Identify all ttg.warp_specialize ops -- these define the partition structure.
Scan each partition region for barrier-related ops (see mechanism table above).
Trace SSA values backward from barrier ops to their allocation sites. Follow block arguments and iter_args chains.
Identify buffer.id attributes on local_alloc and tmem_alloc ops to group related barriers.
Check for merged barriers -- multiple buffers sharing the same buffer.id with a single barrier_expect whose size is the sum of individual buffer sizes.
Look for loc metadata (e.g., loc("a_desc"), loc("K")) to name buffers.
Check async_task_id attributes on ops to determine partition membership when analyzing pre-code-partition IR.
Identify backwards-direction barriers in persistent kernels (outer tile loops). Check whether downstream partitions produce tokens or release barriers that upstream partitions consume in the next iteration:
- TMEM: Does tmem_load's output token feed back (via iter_arg) to the next iteration's tmem_store? If not, flag as a potential missing backward sync.
- SMEM mbarrier: Is the buffer multi-buffered (depth > 1) with phase tracking? If so, backwards sync is implicit. If single-buffered, check for explicit backward barriers.
- Legacy WS: Does consumer_release pair with the next producer_acquire?

Example Reports

See EXAMPLES.md in this skill directory for two fully worked example reports:

Blackwell GEMM with merged barriers -- @matmul_kernel_tma_persistent from ws_code_partition_merged_barrier.mlir. Demonstrates merged buffer.id, TMEM token chains, and tc_gen5_mma barrier patterns.
Hopper matmul with two consumers -- @matmul_kernel_two_consumers from ws_code_partition.mlir. Demonstrates legacy producer/consumer barriers, shared SMEM buffers consumed by multiple partitions, and pre-code-partition async_task_id analysis.

Reference Files

Barrier op definitions: include/triton/Dialect/TritonNvidiaGPU/IR/TritonNvidiaGPUOps.td
NVWS Aref ops: third_party/nvidia/include/Dialect/NVWS/IR/NVWSOps.td
Code partition (legacy): third_party/nvidia/hopper/lib/Transforms/WarpSpecialization/WSCodePartition.cpp
Code partition (new): lib/Dialect/TritonGPU/Transforms/WarpSpecialization/
Test IR examples:
- test/Hopper/WarpSpecialization/ws_code_partition.mlir -- basic producer/consumer
- test/Hopper/WarpSpecialization/ws_code_partition_merged_barrier.mlir -- merged barriers
- test/Hopper/WarpSpecialization/ws_code_partition_data_partition_barriers.mlir -- staggered indices
- test/Hopper/WarpSpecialization/blackwell_fa_code_partition.mlir -- complex multi-partition FA
- test/TritonGPU/rewrite-partition-dependencies.mlir -- Aref-based barriers