name: tt-lang
description: Comprehensive TT-Lang DSL reference including programming model, APIs, hardware constraints, and guides for translating CUDA, Triton, PyTorch, or TTNN kernels
argument-hint:
External Resources
Prerequisites
Before writing TT-Lang kernels, confirm your environment:
- Verify
ttlandttnnPython packages are available (import ttl; import ttnn) - Verify hardware access or functional simulator is working
- If you are having problems connecting to a device, use the tt-connect-remote-device skill which has a smoke test to verify everything is working, or ask the user how to proceed
Key Use Case: Fusing TTNN Operations
A common use case is taking a sequence of TTNN operations and fusing them into a single TT-Lang kernel for better performance. For example:
# Original TTNN program (multiple ops, multiple round trips)
x = ttnn.exp(input)
y = ttnn.add(x, bias)
z = ttnn.relu(y)
# Fused TT-Lang kernel (single kernel, all ops in one compute function)
@ttl.kernel(grid=(1, 1))
def fused_kernel(input, bias, out):
# ... setup CBs ...
@ttl.compute()
def compute():
inp = input_dfb.wait()
b = bias_dfb.wait()
o = out_dfb.reserve()
# All ops fuse into one compute body
result = ttl.math.relu(ttl.math.exp(inp) + b)
o.store(result)
# ... pop/push ...
When fusing TTNN ops:
- Identify the sequence of ops to fuse
- Create one DFB per input tensor
- Chain operations in a single compute function
- TT-Lang will generate optimized fused code
TT-Lang Programming Model
Kernel Structure
Every TT-Lang kernel has exactly three threads that run concurrently:
- Compute thread (
@ttl.compute()): Math operations on tiles in L1 - Reader thread (
@ttl.datamovement()): Loads data from DRAM to dataflow buffers - Writer thread (
@ttl.datamovement()): Writes data from dataflow buffers to DRAM
These threads synchronize via dataflow buffers (DFBs).
Basic Kernel Template
import ttl
@ttl.kernel(grid=(1, 1))
def add_kernel(lhs, rhs, out):
lhs_dfb = ttl.make_dataflow_buffer_like(lhs, shape=(1, 1), buffer_factor=2)
rhs_dfb = ttl.make_dataflow_buffer_like(rhs, shape=(1, 1), buffer_factor=2)
out_dfb = ttl.make_dataflow_buffer_like(out, shape=(1, 1), buffer_factor=2)
@ttl.compute()
def compute():
with lhs_dfb.wait() as l, rhs_dfb.wait() as r, out_dfb.reserve() as o:
o.store(l + r)
@ttl.datamovement()
def dm_read():
with lhs_dfb.reserve() as blk:
tx = ttl.copy(lhs[0, 0], blk)
tx.wait()
with rhs_dfb.reserve() as blk:
tx = ttl.copy(rhs[0, 0], blk)
tx.wait()
@ttl.datamovement()
def dm_write():
with out_dfb.wait() as blk:
tx = ttl.copy(blk, out[0, 0])
tx.wait()
# Call the kernel directly (no return ttl.Program)
# add_kernel(lhs_tensor, rhs_tensor, out_tensor)
Using Context Managers (Preferred)
The with statement automatically handles pop() and push():
@ttl.compute()
def compute():
with input1_dfb.wait() as a, input2_dfb.wait() as b:
with output_dfb.reserve() as o:
result = a + b
o.store(result)
# pop/push happens automatically at end of with block
@ttl.datamovement()
def dm_read():
with input1_dfb.reserve() as blk:
tx = ttl.copy(input1[0, 0], blk)
tx.wait()
# push happens automatically
Dataflow Buffer API Reference
# Create a dataflow buffer
dfb = ttl.make_dataflow_buffer_like(
tensor, # TTNN tensor to inherit dtype/layout from
shape=(R, C), # Block size in tiles (e.g., (2, 2) = 4 tiles per block)
buffer_factor=2 # Factor of extra blocks in DFB (2 = double buffering) for pipelining
)
# Consumer operations (compute thread consumes data)
blk = dfb.wait() # Block until data available, returns block
dfb.pop() # Release block back to producer
# Producer operations (datamovement thread produces data)
blk = dfb.reserve() # Block until space available, returns block
dfb.push() # Signal data is ready for consumer
# Context manager (preferred - auto pop/push)
with dfb.wait() as blk: # For consumers
# use blk...
with dfb.reserve() as blk: # For producers
# fill blk...
# Block operations
blk.store(expr) # Store result of expression into block
DFB Shape = Block Size: The shape=(R, C) parameter defines the block size in tiles. A block is the unit of data transferred between threads. For tensors larger than one block, use loops to iterate over multiple blocks:
Note: buffer factor is a pipeline hint, not a queue depth. Almost all kernels just use 2. You are able to push as many tiles into a CB as you want, it's just a datatype like array or queue, even a buffer_factor=1 dataflow buffer can support hundreds of tiles.
# 128x128 tensor = 4x4 tiles, process in 2x2 blocks (4 iterations)
dfb = ttl.make_dataflow_buffer_like(tensor, shape=(2, 2), buffer_factor=2)
@ttl.datamovement()
def dm_read():
for row in range(2): # 2 row-blocks
for col in range(2): # 2 col-blocks
with dfb.reserve() as blk:
tx = ttl.copy(tensor[row*2:(row+1)*2, col*2:(col+1)*2], blk)
tx.wait()
@ttl.compute()
def compute():
for _ in range(4): # Must match total iterations in dm_read
with dfb.wait() as blk, out_dfb.reserve() as o:
o.store(ttl.math.exp(blk))
Available Operations
Binary Operators
result = a + b # Element-wise addition
result = a - b # Element-wise subtraction
result = a * b # Element-wise multiplication
result = a / b # Element-wise division
result = a @ b # Matrix multiplication (equivalent to ttl.math.matmul(a, b))
Binary Functions
result = ttl.math.max(a, b) # Element-wise maximum
result = ttl.math.min(a, b) # Element-wise minimum
Unary Functions (ttl.math.*)
result = ttl.math.exp(x) # Exponential
result = ttl.math.log(x) # Natural logarithm
result = ttl.math.sqrt(x) # Square root
result = ttl.math.rsqrt(x) # Reciprocal square root (1/sqrt(x))
result = ttl.math.recip(x) # Reciprocal (1/x)
result = ttl.math.tanh(x) # Hyperbolic tangent
result = ttl.math.sigmoid(x) # Sigmoid (1/(1+exp(-x)))
result = ttl.math.relu(x) # ReLU (max(0, x))
result = ttl.math.abs(x) # Absolute value
result = ttl.math.neg(x) # Negation (-x)
result = ttl.math.floor(x) # Floor
result = ttl.math.ceil(x) # Ceil
result = ttl.math.sign(x) # Sign (-1, 0, or 1)
result = ttl.math.selu(x, scale, alpha) # SELU activation
result = ttl.math.fill(x, value) # Fill block with scalar value (value must be a constant!)
Matrix Multiplication
# Two equivalent ways to do matmul:
result = a @ b # @ operator
result = ttl.math.matmul(a, b) # function call
# Example usage:
with a_dfb.wait() as a_tile, b_dfb.wait() as b_tile, c_dfb.reserve() as c_out:
c_out.store(a_tile @ b_tile)
Multi-tile matmul: When CBs hold multiple tiles (e.g., shape=(2, 2)), the compiler generates loops over K dimension and accumulates automatically. The DST register persists across K iterations, enabling proper accumulation. For example, with A[1,2] @ B[2,1] = C[1,1], the K=2 tiles accumulate correctly.
Power (scalar integer exponent)
# Raises each element to an integer power (top-level, not ttl.math)
result = ttl.power(x, 2) # x^2
result = ttl.power(x, 3) # x^3
Transpose
# Transpose tiles (top-level, not ttl.math)
# Takes input block, works with multi-tile CBs
with inp_dfb.wait() as x, out_dfb.reserve() as o:
o.store(ttl.transpose(x))
Non-square example: For 4x2 tiles → 2x4 tiles:
inp_dfb = ttl.make_dataflow_buffer_like(inp, shape=(4, 2), buffer_factor=2)
out_dfb = ttl.make_dataflow_buffer_like(out, shape=(2, 4), buffer_factor=2) # Swapped!
Reductions (require scaler tensor)
# Reductions are in ttl.math and need a "scaler" tensor (1x1 DFB of all 1.0s)
# dims=[0] = collapse rows, dims=[1] = collapse columns, dims=[0, 1] = scalar
# Scaler: 32x32 tile of 1.0s in a 1x1 DFB
scaler_dfb = ttl.make_dataflow_buffer_like(scaler, shape=(1, 1), buffer_factor=2)
with inp_dfb.wait() as i, scaler_dfb.wait() as s, out_dfb.reserve() as o:
# Scalar reduction (sum/max entire DFB -> single value in output [0,0])
o.store(ttl.math.reduce_sum(i, s, dims=[0, 1]))
o.store(ttl.math.reduce_max(i, s, dims=[0, 1]))
# Collapse rows (reduce along dim 0): (N, M) -> (1, M)
o.store(ttl.math.reduce_sum(i, s, dims=[0]))
# Collapse columns (reduce along dim 1): (N, M) -> (N, 1)
o.store(ttl.math.reduce_sum(i, s, dims=[1]))
Dimension semantics match PyTorch:
dims=[0]for reduce collapses rows (dim 0) - output shape [1, M]dims=[1]for reduce collapses columns (dim 1) - output shape [N, 1]
Multi-tile reduce: Reduces across ALL tiles in the input DFB. For example, a 4x1 tile input DFB reduced with dims=[0, 1] produces a single scalar value (in a 1x1 output DFB). The reduction sums all elements across all 4 tiles into position [0,0].
Broadcast
# Broadcast expands a smaller block to match a larger output shape
# dims=[0] = expand dim 0 (rows), dims=[1] = expand dim 1 (cols), dims=[0, 1] = broadcast scalar
with scalar_dfb.wait() as s, out_dfb.reserve() as o:
# Broadcast 1x1 scalar to fill entire output block
o.store(ttl.math.broadcast(s, dims=[0, 1]))
with row_dfb.wait() as r, out_dfb.reserve() as o:
# Broadcast (1,M) row across N rows: dims=[0] expands dim 0
o.store(ttl.math.broadcast(r, dims=[0]))
with col_dfb.wait() as c, out_dfb.reserve() as o:
# Broadcast (N,1) column across M columns: dims=[1] expands dim 1
o.store(ttl.math.broadcast(c, dims=[1]))
Broadcast dimension semantics (match PyTorch):
dims=[0]for broadcast expands dim 0 (copies row to all rows) - input (1, M) -> output (N, M)dims=[1]for broadcast expands dim 1 (copies column to all columns) - input (N, 1) -> output (N, M)
Note: Reduce and broadcast use matching dims. dims=[1] reduce collapses columns to produce (N, 1), dims=[1] broadcast expands that column back to (N, M).
Conditional Select
result = ttl.where(condition, true_val, false_val)
Operation Fusion
Operations chain automatically - no need for store/reload between ops:
@ttl.compute()
def fused_compute():
with input_dfb.wait() as a, bias_dfb.wait() as b, out_dfb.reserve() as o:
# All these ops fuse into one efficient compute body
x = ttl.math.exp(a)
y = x + b
z = ttl.math.sigmoid(y)
result = ttl.math.relu(z)
o.store(result)
Limitation: Ops that take DFB arguments (matmul, reduce, transpose, broadcast) cannot be fused with each other. Each must have its own with block and store. Broadcast cannot be fused with elementwise ops either.
When fusion fails: Use sequential with blocks to break the chain - you do NOT need separate kernels:
@ttl.compute()
def compute():
# CORRECT: Break into two with blocks (still one kernel!)
with a_dfb.wait() as a, b_dfb.wait() as b, intermediate_dfb.reserve() as inter:
inter.store(a @ b)
with intermediate_dfb.wait() as inter, scaler_dfb.wait() as s, out_dfb.reserve() as o:
o.store(ttl.math.reduce_sum(inter, s, dims=[0, 1]))
The compiler fuses 20+ elementwise ops in a single compute function without issues.
Kernel Design: Minimize DRAM Traffic
Strive for one fused kernel. Multiple kernels are fine for incremental development and debugging, but each kernel boundary creates DRAM round-trips. For production:
- One kernel = one DRAM read + one DRAM write (ideal)
- Two kernels = read → compute → write → read → compute → write (2x DRAM traffic)
- N kernels = N× DRAM traffic (avoid)
# BAD: Two kernels = 2x DRAM traffic
@ttl.kernel(grid=(1, 1))
def kernel1(inp, temp):
# Read inp from DRAM, write temp to DRAM
...
@ttl.kernel(grid=(1, 1))
def kernel2(temp, out):
# Read temp from DRAM, write out to DRAM
...
# GOOD: One fused kernel = 1x DRAM traffic
@ttl.kernel(grid=(1, 1))
def fused_kernel(inp, out):
# Read inp from DRAM once, all compute in L1, write out to DRAM once
# Use intermediate CBs (L1) instead of intermediate tensors (DRAM)
...
Development workflow: Start with multiple simple kernels to verify correctness, then fuse into one kernel for performance.
Multi-Tile Processing and Streaming
Strive to always use grid="auto" with streaming loops:
grid="auto"- this automatically selects the grid size at compile time. Hardcoded grids are only for special cases (e.g., pipe topologies that require a fixed core count). Using grid="auto" will enable full core utilization from the get go.- Stream with loops in both compute and datamovement threads to handle arbitrary input sizes through DFBs.
- Compute tiles_per_core dynamically from tensor shape and grid size so kernels work on any input size.
Always strive to use the above patterns to ensure your kernels are flexible for any input size and fully utilize the cores available.
The exception: often for debugging or incremental development, it's helpful to start with a single core kernel; that is fine. You can start with a single core to isolate or debug a pattern, but strive to set it up in a way that it will naturally work with multiple cores later.
IMPORTANT: Match the User's Target Data Size
If the user provides a specific model config or tensor shape, strive to support that size. You can simplify to smaller tensors for initial testing and debugging, but the goal is a kernel that works on their actual data. Use loops and streaming to handle large inputs:
TILE_SIZE = 32
GRANULARITY = 4 # tiles per block dimension
@ttl.kernel(grid="auto")
def streaming_kernel(a, b, c, y):
row_tiles_per_block = GRANULARITY
col_tiles_per_block = GRANULARITY
grid_cols, grid_rows = ttl.grid_size(dims=2)
rows = a.shape[0] // TILE_SIZE // row_tiles_per_block
cols = a.shape[1] // TILE_SIZE // col_tiles_per_block
rows_per_core = -(-rows // grid_rows) # divceil
cols_per_core = -(-cols // grid_cols) # divceil
a_dfb = ttl.make_dataflow_buffer_like(a, shape=(row_tiles_per_block, col_tiles_per_block), buffer_factor=2)
b_dfb = ttl.make_dataflow_buffer_like(b, shape=(row_tiles_per_block, col_tiles_per_block), buffer_factor=2)
c_dfb = ttl.make_dataflow_buffer_like(c, shape=(row_tiles_per_block, col_tiles_per_block), buffer_factor=2)
y_dfb = ttl.make_dataflow_buffer_like(y, shape=(row_tiles_per_block, col_tiles_per_block), buffer_factor=2)
@ttl.compute()
def compute():
core_col, core_row = ttl.core(dims=2)
for local_row in range(rows_per_core):
row = core_row * rows_per_core + local_row
if row < rows:
for local_col in range(cols_per_core):
col = core_col * cols_per_core + local_col
if col < cols:
with a_dfb.wait() as a_blk, b_dfb.wait() as b_blk, c_dfb.wait() as c_blk, y_dfb.reserve() as y_blk:
y_blk.store(a_blk * b_blk + c_blk)
@ttl.datamovement()
def dm_read():
core_col, core_row = ttl.core(dims=2)
for local_row in range(rows_per_core):
row = core_row * rows_per_core + local_row
if row < rows:
sr = row * row_tiles_per_block
er = (row + 1) * row_tiles_per_block
for local_col in range(cols_per_core):
col = core_col * cols_per_core + local_col
if col < cols:
sc = col * col_tiles_per_block
ec = (col + 1) * col_tiles_per_block
with a_dfb.reserve() as blk:
tx = ttl.copy(a[sr:er, sc:ec], blk); tx.wait()
with b_dfb.reserve() as blk:
tx = ttl.copy(b[sr:er, sc:ec], blk); tx.wait()
with c_dfb.reserve() as blk:
tx = ttl.copy(c[sr:er, sc:ec], blk); tx.wait()
@ttl.datamovement()
def dm_write():
core_col, core_row = ttl.core(dims=2)
for local_row in range(rows_per_core):
row = core_row * rows_per_core + local_row
if row < rows:
sr = row * row_tiles_per_block
er = (row + 1) * row_tiles_per_block
for local_col in range(cols_per_core):
col = core_col * cols_per_core + local_col
if col < cols:
sc = col * col_tiles_per_block
ec = (col + 1) * col_tiles_per_block
with y_dfb.wait() as blk:
tx = ttl.copy(blk, y[sr:er, sc:ec]); tx.wait()
From examples/tutorial/multicore_grid_auto.py. Key patterns: grid="auto", dynamic tiles_per_core via divceil, bounds check with if row < rows.
Key streaming principles:
- DFB size is limited by L1 (~1.5MB per core) - you can't fit huge tensors
- Stream blocks through CBs - read a block, process it, write it, repeat
- Loop counts must match - compute iterations = dm_read iterations = dm_write iterations
- DRAM is large but slow - keep data in L1 as long as possible, stream to avoid DRAM round-trips
Pipes (Core-to-Core Communication)
Pipes are fully implemented in both the simulator and compiler. They enable core-to-core communication for patterns like gather, scatter, and ring exchanges. Get your kernel working without pipes first, then add them when needed for inter-core communication.
Pipe API
# Create pipes and wrap in a PipeNet
pipes = [ttl.Pipe((x, 0), ((x + 1) % N, 0)) for x in range(N)]
net = ttl.PipeNet(pipes)
# Send data through pipe (in dm_read on source core, inside a reserve block)
with dfb.reserve() as blk:
tx = ttl.copy(src[0, 0], blk); tx.wait()
def send(pipe):
xf = ttl.copy(blk, pipe); xf.wait()
net.if_src(send)
# Receive data from pipe (in dm_read on destination core)
with dfb.reserve() as blk:
def recv(pipe):
xf = ttl.copy(pipe, blk); xf.wait()
net.if_dst(recv)
Pipe Debugging Tips
- Pipes cause hangs when send/receive don't match - every
ttl.copy(blk, pipe)needs a correspondingttl.copy(pipe, blk) - Start without pipes - get independent multi-core working first, then add pipes
- Add pipes incrementally - test after adding each pipe
- See the CB Threading Rules in the Debugging section for common deadlock causes
Hardware Limits
- 32 CBs max per core
- ~1.5MB L1 per core
- ~100MB total SRAM across chip - utilize as much as possible for throughput
- Tile size: 32x32 elements = 2KB (bfloat16)
Prefer grid="auto" with streaming (shown above) over hardcoded grid sizes. See Reference Examples for complete working kernels.
Tensor Setup
Tensors must be:
- Tilized:
layout=ttnn.TILE_LAYOUT(32x32 element tiles) - Interleaved:
ttnn.DRAM_MEMORY_CONFIGorttnn.L1_MEMORY_CONFIG - bfloat16: Standard data type for Tenstorrent hardware
IMPORTANT: torch tensors will NOT work as kernel inputs.
import torch
import ttnn
device = ttnn.open_device(device_id=0)
# Create torch tensor (dimensions must be multiples of 32)
input_torch = torch.randn(64, 64, dtype=torch.bfloat16)
output_torch = torch.zeros(64, 64, dtype=torch.bfloat16)
# Convert to TTNN tensors
input_tensor = ttnn.from_torch(
input_torch,
dtype=ttnn.bfloat16,
layout=ttnn.TILE_LAYOUT,
device=device,
memory_config=ttnn.DRAM_MEMORY_CONFIG, # or ttnn.L1_MEMORY_CONFIG
)
output_tensor = ttnn.from_torch(
output_torch,
dtype=ttnn.bfloat16,
layout=ttnn.TILE_LAYOUT,
device=device,
memory_config=ttnn.DRAM_MEMORY_CONFIG,
)
# Run kernel
my_kernel(input_tensor, output_tensor)
# Read result back
result = ttnn.to_torch(output_tensor)
ttnn.close_device(device)
Semantic Mapping: Think at the Hardware Level
TT-Lang is a LOW-LEVEL DSL. Do not expect a 1:1 mapping from PyTorch ops. When translating:
Missing ops don't mean failure - If
conv2ddoesn't exist, don't stop. Think about what conv2d actually does at the hardware level.Decompose to primitives - Most "complex" operations are actually:
- Simple compute (matmul, elementwise ops)
- Complex data movement (gathering, reordering tiles)
Data movement is the magic - TT-Lang gives you full control over which tiles go where via
ttl.copy()and tensor slicing. If you can describe WHERE data needs to go, you can implement the operation.
Example: Conv2d
Conv2d seems like a "high-level op" but it's actually matmul with clever data arrangement:
What conv2d does:
- For each output position, gather a KxK window of input
- Flatten that window into a vector
- Dot product with filter weights
How to implement in TT-Lang:
- Reader kernel: Loop over output positions, DMA the KxK windows into CBs (im2col)
- Compute kernel: Just do matmul (window @ weights)
- Writer kernel: Write results back
The "conv2d" is in the data movement, not in a magic instruction.
Example: Softmax
No softmax op? Decompose it: max → shift → exp → sum → divide
# softmax(x) = exp(x - max(x)) / sum(exp(x - max(x)))
# Numerically stable version with max subtraction
with x_dfb.wait() as x, scaler_dfb.wait() as s:
# 1. Find max for numerical stability
with max_dfb.reserve() as mx:
mx.store(ttl.math.reduce_max(x, s, dims=[0, 1]))
# 2. Broadcast max back to full size
with max_dfb.wait() as mxv, bcast_dfb.reserve() as mxb:
mxb.store(ttl.math.broadcast(mxv, dims=[0, 1]))
# 3. Compute exp(x - max) and sum
with bcast_dfb.wait() as max_bcast:
shifted = x - max_bcast
exp_shifted = ttl.math.exp(shifted)
with sum_dfb.reserve() as sm:
sm.store(ttl.math.reduce_sum(exp_shifted, s, dims=[0, 1]))
# 4. Broadcast sum and divide
with sum_dfb.wait() as sumv, sum_bcast_dfb.reserve() as smb:
smb.store(ttl.math.broadcast(sumv, dims=[0, 1]))
with sum_bcast_dfb.wait() as sum_bcast, out_dfb.reserve() as o:
o.store(ttl.math.exp(x - max_bcast) / sum_bcast)
Key Principle
When you are re-writing a high level operation or kernel:
- What does this kernel or op do at a HW level? Think about what's actually happening in the HW when this op runs
- What primitives do we have? matmul, elementwise, DMA with indexing
- Build it from primitives. A naive O(n²) loop that works is better than giving up. The goal is NOT performance! Just correctness.
- This is not a high level DSL like pytorch or ttnn, it's low level and you have explicit control over all of the HW, memory management, and synchronization. Do not think about direct mappings for high level ops and kernels, think about the best way to represent the kernel in tt-lang at the level it is designed to operate.
Even ops that DO exist may have different semantics (write in place, different numerical behavior). Always test to verify.
IMPORTANT: the test runner will just execute your script as a python file. Don't overthink it. The ttlang-sim and the hw runner will just run the script as python (not pytest!) so just add a main block, open device, print/assert tensor values. The sim should have full compatibility with ttnn function for moving tensors, opening device and so on:
Below will work on both hw and sim:
if __name__ == "__main__":
device = ttnn.open_device(device_id=0)
# call test functions here
ttnn.close_device(device)
Translation Guide: GPU → TT-Lang
Concept Mapping
| GPU Concept | TT-Lang Equivalent |
|---|---|
| Thread block / workgroup | Grid of Tensix cores (grid=(rows, cols)) |
| Shared memory | L1 via dataflow buffers |
| Global memory | DRAM with DMA transfers |
| Warp/wave operations | Tile-level operations (32x32) |
__syncthreads() |
DFB wait()/push() synchronization |
| Kernel launch | Direct function call: my_kernel(a, b, c) |
CUDA/Triton → TT-Lang
Original CUDA pattern:
__global__ void add_kernel(float* a, float* b, float* c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
TT-Lang equivalent:
@ttl.kernel(grid=(1, 1)) # Or multicore for large tensors
def add_kernel(a, b, c):
a_dfb = ttl.make_dataflow_buffer_like(a, shape=(1, 1), buffer_factor=2)
b_dfb = ttl.make_dataflow_buffer_like(b, shape=(1, 1), buffer_factor=2)
c_dfb = ttl.make_dataflow_buffer_like(c, shape=(1, 1), buffer_factor=2)
@ttl.compute()
def compute():
with a_dfb.wait() as av, b_dfb.wait() as bv:
with c_dfb.reserve() as cv:
result = av + bv # Operates on entire 32x32 tile
cv.store(result)
@ttl.datamovement()
def dm_read():
with a_dfb.reserve() as blk:
tx = ttl.copy(a[0, 0], blk)
tx.wait()
with b_dfb.reserve() as blk:
tx = ttl.copy(b[0, 0], blk)
tx.wait()
@ttl.datamovement()
def dm_write():
with c_dfb.wait() as blk:
tx = ttl.copy(blk, c[0, 0])
tx.wait()
# Call: add_kernel(a, b, c)
PyTorch → TT-Lang
Original PyTorch:
def gelu(x):
return x * 0.5 * (1 + torch.tanh(0.7978845608 * (x + 0.044715 * x**3)))
TT-Lang equivalent:
@ttl.kernel(grid=(1, 1))
def gelu_kernel(x, out):
x_dfb = ttl.make_dataflow_buffer_like(x, shape=(1, 1), buffer_factor=2)
out_dfb = ttl.make_dataflow_buffer_like(out, shape=(1, 1), buffer_factor=2)
@ttl.compute()
def compute():
with x_dfb.wait() as xv:
with out_dfb.reserve() as o:
# Decompose GELU into available ops
x3 = xv * xv * xv
inner = xv + x3 * 0.044715 # Need scale tensor for constants
# ... continue decomposition
o.store(result)
# ... dm_read, dm_write ...
Note: For scalar constants like 0.5, create a full tile tensor:
scale_torch = torch.full((32, 32), 0.5, dtype=torch.bfloat16)
scale = ttnn.from_torch(scale_torch, ...)
Using TTNN to Fill Gaps
If an operation isn't available in TT-Lang, you can use TTNN ops for:
- Input preprocessing (reshaping, padding, layout conversion)
- Operations not yet supported in TT-Lang
- Output post-processing
Example: Using TTNN for padding
# TT-Lang requires tile-aligned dimensions (multiples of 32)
# Use TTNN to pad inputs that aren't tile-aligned
input_torch = torch.randn(100, 50) # Not tile-aligned
# Pad to 128x64 (multiples of 32)
padded = ttnn.pad(input_tensor, padding=((0, 28), (0, 14)), value=0.0)
# Run TT-Lang kernel on padded input
my_kernel(padded, output_tensor)
# Slice result back to original size if needed
result = ttnn.slice(output_tensor, [0, 0], [100, 50])
Rule of thumb:
- Try to implement in TT-Lang first
- Use TTNN for preprocessing (padding, reshaping) and postprocessing (slicing)
- The bulk of computation should be in TT-Lang for fusion benefits
Iteration Workflow (REQUIRED)
You MUST test every kernel you write. The workflow has two phases:
Phase 1: Iterate with the Functional Simulator (default)
The functional simulator (ttlang-sim) is the primary development tool. It catches DFB mismatches, shape errors, type errors, and functional bugs via dynamic analysis. Use it for all iteration.
1. Write kernel to file
2. Run:
# Via run-test.sh:
run-test.sh /path/to/kernel.py
# Or directly:
python /path/to/kernel.py
3. Read log output (or tail -100 /tmp/ttlang_test_output.log if using remote tools)
4. If errors: fix and go to step 2
5. If success: verify numerical output is correct
Phase 2: Validate on Real Hardware
Once the kernel passes in the simulator, do a final hardware run:
# Via run-test.sh:
run-test.sh --hw /path/to/kernel.py
# Or directly (on a machine with HW access):
python /path/to/kernel.py
NOTE: it is possible that the sim and hw diverge which may require you to either use --hw early or iterate on a program that passes in the sim but not on HW. If your program works with the sim but not on HW you can use the same iteration flow from phase 1 to debug (you may need to isolate patterns and iterate). You can also ask the user for guidance, they may care more about HW or sim working.
When to use --hw early: If the simulator has a bug or is overly conservative for your use case, you can bypass it with --hw at any point. But prefer the simulator for iteration since it gives better error diagnostics.
IMPORTANT:
- Exit code 0 does NOT mean success - always read the log
- The log can be thousands of lines - use
tail,head, orgrepto filter (e.g.,tail -100 /tmp/ttlang_test_output.logorgrep "pattern" /tmp/ttlang_test_output.log) - Look for:
AssertionError,Exception,error:,FAIL,mismatch - Never guess at fixes - always read the actual error message
- IMPORTANT: Set a low timeout for faster iteration - tests should execute in under 1 second. Hangs are common (especially with pipes or DFB mismatches) and a low timeout helps detect them quickly.
Handling Hangs:
- If a kernel hangs, the most common cause is DFB mismatch - every
wait()needs a correspondingpush()from producer, everyreserve()needs a correspondingpop()from consumer - Verify loop counts match between compute and datamovement threads
- Kill zombie processes:
pkill -9 python(or viaremote-run.sh pkill -9 pythonif using remote tools)
Compiler Errors: Workaround or Exit Early
Your goal is NOT to debug the compiler. If you hit an MLIR error or miscompile:
First: Try a workaround
- Restructure the kernel differently
- Use a different op combination
- Split into multiple simpler kernels
- Use TTNN for the problematic operation
If no workaround exists: Exit early
- Report the error clearly to the user
- Include the MLIR snippet that fails (from
/tmp/ttlang_initial.mliror/tmp/ttlang_final.mlir) - Describe what you tried
- Do NOT spend time investigating compiler internals
Signs of a compiler bug (not your fault):
- MLIR verification errors
- Assertion failures in passes
- Segfaults during compilation
- Generated code that doesn't match the input semantics
Low-Level DSL: Test Everything
This is NOT PyTorch. TT-Lang is a low-level DSL where you directly control memory management and synchronization. Operations may have unexpected semantics:
- Ops might write in place
- Ops might take dataflow buffers as arguments
- Ops might have different numerical behavior than PyTorch equivalents
- Memory layouts matter (tilized, interleaved, etc.)
Do not assume PyTorch semantics. If you're unsure how an op behaves, TEST IT.
Debug Strategy: Isolate and Print
You cannot print or assert inside kernels. Instead:
- Test ops in isolation - Write a minimal kernel with just one op
- Print tensors before/after - Use
print(ttnn.to_torch(tensor))after the kernel runs - Compare against expected - Compute the expected result in PyTorch and compare
- Build up incrementally - Once one op works, add the next
# Example: Testing an op in isolation
@ttl.kernel(grid=(1, 1))
def test_single_op(inp, out):
inp_dfb = ttl.make_dataflow_buffer_like(inp, shape=(1, 1), buffer_factor=2)
out_dfb = ttl.make_dataflow_buffer_like(out, shape=(1, 1), buffer_factor=2)
@ttl.compute()
def compute():
with inp_dfb.wait() as x:
with out_dfb.reserve() as o:
result = ttl.math.exp(x) # Test just this one op
o.store(result)
# ... dm_read, dm_write ...
# After running:
print("Input:", ttnn.to_torch(inp_tensor))
print("Output:", ttnn.to_torch(out_tensor))
print("Expected:", torch.exp(inp_torch))
Iterate as much as you need. There is no limit on test runs. If behavior is unexpected, simplify further until you understand what's happening.
Debugging Tips
- Start in isolation: Test one op at a time before combining
- Print tensors: Always print input/output to verify behavior
- Check shapes: All dimensions must be multiples of 32
- Verify DFB balance: Every
wait()needspop(), everyreserve()needspush() - Read the log: Always check
/tmp/ttlang_test_output.logafter each run - Check MLIR: Use
/tmp/ttlang_initial.mlirand/tmp/ttlang_final.mlirfor compiler issues
Debug Printing (dprint)
See TTLangSpecification.md, Section 10.2 for using print inside thread functions to inspect tensors, blocks, and dataflow buffer state.
CB Threading Rules (Deadlock Debugging)
Each DFB has exactly one producer (reserve+push) and one consumer (wait+pop). The three threads (dm_read, compute, dm_write) all start simultaneously and run until they block.
- Rule 1: One producer, one consumer per DFB. A DFB flows between two threads (dm_read->compute or compute->dm_write) or is thread-local (compute->compute).
- Rule 2: A DFB cannot have two producers. If dm_read reserves on a DFB, compute CANNOT also reserve on it. Violation causes interleaved data or deadlock.
- Rule 3: Thread-local accumulators must be initialized in compute, not DM. The first iteration uses
reserve()with an initial value; subsequent iterations usewait()+reserve()self-cycle. - Rule 4: Check every DFB appears in exactly two threads (or one if local). For each DFB, list which threads call
reserve()(producer) andwait()(consumer). If any DFB hasreserve()in two different threads, that's a bug.
If a kernel deadlocks, check for DFBs that have reserve() in both dm_read and compute. That's the most common cause.