name: ttnn description: TTNN operations library reference for Tenstorrent hardware. Covers tensor APIs, ops catalog, model conversion from PyTorch, and memory/layout configuration.
External Resources
- TTNN Documentation
- TT-Metal Repository
- TTNN API Reference -- full operation catalog (~400+ ops organized by category)
- Tensor Reference -- shapes, layouts, memory configs, data types
- Converting PyTorch Models to TTNN -- step-by-step conversion guide
- Multi-Device Programming -- MeshDevice, tensor parallelism, data parallelism, CCL ops
- Tensor Sharding -- height, width, and block sharding strategies
Multi-Device
TTNN natively supports multi-chip execution via the MeshDevice abstraction. See multi_device.md for full details.
# Single device
device = ttnn.open_device(device_id=0, trace_region_size=100000000)
# Multi-device mesh (e.g., 4 chips in a row)
ttnn.set_fabric_config(ttnn.FabricConfig.FABRIC_1D)
mesh_device = ttnn.open_mesh_device(ttnn.MeshShape(1, N_CHIPS),
trace_region_size=100000000)
# Replicate a tensor to all devices
x = ttnn.from_torch(t, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT,
device=mesh_device,
mesh_mapper=ttnn.ReplicateTensorToMesh(mesh_device))
# Shard a tensor across devices along a dimension (tensor parallelism)
w = ttnn.from_torch(t, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT,
device=mesh_device,
mesh_mapper=ttnn.ShardTensorToMesh(mesh_device, dim=1))
# Read back sharded results by concatenating
result = ttnn.to_torch(t, mesh_composer=ttnn.ConcatMeshToTensor(mesh_device, dim=1))
# Tensor parallel matmul pattern: column parallel + row parallel + all_reduce
col_out = ttnn.matmul(x_replicated, w_col_sharded) # shard W along dim=1
row_out = ttnn.matmul(col_out, w_row_sharded) # shard W along dim=0
reduced = ttnn.all_reduce(row_out) # sync across chips
Sharding
Tensor sharding distributes data across cores for locality and reduced communication. See tensor_sharding.md for height, width, and block sharding strategies.
Custom Program Sizes
Large fused kernels can exceed the default kernel config buffer limit (~69KB). The fix is to reduce worker_l1_size, which trades user L1 (for CBs/buffers) for more kernel config space.
# Get the default worker L1 size
default_size = ttnn.device.get_max_worker_l1_unreserved_size()
# Subtract enough for your kernel's config buffer needs
# e.g., fused kernel is ~85KB, so give 88KB (90112 bytes) more config space
device = ttnn.open_device(device_id=0, worker_l1_size=default_size - 90112)
The tradeoff: slightly less L1 available for tile buffers. Start with a small reduction (e.g., 8192) and increase if you still hit the config buffer limit.
Tracing
TTNN supports captured traces for eliminating host overhead in hot loops. See the tt-enable-tracing skill for setup and usage.
Looking Up Op Documentation
Find the op name in api.rst, then fetch its full documentation:
curl https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/api/ttnn.<OP>.html
For example: api/ttnn.conv2d.html, api/ttnn.matmul.html, api/ttnn.softmax.html.
Output Tensors and Scratch Memory
Most TTNN ops accept an output_tensor or optional_output_tensor parameter that lets you write the result into a pre-allocated tensor instead of allocating a new one. This is useful for:
- Performance: avoids repeated allocation/deallocation overhead
- Tracing: required for pre-allocating all tensors before trace capture
- Scratch buffers: reuse the same tensor across ops or loop iterations
# Pre-allocate a scratch tensor
scratch = ttnn.zeros_like(x, device=device, memory_config=ttnn.DRAM_MEMORY_CONFIG)
# Reuse it across ops
ttnn.relu(x, output_tensor=scratch)
ttnn.add(scratch, bias, output_tensor=scratch)
Look up individual ops in the API reference to check whether they support output_tensor.
Overview
TTNN is the high-level operations library for Tenstorrent hardware. It provides a PyTorch-like API for tensor creation, manipulation, and computation on TT devices. TTNN ops run individually (one kernel launch per op call). For fusing multiple ops into a single kernel, use TT-Lang.
Key Concepts
- Tensors must be moved to device before computation:
ttnn.to_device(tensor, device) - Layouts:
ttnn.ROW_MAJOR_LAYOUTorttnn.TILE_LAYOUT(32x32 tiles, required for most compute ops) - Memory configs:
ttnn.DRAM_MEMORY_CONFIG(default, large) orttnn.L1_MEMORY_CONFIG(fast, limited ~1.5MB/core) - Data types:
ttnn.bfloat16(standard),ttnn.float32,ttnn.bfloat8_b,ttnn.uint32
Common Patterns
import torch
import ttnn
device = ttnn.open_device(device_id=0)
# Torch -> TTNN
x_torch = torch.randn(1, 1, 64, 64, dtype=torch.bfloat16)
x = ttnn.from_torch(x_torch, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT,
device=device, memory_config=ttnn.DRAM_MEMORY_CONFIG)
# Compute
y = ttnn.relu(x)
y = ttnn.matmul(a, b)
y = ttnn.softmax(x, dim=-1)
# TTNN -> Torch
result = ttnn.to_torch(y)
ttnn.close_device(device)