benchmark

star 3

Run matmul performance benchmarks. Use when user wants to measure TFLOPS, compare kernel performance, or verify correctness after code changes.

m96-chan

By m96-chan schedule Updated 12/30/2025

play_arrow Run Skill in Manus View GitHub

name: benchmark description: Run matmul performance benchmarks. Use when user wants to measure TFLOPS, compare kernel performance, or verify correctness after code changes.

Benchmark PyGPUkit

Run comprehensive matmul benchmarks for all supported dtypes.

Usage

# Full benchmark
python scripts/benchmark.py

# Quick mode (fewer iterations)
python scripts/benchmark.py --quick

# Specific sizes
python scripts/benchmark.py --sizes 4096,8192

# TF32 kernel version
python scripts/benchmark.py --tf32-version v2

Options

--sizes: Comma-separated matrix sizes (default: 2048,4096,8192)
--quick: Fewer warmup/iterations for faster results
--dtypes: Which dtypes to test (default: fp32,tf32,fp16,bf16)
--tf32-version: v1 (WMMA) or v2 (PTX mma.sync, default)

Instructions

Ensure the project is built before benchmarking
Run python scripts/benchmark.py [options]
Report the performance results including:
- TFLOPS for each dtype and size
- Correctness verification (PASS/FAIL)
- Comparison with theoretical peak

Expected Results (RTX 5090)

Dtype	Target TFLOPS
FP32	~18
TF32	~27
FP16	~15
BF16	~15

Environment Variables

PYGPUKIT_ALLOW_TF32=1: Enable TF32 TensorCore
PYGPUKIT_TF32_V2=1: Use PTX mma.sync kernel

Install via CLI

npx skills add https://github.com/m96-chan/PyGPUkit --skill benchmark

Repository Details

star Stars 3

call_split Forks 0

navigation Branch main

article Path SKILL.md

More from Creator

m96-chan

m96-chan Explore all skills →