name: profile-analyzer description: Analyze Sirius GPU performance from nsys profiles — runs benchmarks, generates reports with kernel occupancy, memory bandwidth, operator attribution, and compares runs for regression detection.
Sirius nsys Profile Analyzer
You are analyzing GPU performance profiles for Sirius, a GPU-accelerated SQL query engine built on DuckDB. The profiles come from NVIDIA Nsight Systems (nsys) and are stored as SQLite databases.
Profiling Overhead Warning
nsys profiling adds measurable overhead to query execution times. Timings captured during a profiled run (cold/hot in summary.json) are inflated and should NOT be used to determine whether an optimization actually improved performance. Instead:
- Profiled runs → Use for GPU analysis (kernels, operators, occupancy, memory, bottlenecks)
- Non-profiled runs → Use for accurate performance timing (cold/hot comparisons, regression detection)
Always run both when comparing performance across code changes.
Workflows
There are four workflows — choose based on what the user wants:
Workflow A: Full Performance Analysis (recommended)
This is the complete workflow: profile for GPU analysis, then run without profiling for accurate timings. Both runs should use the same queries, scale factor, and iteration count.
Step 1: Profiled run (for GPU analysis data)
bash test/tpch_performance/nsys_report.sh --sf <scale_factor> [query_numbers...]
Step 2: Non-profiled timing run (for accurate cold/hot times)
# Option A: Sirius-only timing
export SIRIUS_CONFIG_FILE=<path_to_config>
bash test/tpch_performance/run_tpch_parquet.sh sirius <scale_factor> <iterations> <query_numbers...>
# Option B: Full DuckDB vs Sirius benchmark with validation
export SIRIUS_CONFIG_FILE=<path_to_config>
bash test/tpch_performance/benchmark_and_validate.sh <scale_factor> <iterations>
The non-profiled run produces per-query timings.csv files with accurate cold/hot timings. benchmark_and_validate.sh also validates GPU results against CPU and produces a comparison table with speedup ratios.
When comparing across runs:
- Use the non-profiled timings to determine if performance actually improved or regressed
- Use the profiled data to understand why performance changed (kernel times, operator attribution, occupancy, memory patterns)
- The profiled
summary.jsontimings are useful for relative comparisons within the same profiled run (e.g., which query is slowest) but not across runs with different code
Full options for the profiled run:
bash test/tpch_performance/nsys_report.sh \
--sf <scale_factor> \
--output-dir ./reports \
--label <custom_name> \
--iterations 4 \
--query-timeout 120 \
--compare <baseline_report_dir> \
[query_numbers...]
Output directory structure (profiled):
reports/<label>_<YYYYMMDD_HHMMSS>/
report.md - Human-readable analysis with all metrics
summary.json - Machine-readable per-query metrics (profiled timings — use for analysis, not perf comparison)
metadata.json - Hardware, git commit, config, driver version
comparison.md - (if --compare used) Regression/improvement analysis
profiles/ - All raw artifacts
q1.sqlite, q1.nsys-rep, q1_timings.csv, q1_result.txt, ...
summary.txt
Output from non-profiled run:
# run_tpch_parquet.sh:
timings_sirius_sf<SF>_q<N>.csv - Per-query cold/hot timings (accurate, no profiling overhead)
# benchmark_and_validate.sh:
runs/<timestamp>_sf<SF>_<N>iter/
comparison.txt - DuckDB vs Sirius timing table with speedup ratios
timings.csv - Combined long-format timings (engine, query, iteration, runtime_s)
validation.csv - Per-query result match status (success/validation/error)
sirius/q<N>/timings.csv - Per-query Sirius timings
duckdb/q<N>/timings.csv - Per-query DuckDB timings
Workflow B: Generate Report from Existing Profiles
bash test/tpch_performance/nsys_report.sh --profile-dir <path_to_profiles>
Workflow C: Quick Analysis (no archival)
Use nsys_analyze.sh directly for quick, one-off analysis without creating a report directory.
bash test/tpch_performance/nsys_analyze.sh <path_to_profiles_or_file> [query_numbers...]
Workflow D: Compare Two Existing Reports
bash test/tpch_performance/nsys_compare.sh <baseline_report_dir> <current_report_dir> [--threshold PCT]
Default threshold is 10%. Values beyond the threshold are flagged:
- REGRESSION: current is >threshold% slower
- IMPROVED: current is >threshold% faster
- FIXED: query failed in baseline but passes now
- BROKEN: query passed in baseline but fails now
Important: The timings in summary.json (used by nsys_compare.sh) are from profiled runs and include nsys overhead. These comparisons are useful for spotting large changes but should be validated with non-profiled timing runs before concluding a real regression or improvement exists. For definitive performance comparison, compare the non-profiled timings.csv files from run_tpch_parquet.sh or benchmark_and_validate.sh.
Before Running
- Ask the user for any paths you don't know. Do NOT assume paths.
- For profiling (Workflow A without
--profile-dir), the Sirius config must be set:export SIRIUS_CONFIG_FILE=<path_to_config> - All paths in
profile_tpch_nsys.share configurable via env vars (DUCKDB,PARQUET_DIR,QUERY_DIR,OUTPUT_DIR,ITERATIONS,QUERY_TIMEOUT).
Analysis Sections
The report contains these sections per query:
All analysis is scoped to the query execution window — the time span from the first Sirius operator start to the last operator end. Init overhead (CUDA context creation, cudaHostAlloc, cuFile init) and cleanup (cudaFreeHost, pool destruction) are excluded from the main metrics and shown separately.
| Section | What it Shows |
|---|---|
| Execution Time Breakdown | Trace duration vs query execution time vs init vs cleanup |
| GPU Hardware | GPU model, SM count, VRAM, compute capability |
| NVTX Domain Summary | Time breakdown across software layers (Sirius, libkvikio, libcudf, CCCL, cuFile) |
| Sirius Physical Operators | Per-operator call counts, wall times, percentages |
| Top GPU Kernels | Hottest kernels by total GPU time |
| Kernel Occupancy Estimation | Theoretical SM occupancy per kernel, limiting factor (registers/shared_mem/warps) |
| Register Spill Analysis | Kernels using local memory (register spilling to slow memory) |
| GPU Kernel Time Summary | Aggregate kernel stats, stream count, device count |
| GPU Utilization Overview | Kernel time as % of query execution time (not total trace) |
| Memory Transfer Breakdown | H2D/D2H/D2D with Pageable vs Pinned src/dst, bandwidth in GB/s |
| CUDA Runtime API Hotspots | Slowest CUDA API calls during query execution only |
| Host Memory Allocation During Query | Only alloc calls during runtime (init allocs excluded) |
| Init/Cleanup Overhead | What was excluded — cudaHostAlloc, cudaFreeHost, context creation, etc. |
| GPU Kernel Attribution | Maps GPU kernel time back to Sirius operators via correlation chain |
| Top Kernels per Operator | Which specific kernels each operator launches |
| GPU Stream Utilization | Per-stream busy% = kernel_time / stream_active_span |
| Synchronization Analysis | GPU sync wait times by type |
| NVTX Operations by Domain | Top operations per software layer |
| Cross-Query Comparison | (multi-file only) Side-by-side query overview |
Architecture Context
Sirius Execution Model
Sirius intercepts DuckDB query plans and offloads them to GPU via cuDF:
- DuckDB parses SQL and creates a logical plan
- Sirius converts it to a physical GPU plan with operators like
sirius_physical_table_scan,sirius_physical_hash_join, etc. - cuDF (libcudf) provides GPU-accelerated DataFrame primitives
- CCCL (CUB/Thrust) provides GPU algorithm primitives underneath cuDF
- CUDA kernels execute on the GPU
NVTX Domain Hierarchy
- Domain 0 (Sirius): Physical operator execution (
sirius_physical_*::execute/sink) - Domain 1 (libkvikio): GPU-Direct Storage I/O operations (
posix_host_read,task) - Domain 2 (cuFile): cuFile handle management
- Domain 3 (libcudf): cuDF operations (
aggregate,binary_operation,materialize_all_columns, etc.) - Domain 4 (CCCL): CUB/Thrust primitives (
DeviceFor::ForEachN,thrust::transform, etc.)
Key Relationships
- Operator -> Kernel attribution: Correlates via
CUPTI_ACTIVITY_KIND_RUNTIME.correlationId(links runtime API calls to kernels; runtime call timestamps fall within NVTX operator ranges) - Cold vs Hot: First iteration is "cold" (I/O, JIT). Subsequent iterations are "hot" (cached).
- Multi-stream: Sirius uses 30-40 CUDA streams for parallelism
Sirius Physical Operators
| Operator | Purpose |
|---|---|
table_scan |
Read parquet files via cuDF |
projection |
Evaluate column expressions |
filter |
Row filtering |
hash_join |
Hash-based join |
grouped_aggregate |
Group-by aggregation |
grouped_aggregate_merge |
Merge partial aggregates across partitions |
partition |
Data partitioning |
order |
ORDER BY |
merge_sort |
Merge sorted partitions |
sort_partition / sort_sample |
Sort-based repartitioning |
top_n / top_n_merge |
LIMIT processing |
concat |
Concatenation |
materialized_collector / result_collector |
Final result materialization |
Interpretation Guide
Occupancy
- 100%: Maximum warps per SM. Block size and register usage fit perfectly.
- 50-100%: Generally acceptable. Check if the limiter can be relaxed.
- Below 50%: Potential bottleneck. Check the
limitercolumn:registers: Kernel uses too many registers per thread. Compiler flag--maxrregcountcould help, or algorithmic changes to reduce register pressure.shared_mem: Shared memory per block limits active blocks. Theshmem_bcolumn shows the driver-allocated amount (aligned to SM partition granularity, often 16KB/32KB/64KB/102KB). Reducing shared memory usage or using dynamic allocation may help.warps: Block is too large relative to SM warp capacity.hw_limit: Hit the max blocks per SM limit (24 on Ada Lovelace).
- Note: Low occupancy doesn't always mean poor performance — compute-bound kernels can achieve peak throughput at low occupancy if they have high ILP (instruction-level parallelism).
Bandwidth
- Pinned H2D: Expect 12-14 GB/s on PCIe 4.0 x16. Below 8 GB/s suggests contention or small transfers.
- Pageable H2D: Much slower (~0.05 GB/s). Indicates un-pinned host allocations — a major performance issue if significant data goes through this path.
- D2D: Internal GPU memory shuffles. Should achieve 100-400 GB/s depending on transfer patterns.
- D2H: Usually small volumes (query results). Bandwidth similar to H2D.
GPU Utilization
kernel_pct_of_query: Kernel time / query execution span (excludes init/cleanup). Values of 40-60% are typical — the remainder is CPU orchestration, sync waits, and memcpy. Below 30% suggests the GPU is starved.kernel_pct_of_ops: Kernel time / Sirius operator time. Values >100% indicate GPU kernels overlap with CPU operator orchestration (normal with async execution). Low values suggest CPU-side bottlenecks within operators.
Host Memory Overhead
- The analysis separates init-time allocations from query-runtime allocations.
cudaHostAllocduring init (10+ seconds) is a one-time cost — it's shown in the Init/Cleanup Overhead section.- If
cudaHostAllocappears in the "During Query Execution" section, that's a performance issue — synchronous allocation during active queries stalls the pipeline. cudaStreamSynchronizeis typically the dominant cost during query execution — it represents time the CPU waits for GPU work to complete.
Register Spill
local_bytes_per_thread > 0means the kernel exceeded the register file and is spilling to local memory (which actually resides in global/L2 memory — much slower).- This is a red flag for performance-critical kernels. Solutions: reduce register usage, simplify kernel logic, or use
__launch_bounds__to guide the compiler.
What to Look For
- Where is query time going? Compare operator time vs query execution span. Large gaps = sync waits, memcpy, or CPU orchestration overhead.
- GPU utilization: Low kernel_pct_of_query (<30%) = GPU underutilized during query execution (CPU-bound, sync-bound, or I/O-bound).
- Occupancy hotspots: Kernels with <50% occupancy that consume significant GPU time.
- Memory bandwidth: Pageable transfers are orders of magnitude slower than pinned.
- Host allocation cost: cudaHostAlloc often dominates CUDA API time.
- Cold vs Hot delta: Large gap = I/O/JIT overhead. Small gap = compute-dominated. Use non-profiled timings for accurate cold/hot comparison.
- Stream utilization: Low busy% with many streams = fine-grained parallelism. High busy% on few streams = load imbalance.
- Register spill: Any kernel with local_bytes_per_thread > 0 is spilling.
- Operator attribution: Which operators consume the most GPU time? Focus optimization here.
- OOM failures: Queries failing with
std::bad_allocneed memory optimization. - Profiling vs actual performance: Always validate profiled timing changes with non-profiled runs. nsys overhead can mask or exaggerate real performance differences.
Output Format
Always present findings in a structured way:
- Start with a high-level summary (pass/fail, total times, report location)
- Identify the top 3-5 bottlenecks
- For each bottleneck, explain what it means and potential causes
- Compare cold vs hot when relevant — clearly label whether timings are from profiled or non-profiled runs
- When comparing reports, highlight regressions and improvements but note that profiled timings include nsys overhead — recommend validating with non-profiled runs if not already done
- When analyzing a single query deeply, walk through the execution timeline: I/O -> operators -> kernels -> output
- When presenting performance conclusions, always distinguish between profiled timings (for analysis) and non-profiled timings (for actual performance measurement)