optimization-advisor - SKILL.md Agent Skill

name: optimization-advisor description: > Use this skill to find exactly which source code to optimize for better GPU performance. Maps nsys profile hotspots to specific Sirius source files and functions, classifies bottlenecks as GPU-bound, CPU-bound, or sync-bound, and recommends actionable code changes. Trigger when the user wants to know what to optimize, where to focus coding effort, or wants source-level optimization guidance. This skill focuses on actionable source code targets — for generating performance reports and measurements, use profile-analyzer instead.

Sirius Code Optimization Advisor

You are identifying optimization targets in Sirius, a GPU-accelerated SQL query engine built on DuckDB. You analyze nsys profile data and map hotspots back to specific source code functions so developers know exactly where to focus.

Profiling Overhead Warning

nsys profiling adds measurable overhead to query execution times. When the user wants to validate that an optimization actually improved performance, always recommend running queries both WITH and WITHOUT profiling:

Profiled run (nsys_report.sh / profile_tpch_nsys.sh) → Provides GPU analysis data (kernels, operators, occupancy, memory) to understand why performance changed
Non-profiled run (run_tpch_parquet.sh or benchmark_and_validate.sh) → Provides accurate cold/hot timings to confirm the optimization actually helped

Never claim an optimization improved or regressed performance based solely on profiled timings. The profiled data tells you what changed internally; the non-profiled timings tell you whether it actually got faster.

STEP 1: Confirm Profiles with the User (MANDATORY)

Before running any analysis, you MUST confirm with the user which profiles to analyze. Do NOT proceed to STEP 2 until the user has explicitly confirmed.

List available profiles. Run this to show the user what's available:
```
ls -d reports/*/
```
Present the options to the user. Show them the available report directories and ask:
- Which report directory (or profile path) do they want to analyze?
- Do they want all queries, or specific ones (e.g., 1 3 10)?
- Or do they want to generate new profiles first?
Wait for explicit confirmation. Do NOT run nsys_hotspots.sh until the user has told you which profiles and which queries to use. Even if there is only one report directory, confirm it with the user first.

STEP 2: Run the Analysis

Once the user has told you what to analyze, run nsys_hotspots.sh:

# Analyze all queries in a profile directory
bash test/tpch_performance/nsys_hotspots.sh <path_to_profiles>

# Analyze specific queries
bash test/tpch_performance/nsys_hotspots.sh <path_to_profiles> 1 3 10

# Analyze a single query
bash test/tpch_performance/nsys_hotspots.sh <path_to_q1.sqlite>

# If the user gave a report directory, use its profiles/ subdirectory
bash test/tpch_performance/nsys_hotspots.sh reports/<label>/profiles/

# Save the report
bash test/tpch_performance/nsys_hotspots.sh <path_to_profiles> > optimization_guide.md

Generating New Profiles (if requested)

If the user wants fresh profiles instead of existing ones:

# Step 1: Profile (requires SIRIUS_CONFIG_FILE)
bash test/tpch_performance/nsys_report.sh --sf <scale_factor>

# Step 2: Analyze the generated profiles
bash test/tpch_performance/nsys_hotspots.sh reports/<generated_dir>/profiles/

Validating Optimizations (non-profiled timing run)

After identifying and implementing an optimization, run queries WITHOUT profiling to get accurate timings:

# Sirius-only timing (accurate cold/hot without nsys overhead)
export SIRIUS_CONFIG_FILE=<path_to_config>
bash test/tpch_performance/run_tpch_parquet.sh sirius <scale_factor> <iterations> <query_numbers...>

# Full DuckDB vs Sirius benchmark with result validation
bash test/tpch_performance/benchmark_and_validate.sh <scale_factor> <iterations>

Compare these non-profiled timings against a previous non-profiled baseline to confirm the optimization actually improved wall-clock performance. Then run a new profiled analysis to understand what changed internally.

Analysis Sections

The report contains these sections per query:

Section	What It Answers
Source File Mapping	Maps every NVTX operator name to its source file path
Hottest Operators (wall time)	Which functions consume the most elapsed time?
Hottest Operators (GPU time)	Which functions consume the most GPU kernel time?
Operator Efficiency	Wall time vs GPU time — identifies CPU-bound vs GPU-bound bottlenecks
Top Kernels per Operator	Which GPU kernels dominate each operator?
Occupancy Bottlenecks	Kernels with <50% occupancy, grouped by operator with limiter info
Sync & Wait Overhead	Synchronization time attributed to operators
Memory Transfer Hotspots	Data movement volume and bandwidth per operator
Sequential Execution Chains	Back-to-back operators on the same thread — parallelism opportunities
Stream Utilization	Stream count, per-stream busy%, GPU utilization during query
Operator Concurrency	Thread-level parallelism — how many threads execute operators

For multi-query analysis, the report includes:

Cross-Query Optimization Priority Matrix — operators ranked by total wall time with source files and efficiency
Per-Query Performance Summary — exec time, GPU time, utilization, streams, sync
Top Optimization Targets — top 5 operators with actionable recommendations

Architecture Context

Operator-to-Source Mapping

All Sirius physical operators follow a deterministic naming pattern:

NVTX event name: sirius_physical_<name>::execute or sirius_physical_<name>::sink
Source file: src/op/sirius_physical_<name>.cpp
Header file: src/include/op/sirius_physical_<name>.hpp

Special cases:

sirius_physical_materialized_collector -> src/op/sirius_physical_result_collector.cpp
sirius_physical_left_delim_join / sirius_physical_right_delim_join -> src/op/sirius_physical_delim_join.cpp
sirius_physical_streaming_limit -> src/op/sirius_physical_limit.cpp

Execution Model

Sirius executes query plans as pipelines:

Pipeline executor has a thread pool and stream pool (src/pipeline/gpu_pipeline_executor.cpp)
Within each task, operators execute sequentially with stream.synchronize() after each operator (src/pipeline/gpu_pipeline_task.cpp:218)
Multiple tasks run concurrently across threads, each with its own CUDA stream
Within each operator, cuDF may use multiple streams internally

This means:

Inter-task parallelism: Multiple pipeline tasks run on different threads/streams
Intra-operator parallelism: cuDF operations may launch many kernels across streams
Inter-operator serialization: stream.synchronize() between operators within a task

Optimization Dimensions

Dimension	What to Look For	Threshold
Wall time	Operators consuming >20% of query time	Top 3 operators
GPU efficiency	Wall time vs GPU time ratio	<20% = CPU-bound
Occupancy	Theoretical SM occupancy	<50% with significant GPU time
Sync overhead	cudaStreamSynchronize / cudaDeviceSynchronize time	>30% of operator wall time
Memory	Pageable transfers, high D2D volume	Pageable = major issue
Parallelism	Sequential chains, low stream count	Gap <1ms = serialized

Interpretation Guide

Bottleneck Classification

GPU-BOUND (efficiency >= 50%): The operator spends most time in GPU kernels. Optimize kernel occupancy, memory access patterns, or algorithmic efficiency.
MIXED (efficiency 20-50%): Both GPU compute and CPU/sync overhead are significant. Check for unnecessary synchronization, small kernel launches, or memory allocation during execution.
CPU-BOUND (efficiency < 20%): The operator spends most time on CPU orchestration, sync waits, or memory operations. Look for cudaStreamSynchronize, cudaHostAlloc, or host-side data preparation.

What to Do After Identifying Targets

Read the source file listed in the Source File Mapping
Look at the execute() method — find NVTX-instrumented code section
Check Top Kernels — identify which cuDF/CCCL calls generate the dominant kernels
For CPU-bound operators — search for sync calls, memory allocation, or complex host-side logic
For GPU-bound operators — check occupancy limiters and consider kernel fusion or algorithmic changes
For sequential chains — consider whether adjacent operators could share streams or be fused

Output Format

Always present findings in a structured way:

Start with the Cross-Query Priority Matrix (the most actionable view)
Focus on the top 3-5 optimization targets
For each target, explain what makes it slow and what to investigate
Link operators to source files so the developer can navigate directly
When comparing across queries, highlight operators that consistently dominate
When presenting optimization recommendations, remind the user to validate improvements with non-profiled timing runs (run_tpch_parquet.sh or benchmark_and_validate.sh) — profiled timings include nsys overhead and do not reflect actual query performance