memory-benchmark - SKILL.md Agent Skill

name: memory-benchmark description: How to benchmark and analyze memory usage in Turso using the memory-benchmark crate and dhat heap profiler. Use this skill whenever the user mentions memory usage, memory profiling, allocation tracking, heap analysis, memory regression, memory benchmarking, dhat, or wants to understand where memory is being allocated during SQL workloads. Also use when investigating memory growth in WAL or MVCC mode. IMPORTANT - If you modify the perf/memory crate (add profiles, change CLI flags, change output format, etc.), update this skill document to reflect those changes so it stays accurate for future agents.

Memory Benchmarking & Analysis

The perf/memory crate benchmarks memory usage of SQL workloads under WAL and MVCC journal modes. It uses dhat as the global allocator to track every heap allocation, and memory-stats for process-level RSS snapshots.

It also contains a stack-report helper binary for stack-usage investigations. That binary runs a SQL payload with the stacker feature enabled and captures turso_stack tracing events in-process, aggregating structured tracing fields instead of parsing stderr log text.

Location

Benchmark crate: perf/memory/
CodSpeed bench crate: perf/memory/codspeed/ (CI allocation regression tracking)
Analysis script: perf/memory/analyze-dhat.py
dhat output: dhat-heap.json (written to CWD after each run)

The crate is split into a library and binaries. The workload engine lives in memory_benchmark::workload (run_workload, WorkloadConfig, WorkloadObserver, the JournalMode/WorkloadProfile enums and create_profile); the memory-benchmark bin is a thin CLI over it that adds dhat/RSS measurement. Randomized profiles (read-heavy, mixed) use a fixed RNG seed (profile::WORKLOAD_RNG_SEED) so workloads are identical across runs.

Running Stack Reports

Use this when investigating stack usage from SQL translation/execution probes. Run stack reports in release mode with --features stacker when comparing against server logs or CI stack-size output. Debug builds can materially overstate stack deltas and should only be used for quick local iteration.

cargo run --release -q -p memory-benchmark --features stacker --bin stack-report -- \
  --sql path/to/payload.sql \
  --top 40

Useful options:

--sql FILE|-             # SQL payload, or stdin with -
--format human|json|csv  # output format
--top N                  # aggregate/span rows per statement in human output
--statement N[,N...]     # only include reports for 1-based statement indexes
--sql-contains TEXT      # only include reports for statements containing TEXT, ASCII case-insensitive

The report is statement-oriented. For each SQL statement, it records the remaining stack before execution, the minimum remaining stack sampled while that statement ran, and stack_used = baseline_remaining_stack - min_remaining_stack. Statements are sorted by stack_used descending so the worst SQL statements are first. The human report also prints global and per-statement span aggregates sorted by total_inclusive_stack_used descending. These aggregate rows group by label plus detail and include call count, total/max self stack, total/max inclusive stack, max cumulative stack at span entry, and peak_path_hits for spans that were active at the statement's minimum remaining-stack sample.

Within each statement, raw span rows are still sorted by stack_used descending, with the original tracing emission sequence kept in the trace_sequence field (seq in human output). Raw span rows include inclusive_stack_used, which is measured from the span's parent stack level down to the deepest sampled remaining stack while the span was active. This is an inclusive profiler-style metric, so nested spans intentionally overlap; use it for ranking likely contributors, not for summing to statement total stack.

JSON and CSV formats are deterministic and intended for comparing runs. CSV uses a row_type column with global_aggregate, statement_aggregate, span, and statement rows.

Statement filters affect reporting only. The runner still executes the full SQL payload in order so schema/data setup and earlier statements remain visible to later selected statements. Multiple --statement and --sql-contains filters are allowed; when both are present, a statement must match both kinds.

stack-report splits payloads with turso_parser::parser::Parser::next_cmd(). It then executes statements with no result columns, and queries and drains row-producing statements. Do not change binding execute_batch semantics for stack reports.

The runner currently uses a fixed in-memory database and enables generated columns, custom types, and materialized views internally. There are no stack report CLI flags for selecting the database path or toggling those experimental features.

Running Benchmarks

Always run in release mode — debug builds have wildly different allocation patterns and the results are not representative of real-world usage.

# Basic: single connection, WAL mode, insert-heavy workload
cargo run --release -p memory-benchmark -- --mode wal --workload insert-heavy -i 100 -b 100

# MVCC with concurrent connections
cargo run --release -p memory-benchmark -- --mode mvcc --workload mixed -i 100 -b 100 --connections 4

# Run a final checkpoint after the workload
cargo run --release -p memory-benchmark -- --mode wal --workload read-heavy --checkpoint

# Guarantee automatic MVCC checkpoints during the run by lowering the
# logical-log threshold (default is ~4 MB, more than small workloads write)
cargo run --release -p memory-benchmark -- --mode mvcc --workload insert-heavy --mvcc-checkpoint-threshold 16384

# All CLI options
cargo run --release -p memory-benchmark -- \
  --mode wal|mvcc \
  --workload insert-heavy|read-heavy|mixed|scan-heavy|series-blob \
  -i <iterations> \
  -b <batch-size> \
  --connections <N> \
  --checkpoint \
  --mvcc-checkpoint-threshold <bytes> \
  --timeout <ms> \
  --cache-size <pages> \
  --format human|json|csv

Every run produces a dhat-heap.json in the current directory. This file contains per-allocation-site data for the entire run.

Built-in Workload Profiles

Profile	Description	Setup
`insert-heavy`	100% INSERT statements	Creates table
`read-heavy`	90% SELECT by id / 10% INSERT	Seeds 10k rows
`mixed`	50% SELECT / 50% INSERT	Seeds 10k rows
`scan-heavy`	Full table scans with LIKE	Seeds 10k rows
`series-blob`	`INSERT INTO bench(data) SELECT zeroblob(2048) FROM generate_series(1, ?)`	Creates `bench`; `batch-size` is the series length

Profiles implement the Profile trait in perf/memory/src/profile/. To add a new workload, create a new file implementing the trait and wire it into the WorkloadProfile enum in main.rs.

Understanding the Output

The benchmark reports three categories of metrics:

RSS (process-level)

Measured via memory-stats crate. Includes everything: heap, mmap'd files (WAL, DB pages pulled into OS page cache), tokio runtime, etc. Snapshots are taken at phase transitions (setup -> run) and after each batch.

Baseline: RSS before any DB work (runtime overhead)
Peak: Highest RSS observed during the run
Net growth: Final RSS minus baseline — the memory attributable to the workload

Heap (dhat)

Precise allocation tracking via the dhat global allocator. Only counts explicit heap allocations (malloc/alloc), not mmap.

Current: Bytes still allocated at measurement time
Peak: Highest simultaneous live allocation during the entire run
Total allocs: Number of individual allocation calls
Total bytes: Cumulative bytes allocated (includes freed memory) — measures allocation pressure

Disk

File sizes after the benchmark completes:

DB file: The .db file
WAL file: The .db-wal file (WAL mode only)
Log file: The .db-log file (MVCC logical log only)

Analyzing dhat Output

After running a benchmark, use the analysis script to produce a readable report from dhat-heap.json:

# Overview: top allocation sites by bytes live at global peak
python3 perf/memory/analyze-dhat.py dhat-heap.json --top 15 --modules

# Focus on a specific subsystem
python3 perf/memory/analyze-dhat.py dhat-heap.json --filter mvcc --stacks
python3 perf/memory/analyze-dhat.py dhat-heap.json --filter btree --stacks
python3 perf/memory/analyze-dhat.py dhat-heap.json --filter page_cache --stacks

# Sort by different metrics
python3 perf/memory/analyze-dhat.py dhat-heap.json --sort-by eb  # bytes at exit (leaks)
python3 perf/memory/analyze-dhat.py dhat-heap.json --sort-by tb  # total bytes (pressure)
python3 perf/memory/analyze-dhat.py dhat-heap.json --sort-by mb  # max live bytes per site

# JSON output for programmatic use
python3 perf/memory/analyze-dhat.py dhat-heap.json --json

Sort Metrics

Flag	Metric	Use when
`gb`	Bytes live at global peak (default)	Finding what dominates memory at the high-water mark
`eb`	Bytes live at exit	Finding memory leaks or things that never get freed
`tb`	Total bytes allocated	Finding allocation pressure hotspots (GC churn)
`mb`	Max bytes live per site	Finding per-site high-water marks
`tbk`	Total allocation count	Finding chatty allocators (many small allocs)

Analysis Flags

--top N — Show top N sites (default 15)
--filter PATTERN — Filter to sites/stacks containing substring (e.g. mvcc, btree, wal, pager)
--stacks — Show full callstacks for top allocation sites
--modules — Aggregate by crate/module for a high-level breakdown
--json — Machine-readable aggregated output

Typical Workflow

When investigating memory usage or a suspected regression:

Run the benchmark with parameters matching the scenario:

cargo run -p memory-benchmark -- --mode mvcc --workload mixed -i 500 -b 100 --connections 4

Get the high-level picture — which modules use the most memory:

python3 perf/memory/analyze-dhat.py dhat-heap.json --modules --top 20

Drill into the hot module — e.g. if turso_core dominates:

python3 perf/memory/analyze-dhat.py dhat-heap.json --filter turso_core --stacks --top 10

Check for leaks — anything still alive at exit that shouldn't be:

python3 perf/memory/analyze-dhat.py dhat-heap.json --sort-by eb --top 10

Compare modes — run the same workload under WAL and MVCC and compare the reports to see the memory cost of MVCC versioning.

Concurrency Details

When --connections > 1:

Setup phase (schema creation, seeding) always runs on a single connection sequentially
Run phase spawns one tokio task per connection, each executing its batch concurrently
--checkpoint adds a final single-connection PRAGMA wal_checkpoint(TRUNCATE) phase after the run phase
Each connection gets busy_timeout set (default 30s, configurable via --timeout)
WAL mode uses BEGIN, MVCC uses BEGIN CONCURRENT
The Profile trait's next_batch(connections) returns one batch per connection with non-overlapping row IDs

CodSpeed Allocation Tracking in CI

.github/workflows/codspeed-memory.yml runs every workload profile under both journal modes with CodSpeed's memory instrument (eBPF-based malloc tracking: peak memory, total allocated, allocation count) so allocation regressions show up on PRs. The bench harness is the separate crate perf/memory/codspeed/ (criterion benchmarks named <mode>/<workload>/<total-ops>, e.g. mvcc/insert-heavy/2000, with much smaller iteration counts than the CLI defaults). Each (mode, workload) pair runs at 1x/2x/4x scale — same batch size, more iterations — so comparing the sizes shows how memory grows with workload volume, plus an 8x <ops>-checkpoint variant that guarantees checkpointing is part of the measurement: it lowers mvcc_checkpoint_threshold to 16 KiB so MVCC auto-checkpoints fire mid-run (WAL's 1000-frame threshold is hardcoded in core/storage/wal.rs) and ends with an explicit PRAGMA wal_checkpoint(TRUNCATE). The workflow builds the bench binary once, then fans out one CI job per workload profile, each filtering benchmarks by name — the sharding pattern from CodSpeed's sharded-benchmarks docs.

The bench crate must stay free of [[bin]] targets: cargo builds a package's bins (panic=abort under the release profile) alongside its benches (panic=unwind), and the duplicated turso_sdk_kit cdylib/staticlib units then collide on unhashed output filenames and break the build. That is why the bench does not live in perf/memory itself.

Run locally:

# Quick correctness pass (runs each benchmark once)
cargo bench -p memory-benchmark-codspeed --bench memory_profiles -- --test

# What CI runs (requires cargo-codspeed; uninstrumented outside the CodSpeed runner)
cargo codspeed build -m memory -p memory-benchmark-codspeed --features codspeed
cargo codspeed run -m memory -p memory-benchmark-codspeed --bench memory_profiles "insert-heavy"

Do NOT run plain cargo bench -p memory-benchmark-codspeed without -- --test unless you want full criterion sampling — each sample executes an entire workload.

Adding a New Profile

Create perf/memory/src/profile/your_profile.rs implementing the Profile trait
Add pub mod your_profile; to perf/memory/src/profile/mod.rs
Add a variant to WorkloadProfile enum in src/workload.rs
Wire it into create_profile() in src/workload.rs
Add it to WORKLOADS (and base_workload_size) in perf/memory/codspeed/benches/memory_profiles.rs and to the workload matrix in .github/workflows/codspeed-memory.yml so CI tracks it

The Profile trait:

pub trait Profile {
    fn name(&self) -> &str;
    fn next_batch(&mut self, connections: usize) -> (Phase, Vec<Vec<WorkItem>>);
}

Return Phase::Setup for schema/seeding (single batch), Phase::Run for measured work (one batch per connection), Phase::Done when finished.

Keeping This Skill Up to Date

This skill document is the source of truth for how agents use the memory benchmark tooling. If you modify the perf/memory crate — adding profiles, changing CLI flags, altering output format, updating the analysis script, changing the Profile trait, etc. — update this SKILL.md to match. Specifically:

New CLI flags: add to the "Running Benchmarks" section
New profiles: add to the "Built-in Workload Profiles" table
Changed output metrics: update the "Understanding the Output" section
New analyze-dhat.py flags or sort metrics: update the "Analyzing dhat Output" section
Changed Profile trait signature: update "Adding a New Profile"

Future agents rely on this document being accurate. Stale instructions cause wasted work.