name: perf-optimize-uftrace
description: uftrace-powered iterative performance optimization for the apply-load benchmark
argument-hint:
Parse $ARGUMENTS:
- Extract the target TPS number. If missing or not a valid number, ask the user
to provide one (e.g.
/perf-optimize-uftrace 20000). - Store as
$TARGET_TPS.
uftrace Performance Optimization Workflow
Iteratively optimize henyey's apply-load single-shot benchmark to reach
$TARGET_TPS transactions per second, using uftrace function tracing for
precise, deterministic profiling at every stage.
Hard constraint: No protocol changes — observed behavior (transaction results, ledger hashes, meta) must stay identical. Only internal implementation performance is in scope.
Major refactorings are allowed and encouraged when they unlock performance gains. Don't shy away from changing data structures, reworking function signatures, or restructuring hot paths across crate boundaries. Correctness is verified by the test suite, not by minimizing diff size.
Phase 0: Prerequisites & Setup
Generate a session ID (8-char random hex). All session data goes under
~/data/<session-id>/.Verify tools exist. Check for each tool and install if missing:
Tool Check Install uftracewhich uftraceSee https://github.com/namhyung/uftrace rustfiltwhich rustfiltcargo install rustfiltinferno-flamegraphwhich inferno-flamegraphCARGO_TARGET_DIR=~/data/<session-id>/cargo-inferno cargo install infernoRust nightly rustup run nightly rustc --versionrustup toolchain install nightlyIf
uftraceis not installed, stop and tell the user — it requires system installation (not a cargo package).
Phase 1: Build Two Binaries
Build two separate release binaries with different cargo target directories.
1a. Measurement Binary (LTO)
Standard release build for accurate TPS measurement. Uses the workspace's
normal [profile.release] settings (LTO, codegen-units=1).
CARGO_TARGET_DIR=~/data/<session-id>/cargo-target \
cargo build --release -p henyey
cp ~/data/<session-id>/cargo-target/release/henyey \
~/data/<session-id>/henyey-measure
1b. Instrumented Binary (no-LTO, mcount)
Built with nightly for -Z instrument-mcount support. LTO is disabled so
that vendored crate functions (soroban-env-host, wasmi, ed25519-dalek, etc.)
are not inlined and remain visible as separate call frames in the trace.
CARGO_TARGET_DIR=~/data/<session-id>/cargo-target-uftrace \
RUSTFLAGS="-Z instrument-mcount" \
CARGO_PROFILE_RELEASE_LTO=false \
CARGO_PROFILE_RELEASE_CODEGEN_UNITS=16 \
cargo +nightly build --release -p henyey
cp ~/data/<session-id>/cargo-target-uftrace/release/henyey \
~/data/<session-id>/henyey-trace
Why two builds?
- LTO inlines cross-crate functions (especially SAC/soroban-env-host), making them invisible to uftrace. The no-LTO build preserves all call boundaries.
- The mcount instrumentation adds ~4-7x overhead, so TPS numbers from the instrumented binary are not meaningful for measuring optimization impact.
- The LTO binary matches production performance characteristics.
Verify the instrumented binary has mcount call sites:
nm ~/data/<session-id>/henyey-trace | grep -c "mcount"
# Expect 15,000-35,000 sites
Phase 2: Baseline Measurement
Run the apply-load benchmark 3 times with the measurement (LTO) binary:
~/data/<session-id>/henyey-measure apply-load \
--mode single-shot --tx-count 50000 --clusters 4 --iterations 10
Parse Average TPS: NNN from each run. Record all 3 values and take the
median as $BASELINE_TPS.
If $BASELINE_TPS >= $TARGET_TPS, report success and stop.
Phase 3: Trace Recording
Record a uftrace trace using the instrumented (no-LTO) binary.
uftrace record \
-d ~/data/<session-id>/uftrace.data \
-t 10us \
--no-libcall \
~/data/<session-id>/henyey-trace apply-load \
--mode single-shot --tx-count 50000 --clusters 4 --iterations 1
Key flags and why they matter:
| Flag | Purpose |
|---|---|
-t 10us |
Filter out functions that complete in < 10 microseconds. Without this, the trace would be hundreds of gigabytes. This captures all meaningful work while keeping data to ~200-300MB. |
--no-libcall |
Skip tracing libc/PLT calls. We only care about application code. |
--iterations 1 |
Single iteration keeps trace size manageable. The trace captures function-level timing, not statistical variance — one iteration is sufficient for profiling. |
Expected output: ~/data/<session-id>/uftrace.data/ containing thread-specific
.dat files totaling 200-400MB.
Phase 4: Trace Analysis
Step 1: Identify Thread Roles
List .dat files by size to classify threads:
ls -lhS ~/data/<session-id>/uftrace.data/*.dat | head -25
Thread classification by file size:
- Main thread: Largest file (200-350MB) — runs benchmark orchestration, ledger close, commit, bucket list operations
- Worker threads: ~30-40MB each — run
execute_single_cluster, one per cluster (expect 4 for--clusters 4) - Merge threads: ~10-20MB each — run background bucket merges
(
merge_buckets_to_file_with_counters) - Small threads: < 1MB — tokio runtime, I/O, ignore these
Record the TIDs for: main thread, all worker threads, representative merge thread.
Step 2: Worker Thread Self-Time Report
The worker threads execute SAC transactions. Analyze one representative worker thread:
uftrace report -d ~/data/<session-id>/uftrace.data/ \
--tid <worker-tid> -t 10us -s self --no-pager 2>/dev/null \
| rustfilt | head -80
The self column shows exclusive time spent in each function (not
including time in its children). This is the primary metric for identifying
optimization targets — functions with high self-time are doing the most
actual work.
Extract the top 50 functions. Categorize each into one of:
| Category | Example functions |
|---|---|
| Signature verification | curve25519_dalek::*scalar_mul*, ed25519_dalek::*verify*, sqrt_ratio_i |
| Host function setup/dispatch | invoke_host_function_typed, invoke_host_function_core, call_n_internal, invoke_function, try_finish, build_storage_footprint_from_xdr, get_ledger_changes_typed |
| SAC contract logic | StellarAssetContract::transfer, receive_balance, spend_balance, is_authorized, write_contract_balance, read_asset, require_auth |
| Storage operations | load_soroban_footprint, apply_soroban_storage_changes, build_entry_changes_with_hot_archive, validate_footprint_entry, EntryStore::insert_created |
| Bucket/snapshot lookup | HotArchiveBucketList::get, BucketListSnapshot::load_keys_result, SnapshotHandle::prefetch |
| XDR serialization | Transaction::write_xdr, TransactionEnvelope::to_xdr, TransactionFrame::hash, inner_tx_size_bytes |
| Metered XDR (soroban) | LedgerEntryData::write_xdr with MeteredWrite, ScMap::write_xdr |
| Event handling | externalize (events), transfer_event, Budget::shadow_mode |
| HashMap/allocation | reserve_rehash, drop_in_place, __rust_realloc, __rust_dealloc |
| Cleanup/drop | drop_in_place::<TransactionExecutor>, drop_in_place::<InvokeHostFunctionTypedResult> |
| Wasmi overhead | Linker::clone, BTreeMap::clone::clone_subtree |
Compute the percentage of total active self-time for each category.
Step 3: Worker Thread Call Graph
Get the hierarchical call graph to understand where time accumulates:
uftrace graph -d ~/data/<session-id>/uftrace.data/ \
--tid <worker-tid> -t 100us --no-pager 2>/dev/null \
| rustfilt | head -120
This shows the call hierarchy with cumulative timing. Trace the critical path:
execute_single_cluster
-> execute_transaction_with_request
-> pre_apply_arc (signature verification, XDR size calc)
-> apply_body
-> execute_operation_with_soroban
-> execute_invoke_host_function
-> execute_host_function_p25
-> invoke_host_function_typed
-> invoke_host_function_core
-> Host::invoke_function
-> Host::call_n_internal
-> StellarAssetContract::call
-> StellarAssetContract::transfer
Step 4: Main Thread Analysis
The main thread runs commit/bucket operations. Use a higher time filter since these are coarser-grained operations:
uftrace report -d ~/data/<session-id>/uftrace.data/ \
--tid <main-tid> -t 1ms -s self --no-pager 2>/dev/null \
| rustfilt | head -50
Look for: BucketList::add_batch_impl, merge_buckets_to_file_with_counters,
LiveBucketIndex::from_entries_default_with_iter, LedgerDelta::merge,
commit, close_ledger.
Step 5: Generate Flamegraph SVGs
Generate demangled flamegraph SVGs for visualization. These are the primary visual artifacts — interactive SVGs where you can click to zoom into call stacks.
Worker threads flamegraph (all clusters combined):
for tid in <worker-tid-1> <worker-tid-2> <worker-tid-3> <worker-tid-4>; do
uftrace dump --flame-graph \
-d ~/data/<session-id>/uftrace.data/ \
--tid $tid -t 100us 2>/dev/null
done | rustfilt > ~/data/<session-id>/flamegraph-workers.txt
inferno-flamegraph \
--title "Henyey SAC Workers (<tx-count> TPS, <clusters> clusters)" \
< ~/data/<session-id>/flamegraph-workers.txt \
> ~/data/<session-id>/flamegraph-workers.svg
Main thread flamegraph:
uftrace dump --flame-graph \
-d ~/data/<session-id>/uftrace.data/ \
--tid <main-tid> -t 1ms 2>/dev/null \
| rustfilt > ~/data/<session-id>/flamegraph-main.txt
inferno-flamegraph \
--title "Henyey Main Thread (commit/bucket ops)" \
< ~/data/<session-id>/flamegraph-main.txt \
> ~/data/<session-id>/flamegraph-main.svg
Detailed single-worker flamegraph (lower filter for SAC internals):
uftrace dump --flame-graph \
-d ~/data/<session-id>/uftrace.data/ \
--tid <worker-tid> -t 10us 2>/dev/null \
| rustfilt > ~/data/<session-id>/flamegraph-worker-detail.txt
inferno-flamegraph \
--title "Henyey Worker Detail (SAC internals, 10us)" \
< ~/data/<session-id>/flamegraph-worker-detail.txt \
> ~/data/<session-id>/flamegraph-worker-detail.svg
Report the paths of all generated SVGs to the user.
Step 6: Quantify Optimization Targets
For each category from Step 2, compute:
- Total self-time across all functions in the category
- Percentage of active work time (total wall minus idle/condvar wait)
- Per-transaction average (self-time / tx-count)
Rank categories by self-time. For the top 5 categories:
- Read the source code of the highest self-time function(s) in that category
- Identify the specific inefficiency (redundant work, allocations, etc.)
- Estimate expected TPS gain:
(category_self_time / total_active_time) * current_TPS * improvement_fraction - Generate a concrete hypothesis
Phase 5: Hypothesis Document
Create or update docs/perf-hypotheses-uftrace.md:
# Performance Hypotheses (uftrace)
Baseline: <BASELINE_TPS> TPS | Target: <TARGET_TPS> TPS | Date: <today>
## uftrace Profile Summary
| Category | Self-time | % of active | Per-TX avg |
|----------|-----------|-------------|------------|
| Host function setup | X.XXs | XX.X% | XXXus |
| SAC contract logic | X.XXs | XX.X% | XXXus |
| ... | ... | ... | ... |
## Hypotheses
| # | Hypothesis | Target function(s) | Self-time before | Status | Expected gain | Measured gain | TPS after |
|---|-----------|-------------------|------------------|--------|---------------|---------------|-----------|
| 1 | <description> | <function> | X.XXs | pending | +X% | | |
Phase 6: Iterative Optimization Loop
For each hypothesis (highest expected gain first):
Step A: Read & Plan
Read the source of the target function(s). Understand why self-time is high. Look for:
- Redundant allocations or cloning
- Repeated XDR serialization/deserialization
- Lock contention or unnecessary synchronization
- Cache misses from data layout
- Unnecessary database round-trips or flushes
- Work that could be batched, cached, or parallelized
- Unnecessary work for SAC/builtin contracts (e.g., wasmi linker clone when no WASM is executed)
Compare with stellar-core if the function has a C++ counterpart.
Step B: Implement
- Implement the optimization.
- Keep changes focused on one hypothesis at a time.
- Do not change observable behavior — transaction results, ledger hashes, and emitted meta must remain identical.
Step C: Measure TPS
Rebuild the measurement (LTO) binary and run the benchmark 3 times:
CARGO_TARGET_DIR=~/data/<session-id>/cargo-target \
cargo build --release -p henyey
~/data/<session-id>/cargo-target/release/henyey apply-load \
--mode single-shot --tx-count 50000 --clusters 4 --iterations 10
Record the median TPS and compute delta from previous best.
Step D: Targeted Re-trace
Rebuild the instrumented (no-LTO) binary and record a targeted trace to confirm the optimization reduced self-time in the target function(s).
CARGO_TARGET_DIR=~/data/<session-id>/cargo-target-uftrace \
RUSTFLAGS="-Z instrument-mcount" \
CARGO_PROFILE_RELEASE_LTO=false \
CARGO_PROFILE_RELEASE_CODEGEN_UNITS=16 \
cargo +nightly build --release -p henyey
Record a new trace (1 iteration, 10us filter):
uftrace record \
-d ~/data/<session-id>/uftrace-after-<hypothesis-number>.data \
-t 10us --no-libcall \
~/data/<session-id>/cargo-target-uftrace/release/henyey apply-load \
--mode single-shot --tx-count 50000 --clusters 4 --iterations 1
Query the target function's self-time on a worker thread:
uftrace report \
-d ~/data/<session-id>/uftrace-after-<N>.data/ \
--tid <worker-tid> -t 10us -s self --no-pager 2>/dev/null \
| rustfilt | grep -i "<target-function-pattern>"
Compare self-time before and after. Record both values.
Step E: Decide
- Accept if TPS gain > 1% AND the target function's self-time decreased (or the function was eliminated entirely).
- Reject if no meaningful TPS gain.
- Investigate if TPS improved but self-time didn't change — the gain may have come from a different code path. Check the full self-time report for unexpected changes.
Step F: Document & Commit
- Update the hypothesis table with measured gain, self-time before/after, and new TPS.
- If accepted: commit with message like
Optimize <what>: +X% TPS (<old> -> <new>). - If rejected: revert the change, set status to
rejected, note why. - If significant gain (>3%): regenerate the worker flamegraph SVG to show the updated profile.
- If current TPS
>= $TARGET_TPS: stop — target reached.
Step G: Discover New Hypotheses
During implementation, if you discover new optimization opportunities:
- Add them to the hypothesis table with status
pending - Include the target function and estimated self-time from the trace
- Re-sort pending hypotheses by expected gain
Repeat
Pick the next pending hypothesis and go to Step A. If all hypotheses are
exhausted and target is not reached, report the final TPS and remaining gap.
Measurement Protocol
- TPS measurement: Always use the LTO binary. Run 3 times, take the median.
- Trace recording: Always use the no-LTO instrumented binary. 1 iteration,
-t 10usfilter,--no-libcall. - Thread analysis: Always use
--tid <tid>when runninguftrace reportoruftrace graph. Never run these commands on the full trace without a thread filter — the data volume will cause timeouts. - Demangling: Always pipe uftrace output through
rustfilt. Rust symbols are mangled and unreadable without it. - Time filters for output commands:
uftrace report:-t 10usfor workers,-t 1msfor main threaduftrace graph:-t 100usfor readable hierarchyuftrace dump --flame-graph:-t 100usfor workers,-t 1msfor mainuftrace dump --flame-graph(detail):-t 10usfor single worker
- Ensure no other heavy processes are running during measurement.
- Report TPS with the transaction count and wall-clock time used to compute it.
Common Pitfalls (from experience)
Dynamic patching doesn't work on LTO'd Rust binaries. Do not attempt
uftrace -P. --forceoruftrace --patchon the LTO binary — it will only capture a handful of libc calls. The-Z instrument-mcountnightly flag is required.LTO inlines vendored crate functions. With LTO enabled, functions from
soroban-env-host,ed25519-dalek,wasmi, etc. are inlined into their callers and become invisible to uftrace. The no-LTO build (CARGO_PROFILE_RELEASE_LTO=false CARGO_PROFILE_RELEASE_CODEGEN_UNITS=16) preserves all call boundaries.Full trace analysis times out. Running
uftrace reporton a 300MB+ trace without--tidwill exceed the 2-minute command timeout. Always filter by thread ID.uftrace graph -Fuses mangled names. The-F(function filter) flag inuftrace graphoperates on mangled symbol names, not demangled. If a filter fails with "failed to set filter", either drop the-Fflag and grep the output instead, or find the exact mangled name withnm <binary> | grep <pattern>.Idle time dominates wall time. Worker threads spend 60-70% of wall time waiting on condvars (tokio blocking pool). When computing percentages, subtract idle time to get the active work time that matters.
The no-LTO binary is 4-7x slower. Due to mcount instrumentation overhead and lack of cross-crate inlining. TPS numbers from the instrumented binary are meaningless for optimization measurement — they are only useful for relative self-time comparisons.
Summary Report
When the loop terminates (target reached or hypotheses exhausted), print:
## Performance Optimization Summary (uftrace)
Baseline: <BASELINE_TPS> TPS
Final: <FINAL_TPS> TPS
Target: <TARGET_TPS> TPS
Improvement: +<PERCENT>%
Status: <target reached | gap remaining>
Accepted optimizations:
- <hypothesis>: +X% (<old> -> <new> TPS)
Self-time: <function> <before>ms -> <after>ms (-Y%)
- ...
Rejected hypotheses:
- <hypothesis>: <reason>
- ...
Artifacts:
- Baseline flamegraph: ~/data/<session-id>/flamegraph-workers.svg
- Final flamegraph: ~/data/<session-id>/flamegraph-workers-final.svg
- Main thread: ~/data/<session-id>/flamegraph-main.svg
- Trace data: ~/data/<session-id>/uftrace.data/
- Hypothesis document: docs/perf-hypotheses-uftrace.md
Update docs/perf-hypotheses-uftrace.md with the final summary.