name: perf-investigation description: Profile-driven performance work on the panache parser or formatter. Measure first with perf + the right harness; classify hotspots into one of a small set of buckets; apply the matching cheap fix; verify median wall-time moved before committing.
Use this skill when asked to "speed up parsing", "speed up formatting",
"look at the parser/formatter hotspots", "fix the regression after
The buckets, workflow, and verification steps are shared across parser and formatter; only the harness invocation and the hot-file map differ. The "Harness" sections below have both — pick the one that matches the target.
Scope boundaries
- Verification is end-to-end test green +
commonmark_allowlistgreen- clippy/fmt clean. Performance gains do not justify a snapshot diff or a regression in any test.
- The thread-local pool / scratch-bundle pattern in
inline_ir.rs::ScratchEventsis the established shape for amortizing per-call allocations. Don't invent a new pattern; extend that one. - Formatting must remain idempotent (
format(format(x)) == format(x)). Any formatter perf change that touches emitter shape needs the golden-cases suite green. - Parsing must remain CST-lossless. Any parser perf change that
touches
builder.token()/builder.start_node()shape needs the parser golden snapshots and the conformance allowlist green.
Related rules to read first
.claude/rules/parser.md— losslessness, dialect gating, no formatter policy in parser code, TEXT-coalescence-vs-structural rule..claude/rules/integration-tests.md— where parser vs formatter goldens live (don't mix them).
Harness noise to ignore inside this skill
The runtime occasionally injects a system-reminder nudging you to
use TaskCreate / TaskUpdate. The workflow below is linear
(baseline → profile → classify → fix → measure → commit → repeat), so
task tools add overhead without value. Skip them unless the user
explicitly asks.
Harness — parser
Stress doc: pandoc/MANUAL.txt (~300 KB). Small docs hide per-line
dispatcher cost behind allocator noise.
CARGO_PROFILE_RELEASE_DEBUG=true cargo build --release \
--example profile_parse -p panache-parser
for i in $(seq 1 12); do
taskset -c 0 ./target/release/examples/profile_parse \
pandoc/MANUAL.txt 200 2>&1 | tail -1
done
taskset -c 0 pins to one core — without it, scheduling jitter on a
hybrid-core CPU swamps small wins. Discard the first 2-3 warmup runs;
take median of the remaining ~9-10. Per-run variance on a warm machine
is ~3-5%; demand at least that big a delta before declaring a fix
worked.
Harness — formatter
The repo's formatting bench is cargo bench --bench formatting. For
focused hotspot work, set PANACHE_BENCH_DOC to the doc you're
investigating and a low PANACHE_BENCH_ITERATIONS:
cd benches/documents && ./download.sh && cd ../.. # first time only
PANACHE_BENCH_DOC=pandoc_manual.md PANACHE_BENCH_ITERATIONS=3 \
cargo bench --bench formatting
# Or end-to-end on a single doc via the CLI binary, with hyperfine if
# available (more honest than ad-hoc shell loops):
CARGO_PROFILE_RELEASE_DEBUG=true cargo build --release
hyperfine --warmup 3 \
'taskset -c 0 ./target/release/panache format \
< pandoc/MANUAL.txt > /dev/null'
Same warmup-discard rule applies.
Capture a perf profile
perf record --call-graph=dwarf -F 999 -o /tmp/panache_perf.data -- \
./target/release/examples/profile_parse pandoc/MANUAL.txt 400
perf report --stdio -i /tmp/panache_perf.data \
--no-children -g none --percent-limit 1.0 | head -40
Always read cpu_core samples, not cpu_atom — on a hybrid-core
CPU cpu_atom typically captures only a handful of samples and
percentages there are essentially noise. Use --no-children for the
flat self-time view; use -g graph,caller,… (or ,callee,…) when
you need to find who calls a hot leaf. For inline-frame visibility
add --inline.
For flame graphs, the repo already integrates cargo flamegraph:
PANACHE_BENCH_DOC=pandoc_manual.md PANACHE_BENCH_ITERATIONS=3 \
cargo flamegraph --bench formatting
Classify each hotspot
Every parser/formatter hotspot recovered so far falls into one of these buckets. Identify which one BEFORE editing:
- Slice-pattern trim —
s.trim_*_matches([' ', '\t'])or similar ASCII-set trims show up ascore::str::trim_matches/trim_start_matcheswithMultiCharEqSearcher/CharPredicateSearcher::next_rejectin the call stack. Replace with byte-level helpers fromparser/utils/helpers.rs(trim_end_newlines,trim_start_spaces_tabs,trim_end_spaces_tabs,is_blank_line). .trim().is_empty()on every line — Unicode whitespace iterator for what is always ASCII. Useis_blank_line(s)instead.- Per-line block parser invoked without leading-byte gate —
try_parse_*runs on every non-blank line; allocates / scans before realizing the line can't possibly be the construct. Add a cheap byte gate:bytes after up to 3 spacesmatches the expected leading byte. Examples that paid off (parser):[for ref-def + footnote-def,<for HTML block,:for fenced-div + def-marker,=/-for setext underline (next line). Skip when the existing inner check is already byte-cheap (count_blockquote_markersalready has one). - Per-call
Stringallocation on a no-match path — atry_parse_*function that returnsOption<(String, …)>allocates the string even when the caller's outer guard rejects it. Change the signature to returnOption<usize>(orOption<&str>) and have the caller build theStringonly on confirmed match. - Per-iteration
Vec::new()in a hot loop —.collect::<Vec<_>>()inside an inner loop, or freshVecper call to a function that's invoked per range/paragraph/line. Either pool via the scratch bundle pattern ininline_ir.rs::ScratchBundleor hoist +clear()+extend()so capacity is reused across iterations. - Per-call malloc for a discardable builder —
GreenNodeBuilder::new()indetect_preparedto "try a parse and throw it away" allocates a fresh NodeCache each call. The right fix is splitting the parser function into a separatevalidate_*that doesn't emit, not pooling the discardable builder (the cache holds Arcs across parses and pooling it across the benchmark loop creates an unrealistic flatter — each iteration after the first hits a warm cache that wouldn't exist in real CLI usage). - char-walk where bytes would do — code-span / list-marker-like
scanners stepping
pos += rest[pos..].chars().next()?.len_utf8()byte-by-byte through plain ASCII. Replace withmemchr-stylebytes.iter().position(|&b| b == NEEDLE)(the compiler emits vectorized memchr). All Pandoc / CommonMark structural bytes are ASCII, so byte-level scans are losslessness-safe. to_uppercase()/to_lowercase()on ASCII — Unicode case-folding allocates a freshString. For ASCII-only checks (e.g. Roman numeral validation), case-fold a byte at a time viab & !0x20.- Formatter wrapping / line-builder churn — formatter-specific
hotspots tend to live in wrapping
(
crates/panache-formatter/src/formatter/wrapping.rs), inline emission (inlines.rs), and table layout (tables.rs). Common shapes:Stringallocation per inline span, repeated width recalculation, per-lineVec<String>for column widths. Same buckets as above, just different files. - rowan internals (NodeCache::token, Arc::drop_slow,
reserve_rehash, ThinArc::from_header_and_iter) — these dominate
the residual ~12-15% on parser benchmarks. Proportional to
builder.token()/builder.start_node()call count and to the size of the resulting green tree. Reducing them means emitting fewer tokens (e.g. coalescing a per-lineTEXT + NEWLINEpair in raw / code blocks into oneTEXTtoken where the formatter doesn't need the split). This is invasive — verify CST snapshots and the formatter round-trip before changing emitter shape. Don't try to pool the NodeCache across parses; it holds Arc'd green nodes (memory leak) and warming it across benchmark iterations creates a misleading result.
Apply the smallest matching fix
Rules that have paid off:
- Don't theorize before measuring. Multiple "should-have-helped"
changes (pre-sizing line-split vecs via an LF count; byte-gate on
BlockQuoteParser::detect_prepared) regressed wall time and were reverted. The intuition wasn't wrong; the cost model was. Always measure. - Verify with measurement, not perf-only. A change can drop a symbol from the perf top-25 without moving median wall time (sample relocation, not work elimination). The wall-time median is the truth.
- One change per commit. Prevents one regression from masking the
win of another. Re-run the test suite + clippy + fmt + a fresh
taskset -c 0measurement cycle for each. - Revert promptly. If 12 runs after the change don't show a median shift larger than the baseline noise (~3-5%), the fix doesn't pay; revert and pick a different lever. Don't ship pretty-but-flat refactors as perf.
Verify and commit
For every commit:
cargo test --workspace --no-fail-fast
cargo test -p panache-parser --test commonmark commonmark_allowlist
cargo clippy --workspace --all-targets --all-features -- -D warnings
cargo fmt -- --check
For formatter changes also verify the golden-cases suite explicitly:
cargo test --test golden_cases
Then a fresh measurement (taskset, 12 runs, median). The commit message should name the bucket and quote the median delta:
perf(parser|formatter): <bucket> on <call site>
<one-paragraph rationale: profile pointed here, what specifically>
<was wasteful, what the fix replaces it with>
Median wall time on `<harness command>` (12 runs):
~X ms → ~Y ms (~Z%).
Cite the wall-time number even when it's "in the noise" — that's the honest record, and a reviewer can decide whether to ship a noise-floor change at all.
Key files — parser
crates/panache-parser/examples/profile_parse.rs— the harness.crates/panache-parser/src/parser/utils/helpers.rs— byte-level trim / blank-line helpers; first place to look for an existing helper before adding a new one.crates/panache-parser/src/parser/inlines/inline_ir.rs—ScratchEvents/ScratchBundlethread-local pool pattern;build_full_plansfor per-paragraph IR work.crates/panache-parser/src/parser/block_dispatcher.rs— every block parser'sdetect_preparedlives here; this is the hot per-line dispatch site.crates/panache-parser/src/parser/inlines/refdef_map.rs— document-wide refdef pre-pass, called once per parse.pandoc/MANUAL.txt— 300 KB stress doc.
Key files — formatter
benches/formatting.rs— the bench harness; respectsPANACHE_BENCH_DOCandPANACHE_BENCH_ITERATIONS.benches/documents/— set of stress docs (small,medium_quarto,tables,math,large_authoring,pandoc_manual.md).crates/panache-formatter/src/formatter/— split by concern (wrapping,inlines,paragraphs,headings,lists,tables, …). Match the file to the construct your hotspot involves.crates/panache-formatter/src/formatter.rs— top-level orchestration.tests/fixtures/cases/— formatter goldens (UPDATE_EXPECTED=1to refresh, but verify diffs carefully — the formatter's idempotency invariant means a wrong refresh is a silent regression).
Don't redo / known traps
split_lines_inclusiveLF pre-count regressed parser wall time. The extra pass over the input cost more than the resize-grow it saved. Don't try this again unless you change the data structure (e.g. thread-local pooledVec<&'static str>via lifetime transmute) — and even then, prove the win with measurement first.BlockQuoteParser::detect_preparedbyte-gate was a noise-level regression.count_blockquote_markersalready has its own internal byte-cheap check; layering another gate on top added a tiny cost without saving meaningful work.- Don't pool the rowan
NodeCacheacross parses. Holds Arc'd green nodes (LSP memory leak) and produces misleading benchmark numbers (warm cache after iter 1). - Don't trust
cpu_atomperf samples on a hybrid-core CPU. Readcpu_coredata; atom is too few samples to be reliable. - Don't add a too-permissive byte-gate. It has no effect. The gate must match the construct's actual first-byte set after indent; verify with the full test suite, not just by reading the parser code.
- Don't add
String::new()in adetect_preparedbefore the gate.ReferenceDefinitionParserwas the canonical example — the multi-lineString::new() + push_strran on every line; the byte gate is what unlocked the 15% wall-time jump.
Report-back format
When done, report:
- Hotspot (function + approximate %) addressed.
- Bucket (from §Classify each hotspot).
- Median wall-time delta on the relevant harness (12 runs,
taskset -c 0). - Test / clippy / fmt status (all green or specific exception).
- What was tried but reverted (with reason).
- Suggested next hotspot, ranked by likely shared root cause.