name: inference-value-ledger
last_validated: 2026-06-05
description: >-
Render the leadership value-prop ledger for the inference effort: the deployable
wins vs the FlashInfer-TRTLLM + best-tuned-vLLM 0.21/0.22 baseline, grouped
DONE / IN-PROGRESS / NOT-DONE / CLOSED-NEGATIVE, each row data-backed by a perf-lake
campaign (live sol_rigor + verdict tier), plus the ranked GRIND FRONTIER of next levers
(the always-be-grinding performance ratchet). Joins the curated
perf-tune-report/configs/value-findings.yaml registry with the live campaigns via
perftunereport value_view. Flags any finding whose backing campaign is missing or
ungrounded. Use when you need to show value / report to leadership / answer "what have
we uncovered that we can deploy, revalidate, or pursue". Triggers on "value ledger",
"value prop", "show value", "show the value prop", "leadership view", "what wins do we
have", "vs flashinfer", "value-findings", "what can we deploy", or any combination of
"value / wins / findings / leadership / report" with "inference / vllm / flashinfer / perf".
inference-value-ledger
One command to render the always-current leadership value-prop view.
Run it
# print to stdout:
perftunereport value_view
# write a doc:
perftunereport value_view --out /path/to/VALUE-LEDGER.generated.md
# JSON envelope (finding + flag counts, e.g. for CI):
perftunereport value_view --json
perftunereport ships with the profile_and_optimize server. If it is not on PATH,
put the server on PYTHONPATH and invoke it as a module.
How it works
- Curated source-of-truth:
perf-tune-report/configs/value-findings.yaml - one entry per finding (
id,title,lifecycle,baseline,win,hardware,deploy_readiness,campaign_ids,next_lever[+ optionalnext_value]). Edit this when a finding's status changes or a new finding lands. This is the ONLY hand-maintained part. perftunereport value_viewjoins it with the LIVE perf-lake campaigns: each campaign'sreport_status.json(sol_rigor/dcgm_grounded) +verdict.json(tier/baseline_named), and renders the grouped table plus the ranked GRIND FRONTIER. Win numbers are human-verified in the registry. The live columns + flags are read at render time so the ledger never silently drifts from the lake.- Flags surface when a campaign is missing locally (S3-only / unpublished), ungrounded
(
sol_rigornone/L1), its baseline is not named, or a finding has nonext_lever- fix by publishing/tagging the campaign or naming the next lever.
Keep it honest
Lifecycle is evidence-state, not aspiration: a finding is done only when its campaign is
published + grounded, in_progress while revalidating, not_done for research,
closed_negative for banked dead-ends (value = prevented spend). Never mix B200 (TP=8) and
GB300 (TP=4) without the hardware column. Every row names its baseline (the
DRAFT-vs-VERDICT "fair baseline" rule). Code-under-test provenance match: a
finding's source_refs delivery+commit MUST match the cited campaign_ids' provenance -- citing
an overlay/offline-prepped campaign as the benefit of an infr-patch is a cross-tier DRAFT defect
(even when the kernels match). value_view flags this via provenance_match_problems() (delivery/
commit mismatch between source_refs and a cited campaign). Typical failure: citing an
overlay-deploy champion's number as a source patch's benefit -- always re-measure on the
patch's own deploy.
Full-context reporting (no bare numbers)
Per the methodology canon "Every performance number carries its full context (no bare
numbers)" (docs/METHODOLOGY.md, "Full-context reporting"): every number this
skill emits (throughput, latency, TPOT/ITL, BW, %SoL, speedup, efficiency, goodput, acceptance
rate, scaling efficiency, thermal/failure rate - whatever it reports) MUST carry its full
measurement-context descriptor, and every comparison MUST be matched on it. A bare number is a
defect - it cannot set a default, ship a config, or appear in a report.
- Identity: model (+HF path), hardware (exact ceiling token
GB300/B200), quant, kv-cache dtype. - Parallelism: TP, DP (replicas), PP, EP, parallel_strategy.
- Serving cfg: max-num-seqs, max-num-batched-tokens, gpu-memory-utilization, max-model-len, cudagraph_mode/enforce_eager, async_scheduling, prefix-caching.
- Workload: dataset, ISL/OSL (or mean in/out tokens), concurrency, num-prompts.
- Regime: warm vs cold. Latency vs throughput tier.
- Stack: image/vllm commit, bench backend, serving engine.
- Grounding:
%SoL(+ ceiling key fromconfigs/sol-ceilings.yaml- never inline a peak), sol_rigor (L1-L4), trials n (mean±std), same-node, baseline named. (If the metric is not roofline-bound - e.g. accuracy/acceptance - omit%SoLbut keep the rest of the descriptor.) - Per-number exact shape (no smoothing): when reporting more than one number, keep EACH with its own exact shape (ISL/OSL, concurrency, dataset, regime) - never normalize a set to one uniform descriptor that hides per-point variation (e.g.
c=1 @ ISL1024/OSL256+c=64 @ ISL4096/OSL512, NOT one shared "random").
Asset validation (review + FAIL LOUD)
Every asset this skill emits (value-prop ledger / GRIND FRONTIER view / table) is held to
the asset-validation canon (docs/METHODOLOGY.md,
"Asset validation"): the generator FAILS LOUDLY on missing/bad data (a finding whose backing campaign is
missing or ungrounded, unknown/null where a value is required) -> flag it loudly, never silently
drop or fabricate a row, and the agent REVIEWS the ledger for human-sense + 100% accuracy
(every win row is campaign-grounded with live sol_rigor + verdict tier, no orphaned/ungrounded
claim presented as a win) and rebuilds it if wrong -- never ships a wrong/confusing ledger
with a caveat.
PR value proposition (WHY + benefit + applicability)
This skill IS the value-prop ledger -- so when one of its findings graduates into a PR (a
downstream patch PR or any upstream/shareable PR), that PR MUST carry the same value framing,
not just the metric: the WHY (blocker removed / capability unlocked), the measured benefit
(full descriptor + %SoL/sol_rigor + a resolving perf-lake / evidence link. Never a hand-waved
"no perf penalty"), and an applicability matrix -- model architecture (sparse/MoE vs dense),
size, reasoning-vs-not, quant, parallelism (incl. any TP>1-specific behavior), workload, and
concurrency regime. State what it does NOT apply to. Keep it tight (no AI-slop): lead with
SoL%/roofline% and matched numbers. Cut hedging, redundancy, and decorative charts (infra PRs don't embed
images). De-slop checklist: no em-dashes (#1 AI-slop tell), minimal bold, plain punctuation, inline code
out of narrow table cells. Source: docs/METHODOLOGY.md "Value proposition" + "De-slop".
Next lever / BREAKTHROUGH (Grind Mandate)
If this skill emits a measured result, its output MUST end by naming the next perf lever,
its expected unlock (direction + rough magnitude), and the gate that proves/refutes it,
per the Grind Mandate (docs/METHODOLOGY.md, "Always be grinding"). A
measured win is the new floor, not the finish -- so do everything we can to find the next
BREAKTHROUGH: the highest-EV unlock toward Speed-of-Light (a new champion / kernel / router /
quant / parallelism / spec-decode win, or an unblocked stack), not just the next micro-lever.
Rank the candidate breakthrough levers by value x cost (the GRIND FRONTIER, perftunereport value_view), pursue the top, bank the rest with evidence. Record WHY a refuted lever loses,
update the standing frontier in the active bundle's HANDOFF.md. Never conclude
"exhausted/optimal/done" without an explicit next-lever frontier (an empty frontier AND a
documented SoL wall only). Delete this section ONLY if the skill produces no measurements.