name: deterministic-metric-design description: Inventing deterministic metrics — turning a fuzzy property like 'maintainability', 'risk', or 'how reducible this code is' into a deterministic, computable number an agent can trust and optimize. Covers the path from construct to adoption — operationalizing the construct, confronting computability limits (Kolmogorov, Rice) with sound proxies, picking the right measurement scale, proving properties (monotonicity, invariance, the Weyuker/Briand axioms), guaranteeing determinism, establishing construct validity (not just LOC in disguise), and hardening against Goodhart-style gaming when an agent optimizes the metric. Trigger when designing, reviewing, or validating a quantitative metric, score, measure, or index — and even when the user doesn't say 'metric' but wants to quantify, score, rank, or measure code/behavior, build a deterministic optimization target, or invent a measure for something previously unquantified (e.g., behavior-preserving codebase-size reduction).
dot-skills Deterministic Metric Design Best Practices
Design metrics that are deterministic, computable, provable, and valid — measures an agent can trust and optimize against without gaming them. The 44 rules across 8 categories take a metric from a fuzzy construct to an adoptable, machine-checkable number: define the construct, confront computability limits with sound proxies, ground it in measurement theory, prove its properties, pin its determinism, validate it empirically, harden it against optimization pressure, and package it for adoption.
A running example threads through every category — a deterministic measure of behavior-preserving codebase-size reduction (shrink code without changing how the app works). It is the ideal stress test because its ideal form is provably out of reach (Kolmogorov complexity is uncomputable; program equivalence is undecidable by Rice's theorem), so the whole craft is building a deterministic, tractable proxy with a proven guarantee.
This is the measurement-design layer that the *-algorithms skills apply (Big-O, NDCG, cyclomatic, MoJoFM) but never teach.
When to Apply
Use this skill when:
- Designing a new metric, score, or index — or reviewing someone's proposed metric for rigor
- Asked to "quantify", "measure", "score", or "rank" a property that has no agreed measure yet
- Building a deterministic optimization target an agent will push on (e.g., reduce code size without changing behavior)
- Auditing an existing metric that "feels off" — it suspiciously tracks LOC, jumps between runs, or gets gamed
- Turning a research idea or formula into something computable, reproducible, and adoptable
Workflow: Define → Make Computable → Prove → Validate → Harden
The categories are ordered by cascade severity — an upstream mistake poisons everything below it. Work top-down, and jump straight to a category using this table:
| If you are… | Start in | First rule |
|---|---|---|
| Starting from a fuzzy property | def- |
def-name-the-latent-construct |
| Worried the ideal is uncomputable / undecidable | comp- |
comp-do-not-define-metric-as-uncomputable-ideal |
| Unsure whether you can average or take ratios | meas- |
meas-declare-the-scale-type |
| Claiming the metric behaves a certain way | prop- |
prop-prove-monotonicity |
| Getting different numbers between runs | det- |
det-pin-iteration-and-tie-break-order |
| Unsure it measures the real thing | valid- |
valid-discriminant-not-just-loc |
| Letting an agent optimize the metric | game- |
game-hard-block-construct-violating-wins |
| Publishing the metric for others | agg- |
agg-ship-reference-impl-and-test-vectors |
Each reference file is a {category}-{slug}.md containing: WHY it matters, an Incorrect example with the failure annotated, a Correct example with the minimal fix, and a reference. The incorrect/correct examples are metric definitions and procedures, not application code — the contrast is a badly-designed measure versus the fixed one.
Rule Categories by Priority
| # | Category | Prefix | Impact | Rules |
|---|---|---|---|---|
| 1 | Construct Definition & Operationalization | def- |
CRITICAL | 6 |
| 2 | Computability & Tractability | comp- |
CRITICAL | 7 |
| 3 | Measurement-Theoretic Foundations | meas- |
HIGH | 5 |
| 4 | Proof of Metric Properties | prop- |
HIGH | 6 |
| 5 | Determinism & Reproducibility | det- |
HIGH | 5 |
| 6 | Construct Validity & Calibration | valid- |
MEDIUM-HIGH | 6 |
| 7 | Optimization Safety & Anti-Gaming | game- |
MEDIUM | 5 |
| 8 | Aggregation, Reporting & Adoption | agg- |
LOW-MEDIUM | 4 |
See references/_sections.md for the full ordering rationale.
Quick Reference
1. Construct Definition & Operationalization (CRITICAL)
def-name-the-latent-construct— Name the unobservable property before writing any formuladef-separate-construct-from-proxy— Keep construct, proxy, and their assumed link distinctdef-write-falsifiable-operational-definition— Specify the exact procedure that yields the numberdef-fix-unit-of-analysis— Pin the unit of analysis and the measurement boundarydef-anchor-to-the-decision— Attach the decision and action threshold the metric drivesdef-operationalize-behavior-and-size— Define "behavior" (≈) and "size" so a formatter can't move them
2. Computability & Tractability (CRITICAL)
comp-do-not-define-metric-as-uncomputable-ideal— Don't define the metric as Kolmogorov complexitycomp-respect-rices-theorem-for-semantic-properties— Use sound approximations for undecidable semantic factscomp-choose-a-decidable-observational-equivalence— Replace undecidable equivalence with a checkable ≈comp-design-a-proxy-with-a-proven-error-direction— Give the proxy a sound bound that never over-statescomp-keep-the-metric-tractable— Pick a near-linear proxy, not an NP-hard optimumcomp-bound-approximation-error-explicitly— Quantify and report the proxy↔ideal gapcomp-prefer-monotone-confluent-transformations— Confluent, terminating rewrites give a unique fixed point
3. Measurement-Theoretic Foundations (HIGH)
meas-declare-the-scale-type— Declare nominal/ordinal/interval/ratio before any statisticmeas-only-admissible-statistics— Use only statistics invariant under the scale's transformsmeas-establish-meaningful-zero-and-unit— Give a true zero and a named unit for ratio claimsmeas-preserve-the-empirical-relation— Verify the metric orders known anchor cases correctlymeas-avoid-ad-hoc-weighted-sums— Don't sum incommensurable scales with arbitrary weights
4. Proof of Metric Properties (HIGH)
prop-prove-monotonicity— Prove the score moves the right way when the construct doesprop-prove-invariance-under-irrelevant-transforms— Prove invariance to renaming and formattingprop-ensure-sensitivity-to-relevant-change— Ensure it still discriminates (no saturation)prop-check-weyuker-briand-axioms— Check the published axioms for your measure typeprop-prove-boundedness-and-handle-empty— Prove the range; define the empty / zero-denominator caseprop-prove-or-disclaim-composability— Prove additivity before aggregating, or refuse to sum
5. Determinism & Reproducibility (HIGH)
det-make-the-metric-a-pure-function— No hidden time, network, or global statedet-pin-iteration-and-tie-break-order— Sort by a total key; seed any randomnessdet-pin-the-input-representation— Fix exactly which representation (AST stage) you measuredet-control-floating-point-and-accumulation— Fix summation order and rounding precisiondet-version-and-record-the-toolchain— Emit metric version, tool versions, and input hash
6. Construct Validity & Calibration (MEDIUM-HIGH)
valid-converge-with-accepted-measure— Show convergence with a trusted measure of the constructvalid-discriminant-not-just-loc— Prove incremental signal beyond LOC / sizevalid-predictive-validity-against-outcome— Show it predicts the real outcome out-of-samplevalid-beat-the-trivial-baseline— Quote the lift over a dumb baselinevalid-calibrate-thresholds-to-ground-truth— Derive thresholds from data, not round numbersvalid-validate-out-of-sample— Use a holdout / temporal split to avoid overfitting the corpus
7. Optimization Safety & Anti-Gaming (MEDIUM)
game-make-cheapest-improvement-the-right-one— Make the cheapest score gain the genuine onegame-recognize-goodhart-variants— Anticipate regressional / extremal / causal Goodhartgame-pair-with-guardrail-metrics— Add counter-metrics that veto a regressing "win"game-hard-block-construct-violating-wins— Gate on invariants; never use a tradable soft penaltygame-detect-reward-hacking-with-audits— Spot-audit top scores; watch proxy↔outcome drift
8. Aggregation, Reporting & Adoption (LOW-MEDIUM)
agg-respect-scale-in-aggregation— Aggregate the way the scale permits (no mean of ordinal)agg-report-uncertainty-not-false-precision— Report intervals / bounds, not false precisionagg-version-the-metric-publicly— Semver + changelog so consumers stay comparableagg-ship-reference-impl-and-test-vectors— Publish test vectors so implementations agree
How to Use
- Identify where you are with the Workflow table and open the matching first rule.
- Work the categories top-down —
def-andcomp-are CRITICAL because a fuzzy construct or an uncomputable ideal makes everything downstream noise or unusable. - When proposing or critiquing a metric, quote the rule by file path so reviewers can check the reasoning.
- For a new metric, produce a one-page spec naming: construct, proxy, scale + unit + zero, proven properties, determinism guarantees, validity evidence, guardrails, and version — one line per category here.
- See
references/_sections.mdfor ordering rationale andassets/templates/_template.mdwhen adding rules.
Reference Files
| File | Description |
|---|---|
| references/_sections.md | Category definitions, impact levels, and ordering rationale |
| assets/templates/_template.md | Template for adding new rules |
| metadata.json | Discipline, type, and source references |
Related Skills
same-results-less-code,code-simplifier,complexity-optimizer,knip-deadcode— prescriptive code-reduction skills. This skill supplies the measurement layer they lack: a deterministic, behavior-preserving reduction metric to target and verify.algorithmic-complexity-review,computer-science-algorithms— apply existing measures (Big-O). This skill teaches how to design new ones.opensearch-function-scoring-algorithms— applied ranking metrics (NDCG, A/B tests). This skill is the foundational methodology beneath itseval-category.