deterministic-metric-design - SKILL.md Agent Skill

name: deterministic-metric-design description: Inventing deterministic metrics — turning a fuzzy property like 'maintainability', 'risk', or 'how reducible this code is' into a deterministic, computable number an agent can trust and optimize. Covers the path from construct to adoption — operationalizing the construct, confronting computability limits (Kolmogorov, Rice) with sound proxies, picking the right measurement scale, proving properties (monotonicity, invariance, the Weyuker/Briand axioms), guaranteeing determinism, establishing construct validity (not just LOC in disguise), and hardening against Goodhart-style gaming when an agent optimizes the metric. Trigger when designing, reviewing, or validating a quantitative metric, score, measure, or index — and even when the user doesn't say 'metric' but wants to quantify, score, rank, or measure code/behavior, build a deterministic optimization target, or invent a measure for something previously unquantified (e.g., behavior-preserving codebase-size reduction).

dot-skills Deterministic Metric Design Best Practices

Design metrics that are deterministic, computable, provable, and valid — measures an agent can trust and optimize against without gaming them. The 44 rules across 8 categories take a metric from a fuzzy construct to an adoptable, machine-checkable number: define the construct, confront computability limits with sound proxies, ground it in measurement theory, prove its properties, pin its determinism, validate it empirically, harden it against optimization pressure, and package it for adoption.

A running example threads through every category — a deterministic measure of behavior-preserving codebase-size reduction (shrink code without changing how the app works). It is the ideal stress test because its ideal form is provably out of reach (Kolmogorov complexity is uncomputable; program equivalence is undecidable by Rice's theorem), so the whole craft is building a deterministic, tractable proxy with a proven guarantee.

This is the measurement-design layer that the *-algorithms skills apply (Big-O, NDCG, cyclomatic, MoJoFM) but never teach.

When to Apply

Use this skill when:

Designing a new metric, score, or index — or reviewing someone's proposed metric for rigor
Asked to "quantify", "measure", "score", or "rank" a property that has no agreed measure yet
Building a deterministic optimization target an agent will push on (e.g., reduce code size without changing behavior)
Auditing an existing metric that "feels off" — it suspiciously tracks LOC, jumps between runs, or gets gamed
Turning a research idea or formula into something computable, reproducible, and adoptable

Workflow: Define → Make Computable → Prove → Validate → Harden

The categories are ordered by cascade severity — an upstream mistake poisons everything below it. Work top-down, and jump straight to a category using this table:

If you are…	Start in	First rule
Starting from a fuzzy property	`def-`	def-name-the-latent-construct
Worried the ideal is uncomputable / undecidable	`comp-`	comp-do-not-define-metric-as-uncomputable-ideal
Unsure whether you can average or take ratios	`meas-`	meas-declare-the-scale-type
Claiming the metric behaves a certain way	`prop-`	prop-prove-monotonicity
Getting different numbers between runs	`det-`	det-pin-iteration-and-tie-break-order
Unsure it measures the real thing	`valid-`	valid-discriminant-not-just-loc
Letting an agent optimize the metric	`game-`	game-hard-block-construct-violating-wins
Publishing the metric for others	`agg-`	agg-ship-reference-impl-and-test-vectors

Each reference file is a {category}-{slug}.md containing: WHY it matters, an Incorrect example with the failure annotated, a Correct example with the minimal fix, and a reference. The incorrect/correct examples are metric definitions and procedures, not application code — the contrast is a badly-designed measure versus the fixed one.

Rule Categories by Priority

#	Category	Prefix	Impact	Rules
1	Construct Definition & Operationalization	`def-`	CRITICAL	6
2	Computability & Tractability	`comp-`	CRITICAL	7
3	Measurement-Theoretic Foundations	`meas-`	HIGH	5
4	Proof of Metric Properties	`prop-`	HIGH	6
5	Determinism & Reproducibility	`det-`	HIGH	5
6	Construct Validity & Calibration	`valid-`	MEDIUM-HIGH	6
7	Optimization Safety & Anti-Gaming	`game-`	MEDIUM	5
8	Aggregation, Reporting & Adoption	`agg-`	LOW-MEDIUM	4

See references/_sections.md for the full ordering rationale.

Quick Reference

1. Construct Definition & Operationalization (CRITICAL)

def-name-the-latent-construct — Name the unobservable property before writing any formula
def-separate-construct-from-proxy — Keep construct, proxy, and their assumed link distinct
def-write-falsifiable-operational-definition — Specify the exact procedure that yields the number
def-fix-unit-of-analysis — Pin the unit of analysis and the measurement boundary
def-anchor-to-the-decision — Attach the decision and action threshold the metric drives
def-operationalize-behavior-and-size — Define "behavior" (≈) and "size" so a formatter can't move them

2. Computability & Tractability (CRITICAL)

comp-do-not-define-metric-as-uncomputable-ideal — Don't define the metric as Kolmogorov complexity
comp-respect-rices-theorem-for-semantic-properties — Use sound approximations for undecidable semantic facts
comp-choose-a-decidable-observational-equivalence — Replace undecidable equivalence with a checkable ≈
comp-design-a-proxy-with-a-proven-error-direction — Give the proxy a sound bound that never over-states
comp-keep-the-metric-tractable — Pick a near-linear proxy, not an NP-hard optimum
comp-bound-approximation-error-explicitly — Quantify and report the proxy↔ideal gap
comp-prefer-monotone-confluent-transformations — Confluent, terminating rewrites give a unique fixed point

3. Measurement-Theoretic Foundations (HIGH)

meas-declare-the-scale-type — Declare nominal/ordinal/interval/ratio before any statistic
meas-only-admissible-statistics — Use only statistics invariant under the scale's transforms
meas-establish-meaningful-zero-and-unit — Give a true zero and a named unit for ratio claims
meas-preserve-the-empirical-relation — Verify the metric orders known anchor cases correctly
meas-avoid-ad-hoc-weighted-sums — Don't sum incommensurable scales with arbitrary weights

4. Proof of Metric Properties (HIGH)

prop-prove-monotonicity — Prove the score moves the right way when the construct does
prop-prove-invariance-under-irrelevant-transforms — Prove invariance to renaming and formatting
prop-ensure-sensitivity-to-relevant-change — Ensure it still discriminates (no saturation)
prop-check-weyuker-briand-axioms — Check the published axioms for your measure type
prop-prove-boundedness-and-handle-empty — Prove the range; define the empty / zero-denominator case
prop-prove-or-disclaim-composability — Prove additivity before aggregating, or refuse to sum

5. Determinism & Reproducibility (HIGH)

det-make-the-metric-a-pure-function — No hidden time, network, or global state
det-pin-iteration-and-tie-break-order — Sort by a total key; seed any randomness
det-pin-the-input-representation — Fix exactly which representation (AST stage) you measure
det-control-floating-point-and-accumulation — Fix summation order and rounding precision
det-version-and-record-the-toolchain — Emit metric version, tool versions, and input hash

6. Construct Validity & Calibration (MEDIUM-HIGH)

valid-converge-with-accepted-measure — Show convergence with a trusted measure of the construct
valid-discriminant-not-just-loc — Prove incremental signal beyond LOC / size
valid-predictive-validity-against-outcome — Show it predicts the real outcome out-of-sample
valid-beat-the-trivial-baseline — Quote the lift over a dumb baseline
valid-calibrate-thresholds-to-ground-truth — Derive thresholds from data, not round numbers
valid-validate-out-of-sample — Use a holdout / temporal split to avoid overfitting the corpus

7. Optimization Safety & Anti-Gaming (MEDIUM)

game-make-cheapest-improvement-the-right-one — Make the cheapest score gain the genuine one
game-recognize-goodhart-variants — Anticipate regressional / extremal / causal Goodhart
game-pair-with-guardrail-metrics — Add counter-metrics that veto a regressing "win"
game-hard-block-construct-violating-wins — Gate on invariants; never use a tradable soft penalty
game-detect-reward-hacking-with-audits — Spot-audit top scores; watch proxy↔outcome drift

8. Aggregation, Reporting & Adoption (LOW-MEDIUM)

agg-respect-scale-in-aggregation — Aggregate the way the scale permits (no mean of ordinal)
agg-report-uncertainty-not-false-precision — Report intervals / bounds, not false precision
agg-version-the-metric-publicly — Semver + changelog so consumers stay comparable
agg-ship-reference-impl-and-test-vectors — Publish test vectors so implementations agree

How to Use

Identify where you are with the Workflow table and open the matching first rule.
Work the categories top-down — def- and comp- are CRITICAL because a fuzzy construct or an uncomputable ideal makes everything downstream noise or unusable.
When proposing or critiquing a metric, quote the rule by file path so reviewers can check the reasoning.
For a new metric, produce a one-page spec naming: construct, proxy, scale + unit + zero, proven properties, determinism guarantees, validity evidence, guardrails, and version — one line per category here.
See references/_sections.md for ordering rationale and assets/templates/_template.md when adding rules.

Reference Files

File	Description
references/_sections.md	Category definitions, impact levels, and ordering rationale
assets/templates/_template.md	Template for adding new rules
metadata.json	Discipline, type, and source references

Related Skills

same-results-less-code, code-simplifier, complexity-optimizer, knip-deadcode — prescriptive code-reduction skills. This skill supplies the measurement layer they lack: a deterministic, behavior-preserving reduction metric to target and verify.
algorithmic-complexity-review, computer-science-algorithms — apply existing measures (Big-O). This skill teaches how to design new ones.
opensearch-function-scoring-algorithms — applied ranking metrics (NDCG, A/B tests). This skill is the foundational methodology beneath its eval- category.