ice-score - SKILL.md Agent Skill

name: ice-score description: "Use to prioritize solutions or opportunities using ICE scoring with evidence-backed confidence." metadata: instruction_budget: "35" framework_dependency: "mycelium" framework_dependency_note: "This skill is designed to run within the Mycelium framework (https://github.com/haabe/mycelium). Standalone use will skip the canvas state, theory gates, and harness behavior the skill assumes. Install: /plugin install mycelium@haabe/mycelium."

ICE Score Skill

ICE scoring with integrated confidence meter.

Preflight: Read target canvas file(s) before any Write/Edit

Hard rule. Before issuing Write or Edit against any .claude/canvas/*.yml, use the Read tool on that file in this session. Claude Code's Read-before-Write check requires the Read tool specifically — cat/head/grep via Bash do NOT satisfy it.

Edit vs Write — different cost profiles (verified 2026-05-14):

Edit (exact-string replacement): Read with limit: 1 satisfies the check at ~50 tokens. State-tracking is per-file, not per-byte — subsequent Edit calls work anywhere in the file. Use this for partial updates against large canvas files (e.g., purpose.yml at 800+ lines).
Write (full replacement): do a full Read first. Write obliterates the file; you should see what you're about to replace. The limit:1 shortcut is not appropriate here.

ID-bearing entries — scan the ID space before assigning (added 2026-05-15, v0.23.19): When adding a new component, opportunity, solution, or any other ID-bearing entry to a canvas file, run a Bash grep first to confirm the next ID in your prefix sequence is actually free:

grep "^  - id: <prefix>-" .claude/canvas/<file>.yml | sort -u

Replace <prefix> with the canvas's ID prefix (comp for landscape, opp for opportunities, sol for solutions, ht for human-tasks, etc.). Then pick the next free integer. validate_canvas.py has a duplicate-ID check (lines 230-239) that catches the failure on CI, but a duplicate can persist in the working tree for days if CI isn't run between edit and discovery — see roadmap-repo corrections.md 2026-05-15 "Duplicate canvas ID created in landscape.yml" for the worked example.

Original failure mode: anti-pattern #7 instance #5, 2026-05-09 — agent conflated Bash head with the Read tool, lost ~14k tokens to a Write-fail → remedial-full-Read → re-Write loop. The limit:1 discipline (graduated 2026-05-14, v0.23.18) prevents the second-order cost where the agent correctly follows the rule but full-Reads every time. The ID-scan discipline (graduated 2026-05-15, v0.23.19) prevents the related class where the agent reads enough of the file to satisfy the Edit check but not enough to see existing ID assignments — kin to anti-pattern #8 (Stale State Read).

If this skill writes to multiple canvas files, register each one first (limit:1 for Edit-only paths; full Read for Write paths) AND ID-scan any prefix you intend to assign.

See CLAUDE.md Canvas writes — Read before Write for the canonical rule.

Workflow

ICE scoring is applied to OST solution leaves. Each leaf must have a Four Risks assessment (Torres Product Trio) before it can be scored — the risks are the inputs, ICE is the output.

Step 1: Verify Four Risks exist

For each solution leaf in .claude/canvas/opportunities.yml, check that four_risks has been assessed:

Value (product lens): Do users want/need this?
Usability (design lens): Can users figure it out?
Feasibility (engineering lens): Can we build it within constraints?
Viability (cross-cutting): Does it align with business/legal/ethical constraints?

If any risk dimension is missing, assess it first. Each dimension must have its own evidence — a single combined statement fails (Torres Product Trio rule).

Step 2: Derive ICE from the Four Risks

For each solution, score three dimensions (1-10):

Impact — derived from value + usability + viability risks:

What does the value risk assessment say about user demand?
What does the usability assessment say about adoption likelihood?
What does the viability assessment say about business alignment?
1-3: Low value evidence, high usability risk, or poor viability fit
4-6: Moderate value evidence, some usability unknowns
7-10: Strong value evidence, low usability risk, clear viability

Confidence — how well-tested are the risk assessments?

Scale note: Mycelium uses 0.0-1.0. Gilad's original Confidence Meter uses 0-10 non-linear (0.01=opinion, 1=anecdotal, 5=market data, 8=A/B test, 10=launch data). The non-linear penalty is preserved through evidence-class weighting.

1-3: Risk assessments based on gut feel or desk research
4-6: Some direct evidence (interviews, analogues, small tests)
7-10: Strong evidence (user tests, prototypes, data). Requires test-validated evidence.

Ease — derived from feasibility risk:

What does the engineering assessment say about complexity?
What dependencies, unknowns, or technical spikes were identified?
1-3: High feasibility risk. Major unknowns, new capabilities needed.
4-6: Moderate feasibility risk. Known patterns, some complexity.
7-10: Low feasibility risk. Quick win, existing patterns.

Step 3: Calculate and rank

ICE = Impact x Confidence x Ease. Rank solution leaves by ICE score.

Step 4: Identify riskiest assumptions

For the top-ranked solutions, extract the highest-risk assumptions from the Four Risks assessment — these are what /mycelium:assumption-test should target first. Prioritize assumptions where importance is high but evidence is low.

Step 5: Bias check

Are high-scoring items benefiting from availability bias, IKEA effect, anchoring, or hyperbolic discounting (overweighting near-term Impact at the expense of longer-horizon outcomes)? Review for bias after scoring, before acting. Source for hyperbolic discounting: Laibson, "Golden Eggs and Hyperbolic Discounting" (1997) — when the Impact dimension is rewarded for "what we can ship this sprint," delayed-payoff opportunities get systematically under-scored.

Output

Lead with the recommendation (Hick's Law, per harness/design-principles.md — a flat N-row ranking manufactures decision-tax; graduated from two consecutive /framework-health 4e flags, 2026-06-05 + 2026-06-12):

**Recommended: [top solution] (ICE [score])** — [one line: why it leads + the single riskiest assumption to test first]

| Solution | Value | Usability | Feasibility | Viability | I | C | E | ICE | Riskiest Assumption |
|----------|-------|-----------|-------------|-----------|---|---|---|-----|---------------------|
| ...      | risk  | risk      | risk        | risk      | X | X | X | XXX | [what to test]      |

If the top two scores are within noise of each other (see Noise Check below), say so instead of manufacturing a false leader — "too close to call on current evidence; [X] would separate them" is a legitimate recommendation.

Rules

Every ICE score must trace back to a Four Risks assessment — no scoring without risk evaluation first.
Each risk dimension must have separate evidence (Product Trio rule).
Confidence of 7+ requires test-validated evidence, not analogy or assumption.
If all items score similarly, the risk assessments need more depth.
Review for bias after scoring, before acting.
No OST leaf advances to selected without an ICE score. A solution leaf may be sketched without ICE while still a candidate. The moment it is picked up for delivery (cycle opens, diamond progresses Define→Develop with this leaf as the chosen solution), ice_score on the solution entry in opportunities.yml must be non-zero. This is what makes the calibration dimension measurable later — a cycle that ships without a prior ICE prediction is a permanent dark cell in cycle-history.yml. See engine/cycle-learning.md#cycle-class for how this propagates into cycle-history.yml.predicted.ice_score.

Canvas Output

Update .claude/canvas/opportunities.yml — write four_risks and ice_score per solution leaf. Update .claude/canvas/gist.yml with idea ICE scores and confidence levels.

Theory Citations

Ellis: ICE scoring (Impact × Confidence × Ease — Sean Ellis invented the ICE framework as part of growth methodology)
Gilad: Evidence-Guided (Confidence Meter — the evidence-grading layer on top of Ellis's ICE, mapping evidence types to confidence levels)
Torres: Continuous Discovery Habits (OST leaves as unit of evaluation)
Torres: Product Trio (three perspectives feeding risk assessment)
Cagan: Inspired (Four Risks as structured evaluation)
Kahneman: Thinking, Fast and Slow (bias in estimation)
Kahneman, Sibony & Sunstein: Noise (2021) — unwanted variability in judgment (see Noise Check below)

Noise Check (Kahneman, Sibony & Sunstein, 2021)

ICE scores are susceptible to noise — unwanted variability where different sessions or assessors produce different scores for the same evidence. Unlike bias (systematic skew in one direction), noise is random scatter.

Detection: Re-score the same evidence independently (different session or assessor) and compare. If scores diverge by >1 point on any dimension, investigate why before proceeding.

Mitigation: Use structured assessment criteria (the Four Risks inputs above), apply scores independently before discussion, and anchor to evidence types rather than gut feel.

Noise audit procedure: Score the same evidence independently twice (different session or different assessor). If ICE scores diverge by >1 point on any dimension, the gap is noise — investigate the scoring criteria before proceeding. For solo developers: re-score after a 24h break to detect temporal noise.