name: h-verify description: | Verifies that a recorded DecisionRecord still holds — baseline-vs-measure evidence loop with drift detection per FPF Evidence Decay. Make sure to use this skill whenever the user asks "did dec-X work", "is decision Y still valid", "did the prediction come true", "check if the migration held", "is X stale", "measure that decision against reality", "did we actually fix Z", "is our caching decision still right" — or whenever a shipped decision needs a post-implementation reality-check before further work relies on it. Also use when /h-status surfaces a refresh-due decision. NOT for ad-hoc sanity checks (just run the tests directly). NOT for re-framing the underlying problem (use h-frame). when_to_use: | A shipped DecisionRecord needs reality check, OR refresh-due artifact surfaced. Skip for one-off sanity checks where you can just run the test. argument-hint: "[decision-ref or 'what's stale' for full project verification]" allowed-tools: Bash Read Grep Glob mcp__haft__haft_decision mcp__haft__haft_query mcp__haft__haft_refresh
h-verify — Verify a decision still holds
You are running the FPF verification loop: baseline → measure → evidence → record. Drift detection compares current state against baselined affected_files; evidence decay reports surface when valid_until passes; measure verdict is recorded for the predictions the decide step declared.
Step 1 — Identify the decision
If decision_ref is given, use it. Otherwise:
mcp__haft__haft_query(action="status")— surfaces stale/refresh-due decisionsmcp__haft__haft_query(action="list", kind="DecisionRecord")— full list- Ask the operator which decision to verify
Step 2 — Read the decision's predictions
mcp__haft__haft_query(action="search", query="<decision_ref>") returns the DecisionRecord including its predictions field. Each prediction has:
claim— the falsifiable statementobservable— what to measurethreshold— pass/fail boundaryverify_after— when async evidence should be available (if any)
If predictions are empty (the decision was recorded tactical with _skips: ["predictions"]), there's nothing to measure — report that to operator and recommend either:
/h-refreshaction=reopen to add predictions and re-decide properly- Just attach evidence directly via
haft_decision(action="evidence", ...)
Step 3 — Baseline (if drift detection wanted)
If the decision has affected_files and you want drift comparison:
mcp__haft__haft_decision(
action="baseline",
decision_ref="<dec-...>"
// affected_files optional — kernel uses the decision's list
)
The kernel snapshots file content hashes. Subsequent comparisons detect drift. Call once after each commit cycle if you want continuous drift signal.
Step 4 — Gather evidence per prediction
For each prediction:
- Run the observable (test, metric query, log scan, code grep)
- Compare to threshold
- Capture the actual measurement value
Tools available depending on the observable:
Bashfor test runners, metric queries, log scansRead/Grep/Globfor code-level invariant checks- For external metrics: kernel has no special integration; agent describes the source
Step 5 — Attach evidence to the artifact
For each material evidence item:
mcp__haft__haft_decision(
action="evidence",
artifact_ref="<dec-...>",
evidence_type="measurement | test | research | benchmark | audit",
evidence_content="<what you observed, with concrete numbers>",
evidence_verdict="supports | weakens | refutes",
carrier_ref="<file path or URL where the evidence lives>",
claim_refs=["<prediction id or scope label>"],
congruence_level=3, // 3=same context, 2=similar, 1=different, 0=opposed
valid_until="<RFC3339 or YYYY-MM-DD — when this evidence expires>",
causal_support_basis="observational | interventional | realized_counterfactual | identified_estimate | simulation_only"
)
congruence_level (CL) defaults per FPF B.3.5:
- 3: same-context evidence (own production system, own tests)
- 2: similar-context (related project, similar load)
- 1: different-context (external docs, vendor benchmarks)
- 0: opposed-context (rare; conflicting framework)
CL impacts R_eff per FPF B.3:3 — never average across CL.
Step 6 — Record the measurement verdict
After all evidence is attached, record the overall verdict:
mcp__haft__haft_decision(
action="measure",
decision_ref="<dec-...>",
verdict="accepted | partial | failed",
findings="<what actually happened compared to predictions>",
measurements=["p99 latency: 42ms (predicted <50ms — accepted)", "..."],
criteria_met=["<criterion that was met>"],
criteria_not_met=["<criterion that was NOT met>"]
)
Kernel ties the verdict back to the predictions and surfaces:
- Accepted → decision health remains good
- Partial → some predictions held, some didn't → consider reopen or supersede
- Failed → decision invalidated → consider supersede or rollback per the decision's rollback spec
If any verified prediction carried a probability forecast (set at /h-decide),
the measure response also appends a Calibration read: the decomposed-Brier
profile (Brier = reliability − resolution + uncertainty) over all verified
forecasts, plus a directional over/under-confidence bias. Below ~15 accumulated
forecasts it reports cold-start and is not yet actionable — surface it to the
operator but do not over-read a sparse profile.
Step 7 — Handle stale or drifted decisions
If verification reveals:
- Evidence decayed (valid_until passed):
mcp__haft__haft_refresh(action="waive", artifact_ref=..., evidence="<new evidence>", new_valid_until="...")to extend validity, ORaction=reopento start a new problem cycle - Drift detected (affected_files changed since baseline): classify drift as cosmetic / incidental / material via
haft_query(action="status")and decide whether to re-baseline or reopen - Verdict failed:
mcp__haft__haft_refresh(action="supersede", artifact_ref=<old>, new_artifact_ref=<replacement>)after recording the replacement decision via/h-decide
Step 8 — Present to operator
Surface:
- Predictions vs actual measurements
- Verdict (accepted / partial / failed)
- Evidence attached with CL
- Drift status if baseline existed
- Recommended next action (waive / reopen / supersede / nothing — decision still good)
Re-grounding discipline (FPF A.7). When you reference decision IDs
(dec-20260525-...), prediction labels, or evidence refs in the verdict
summary and recommendation paragraphs, pair each with its
human-readable title or claim — dec-20260525-abc (NATS over Kafka for ops simplicity) — verdict accepted not bare dec-20260525-abc verdict accepted. Bare IDs accumulate cognitive debt across long sessions. Keep
IDs for traceability but never let them stand alone in summaries. See
CLAUDE.md Critical Reminders for the project-wide rule.
What NOT to do
- Do NOT call
action="measure"without first gathering evidence — kernel rejects measure-from-memory (the protocol requires evidence before verdict). - Do NOT use congruence_level=3 for evidence that came from a different context (vendor benchmark, external doc). Misclassifying CL inflates R_eff and corrupts trust signal.
- Do NOT skip baseline if the decision has affected_files and drift matters — without baseline, drift is invisible.
- Do NOT silently change a decision's verify_after to "kick the can". Use
action="waive"withevidenceparameter so the extension is auditable. - Do NOT mark verdict=accepted when predictions were skipped — verify what was declared, not what was hoped.
FPF spec references
- B.3 — Trust & Assurance Calculus (F-G-R + CL)
- B.3.3 — Assurance Subtypes & Levels (L0/L1/L2 + TA/VA/LA)
- B.3.4 — Evidence Decay & Epistemic Debt
- A.10 — Evidence Graph Referring
- VER-01 (evidence graph), VER-02 (decay), VER-03 (R_eff), VER-07 (refresh triggers)
- C.27 — Temporal Claim Adequacy (state reading vs trend vs intervention claim)
Look up via mcp__haft__haft_query(action="fpf", query="B.3").