name: review-hypotheses description: "Resolves open hypotheses whose horizons have elapsed, extracts lessons from each outcome, calculates per-category accuracy stats, and generates a calibration report. Use whenever the aspirations loop hits the hypothesis-review cadence, the user says "how accurate are my predictions" or "review my hypotheses", or enough hypotheses have reached their horizon to warrant a batch resolve. Modes: --resolve, --learn, --accuracy-report, --full-cycle, --category-comparison." user-invocable: false triggers:
- "/review-hypotheses" parameters:
- name: mode description: "--resolve, --learn, --accuracy-report, --full-cycle, --category-comparison" required: false execution_history: total_invocations: 0 outcome_tracking: successful: 0 unsuccessful: 0 success_rate: 0.0 last_invocation: null known_pitfalls: [] reconsolidation_trigger: "After 10 invocations with declining success rate, trigger skill review" conventions: [pipeline, aspirations, tree-retrieval, reasoning-guardrails, pattern-signatures] minimum_mode: assistant revision_id: "skill-bootstrap-review-hypotheses-9490ec" previous_revision_id: null
/review-hypotheses — Hypothesis Review & Resolution Engine
Two-phase design: resolve (detect outcomes, move records, record results) and learn (reflect on outcomes, extract patterns). These phases are separated so that /boot can resolve without triggering learning, and /aspirations goals can learn from freshly resolved data without re-checking resolution sources.
Parameters
--resolve— Check resolution status, move active→resolved, record outcomes. Does NOT call /reflect. Setsreflected: falseon each record.--learn— Find resolved records withreflected: false, call/reflect --on-hypothesisfor each, setreflected: true.--accuracy-report— Generate accuracy statistics across all resolved hypotheses--full-cycle—--resolve+--learn+--accuracy-report+/reflect --full-cycle+ spark check--category-comparison <cat1> <cat2>— Compare accuracy between two categories--hypothesis <id>— Check a specific hypothesis- When called as a goal's skill (from aspirations loop), returns one of:
CONFIRMED— hypothesis confirmed (move pipeline file to resolved/)CORRECTED— hypothesis disconfirmed (move pipeline file to resolved/)PENDING— not enough evidence yet (goal stays pending for retry)EXPIRED— past resolves_by deadline (move pipeline file to archived/)
- When called as a goal's skill (from aspirations loop), returns one of:
- No args → default to
--resolve
Step 0: Load Conventions
Step 0: Load Conventions — Bash: load-conventions.sh with each name from the conventions: front matter. Read only the paths returned (files not yet in context). If output is empty, all conventions already loaded — proceed to next step.
Mode 1: Resolve (--resolve)
Detects which active hypotheses have resolved, records outcomes, moves records, and updates the memory tree. Does NOT trigger reflection or learning — that is --learn's job.
Step 1: Load Hypotheses to Check
# Step 1.0: Pre-scan discovered-stage records orphaned past resolves_by (g-115-1629).
# review-hypotheses historically loaded ONLY active + measurement-pending, so
# discovered records past resolves_by were invisible -- never resolved, never
# fed accuracy stats (63 of 193 orphaned at filing, oldest >2mo). This sweep
# EXPIRES clearly-unresolvable ones (short/session horizon past the observation
# window -> archived UNRESOLVABLE, gate-exempt) and PROMOTES well-formed
# evaluable ones to active so the resolution loop below catches them THIS run.
# Direct py -3 (not a bash wrapper) per rb-225/rb-247.
Bash: py -3 core/scripts/hypothesis-discovered-overdue-sweep.py --apply --output json
# Parse {expired, promoted, needs_judgment}. `promoted` records are now
# stage=active and WILL appear in the active load below (resolved this run).
# `needs_judgment` (under-formed, recently overdue) are SURFACED, not auto-
# resolved: for each, if evaluable, synthesize a claim from the position
# (guard-798: leaving discovered needs claim>=20 + a resolution method), then
# resolve/expire with judgment; else leave for the next cycle.
# Primary source: all active hypotheses
Bash: pipeline-read.sh --stage active
# Also re-probe measurement-pending hypotheses (g-115-465 / rb-754) — these are
# shelved waiting for measurement infrastructure to appear. Re-checking the channel
# every cycle is cheap and lets them resolve as soon as data shows up. A re-probed
# measurement-pending record now past its hard resolves_by deadline with the channel
# STILL empty is EXPIRED by Step 2.6a (g-115-1584 / rb-2122) — without that pass the
# stage was a one-way shelf with no auto-expiry that grew unbounded (g-001-02 had to
# hand-archive 35 past-deadline records).
Bash: pipeline-read.sh --stage measurement-pending
APPEND to candidate list
# Horizon filter: micro-hypotheses never enter this pipeline.
# Session-horizon hypotheses use self-check verification (Step 2 handles this).
# Filter OUT any records with horizon: micro (shouldn't exist in pipeline, but defensive).
Filter out records where horizon == "micro"
Sort by resolves_no_earlier_than (soonest first), then end_date for legacy records
Step 1.5: Load Domain Context
Before checking resolution, load background knowledge for each category represented.
Collect unique categories from candidate hypotheses
For each unique category:
Bash: retrieve.sh --category {batch_category} --depth medium
# Returns unified JSON with all data stores. Retrieval counters already incremented.
Cache result — reuse for all hypotheses in same category
Use retrieved context to:
- Inform resolution interpretation (understand domain before judging outcome)
- Populate context_consulted manifest on each hypothesis record during Step 2
- Record: tree_nodes_read, pattern_signatures_checked, articles_read
- Leave context_gaps_identified empty (populated later by /reflect)
Step 2: Check Each Hypothesis
For each active hypothesis:
0. Resolution time filter (token-saving skip):
current_time = current date and time (ET)
# Horizon-aware resolution timing:
# - session: resolves_by may be "session_end" — always check if current session
# - short/long: use resolves_no_earlier_than or end_date as before
# - micro: should never be here (filtered in Step 1)
IF hypothesis.horizon == "session":
IF hypothesis.resolves_by == "session_end":
threshold = null # always eligible for resolution check
ELSE:
threshold = hypothesis.resolves_by (if timestamp) or null
ELSE: # short, long, or unset (default to short behavior)
Determine resolution threshold:
IF hypothesis.resolves_no_earlier_than exists:
threshold = resolves_no_earlier_than
ELSE IF hypothesis.end_date OR hypothesis.source_data.end_date exists:
threshold = end_date + 12 hours (legacy fallback)
ELSE:
threshold = null (no date info — proceed with resolution check)
IF threshold is not null AND threshold > current_time:
SKIP this hypothesis — do NOT check resolution status
Log: "Skipped {slug} — resolves no earlier than {threshold}"
Add to resolve_result.skipped_not_due
Continue to next hypothesis
Note: --hypothesis <id> flag bypasses this filter (always checks)
1. Check resolution date:
- Past due? → Likely resolved, check outcome
- Within 48 hours? → Flag as "resolving soon"
- Session horizon with resolves_by: session_end? → Always check
2. Determine resolution status using horizon-appropriate methods:
# Session-horizon: self-check verification (lightweight)
IF hypothesis.horizon == "session":
IF hypothesis.verification == "self_check":
Evaluate the hypothesis outcome using available local state:
- File existence checks
- State file inspection
- Result of the action that was taken
- Inline reasoning about the outcome
If outcome determinable: record it and proceed to Step 3
If outcome NOT determinable: skip (will retry next session)
ELSE:
Fall through to standard methods below
# Short/long horizon: full external verification (in priority order)
a. Domain-specific sub-skill (if available):
Invoke the relevant resolution sub-skill for this hypothesis's category
Sub-skills handle API calls, data fetching, and status interpretation
b. WebSearch (general resolution check):
WebSearch: "{hypothesis question} outcome result"
WebSearch: "{hypothesis question} resolved decided"
Look for authoritative sources confirming the outcome
c. WebFetch (direct source check):
If the hypothesis record includes a resolution_url or source_url,
fetch it directly and check for outcome indicators
d. Resolution timestamp:
If resolves_no_earlier_than has passed and the hypothesis has a
known binary outcome trigger (e.g., scheduled event), check if
the triggering event has occurred
3. If resolution is confirmed:
Record outcome:
actual_outcome: "YES" | "NO" (or the relevant outcome value)
resolution_date_actual: "YYYY-MM-DD"
resolution_source: "How we verified (method + source)"
4. Compare hypothesis to outcome:
our_hypothesis: "YES"
our_confidence: 0.72
actual_outcome: "NO"
outcome: CORRECTED
surprise_level: 7 # calculated from confidence gap
Step 2.5: Source Agreement Check
When resolving a hypothesis using evidence from multiple sources:
sources = [] # collected during resolution research above
For each source consulted during Steps 1-2:
sources.append({
source_id: "{domain or identifier}",
source_type: "{web_search | web_fetch | memory | user_input}",
verdict: "{YES | NO | UNCLEAR}", # does this source support the hypothesis?
snippet: "{relevant excerpt, max 200 chars}"
})
IF len(sources) == 0:
source_validation = null # nothing to validate
ELIF len(sources) == 1:
source_validation = {
sources_consulted_count: 1,
agreement: "single_source",
agreement_note: "",
sources: sources
}
ELSE: # 2+ sources
verdicts = [s.verdict for s in sources if s.verdict != "UNCLEAR"]
IF len(verdicts) > 0 AND all verdicts same: agreement = "unanimous"
ELSE:
agreement = "contested"
agreement_note = "Sources disagree: {summarize disagreement}"
→ Append to outcome_detail: "[SOURCE DISAGREEMENT: {agreement_note}]"
source_validation = {
sources_consulted_count: len(sources),
agreement: agreement,
agreement_note: agreement_note or "",
sources: sources
}
This does NOT block resolution. Contested sources are noted but the agent still makes a judgment call on the outcome.
Step 2.6: Measurement-Pending Transition + Expiry (g-115-465 / rb-754 / g-115-1584)
Catches hypotheses that are past resolves_no_earlier_than but whose
measurement_channel produced no usable data. Without this step, such
hypotheses stay in active status indefinitely, bloating the active pool
and producing recurring INCONCLUSIVE results every cycle.
Step 2.6a (g-115-1584 / rb-2122) closes the back half of that lifecycle: a
record ALREADY shelved in measurement-pending whose hard resolves_by
deadline has now passed with the channel still empty is EXPIRED (moved to
archived, outcome=EXPIRED) instead of being re-shelved. Without it the
measurement-pending stage was a one-way shelf with no auto-expiry — it grew
unbounded (g-001-02 had to hand-archive 35 past-deadline records). The data
model already documents this transition as intended (pipeline.py VALID_STAGES
note: "measurement-pending→archived (at resolves_by absolute deadline)"); this
step wires the trigger. Expiry is symmetric to Step 4.0's active-record expiry:
EXPIRED records land in archived (never resolved), so they are excluded from
accuracy stats (compute_meta counts only CONFIRMED/CORRECTED) and are never
pulled into Mode 2 reflection (--unreflected = stage==resolved AND not reflected).
Five gap subtypes (catalog at world/knowledge/tree/system/system-constraints-loop/hypothesis-measurement-gap.md):
- infra-missing: writer/source emitter doesn't exist
- session-empty: infrastructure exists but no events produced
- schema-missing: source records exist, specific field absent
- log-inaccessible: log path not at world root, agent can't read it
- (catch-all): new shapes flagged for taxonomy update
After Steps 2 + 2.5 complete for a hypothesis:
IF actual_outcome was determined: continue to Step 3 (resolved path)
ELIF threshold (resolves_no_earlier_than) has passed AND
resolution attempt completed AND
no value was observable in measurement_channel:
# ── Step 2.6a: HARD-DEADLINE expiry for already-shelved records (g-115-1584 / rb-2122) ──
# If THIS record is already in measurement-pending (it came in via Step 1's
# measurement-pending re-probe load) AND it carries a hard resolves_by deadline
# that has now passed, the measurement will not arrive in time — expire it to
# archived instead of re-shelving. Records with no resolves_by (null) or a
# still-future resolves_by fall through to the normal (re-)shelve below, so a
# hypothesis without a hard deadline keeps re-probing exactly as before.
# resolves_by may be a date (YYYY-MM-DD) or a datetime — compare the same way
# Step 4.0 does ("resolves_by has passed"). Mirrors Step 4.0's active-record
# expiry; EXPIRED lands in archived (never resolved).
IF record.stage == "measurement-pending"
AND record.resolves_by is not null
AND resolves_by has passed (parse date/datetime; compare to current_time):
expiry_merge = {
"outcome": "EXPIRED",
"outcome_date": "<today ISO>",
"outcome_detail": "Measurement-pending expired: past resolves_by ({resolves_by}); channel never produced data (gap_subtype: {record.measurement_gap_subtype}). Auto-expired by Mode 1 Step 2.6a.",
}
echo '<expiry_merge>' | bash core/scripts/pipeline-move.sh <id> archived
Add to resolve_result.expired_measurement_pending list.
Log: "EXPIRED measurement-pending {id} — resolves_by {resolves_by} passed, channel still empty"
Continue to next hypothesis.
# ── end Step 2.6a — records below have no passed hard deadline; (re-)shelve ──
# Classify the gap subtype based on attempt failure mode
gap_subtype = (
"infra-missing" IF writer/script doesn't exist OR returned no records
"session-empty" IF source schema present but zero matching records
"schema-missing" IF source records exist but expected field absent
"log-inaccessible" IF log file not findable at expected path
"other" otherwise
)
# Build measurement-pending merge JSON
merge = {
"stage": "measurement-pending", # set by pipeline-move
"measurement_pending": true,
"measurement_pending_set_at": "<now-iso>",
"measurement_gap_subtype": gap_subtype,
"measurement_gap_detail": "<one-sentence explanation of what was probed and what came back>",
"context_consulted": <as in Step 4.1>,
}
echo '<merge-json>' | bash core/scripts/pipeline-move.sh <id> measurement-pending
# Skip Step 3, Step 4. The hypothesis is shelved until evidence appears.
Add to resolve_result.measurement_pending list (separate from newly_resolved).
Continue to next hypothesis.
ELSE: leave in active (resolution date arrived but channel returned a value
that could not be classified — judgment call still possible next cycle)
Migration: on first run after this step ships, all 5 hypotheses listed in
g-115-462 investigation report transition automatically. Subsequent runs
re-probe the channel — if data appears, the hypothesis moves through the
normal Step 2/3/4 path (now starting from measurement-pending instead of
active).
Step 3: Record Outcome
For each resolved hypothesis:
1. Determine outcome:
- Compare our_hypothesis against actual_outcome
- outcome: "CONFIRMED" if hypothesis matches outcome, "CORRECTED" if not
2. Calculate surprise level:
# Single source of truth — same formula as reflect --batch-micro Step 3.
- if CORRECTED: surprise_level = round(our_confidence * 10)
- if CONFIRMED: surprise_level = round((1 - our_confidence) * 10)
- High surprise = high confidence + wrong, or low confidence + right
3. Update the pipeline record with:
- outcome: CONFIRMED | CORRECTED
- our_confidence: (preserved from hypothesis)
- surprise_level: (calculated above)
- resolution_summary: "CONFIRMED — predicted {X} with {confidence}% confidence"
or "CORRECTED — predicted {X}, actual was {Y}"
4. Include outcome in resolution output:
"Result: CONFIRMED — predicted YES with 72% confidence"
or "Result: CORRECTED — predicted YES with 72% confidence, actual was NO"
Step 3.5: Broad Re-Retrieve on High Surprise (G3 / R5)
When a resolution carries surprise_level >= 7, the just-recorded outcome likely
falsifies one or more downstream beliefs / reasoning-bank entries / pattern
signatures that the hypothesis was implicitly endorsing. Step 1.5's batch
retrieval was shallow-to-medium — adequate for predicting outcomes, but
insufficient for finding ALL the entries a surprising correction may affect.
Per .claude/rules/retrieve-before-deciding.md decision point 3 ("resolving a
surprising hypothesis"), this step retrieves category-broadly BEFORE the
atomic move so the Tree Update Protocol (Steps 4.5 + Tree Steps 1-5) and any
downstream /reflect calls operate on a complete reconciliation candidate set.
IF surprise_level >= 7:
# Broad retrieve at deep depth — supplementary stores AND tree nodes
Bash: retrieve.sh --category {hypothesis.category} --depth deep
From the returned JSON, mark candidates for reconciliation review:
- beliefs[] whose claim predicts or depends on the just-falsified outcome
- reasoning_bank[] entries whose content endorses the (now-falsified) prediction
- pattern_signatures[] whose trigger matches the resolution shape
- tree_nodes[] whose capability_level was anchored on this outcome class
Append to the merge JSON built in Step 4.1:
reconciliation_candidates:
beliefs: [bel-NNN, ...] # IDs only — /reflect dereferences later
reasoning_bank: [rb-NNN, ...]
pattern_signatures: [sig-NNN, ...]
tree_nodes: [<node.key>, ...]
If retrieve.sh returns nothing relevant to the falsification:
reconciliation_candidates: { beliefs: [], reasoning_bank: [],
pattern_signatures: [], tree_nodes: [] }
(still record the empty manifest — proof retrieval was attempted)
# E10: Cross-reference confidence recalibration on high-surprise CORRECTED
# outcomes. Phase 8 will encode the ONE node it picks; this loop touches
# ALL cited nodes that didn't make it into the encoding bundle but were
# implicitly endorsing the falsified prediction. Without this, the cited
# nodes retain their pre-correction confidence and look authoritative on
# the next retrieve.
IF outcome == "CORRECTED" AND context_consulted.tree_nodes_read is non-empty:
For each node_key in context_consulted.tree_nodes_read:
Bash: bash core/scripts/tree-read.sh --node <node_key>
IF read failed (node missing or error): SKIP this node
Parse result JSON → old_confidence = result.confidence (default 0.0)
new_confidence = max(0.0, round(old_confidence - 0.05, 2))
IF new_confidence == old_confidence: SKIP (already at floor)
Bash: echo '{"operations": [{"op": "set", "key": "<node_key>", "field": "confidence", "value": <new_confidence>}, {"op": "set", "key": "<node_key>", "field": "last_update_trigger", "value": "surprise-recalibration"}]}' | bash core/scripts/tree-update.sh --batch
Log: "RECALIBRATED {node_key}: confidence {old_confidence} → {new_confidence} (hypothesis {hypothesis.id} surprise={surprise_level})"
# Fail-open: if a tree-update call errors, log and continue. Do NOT
# block the atomic resolve in Step 4 on a recalibration failure.
ELIF surprise_level >= 5:
# Medium-surprise — re-use Step 1.5 cached batch retrieval, no new probe
Carry forward the same context_consulted manifest; do not extend it
ELSE:
# Low surprise (≤4): hypothesis was well-calibrated, no broad re-retrieve
Skip this step entirely
This step is fail-open: if the retrieve.sh call errors, log the failure and
record reconciliation_candidates: { error: "<message>" } — proceed to Step 4.
A failed broad retrieve must NOT block the atomic resolve write.
Step 4: Move Resolved Hypotheses (ATOMIC — complete ALL steps for each hypothesis)
For each resolved hypothesis, execute this checklist IN ORDER.
□ 4.0 CHECK EXPIRATION (hypothesis goals only):
IF the record has resolves_by AND resolves_by has passed AND outcome is still undetermined:
- echo '{"outcome":"EXPIRED"}' | bash core/scripts/pipeline-move.sh <id> archived
- Return "EXPIRED" to calling skill
- SKIP remaining steps 4.1-4.2 for this record
□ 4.1 BUILD merge JSON with:
- Outcome data (outcome: CONFIRMED/CORRECTED, surprise, outcome_date)
- outcome_detail: resolution summary text
- horizon: (preserve from original — defaults to "short" if missing)
# Horizon-dependent metadata:
# short/long horizon — full metadata:
- replay_metadata:
last_replayed: null
replay_count: 0
encoding_score: null
reconsolidation_updates: []
- Preserve context_consulted from original record (populated during evaluation):
tree_nodes_read, pattern_signatures_checked, data_sources_used,
articles_read, experiential_matches, context_gaps_identified
- Initialize context_quality:
usefulness: pending
most_valuable_source: null
least_valuable_source: null
chain_note: null
# Compute process_score inline (not deferred to reflect)
IF outcome == "CONFIRMED" AND confidence >= 0.60:
dual_classification = "earned_confirmed"
ELIF outcome == "CONFIRMED" AND confidence < 0.60:
dual_classification = "lucky_confirmed"
ELIF outcome == "CORRECTED" AND confidence >= 0.60:
dual_classification = "unlucky_corrected"
ELIF outcome == "CORRECTED" AND confidence < 0.60:
dual_classification = "deserved_corrected"
process_quality = confidence if CONFIRMED else (1.0 - confidence)
# session horizon — reduced metadata:
# SKIP replay_metadata, context_quality, process_score
# SKIP context_consulted unless already present on the record
- Source cross-validation (ALL horizons):
source_validation: {source_validation object from Step 2.5}
- Learning status (ALL horizons):
reflected: false
reflected_date: null
□ 4.2 ATOMIC MOVE+UPDATE (single script call):
# The merge JSON MUST include ALL fields from Step 4.1 above.
# For short/long horizon, this means: outcome, outcome_detail, outcome_date,
# surprise, replay_metadata, context_quality, process_score, reflected: false.
# DO NOT skip fields — pipeline-move.sh merges them atomically.
# Missing fields here = missing fields forever (reflect can't backfill structure).
# RESOLUTION-EVIDENCE GATE (g-303-27): a CONFIRMED/CORRECTED move to
# resolved is REJECTED (400 resolution_evidence_required) unless
# outcome_detail (or experience_ref / evidence_for) carries >=1
# verifiable pointer -- a goal-id (g-NNN-NN), commit SHA, file:line,
# session-id, an rb-/guard-/exp-/msg- id, a canonical-script name
# (foo.sh/foo.py), or a percentage with measurement context. This is
# almost always already present in a real resolution summary. For a
# genuinely pointer-free resolution (e.g. a math proof where the
# derivation IS the evidence), add "evidence_override": "<reason>" to
# the merge JSON. EXPIRED/UNRESOLVABLE outcomes are exempt.
echo '<merge-json-with-ALL-4.1-fields>' | bash core/scripts/pipeline-move.sh <id> resolved
This atomically: updates all fields, sets stage to resolved, recounts meta.
GATE: Do NOT proceed to the next hypothesis until the move completes successfully.
If the script exits non-zero, STOP and report the error — do not partially resolve.
Step 4.5: Rate Context Quality (Retrieval Protocol Phase 5)
For each hypothesis just resolved in Step 4, rate the context loaded during Step 1.5:
For each resolved hypothesis:
Read its context_consulted section
Rate usefulness:
- Did loaded tree nodes, patterns, guardrails, or beliefs help interpret the outcome?
- Correctly predicted + context supported reasoning → "helpful"
- Context present but didn't influence resolution interpretation → "neutral"
- Context pointed toward wrong interpretation → "misleading"
- Context had no bearing on this hypothesis → "irrelevant"
Identify most_valuable_source:
Which loaded item (tree node, pattern, guardrail, belief, experiential record)
was most useful? Format: "{layer}:{id}"
Identify least_valuable_source:
Which loaded item added least value or was noise? Format: "{layer}:{id}"
Write to resolved record's context_consulted.context_quality:
usefulness: {rating}
most_valuable_source: {layer:id}
least_valuable_source: {layer:id}
chain_note: "{one-sentence explanation}"
If no context was loaded (empty context_consulted): set usefulness: "irrelevant"
If unable to judge: leave usefulness: "pending" (finalized by /reflect)
Step 5: Check for Triggered Reviews
Check if any auto-review triggers are met:
| Trigger | Condition | Action |
|---|---|---|
| Streak break | 3+ consecutive corrected | Flag for priority learning |
| High-confidence miss | Confidence > 0.80 AND corrected | Flag for priority violation analysis |
| Low-confidence hit | Confidence < 0.50 AND confirmed | Flag for underconfidence review |
| New category success | First confirmed in a category | Flag as potential strength |
| Stale hypothesis | Active > 90 days, no resolution | Check if hypothesis is still trackable |
| Category drift | 5+ hypotheses in one category, 0 in others | Flag for exploration |
If triggered: write triggered review flags to the resolve_result (consumed by callers for alerting).
Step 6: Return Resolve Result
Return a structured summary (consumed by /boot for reporting, and by --full-cycle for chaining):
resolve_result:
newly_resolved: N
already_resolved: M
still_active: K
skipped_not_due: J # hypotheses whose resolves_no_earlier_than hasn't arrived
resolving_soon: L # within 48 hours of resolving
measurement_pending: P # shelved this cycle (Step 2.6 — channel empty, no hard deadline passed)
expired_measurement_pending: Q # archived/EXPIRED this cycle (Step 2.6a — past resolves_by, channel still empty)
triggered_reviews: [...] # any flags from Step 5
resolved_hypotheses:
- id: "2026-03-15_record-slug"
question: "Will X happen?"
predicted: "YES"
confidence: 0.72
actual: "NO"
outcome: CORRECTED
surprise_level: 7
reflected: false
Tree Update Protocol (after resolution)
After resolving any hypothesis, update the memory tree:
Tree Step 1: Recalculate Category Accuracy
category = resolved_hypothesis.category # e.g., "crypto", "sports-nhl", "politics"
Recount: confirmed / total resolved in this category
Tree Step 2: Update Affected Node (Dynamic Lookup)
node=$(bash core/scripts/tree-find-node.sh --text "{category}" --leaf-only --top 1)
# Returns: {key, score, file, depth, summary, node_type}
Read node.file
Update _tree.yaml via batch (accuracy, sample_size, confidence, capability_level):
echo '{"operations": [
{"op": "set", "key": "<node.key>", "field": "accuracy", "value": <new-value>},
{"op": "set", "key": "<node.key>", "field": "sample_size", "value": <new-value>},
{"op": "set", "key": "<node.key>", "field": "confidence", "value": <new-value>},
{"op": "set", "key": "<node.key>", "field": "capability_level", "value": "<new-value>"}
]}' | bash core/scripts/tree-update.sh --batch
Tree Step 3: Update Performance Tracking Node (Dynamic Lookup)
# Use dynamic lookup — never hardcode paths at any depth.
perf_node=$(bash core/scripts/tree-find-node.sh --text "performance-tracking" --top 1)
If perf_node exists:
Read perf_node.file
Update outcome record, category outcome breakdown
Tree Step 4: Update Accuracy Tracking Node (Dynamic Lookup)
# Use dynamic lookup — never hardcode paths at any depth.
acc_node=$(bash core/scripts/tree-find-node.sh --text "hypothesis-accuracy" --top 1)
If acc_node exists:
Read acc_node.file
Update overall accuracy, category breakdown, batch data
Tree Step 5: Propagate Capability Changes
If any node's capability_level changed:
result=$(bash core/scripts/tree-propagate.sh <node.key>)
# Returns: {source_node, ancestors_updated: [...], capability_changes: [...]}
IF result.capability_changes is non-empty:
For each changed ancestor: Read ancestor.file (.md)
- Update parent node's capability map and topic summary
If root-level domain summary affected:
bash core/scripts/tree-update.sh --set root summary "<updated>"
Log: "CAPABILITY UNLOCK: {topic} → {new_level}"
Announce in output: "Category {X} unlocked {LEVEL} level!"
Mode 2: Learn (--learn)
Processes resolved hypotheses that have NOT yet been reflected on. This is the learning phase — it calls /reflect to generate ABC chains, violation tracking, and pattern updates.
Step 1: Load Unreflected Hypotheses
Bash: pipeline-read.sh --unreflected
Sort by surprise descending (learn from surprises first)
If no unreflected records: return { hypotheses_learned: 0 } and exit
Step 2: Reflect on Each Hypothesis
For each unreflected hypothesis:
1. Bash: pipeline-read.sh --id {id} (loads the full resolved record)
2. invoke /reflect --on-hypothesis {hypothesis-id}
This generates:
- ABC chain (Antecedent-Behavior-Consequence)
- Textual reflection
- Violation tracking (if wrong)
- Source assessment
- Encoding score
- Pattern signature updates
3. After /reflect completes successfully:
- Bash: pipeline-update-field.sh {id} reflected true
(auto-sets reflected_date to today)
Step 2.5: Differentiated Extraction Handoff
When calling /reflect for each unlearned resolution:
- Pass the hypothesis outcome (CONFIRMED/CORRECTED) to /reflect
- /reflect will use differentiated extraction (confirmed → strategy validation, corrected → preventive guardrails)
- After /reflect completes, verify that:
- A reasoning bank entry was created via
reasoning-bank-read.sh --id rb-NNN - For corrected outcomes: a guardrail was added via
guardrails-read.sh --id guard-NNN
- A reasoning bank entry was created via
- If /reflect did not create expected entries, create them directly:
- Reasoning bank entry: pipe JSON to
reasoning-bank-add.sh(stdin). Required JSON fields:title,content,category,type,when_to_use,tags,applies_to(one ofany|framework|domain|specific). Hypothesis-derived entries: pickapplies_toby hypothesis subject — framework-quality / pipeline-coordination hypotheses →framework; deployment-domain hypotheses (the specific services, products, or workflows this agent is deployed into) →domain; methodology insights (calibration, pattern-recognition) →any; single-record diagnostic with no transferable shape →specific. - Guardrail entry: pipe JSON to
guardrails-add.sh(stdin)
- Reasoning bank entry: pipe JSON to
Step 3: Return Learn Result
learn_result:
hypotheses_learned: N
hypotheses_already_learned: M
violations_recorded: K
patterns_updated: J
learned_hypotheses:
- id: "2026-03-15_record-slug"
outcome: CORRECTED
violation_type: "high-confidence miss"
encoding_score: 0.72
Mode 3: Accuracy Report (--accuracy-report)
Step 1: Load All Resolved Hypotheses
Bash: pipeline-read.sh --stage resolved (all resolved records)
Bash: meta-read.sh meta-knowledge/_index.yaml # existing meta-data
# Include micro-hypothesis batch stats from pipeline metadata.
# Micro stats are stored in pipeline-meta.json under micro_hypothesis_stats.
# These aggregate across sessions — each session-end consolidation appends batch totals.
Bash: pipeline-read.sh --meta → micro_hypothesis_stats (if exists)
Step 2: Calculate Metrics
accuracy:
overall:
total: 25
confirmed: 18
corrected: 7
accuracy_pct: 72.0
trend: "improving" # compare last 10 vs previous 10
by_category:
politics: {total, confirmed, accuracy, confidence_avg, calibration_error}
crypto: {total, confirmed, accuracy, confidence_avg, calibration_error}
by_hypothesis_type:
high-conviction: {total, confirmed, accuracy}
calibration: {total, confirmed, accuracy}
exploration: {total, confirmed, accuracy}
contrarian: {total, confirmed, accuracy}
by_evaluation_method:
system-1: {total, confirmed, accuracy}
system-2: {total, confirmed, accuracy}
by_research_depth:
quick: {total, confirmed, accuracy}
moderate: {total, confirmed, accuracy}
deep: {total, confirmed, accuracy}
by_time_horizon:
micro: {total, confirmed, accuracy} # same-session, seconds-minutes
session: {total, confirmed, accuracy} # this/next session, hours
short_term_7d: {total, confirmed, accuracy}
medium_term_30d: {total, confirmed, accuracy}
long_term_90d: {total, confirmed, accuracy}
confidence_calibration:
bins:
"50-59%": {predicted, actual, count, error}
"60-69%": {predicted, actual, count, error}
"70-79%": {predicted, actual, count, error}
"80-89%": {predicted, actual, count, error}
"90-100%": {predicted, actual, count, error}
source_reliability:
- source: "source-id"
times_used: 12
led_to_confirmed: 9
reliability: 0.75
streaks:
current: 3
longest_confirmed: 5
longest_corrected: 2
Step 3: Hypothesis Testing
If 20+ resolved hypotheses, test hypotheses:
H1: "Deep research outperforms quick evaluations"
H2: "Category X outperforms category Y"
H3: "Hypotheses with source X outperform those without"
H4: "Higher-confidence hypotheses are more predictable"
H5: "Shorter time horizons are more predictable (compare micro→session→short→long)"
H6: "Micro-hypothesis accuracy correlates with self-model accuracy (are we good at predicting our own behavior?)"
For each: compare groups, note significance, write to
world/knowledge/strategies/hypothesis-results.md
Step 4: Update Meta-Memory
Bash: meta-set.sh meta-knowledge/_index.yaml # update with all accuracy figures
Update aspirations meta via Bash: `aspirations-meta-update.sh --source agent <field> <value>`
Step 5: Strategy Recommendations
1. Categories to focus on (accuracy > 70%)
2. Categories to reduce (accuracy < 40%)
3. Confidence calibration adjustments
4. Evaluation weight adjustments
5. Research depth recommendations
6. Source reliability actions
Mode 4: Full Cycle (--full-cycle)
The comprehensive weekly review. Chains all modes plus deep reflection and replay.
1. invoke /review-hypotheses --resolve (detect outcomes — idempotent if already done)
2. invoke /review-hypotheses --learn (reflect on unlearned — idempotent if already done)
3. invoke /review-hypotheses --accuracy-report
4. invoke /reflect --full-cycle (pattern extraction, calibration, replay)
5. Run spark check for new aspirations/goals
6. Propose strategy adjustments to /aspirations evolve
Bash: echo "review-hypotheses phase documented"
Idempotency note: If /boot already ran --resolve this session, step 1 is a no-op (nothing new to resolve). If aspiration goals already ran --learn, step 2 is a no-op (all records already reflected). The full-cycle still adds value via steps 3-6 which operate at the aggregate level.
Return Protocol
See .claude/rules/return-protocol.md — last action must be a tool call, not text.
Chaining
- Called by:
/aspirations(as goal skill for hypothesis goals,--resolve/--learnfor batch ops),/boot(--resolveonly for catch-up) - Calls (by mode):
--resolve: None (standalone — detects outcomes, records results)--learn:/reflect --on-hypothesis(for each unreflected resolution)--full-cycle:--resolve,--learn,--accuracy-report,/reflect --full-cycle(which includes--batch-micro)
- Feeds into:
/aspirations evolve(strategy recommendations) - Updated by:
/reflect(provides patterns and strategies back)