writing-odd-state-files

name: writing-odd-state-files description: | Hardens Outcome-Driven Development (Aden ODD) execution by enforcing the Goal → Success Criteria → Constraints → Tests → Outcomes → Judgment loop and the state file layout that persists it. Use when authoring or updating an ODD state file, when the user says "ODD", "outcome-driven", "Aden ODD", "goals + success criteria", "create state file", "lifecycle DRAFT/READY/ ACTIVE/COMPLETED", or when work needs measurable acceptance (weighted threshold ≥0.90) instead of vibes-based done. Also triggers when the user asks to "loop on success criteria until ACCEPT", needs a HARD GATE before implementation, wants Tests-as-Hypotheses (one per criterion), or wants a Process Tree to bind workflow control flow. Do NOT use for plan files (use write-plan), traces (use writing-traces), domain models (use writing-ddd-domain-models), or simple TODOs.

Writing ODD State Files

%% Last Modified: 05/03/26 11:13:39 %%

State files are the persistent memory of an Aden ODD run. Without [w] writes after every transition, restarts wipe the evidence chain and the loop has to re-author Goal/Criteria/Constraints from scratch. This skill enforces the layout, the lifecycle, and the process tree that bind any ODD-conformant workflow.

Expert Vocabulary Payload

%% Last Modified: 05/03/26 11:13:39 %%

Aden ODD Spec (must-follow): Goal (intent + direction, no outcomes), SuccessCriterion (id + description + metric + target + weight ∈ [0,1]), Constraint (type ∈ {hard, soft} × category ∈ {time, cost, safety, scope, quality}), Outcome struct (success, result, error, state_changes, tokens_used, latency_ms, summary), feedback loop judgment (ACCEPT, RETRY, REPLAN, ESCALATE), lifecycle (DRAFT → READY → ACTIVE → COMPLETED | FAILED | SUSPENDED), HARD GATE [e]

Metrics: output_contains, output_equals, llm_judge, custom, weighted score, ACCEPT threshold ≥0.90, weight sum = 1.0, hypothesis ("if system correct → outcome X under condition Y")

Process Tree (van der Aalst §3.2.8): sequential (→), exclusive choice (×), parallel (∧), redo loop (↻), silent activity (τ), do-part, redo-part, canonical inline form, visual tree form, activity legend, conformant trace, re-entry rule

State Persistence: session provenance, [r] read recovery, [w] write transition, change log, lifecycle re-entry, scope-out via τ, NBA (Next Best Actions), session learnings (lost on restart)

Constraint Taxonomy: hard-scope, hard-quality, hard-safety, hard-time, hard-cost, soft-quality, soft-time, fallback behavior, no-crash guarantee

Anti-Pattern Watchlist

%% Last Modified: 05/03/26 11:13:39 %%

Scan every state file you author or modify against these. Fix violations before [w].

1. `silent-gate-bypass`

%% Last Modified: 05/03/26 11:13:39 %%

Detection: Lifecycle advances DRAFT → READY (or READY → ACTIVE) without an explicit user confirmation message in the transcript. The HARD GATE [e] node fires without "user said yes." Resolution: STOP. Mark lifecycle SUSPENDED with reason "awaiting [e] HARD GATE." Surface DRAFT artifacts (Goal, T4, T5, T6) inline + state file path. Do NOT proceed until user types explicit approval.

2. `goal-mutation-without-escalate`

%% Last Modified: 05/03/26 11:13:39 %%

Detection: The Goal text changes between writes without an [l] ESCALATE redo entry in the Change Log. Framework rewrote intent autonomously. Resolution: Revert Goal to last user-validated text. If the loop genuinely needs Goal change, surface as Type I decision: "Tests reveal Goal assumption flawed. ESCALATE [l] needed — please confirm new Goal."

3. `vague-success-criteria`

%% Last Modified: 05/03/26 11:13:39 %%

Detection: A criterion lacks any of id, description, metric, target, weight, OR weights do not sum to 1.0, OR metric is freeform prose ("works well", "is fast") instead of output_contains | output_equals | llm_judge | custom. Resolution: Rewrite each row as a 7-column table: ID | Description | Metric | Target | Weight | Test (hypothesis). Verify sum(weights) == 1.0 before [w].

4. `untagged-constraints`

%% Last Modified: 05/03/26 11:13:39 %%

Detection: Constraints written as bullets without [category] tag (e.g. - HC1: must be fast). Soft/hard split missing. Resolution: Reformat as HC<n> [<category>]: <constraint> with category ∈ {time, cost, safety, scope, quality}. Group under **Hard:** and **Soft:** headings.

5. `outcome-struct-elision`

%% Last Modified: 05/03/26 11:13:39 %%

Detection: Test execution at [g] reports only pass/fail or only result. Missing fields from {success, result, error, state_changes, tokens_used, latency_ms, summary}. Resolution: Each test execution must emit ALL 7 fields. If a field is N/A (e.g., state_changes for pure functions), record null explicitly — never omit.

6. `implementation-before-tests`

%% Last Modified: 05/03/26 11:13:39 %%

Detection: Code written for the system before [f] generates failing tests. Tasks table shows [j] entries with no preceding [f]/[g]. Resolution: Block [j]. Write tests first (one hypothesis per SuccessCriterion). Run [g] and confirm 100% fail (TypeError, AssertionError) BEFORE any production code.

7. `re-entry-re-authoring`

%% Last Modified: 05/03/26 11:13:39 %%

Detection: A new session re-runs [a] [b] [c] [d] when lifecycle ∈ {ACTIVE, COMPLETED, FAILED}. State file shows duplicate Goal/Criteria sections or rewritten T3-T6 without justification. Resolution: Always run [r] FIRST. If lifecycle ≥ ACTIVE → take τ-skip on the DRAFT+gate block. Resume at the test loop or operational phase per recovered position.

8. `silent-write-elision`

%% Last Modified: 05/03/26 11:13:39 %%

Detection: Lifecycle transition (DRAFT→READY, RETRY judgment, ACCEPT) occurred but no [w] row in the Tasks table or no entry in the Change Log. Resolution: Append [w] row + Change Log entry retroactively. Persist before next action. Continuous [w] is non-negotiable.

9. `weight-arithmetic-failure`

%% Last Modified: 05/03/26 11:13:39 %%

Detection: Weighted score at [h] computed wrong (sum > 1.0, or arithmetic error, or missing criterion contribution). Resolution: Show the explicit math in the judgment row: C1(0.20) + C2(0.25) + ... = 1.00. ACCEPT only if computed value ≥ 0.90. Below threshold → choose RETRY/REPLAN/ESCALATE with named reason.

10. `mixed-goal-and-outcomes`

%% Last Modified: 05/03/26 11:13:39 %%

Detection: The Goal section contains measurable targets ("achieve 95% accuracy") or test descriptions. Goal has been polluted with what should be in T4/T5. Resolution: Strip Goal back to intent + direction only. Move targets to Success Criteria (T4). Move guardrails to Constraints (T5).

Behavioral Instructions

%% Last Modified: 05/03/26 11:13:39 %%

Phase 0: Anti-Pattern Scan

%% Last Modified: 05/03/26 11:13:39 %%

Before any write, scan the existing state file (if any) against the Anti-Pattern Watchlist above. Fix detected violations as the first action.

Phase 1: [r] Recover Lifecycle (MANDATORY first)

%% Last Modified: 05/03/26 11:13:39 %%

Read the state file at the canonical path (or ask user where it lives). Locate the Lifecycle section.
Classify recovered lifecycle:
- (none) or DRAFT or READY-pre-gate → take authoring branch (Phase 2).
- ACTIVE, COMPLETED, FAILED, or SUSPENDED-post-gate → τ-skip authoring. Resume at the test-outcomes loop (Phase 3) or terminal node.
Write the recovered position into a one-line status block before proceeding. WHY: makes re-entry decision auditable.

Phase 2: DRAFT Authoring (only if lifecycle requires)

%% Last Modified: 05/03/26 11:13:39 %%

Author Goal [a] — one paragraph of intent + direction. NO outcomes, NO measurable targets, NO test descriptions. If you find yourself writing "achieve X%" you are in T4 territory.
Author Success Criteria [b] — table with columns ID | Description | Metric | Target | Weight | Test (hypothesis). One row per criterion. Verify sum(weights) == 1.0. Each metric ∈ {output_contains, output_equals, llm_judge, custom}.
Author Constraints [c] — split into **Hard:** and **Soft:** sections. Each entry: <H|S>C<n> [<category>]: <constraint>. Categories: time | cost | safety | scope | quality. Hard constraints define failure; soft constraints prefer behavior.
Author Context [d] — durable background that shapes agent behavior. File paths, existing patterns, schema, references, links. NOT decisions or rationale (those go in Change Log).
Run [w]: persist DRAFT artifacts + initial Change Log entry.

Phase 3: HARD GATE [e]

%% Last Modified: 05/03/26 11:13:39 %%

Surface DRAFT artifacts to user inline + state file abs path. State explicitly: "DRAFT ready for review. HARD GATE — please confirm Goal + C1..Cn + constraints, or call out revisions."
STOP. Do not advance lifecycle. Wait for explicit user approval. IF user unavailable → write lifecycle = SUSPENDED with reason; halt.
Once user confirms → lifecycle DRAFT → READY → ACTIVE. Run [w].

Phase 4: Test-Outcomes Loop ↻

%% Last Modified: 05/03/26 11:13:39 %%

[f] Generate tests — one test per SuccessCriterion.id. Each test framed as a hypothesis: "if system correct → outcome X under condition Y."
[g] Run tests — emit FULL Outcome struct per execution: {success, result, error, state_changes, tokens_used, latency_ms, summary}. Use null for N/A fields, never omit.
[h] Aggregate — compute weighted score: sum(criterion.weight × pass_indicator). Show the arithmetic explicitly. Choose judgment:
- score ≥ 0.90 → ACCEPT (exit loop, advance to terminal)
- score < 0.90, system code is the issue → RETRY [j]
- score < 0.90, criteria thresholds or system structure off → REPLAN [k]
- score < 0.90, Goal assumption flawed → ESCALATE [l] (Type I — surface to user)
Run [w] after each judgment. Loop back to [f] if not ACCEPT.

Phase 5: Optional Operational + Real-World Phases

%% Last Modified: 05/03/26 11:13:39 %%

IF scope includes staging/pilot deploy → run operational phase [m]→↻[n]→[w]. Otherwise → mark scope-out [τ] in Tasks table with reason.
IF scope includes production deploy → run real-world phase [o]→↻[p]→[w]. Otherwise → mark scope-out [τ] in Tasks table with reason.

Phase 6: Terminal

%% Last Modified: 05/03/26 11:13:39 %%

[q] Emit COMPLETED — update Lifecycle section, mark all Tasks ✅, append final Change Log entry summarizing T0→Tn.
Run final [w].

Output Format — State File Layout

%% Last Modified: 05/03/26 11:13:39 %%

!echo /Users/wesleyfrederick/.claude/skills/caveman/SKILL.md
ALWAYS Use this file EXACTLY for output layout: !jact extract file ${CLAUDE_SKILL_DIR}/references/state-file-template.md, BECAUSE it contains the GOLD STANDARD for state files
WHEN writing output, ALWAYS use /caveman ultra skill, BECAUSE it saves on token usage and prevents LLM vomit that does not add value

Examples

%% Last Modified: 05/03/26 11:13:39 %%

Example 1: BAD vs GOOD Goal

%% Last Modified: 05/03/26 11:13:39 %%

BAD:

## Goal
Build a learnings cache that achieves 90% hit rate, runs in <50ms, supports
session isolation, and passes all 6 tests including TTL sweep within 7 days.

Mixes intent with measurable targets and test descriptions. Pollutes Goal with T4/T5 content. Triggers mixed-goal-and-outcomes anti-pattern.

GOOD:

## T3 — Goal
Cut repeat full-body learning injection per session. Cache by session_id +
content-hash (jact pattern) → mid-session edit auto-invalidate. Markers live
per-repo in `.learnings/` (mirror `.jact/` convention). Cold start = baseline.

**Files:** [paths]
**Ref pattern:** [path] — `{sid}_{md5(file)}`

Pure intent + direction. Targets live in T4. Constraints live in T5. Goal stays stable across the whole loop unless [l] ESCALATE.

Example 2: BAD vs GOOD Success Criteria

%% Last Modified: 05/03/26 11:13:39 %%

BAD:

- Should be fast
- Should not regress baseline
- Should support multiple sessions
- Tests should pass

No IDs. No metrics. No targets. No weights. No hypothesis. Triggers vague-success-criteria anti-pattern.

GOOD:

| ID | Description | Metric | Target | Weight | Test (hypothesis) |
|---|---|---|---|---|---|
| C1 | Cold cache emits full body per ID | output_contains | Each ID's `### {title}\n{body}` present | 0.20 | Empty cache run → diff vs baseline |
| C2 | Warm cache emits title-only on repeat | custom | 2nd-fire output: `### {title}` line + zero body lines | 0.25 | 2x run same sid → 2nd has titles only |
| C3 | Sessions isolated | custom | sid=B output identical to cold-cache regardless of sid=A history | 0.15 | sid=A then sid=B → B full bodies |
| C4 | Cold byte-identical to baseline | output_equals | Cold output == pre-change baseline byte-for-byte | 0.20 | Snapshot test |
| C5 | Mid-session edit invalidates cache | custom | After file md5 changes, next inject = full body | 0.15 | Run sid=A → edit → run sid=A → full |
| C6 | TTL sweep removes >7d markers | custom | After invocation, no marker w/ mtime older than 7d | 0.05 | touch marker mtime=-8d → run → marker gone |

**Weight sum:** 1.00. **Threshold:** ≥0.90 → ACCEPT.

Every criterion: id + description + metric + target + weight + hypothesis. Weights sum = 1.0. Each row maps directly to a [f]/[g] test execution.

Example 3: BAD vs GOOD Outcome at [g]

%% Last Modified: 05/03/26 11:13:39 %%

BAD:

- Test C1: passed
- Test C2: failed (cache not hit)
- Test C3: passed

Missing success / result / error / state_changes / tokens_used / latency_ms / summary. Triggers outcome-struct-elision.

GOOD:

- C1 Outcome: {success: true, result: "full body emitted for 3/3 IDs",
  error: null, state_changes: ["3 markers written"], tokens_used: 1240,
  latency_ms: 87, summary: "cold cache baseline-equivalent"}
- C2 Outcome: {success: false, result: "full body emitted (expected title-only)",
  error: "cache miss on warm key {sid}_{md5}", state_changes: [],
  tokens_used: 1240, latency_ms: 91, summary: "key mismatch — investigate _cache_key"}

Full struct per execution. The error field on C2 directly informs RETRY [j] target (fix _cache_key).

Questions This Skill Answers

%% Last Modified: 05/03/26 11:13:39 %%

"Create a state file for this work using ODD"
"How do I structure goals and success criteria for Aden ODD?"
"Build me an outcome-driven plan with HARD GATE"
"I want to loop on success criteria until ACCEPT"
"What goes in T3/T4/T5/T6?"
"Set up the lifecycle DRAFT → READY → ACTIVE → COMPLETED for this task"
"Add a process tree to my plan"
"Make this plan measurable instead of vibes"
"Why did my agent skip the HARD GATE?"
"How do I write tests as hypotheses for ODD?"
"What's the right shape for Outcome struct?"
"My weights don't sum to 1.0 — fix the criteria table"
"Persist state — I keep losing context on restart"
"Resume an ODD run from where it left off"
"Goal vs Success Criteria vs Constraints — which goes where?"