name: uat-suite-updater description: Manual-only repo workflow. Use only when explicitly invoked as uat-suite-updater or by SKILL.md path; do not auto-select from related requests.
UAT Suite Updater
Two-phase workflow for updating UAT regression suites after shipping a milestone: initialize (deep research → seed file) then worker (pick up chunks → execute).
Spec-Driven Principle
These suites are spec conformance tests — "does the implementation match what was specified?" They are NOT derived from reading source code.
Allowed sources (for both seed generation and workers):
- Milestone specs:
.research/updated-spec/MILESTONE-v{X}.md— the authoritative spec, supersedes originals - Per-phase documents in
.planning/phases/{N}-{name}/:{N}-CONTEXT.md— phase requirements and scope{N}-DISCUSSION-LOG.md— decision resolutions ("we decided X instead of Y"){N}-VERIFICATION.md— did we build what was planned?{N}-VALIDATION.md— does it meet requirements?{N}-{NN}-SUMMARY.md— what each sub-plan delivered{N}-UAT.md— previous UAT results (if any)
- Project-level:
.planning/ROADMAP.md,.planning/REQUIREMENTS.md,.planning/STATE.md - Archived milestones:
.planning/milestones/v{X}-ROADMAP.md,v{X}-REQUIREMENTS.md,v{X}-phases/ - Architecture docs:
docs/architecture.md,docs/model-taxonomy.md,docs/omnifocus-concepts.md - Existing UAT suite files in
.claude/skills/uat-regression/tests/
The agent should explore these locations autonomously — the list above is guidance, not exhaustive. If other non-code docs exist (e.g., RETROSPECTIVE.md, research notes), those are fair game too. The only hard rule is: never read .py files or automated test files.
Never read: .py source files, automated test files, or any implementation code. If you derive tests by reading the code, you confirm what the code does — not what it should do. That's circular and defeats the purpose of UAT.
Warning/error assertions: if the spec or planning docs include the exact warning text, use it. If they don't, the test asserts behavioral criteria instead — "warning is present, is helpful, contains no internals (no type=, pydantic, input_value)". The self-verification step (Worker Step 4) is where spec expectations meet reality.
Regression meaning: once a suite passes, running it later should still pass — unless there's a documented, agreed-upon breaking change visible in planning docs. The suite updater's job is to update suites when the spec evolves, not when the code changes.
Mode Detection
Always run this first. Determines which mode to enter.
- Look for
UAT-SUITE-ANALYSIS.mdat repo root - Not found → Initialization Mode
- Found → read the
## Progresssection:- Unchecked content chunks exist → Worker Mode
- All content chunks checked (only "Delete this file" unchecked) → Completion Mode
- Override: if the user names a specific suite + specific change (e.g., "just add X to edit-operations.md"), skip mode detection → Ad-hoc Override
- Re-seed override: if a seed file exists with unchecked chunks BUT the user explicitly asks to re-analyze or regenerate (e.g., "run it again", "re-seed", "fresh analysis"), ask: "Found existing seed with N unchecked chunks. Archive the old seed and generate a fresh analysis?" On confirmation, archive to
.research/uat-suite-seeds/and enter Initialization Mode.
Idempotency: this workflow is safe to re-run. Initialization always compares the spec against the current state of existing suites. If suites are already up to date (from a previous run or manual edits), the gap analysis will find fewer or no gaps. Running it twice on the same state produces the same result.
Initialization Mode
Deep research session that produces a seed file coordinating future worker sessions.
Precondition: verify this session is running in a git worktree (git rev-parse --show-toplevel vs git worktree list). Hard stop if not — the seed file and chunk work should happen on a branch, not main.
Step 1 — Determine milestone scope
- Read
.planning/STATE.md(frontmatter:milestone,milestone_name) - Read
.planning/ROADMAP.mdfor phase ranges - List git tags (
git tag --list) - Ask user: current milestone or archived?
- Current →
.planning/ROADMAP.md+.planning/phases/ - Archived →
.planning/milestones/v{X}-ROADMAP.md+v{X}-phases/
- Current →
- Determine git diff range from consecutive tags (e.g.,
v1.3..v1.3.1)
Step 2 — Deep exploration (parallel Explore agents)
Spawn four agents in parallel:
- Agent A — Scope overview:
git diff {prev}..{tag} --statto understand which areas changed, then read phase CONTEXT files and verification reports to understand what each phase delivered. Group into themes based on planning docs, not code. - Agent B — Warning/error inventory: read phase requirement specs, CONTEXT files, and verification reports. Extract every warning/error that was specified or confirmed — record ID, expected behavior, and trigger condition. Do NOT read source code to find warning strings.
- Agent C — Existing suite review: read all suites in
.claude/skills/uat-regression/tests/, catalog test counts and coverage domains - Agent D — Planning context: read milestone ROADMAP, phase CONTEXT files, understand what was intended
Step 3 — Gap analysis
- Per-suite: what new tests are needed, what assertions are broken (with references to spec/planning docs)
- Cross-reference every warning/error string against existing suite coverage
- Determine if new suites or composite restructuring is needed
- New suite detection: if the analysis identifies that a new suite file is needed, flag it — this affects how chunks are structured (see Step 4)
- Known gaps: distinguish between "not yet implemented" and "should work but doesn't":
- If a spec requirement belongs to a phase that hasn't been executed yet → skip it, no test needed yet
- If a feature should be working (phase completed) but has a known bug → create a test reflecting the spec's expected behavior. It will fail during UAT, which is correct — the suite caught a real gap.
Step 4 — Chunk the work
- Group by suite affinity (shared themes/setup)
- Proportional sizing: ~15 new tests + ~10 assertion fixes max per chunk, but scale down to match the actual gap. If total work is ≤15 tests + ≤10 fixes, that's 1 chunk, not 4. Don't create artificial chunk boundaries for small updates.
- New suite registration: when a chunk creates a NEW suite file, that same chunk MUST include instructions to:
- Add the suite to the uat-regression SKILL.md skill table (name, file path, test count, coverage description)
- Add the suite to the appropriate combined suite (reads-combined or writes-combined) — or flag if a new combined suite is needed or an existing one should be split
- Do NOT defer registration to a later chunk — the suite must be discoverable the moment it exists
- If composites need deeper restructuring beyond adding a row, create a separate chunk for that
- Always end with a "Delete this file" checkbox
Step 5 — Write seed file
- Write
UAT-SUITE-ANALYSIS.mdat repo root following the Seed File Template section below - No suite editing in this mode — the seed file IS the deliverable
Step 6 — Ambiguity gate
Before committing, present all ambiguities encountered during research. The user will NOT review the seed itself — this is their only chance to catch misinterpretations before they get baked into chunk instructions.
Resolution hierarchy (when sources conflict):
- Updated spec in
.research/updated-spec/supersedes original spec - Phase discussion logs often contain explicit "we decided X" resolutions — check these first
- Later phase CONTEXT files supersede earlier ones on the same topic
- If the resolution is clearly documented in any of the above → not an ambiguity, just use it
- If NOT clearly documented → flag it as an ambiguity below
Present each ambiguity with:
- What was unclear (e.g., "original requirement says X, Phase N context says Y")
- What you chose and why (e.g., "went with Y because it's the later decision")
- Confidence level (high = clear resolution, medium = reasonable judgment call, low = coin flip)
- Wait for user confirmation — user reviews ambiguities, corrects any wrong calls
- If corrections needed: update the seed file, re-present the corrected items
- On confirmation: commit
docs: add UAT suite gap analysis for v{version}
If no ambiguities were found, say so explicitly ("no ambiguities — all requirements were clear and consistent") and proceed to commit.
No gaps found (expected outcome on re-runs)
If research shows suites are already up to date, say so with evidence (which spec requirements are covered, which warnings have tests). Don't produce an empty seed file — this is a successful result, not an edge case. This is the expected outcome when re-running on a milestone whose suites were already updated.
Worker Mode
Pick up the next chunk from the seed file and execute it.
Step 1 — Orient
- Read
UAT-SUITE-ANALYSIS.md, find first unchecked content chunk - Version check: compare version in seed title against
.planning/STATE.md— warn if mismatched - Read the uat-regression skill (
.claude/skills/uat-regression/SKILL.md) silently to internalize conventions - Read the specific suite files being updated in this chunk
Step 2 — Targeted research
- Re-read the relevant planning docs (phase CONTEXT files, verification reports) for the specific suites in this chunk
- Don't re-research the whole milestone — the seed has that context
- If the seed references specific warning/error IDs, verify their expected behavior from planning docs — do NOT read source code
- Line number drift: verify line refs in existing suite files against current state (earlier chunks may have shifted them)
Step 3 — Execute
- Write/update suite files per chunk instructions
- Match existing suite format exactly (see Suite Conventions below)
Step 4 — Verification
After writing the suite changes, identify assumptions that need live verification (e.g., exact warning text, filter behavior, edge cases).
- Present assumptions: list each assumption with what you'd check and how
- Offer self-verification: ask the user "Want me to run these checks myself against your live OmniFocus?"
- If user approves:
- Create minimal test tasks in inbox via MCP
add_tasks(use aUAT-Verify-prefix for isolation) - Run the MCP tool calls that exercise the assumptions
- Report results: confirmed or discrepancy found
- Clean up: create
⚠️ DELETE THIS AFTER UATin inbox (or reuse if one exists), move all verification tasks under it, tell user to delete it. Same cleanup protocol as the main uat-regression skill. - If a discrepancy is found: ask the user — "Did I misunderstand the spec, or is this a real bug?" If misunderstanding → correct the suite to match the user's clarification. If real bug → keep the test reflecting the spec's expected behavior (it will fail during UAT, which is the correct outcome — the suite caught a real gap).
- Create minimal test tasks in inbox via MCP
- If user declines (or wants to check manually): proceed to Step 5 — list the spot-checks for them as before
Never run verification autonomously. Always present assumptions first, always ask permission, always wait for explicit approval before touching OmniFocus.
Step 5 — Completion protocol
- Summarize: files modified, tests added, assertions fixed. If self-verification ran, include results.
- Wait for user sign-off — user reviews the changes (and verification results if applicable)
- On approval:
- Commit suite changes:
test(uat): ... - Mark chunk done in separate commit:
chore: mark chunk N complete in UAT suite analysis
- Commit suite changes:
- If all content chunks now done: inform user, suggest triggering this skill again for Completion mode
Edge case — Concurrent workers
The checkbox mechanism isn't atomic. If a chunk was just checked by another session, move to the next unchecked chunk.
Completion Mode
All content chunks are done. Archive the seed file and wrap up.
- Create
.research/uat-suite-seeds/directory if it doesn't exist - Archive:
git mv UAT-SUITE-ANALYSIS.md .research/uat-suite-seeds/v{version}.md - Commit:
chore: archive UAT suite analysis for v{version} - Remind user to merge the worktree branch to main and clean up the worktree
Ad-hoc Override
If the user names a specific suite and a specific change ("just add X to edit-operations.md"), skip mode detection entirely. Read the target suite, follow uat-regression conventions, make the change. No seed file involved.
Seed File Template
The seed file must follow this exact structure. See the real UAT-SUITE-ANALYSIS.md in the repo for a concrete example.
# UAT Suite Analysis — v{version} "{milestone_name}"
## How to Use This File
This file is the output of a research session that analyzed what v{version} changed vs what existing UAT suites cover. It contains everything a fresh agent needs to update the suites without re-doing the research.
**Workflow:** Run `/uat-suite-updater` in a new session. The skill auto-detects this file and enters Worker mode — it will find the next unchecked chunk, do targeted research, execute the changes, and mark the chunk done.
**Important:** The agent still needs to do its own targeted research for the specific suites it's updating — the gap tables below are a starting point, not exhaustive. The agent should verify against actual source code, especially for exact warning strings.
---
## Progress
- [ ] Chunk 1 — {title}
- [ ] Chunk 2 — {title}
- [ ] ...
- [ ] **Delete this file** (all chunks done, everything merged)
---
## Chunks — Task List
### Chunk completion protocol
After finishing the suite edits for a chunk, the agent does NOT commit. Instead:
1. **Present assumptions** — list any assumptions about live behavior that the suite relies on (exact warning text, filter results, edge cases)
2. **Offer self-verification** — "Want me to run these checks myself against your live OmniFocus?" If approved, the agent creates minimal test tasks via MCP, runs the checks, reports results, and cleans up (see Worker Mode Step 4 in the skill for the full protocol). If a discrepancy is found, the agent updates the suite before proceeding.
3. **Summarize changes** — list every file modified, tests added, assertions fixed, and verification results if applicable
4. **Wait for sign-off** — user reviews the changes
5. **On approval**: commit the suite changes, then update the Progress checklist above (check the box)
---
### Chunk 1: {title}
**Suites:** {list of suite files}
**What to do:**
- {detailed instructions per suite — new tests, assertion fixes, with line references}
**Est. scope:** ~N new tests + ~M assertion fixes.
---
{repeat for each chunk}
---
## Reference Material
Everything below is research output — the chunks above reference it.
---
## What v{version} Built
{Themes with bullet points — what changed and why}
---
## Gap Analysis by Suite
### {suite name} ({N} tests) — {NEEDS UPDATES | UP TO DATE}
**New scenarios needed:**
| Category | Test | Why |
|----------|------|-----|
| ... | ... | ... |
**Existing tests that may need assertion updates:**
- {test reference — what changed}
---
{repeat for each suite}
### Suites that DON'T need changes
| Suite | Why it's fine |
|-------|---------------|
| ... | ... |
---
## Warning/Error Inventory
Every new warning/error from v{version} that needs at least one UAT test:
### Errors
| ID | Text Pattern | Covered By |
|----|-------------|------------|
| ... | ... | ... |
### Warnings
| ID | Text Pattern | Covered By |
|----|-------------|------------|
| ... | ... | ... |
---
## Combined Suite Strategy
{If composites need restructuring — rationale and plan. Omit if no changes needed.}
---
## Summary of Work
| Suite | Action | New Tests | Assertion Fixes |
|-------|--------|-----------|-----------------|
| ... | ... | ... | ... |
| **Total** | | **~N** | **~M** |
---
## Final Cleanup
Once ALL chunks are complete and committed, and the user has validated everything:
1. Run `/uat-suite-updater` one more time — it will enter Completion mode and archive this file
2. The worktree branch is now ready for the user to review and merge to main
Suite Conventions
Workers must follow these conventions when writing or updating suite files. Read at least one existing suite in .claude/skills/uat-regression/tests/ to internalize the patterns.
Suite file structure
# [Suite Name] Test Suite
[One-line description]
## Conventions
- Inbox only, 1-item limit, plus any suite-specific rules
## Setup
### Task Hierarchy
[ASCII tree of tasks to create, with notes on pre-configured state]
### Manual Actions
[What the user needs to do in OmniFocus before tests run]
## Tests
### 1. [Category]
#### Test 1a: [Name]
1. [Step]
2. PASS if: [criteria]
## Report Table Rows
| # | Test | Description | Result |
|---|------|-------------|--------|
Key conventions
- Every test has explicit "PASS if" criteria
- Error tests say "Run INDIVIDUALLY" (Claude Code cancels sibling calls on error)
- Tests that modify shared state include cleanup steps
- Report table has one row per test (no grouping), with a Description column
- Task names use
T[N]-[ShortName]format
Warning/error inventory cross-reference
The single most important gap-finding mechanism. Take the warning/error inventory and check each entry against existing tests. For every warning or error string with no test triggering it, add one. Warnings are first-class citizens in this codebase — every one deserves a UAT test.
Pay special attention to cross-feature interactions:
- What happens when this feature's data is on a task that undergoes other operations (lifecycle, move, etc.)?
- What happens when other operations interact with a task that has this feature active?
- If the codebase distinguishes completed vs dropped (it does — different warning strings, different code paths), make sure BOTH are tested.
Additional coverage patterns
After the inventory cross-reference, also consider:
- Type/variant variety: all relevant variants exercised, not just the simplest
- Round-trip verification: every write test verifies via
get_taskthat data survived the round-trip - No-op detection: sending identical data → "no changes" warning
- Error message cleanliness: no "type=", "pydantic", "input_value" leaking
- Combo scenarios: feature + field edit in same call; feature no-op + field edit (warning present but field still applied)
- Merge/partial update: if partial updates are supported, test same-type merges AND type changes
- Edge cases from automated tests: anything in pytest that should be verified live
- Completed vs dropped: both states tested, not just one