name: evaluate
description: "Evaluate requirements, feature requests, and task intake against project DDD context. Produces a GO/DEFER/REJECT/ESCALATE recommendation with ROI scoring, scope definition, and acceptance
\ criteria.\n TRIGGER: "evaluate this request", "should we build this", "assess this requirement", "triage this".\n NOT FOR: deep-research use cases."
consumes_artifacts:
- research produces_artifact: evaluation tier: always
Requirement Evaluation
The "should we?" gate for the lifecycle pipeline. Evaluates any incoming requirement, feature request, or task against the 4 DDD questions before committing pipeline resources.
Works at L0 (structures any request with effort/impact). Gets autonomous judgment at L2 (DDD docs provide strategic alignment, feasibility, and history).
The 4 Questions
Every evaluation answers these, in order:
| # | Question | Source | Without DDD |
|---|---|---|---|
| 1 | Should we do this? | PRODUCT.md (strategic alignment) | Ask the user |
| 2 | Can we do this? | TECH.md (feasibility, constraints) | Estimate from request |
| 3 | Have we tried this? | IMPROVEMENT.md (past lessons) | No historical context |
| 4 | Should we do it now? | PROJECT.md (current priorities) | Assume yes |
Workflow
Step 1: Parse the Request
From the user's message, extract:
- What: one-sentence description of the requirement
- Why: stated motivation or inferred business value
- Who: who benefits (end user, developer, internal team)
- Constraints: deadlines, dependencies, blockers mentioned
If the request is too vague to parse (e.g., "improve things"), ESCALATE immediately:
"I need more specifics to evaluate this. What specifically should improve, and what would success look like?"
Step 2: Score (L2 with DDD docs)
Read the DDD docs and score each dimension 1-5:
Strategic Alignment (PRODUCT.md):
- 5: Directly serves #1 priority
- 4: Serves top-3 priorities
- 3: Aligned but not priority
- 2: Tangentially related
- 1: Not aligned / conflicts with non-goals
Feasibility (TECH.md):
- 5: Trivial — existing pattern, < 1 session
- 4: Straightforward — known approach, 1-2 sessions
- 3: Moderate — some unknowns, 2-4 sessions
- 2: Hard — significant unknowns or new patterns, 4+ sessions
- 1: Very hard — architectural change, cross-cutting, weeks
Historical Lessons (IMPROVEMENT.md):
- Check "What Failed" for similar past attempts
- Check "What Worked" for applicable patterns
- Check "Known Issues" for related problems
- Score 1-5:
- 5: Strong proven pattern — same approach succeeded before
- 4: Related pattern exists — similar approach worked
- 3: No history — neutral (default)
- 2: Weak negative signal — partial failure or abandoned attempt
- 1: Strong negative — same approach tried and failed
Current Priority (PROJECT.md):
- 5: Directly unblocks current focus
- 4: Supports current sprint
- 3: Important but not current
- 2: Nice to have
- 1: Conflicts with / distracts from current work
ROI Formula:
ROI = (Strategic * 0.35) + (Current_Priority * 0.25) + (Historical * 0.15) + (Feasibility * 0.25)
Range: [1.0, 5.0]. Higher feasibility = easier = higher ROI. All dimensions on 1-5 scale.
Step 2.5: Thesis Alignment
Read Knowledge/Learned/THESIS.md and scan the Thesis Health Summary table for all Active theses.
For each thesis, ask one question: Does this requirement align with, contradict, or have no connection to this thesis?
- ALIGNS — the requirement directly serves or deepens what this thesis claims is important. Note how.
- CONTRADICTS — the requirement works against what this thesis claims. FLAG visibly — don't auto-reject, but make the conflict impossible to miss.
- N/A — no meaningful connection. Skip silently.
Most requirements hit 1-2 theses, rarely all. Only report ALIGNS and CONTRADICTS — don't list every N/A.
If a thesis tension applies (see "Known Tensions Between Theses" section in THESIS.md), surface it: "This sits at the T3 vs T6 tension — building understanding tools that are also infrastructure. Resolution principle: [quote from Tensions table]."
Output (add to evaluation):
### Thesis Alignment
- T1 (memory moat): ALIGNS — adds persistent skill-level memory
- T3 (understanding > execution): ALIGNS — helps user see patterns, not just run tasks
Skip when: No DDD docs (L0 evaluation) — thesis alignment requires enough context to judge.
Step 3.5: Pre-mortem (Mandatory for GO candidates)
If the initial ROI >= 3.2 (GO candidate), run a pre-mortem before confirming:
"It's 2 weeks later. This feature shipped but is considered a failure. What are the 3 most likely reasons it failed?"
Rules:
- Each reason must be specific — not "it was too complex" (vague), but "the HTML structure changed and the regex parser returned 0 results for 3 days before anyone noticed" (specific)
- Check IMPROVEMENT.md "What Failed" for relevant prior failures. If one exists, at least 1 reason must reference it. If none exists, note "no prior art in IMPROVEMENT.md" and move on.
- At least 1 reason must challenge a specific scoring assumption — name which dimension relied on an unverified assumption, and what the assumption was
Output:
| # | Failure Reason | Likelihood | Mitigation |
|---|---|---|---|
| 1 | high/med/low | ||
| 2 | high/med/low | ||
| 3 | high/med/low |
Decision impact:
- If any reason has likelihood=HIGH and no mitigation exists → downgrade to ESCALATE, surface the risk to user
- If pre-mortem reveals a scoring assumption was unverified → reduce the specific dimension whose score relied on that assumption by 1 and recalculate ROI. (E.g., if Feasibility=4 assumed "existing pattern works" but pre-mortem shows it might not → Feasibility becomes 3.) If new ROI < 3.2 → DEFER
- If all reasons are med/low with clear mitigations → GO confirmed
Why this exists: EVALUATE has happy-path bias (LL09, 3 recurrences). Pre-mortem (Gary Klein) generates 30% more specific failure reasons than "argue against" because "imagine it failed" is concrete, "argue why not" is abstract. Same agent, same pass, one extra section — zero architecture change.
Skip when: ROI < 3.2 (already DEFER/REJECT — no need to argue against a NO).
Step 3: At L0 (No DDD Docs)
Skip scoring. Instead, structure the request:
## Evaluation: <requirement title>
### What
<one-sentence description>
### Effort Estimate
<T-shirt size: S/M/L/XL based on request complexity>
### Impact Estimate
<T-shirt size: S/M/L/XL based on stated motivation>
### Questions Before Proceeding
1. <what's unclear>
2. <what could go wrong>
3. <what's the success criteria>
### Recommendation
<GO / DEFER / ESCALATE with reasoning>
Step 4: Produce Recommendation
Based on ROI score (L2) or structured analysis (L0):
| Recommendation | When | Action |
|---|---|---|
| GO | ROI >= 3.2, no blockers | Define scope + acceptance criteria. Advance pipeline to THINK. |
| DEFER | ROI 2.0-3.1, or blocked by current priorities | Add to PROJECT.md backlog with reasoning. |
| REJECT | ROI < 2.0, or conflicts with non-goals | Explain why. Suggest alternative if one exists. |
| ESCALATE | Ambiguous scope, conflicting signals, or confidence < 0.6 | Surface specific questions to user. Don't guess. |
Step 5: Output
Present to user:
## Evaluation: <requirement title>
### Scores (L2 only)
| Dimension | Score | Rationale |
|-----------|-------|-----------|
| Strategic Alignment | 4/5 | Serves priority #2 (self-evolution) |
| Feasibility | 3/5 | Moderate — needs new pattern, ~3 sessions |
| Historical | +1 | Similar approach worked for context loading |
| Current Priority | 3/5 | Important but not blocking current focus |
| **ROI** | **3.5** | |
### Thesis Alignment
- T1 (memory moat): YES — <why>
- T3 (understanding > execution): N/A
### Recommendation: GO
### Pre-mortem (GO candidates only)
| # | Failure Reason | Likelihood | Mitigation |
|---|---------------|-----------|------------|
| 1 | <specific scenario> | med | <mitigation> |
| 2 | <specific scenario> | low | <mitigation> |
| 3 | <specific scenario> | low | <mitigation> |
Score adjustment: none (no HIGH without mitigation)
### Scope
<what's included and what's excluded>
### Acceptance Criteria
1. <criterion 1>
2. <criterion 2>
3. <criterion 3>
### Suggested Pipeline
Think: research <topic>
Plan: design with 3 alternatives
Build: implement chosen approach
Test: QA against acceptance criteria
Publish as artifact (L1+):
{
"requirement": "...",
"scores": {
"strategic": 4,
"feasibility": 3,
"historical": 1,
"priority": 3,
"roi": 3.4
},
"recommendation": "GO",
"thesis_alignment": {"T1": "YES — adds persistent memory", "T3": "N/A"},
"pre_mortem": [
{"reason": "...", "likelihood": "med", "mitigation": "..."},
{"reason": "...", "likelihood": "low", "mitigation": "..."},
{"reason": "...", "likelihood": "low", "mitigation": "..."}
],
"scope": "...",
"acceptance_criteria": ["...", "..."],
"escalation_questions": [],
"suggested_pipeline": ["think", "plan", "build", "test"]
}
Escalation Protocol Integration
Use the escalation protocol (backend/core/escalation.py) for transparent
decision-making at every evaluation. Three levels, three outcomes:
L0 INFORM — "Clear call, FYI"
When: ROI >= 3.2 AND confidence is high AND no blockers.
Emit an INFORM annotation and continue the pipeline:
from core.escalation import inform, build_sse_event, save_escalation
esc = inform(
title="GO: Build payment retry",
situation="ROI 4.1, directly serves priority #1, proven pattern from IMPROVEMENT.md.",
trigger="CLEAR_EVALUATION",
pipeline_stage="evaluate",
project="<PROJECT>",
evidence=["PRODUCT.md: checkout reliability is #1", "IMPROVEMENT.md: saga pattern worked"],
)
save_escalation(WORKSPACE_ROOT, esc)
Triggers: CLEAR_EVALUATION
L1 CONSULT — "I think X, override within 24h"
When: ROI 2.5-3.1 (borderline) OR confidence medium OR non-obvious tradeoff.
Swarm proceeds with its recommendation. Human has 24h to override via Radar todo.
from core.escalation import consult, Option, build_sse_event, save_escalation, create_radar_todo
esc = consult(
title="DEFER recommended: UI performance investigation",
situation="ROI 2.8 — aligned with priorities but effort is high (architectural). Historical: no prior attempt. Proceeding with DEFER unless overridden.",
options=[
Option(label="DEFER (recommended)", description="Add to backlog, revisit next sprint", risk="low", is_recommendation=True),
Option(label="GO", description="Start research phase now", risk="medium"),
Option(label="Discuss", description="I need more context from you"),
],
trigger="LOW_CONFIDENCE_ROI",
recommendation="DEFER — effort-to-impact ratio doesn't justify immediate action",
pipeline_stage="evaluate",
project="<PROJECT>",
evidence=["PRODUCT.md: priority #3", "TECH.md: requires new pattern"],
timeout_hours=24,
)
save_escalation(WORKSPACE_ROOT, esc)
create_radar_todo(esc)
Triggers: LOW_CONFIDENCE_ROI, CONFLICTING_PRIORITIES
L2 BLOCK — "I'm stuck, need your input"
When: ANY of these are true:
- Ambiguous scope: Can't determine what "done" looks like
- Conflicting requirements: PRODUCT.md says X, TECH.md says not-X
- Missing information: Can't answer 2+ of the 4 questions
- High-risk decision: Architecture change, data migration, public API change
- Resource contention: PROJECT.md shows too many open items
Pipeline PAUSES. Creates a high-priority Radar todo.
from core.escalation import block, Option, build_sse_event, save_escalation, create_radar_todo
esc = block(
title="Cannot evaluate: ambiguous scope",
situation="'Improve performance' — of what? API latency, UI render, or build speed? Each leads to a different recommendation.",
options=[
Option(label="Focus on API latency", description="Proven pattern, 2 sessions", risk="low", is_recommendation=True),
Option(label="Focus on UI render", description="Needs research first, 4+ sessions", risk="medium"),
Option(label="Research both first", description="1-session investigation, then decide"),
Option(label="Discuss", description="Let me explain more context"),
],
trigger="AMBIGUOUS_SCOPE",
recommendation="API latency — if forced to choose",
pipeline_stage="evaluate",
project="<PROJECT>",
evidence=["PRODUCT.md: 'performance' listed but not specified"],
)
save_escalation(WORKSPACE_ROOT, esc)
create_radar_todo(esc)
Triggers: AMBIGUOUS_SCOPE, CONFLICTING_PRIORITIES, MISSING_INFORMATION
Escalation to Artifact
When escalating with a project (L1/L2), also publish a partial evaluation artifact so context is preserved for async resolution:
python backend/scripts/artifact_cli.py publish \
--project <PROJECT> --type evaluation --producer s_evaluate \
--summary "ESCALATE: <reason>" \
--data '{"recommendation": "ESCALATE", "escalation_questions": ["..."], "partial_scores": {...}}'
Rules
- Never auto-GO on architectural changes — always ESCALATE for human review
- Never REJECT without explaining why — the user deserves reasoning
- DEFER is not REJECT — deferred items get logged in PROJECT.md for future triage
- L0 evaluation is still valuable — structuring a vague request IS the evaluation
- Don't over-score — 3/5 is the default. 5/5 requires strong evidence.
Artifact Operations
Discover prior research (before scoring):
python backend/scripts/artifact_cli.py discover --project <PROJECT> --types research --full
Publish evaluation (after presenting to user):
python backend/scripts/artifact_cli.py publish \
--project <PROJECT> --type evaluation --producer s_evaluate \
--summary "<GO/DEFER/REJECT/ESCALATE>: <one-line rationale>" \
--data '<JSON of evaluation output>'
Advance pipeline:
python backend/scripts/artifact_cli.py advance --project <PROJECT> --state think
Verification
Before marking this task complete, show evidence for each:
- ROI score calculated — numeric ROI shown (or T-shirt sizing at L0) with per-dimension scores and rationale
- Thesis alignment checked — relevant theses from THESIS.md evaluated; conflicts flagged visibly (L2 only)
- Pre-mortem completed — 3 specific failure reasons with likelihood and mitigation (for GO candidates; skip for DEFER/REJECT)
- Recommendation stated — explicit GO / DEFER / REJECT / ESCALATE with reasoning tied to scores (and pre-mortem if applicable)
- Acceptance criteria defined — numbered, testable criteria for what "done" looks like (GO) or clear rationale for deferral/rejection
- Scope boundaries set — what is included and what is explicitly excluded from the scope
- Evaluation artifact published — JSON artifact saved via artifact_cli (L1+) or structured output shown in chat (L0)