name: auto-review-loop description: "Autonomous multi-round research review loop. Supports two modes: (1) Plan-driven: takes an implementation plan file, executes TODO items respecting dependency DAG, uses Codex MCP to verify completion of each item. (2) Free-form: iterates review → fix → re-review until positive assessment. Use when user says 'auto review loop', 'review until it passes', or wants iterative improvement." argument-hint: "[topic-or-scope] [--plan path/to/plan.md]" allowed-tools: Bash(*), Read, Grep, Glob, Write, Edit, Agent, Skill, mcp__codex__codex, mcp__codex__codex-reply
Auto Review Loop: Autonomous Research Improvement
Context: $ARGUMENTS
Step 0: Environment Check
Parse $ARGUMENTS for --remote flag:
- If
--remoteis present → dispatch to remote server (see below), then STOP - Otherwise → run locally, proceed to Mode Detection
Remote dispatch (only when --remote is specified)
ARIS_CONFIG=".aris/project.json"
if [ -f "$ARIS_CONFIG" ]; then
PROJECT_ID=$(python3 -c "import json;print(json.load(open('$ARIS_CONFIG'))['projectId'])")
API_URL=$(python3 -c "import json;print(json.load(open('$ARIS_CONFIG'))['apiUrl'])")
API_TOKEN=$(python3 -c "import json;print(json.load(open('$HOME/.claude/aris-api.json'))['token'])" 2>/dev/null)
# Strip --remote from arguments before forwarding
CLEAN_ARGS=$(echo "$ARGUMENTS" | sed 's/--remote//')
PROMPT=$(printf '%s' "$CLEAN_ARGS" | python3 -c "import sys,json;print(json.dumps(sys.stdin.read()))")
if [ -n "$PROJECT_ID" ] && [ -n "$API_TOKEN" ]; then
RESULT=$(curl -s -X POST "$API_URL/api/aris/runs" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d "{\"projectId\":\"$PROJECT_ID\",\"workflowType\":\"auto_review_loop\",\"prompt\":$PROMPT,\"title\":\"Auto Review Loop\"}")
RUN_ID=$(echo "$RESULT" | python3 -c "import sys,json;d=json.load(sys.stdin);print(d.get('run',{}).get('id','') or d.get('error','UNKNOWN'))" 2>/dev/null)
echo "Dispatched to remote server. ARIS run ID: $RUN_ID"
fi
fi
After dispatching, STOP. Tell the user the run ID and that they can monitor on the ARIS dashboard.
Mode Detection
Parse $ARGUMENTS for --plan <path>:
- If
--planis present → Plan-Driven Mode (Section A) - Otherwise → Free-Form Review Mode (Section B)
Section A: Plan-Driven Mode
Execute an implementation plan with dependency-aware task ordering. Each TODO item is implemented, then verified by the Codex MCP reviewer before marking complete.
A.1 Initialization
Read the plan file specified by
--plan <path>. This is a markdown file with Steps and TODO items.Parse plan structure:
- Steps are top-level groups (
### Step N: Title) - TODO items within steps (
#### TODO-N.M: Title) - Items marked with ✅ are already completed — skip them
- Steps are top-level groups (
Load or create
PLAN_STATE.jsonin project root:{ "planFile": "docs/implementation_plan.md", "status": "in_progress", "currentNode": "TODO-1.1", "completedNodes": ["TODO-1.0"], "failedNodes": [], "threadId": "019cd392-...", "timestamp": "2026-03-17T10:00:00" }- If exists with
"in_progress"and within 24h → resume fromcurrentNode - Otherwise → fresh start
- If exists with
Register plan with ARIS API (if
~/.claude/aris-api.jsonexists):ARIS_CFG="$HOME/.claude/aris-api.json" if [ -f "$ARIS_CFG" ]; then API_URL=$(python3 -c "import json;print(json.load(open('$ARIS_CFG'))['api_url'])") API_TOKEN=$(python3 -c "import json;print(json.load(open('$ARIS_CFG'))['token'])") # Register run RESULT=$(curl -s -X POST "$API_URL/api/aris/runs/register" \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{"projectId":"PROJECT_ID","prompt":"Plan: PLAN_FILE","status":"running","workflowType":"auto_review_loop","title":"Plan Execution"}') ARIS_RUN_ID=$(echo "$RESULT" | python3 -c "import sys,json;print(json.load(sys.stdin).get('run',{}).get('id',''))") # Upload plan PLAN_MD=$(cat PLAN_FILE) curl -s -X POST "$API_URL/api/aris/runs/$ARIS_RUN_ID/plan" \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d "{\"markdown\":$(python3 -c "import json,sys;print(json.dumps(sys.argv[1]))" "$PLAN_MD")}" fiBuild execution order: Topological sort of the TODO DAG, respecting
dependsOnrelationships.
A.2 Execution Loop
For each TODO item in topological order:
Phase 1: Check Prerequisites
- Verify all items in
dependsOnare incompletedNodes - If any dependency failed → skip this item, mark as
skipped - If dependencies not yet complete → this shouldn't happen in topological order, but wait/skip if it does
- If item is already in
completedNodes(from resume or ✅ marker) → skip
Phase 2: Implement
Read the TODO item's description from the plan file. It contains:
- What needs to be built/implemented
- File locations
- Expected behavior
- Any code snippets or references
Implement the TODO item. This may involve:
- Writing new code files
- Modifying existing code
- Running commands (pip install, tests, etc.)
- Creating data adapters
- Running experiments — check GPU availability first (see "Multi-Server Experiment Routing" section) and dispatch to the server with free GPUs
Phase 3: Verify via Codex MCP
After implementing, send the work to the Codex MCP reviewer for verification:
mcp__codex__codex (or mcp__codex__codex-reply if threadId exists):
config: {"model_reasoning_effort": "xhigh"}
prompt: |
I am executing an implementation plan. I just completed this TODO item:
**{TODO_KEY}: {TODO_TITLE}**
Description from plan:
{TODO_DESCRIPTION}
What I implemented:
{SUMMARY_OF_CHANGES}
Files changed:
{LIST_OF_FILES}
Test results (if any):
{TEST_OUTPUT}
Please verify:
1. Does the implementation match what the plan asked for? (Yes/No)
2. Are there any bugs, missing pieces, or quality issues? (List them)
3. Is this TODO item COMPLETE? (Yes/No/Partial)
4. If Partial, what specific remaining work is needed?
Be strict. Only mark complete if the implementation fully satisfies the plan.
Phase 4: Process Verification Result
Parse the Codex response:
- "Complete: Yes" → Mark as
completed, add tocompletedNodes, update ARIS API - "Complete: Partial" → Implement the remaining work, then re-verify (max 2 retries)
- "Complete: No" → Fix issues, then re-verify (max 2 retries)
- After 2 failed retries → mark as
failed, add tofailedNodes, continue to next item
Phase 5: Update State
After each TODO item:
Update
PLAN_STATE.json:{ "currentNode": "NEXT_TODO_KEY", "completedNodes": ["TODO-1.0", "TODO-1.1", ...], "timestamp": "NOW" }Update ARIS API plan node (if registered):
curl -s -X PATCH "$API_URL/api/aris/runs/$ARIS_RUN_ID/plan/TODO_KEY" \ -H "Authorization: Bearer $API_TOKEN" \ -H "Content-Type: application/json" \ -d '{"status":"completed","resultSummary":"CODEX_VERDICT_SUMMARY"}'Write per-TODO review report to
review/folder:Create
review/TODO-X.Y.md(where X.Y is the TODO key, e.g.review/TODO-1.1.md):# Review: TODO-X.Y — {TODO_TITLE} **Status**: ✅ completed / ❌ failed / ⚠️ needs_redo **Timestamp**: {ISO_TIMESTAMP} **Retries**: {0 / 1 / 2} ## Plan Description {TODO_DESCRIPTION verbatim from plan file} ## What Was Implemented {SUMMARY_OF_CHANGES — what you actually built/modified} **Files changed:** - `path/to/file1.py` — description of change - `path/to/file2.py` — description of change **Test / command output:**{TEST_OUTPUT or "N/A"}
## Codex Review Response {FULL raw Codex response verbatim — do not summarize} ## Verdict - **Match plan?**: Yes / No - **Complete?**: Yes / Partial / No - **Issues found**: {list or "None"} - **Remaining work** (if Partial): {list or "N/A"}Rules for this file:
- Always write this file immediately after receiving the Codex verdict, even on failure.
- If the item was retried, overwrite the same file each time (latest verdict wins).
- If the item was skipped due to dependency failure, write a minimal report:
# Review: TODO-X.Y — {TODO_TITLE} **Status**: ⏭️ skipped **Reason**: Dependency {DEP_KEY} failed — skipping downstream item. - This folder is the primary human-readable audit trail. Keep reports factual and complete.
Append to
AUTO_REVIEW.md:## TODO-X.Y: Title (timestamp) - Status: completed/failed - Reviewer verdict: [Codex response summary] - Files changed: [list] - Report: [review/TODO-X.Y.md](review/TODO-X.Y.md) - Notes: [any observations]
Phase 6: Parallel Execution
When multiple TODO items have canParallel: true and share the same dependencies (all satisfied):
- List all ready parallel items
- Implement them sequentially (Claude can only do one at a time)
- But batch-verify: send all implementations to Codex in a single review prompt
- This is more efficient than individual reviews for independent items
A.3 Termination
When all TODO items are processed:
Set
PLAN_STATE.json→"status": "completed"Write final summary to
AUTO_REVIEW.md:## Plan Execution Summary | TODO | Title | Status | Reviewer Verdict | Report | |------|-------|--------|------------------|--------| | TODO-1.0 | Environment setup | ✅ completed | Pre-verified | [report](review/TODO-1.0.md) | | TODO-1.1 | Run SimpleMem | ✅ completed | Matches plan | [report](review/TODO-1.1.md) | | TODO-2.1 | Implement C1 | ✅ completed | Code correct | [report](review/TODO-2.1.md) | | TODO-4.3 | Implement C6 | ❌ failed | Missing graph builder | [report](review/TODO-4.3.md) | | ... | ... | ... | ... | ... | Completed: X/Y items Failed: Z items (listed above with reasons) Individual review reports: [`review/`](review/)Update ARIS API run status →
completedorfailedUpload review reports to ARIS API so client device can retrieve them:
ARIS_CFG="$HOME/.claude/aris-api.json" if [ -f "$ARIS_CFG" ] && [ -n "$ARIS_RUN_ID" ] && [ -d "review" ]; then python3 -c " import os, sys, json, base64, urllib.request cfg = json.load(open(os.path.expanduser('~/.claude/aris-api.json'))) api_url, token, run_id = cfg['api_url'], cfg['token'], sys.argv[1] files = {} for f in sorted(os.listdir('review')): if f.endswith('.md'): with open(os.path.join('review', f), 'rb') as fp: files[f] = base64.b64encode(fp.read()).decode() if not files: sys.exit(0) payload = json.dumps(files).encode() req = urllib.request.Request( api_url + '/api/aris/runs/' + run_id + '/review-reports', data=payload, headers={'Content-Type': 'application/json', 'Authorization': 'Bearer ' + token}, method='POST' ) urllib.request.urlopen(req, timeout=30) print('[ARIS] Uploaded ' + str(len(files)) + ' review report(s)') " "$ARIS_RUN_ID" 2>/dev/null || true fiFeishu notification (if configured): Send
pipeline_donewith completion stats
Section B: Free-Form Review Mode
(Original behavior when no --plan is specified)
Autonomously iterate: review → implement fixes → re-review, until the external reviewer gives a positive assessment or MAX_ROUNDS is reached.
Constants
- MAX_ROUNDS = 4
- POSITIVE_THRESHOLD: score >= 6/10, or verdict contains "accept", "sufficient", "ready for submission"
- REVIEW_DOC:
AUTO_REVIEW.mdin project root (cumulative log) - REVIEWER_MODEL =
gpt-5.4— Model used via Codex MCP - HUMAN_CHECKPOINT = false — When
true, pause after each round's review and present the score + weaknesses to the user.
Override:
/auto-review-loop "topic" — human checkpoint: true
State Persistence (Compact Recovery)
Persist state to REVIEW_STATE.json after each round:
{
"round": 2,
"threadId": "019cd392-...",
"status": "in_progress",
"last_score": 5.0,
"last_verdict": "not ready",
"pending_experiments": ["screen_name_1"],
"timestamp": "2026-03-13T21:00:00"
}
Workflow
Initialization
- Check for
REVIEW_STATE.json:- Not exist or
"completed"→ fresh start "in_progress"older than 24h → fresh start"in_progress"within 24h → resume fromround + 1
- Not exist or
- Read project docs, memory files, prior reviews
- Read recent experiment results
- Initialize round counter
Loop (repeat up to MAX_ROUNDS)
Phase A: Review via Codex MCP
mcp__codex__codex:
config: {"model_reasoning_effort": "xhigh"}
prompt: |
[Round N/MAX_ROUNDS]
[Full research context: claims, methods, results, known weaknesses]
[Changes since last round]
Please act as a senior ML reviewer (NeurIPS/ICML level).
1. Score this work 1-10 for a top venue
2. List remaining critical weaknesses (ranked by severity)
3. For each weakness, specify the MINIMUM fix
4. State clearly: is this READY for submission? Yes/No/Almost
Round 2+: use mcp__codex__codex-reply with saved threadId.
Phase B: Parse Assessment
Save FULL raw response verbatim. Extract: Score, Verdict, Action items.
STOP CONDITION: score >= 6 AND verdict "ready"/"almost" → stop, document.
Human Checkpoint (if enabled)
Present score + weaknesses, wait for user input.
Feishu Notification (if configured)
Send review_scored notification if ~/.claude/feishu.json exists.
Phase C: Implement Fixes
For each action item (highest priority first): code changes, experiments, analysis, documentation.
When running experiments, use multi-server routing: check /gpu-status or .aris/project.json servers and dispatch to whichever has free GPUs. Run independent experiments on different servers in parallel.
Phase D: Wait for Results
Monitor experiments, collect results.
Phase E: Document Round
Append to AUTO_REVIEW.md. Write REVIEW_STATE.json.
Termination
- Update state →
"completed" - Write final summary with score progression table
- Feishu notification if configured
Multi-Server Experiment Routing (Both Modes)
When running experiments that require GPUs, use ALL available servers — not just one.
Before launching experiments:
Check GPU availability across all project servers:
# Read servers from .aris/project.json python3 -c " import json cfg = json.load(open('.aris/project.json')) for s in cfg['servers']: print(f\"{s['name']}: {s['ssh']}\") "Run /gpu-status (or manually SSH to each server with
nvidia-smi) to find free GPUs.Route experiments to servers with free GPUs:
- Pick the server(s) with the most free GPUs
- For multi-GPU jobs, prefer servers with contiguous free GPUs
- If all GPUs on one server are busy, try the next server
- Run independent experiments on different servers in parallel when possible
Dispatching to a remote server:
# SSH to a specific server and run an experiment
ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no <SSH_COMMAND_FROM_PROJECT_JSON> \
"cd <remotePath> && CUDA_VISIBLE_DEVICES=<free_gpu_ids> <command>"
Key principles:
- Never hardcode a single server — always check availability first
- Parallel dispatch: if two experiments are independent, run them on different servers simultaneously
- Retry on a different server if one server's GPUs fill up during the run
- The prompt may include "Available experiment servers" — use ALL of them, not just the first
Key Rules (Both Modes)
- Large file handling: If Write tool fails due to file size, use Bash (
cat << 'EOF' > file) silently. - ALWAYS use
config: {"model_reasoning_effort": "xhigh"}for Codex MCP calls - Save threadId from first Codex call, use
mcp__codex__codex-replyfor subsequent calls - Be honest — include negative results and failed experiments
- Do NOT hide weaknesses to game scores
- Implement BEFORE re-reviewing/re-verifying
- Document EVERYTHING in
AUTO_REVIEW.md - Plan mode: The Codex reviewer is the authority on whether a TODO is complete. Do not self-assess.
- Plan mode: Respect dependency ordering. Never implement a TODO whose dependencies haven't been verified complete.
Prompt Template for Round 2+ (Free-Form Mode)
mcp__codex__codex-reply:
threadId: [saved from round 1]
config: {"model_reasoning_effort": "xhigh"}
prompt: |
[Round N update]
Since your last review, we have:
1. [Action 1]: [result]
2. [Action 2]: [result]
Updated results table: [paste metrics]
Please re-score and re-assess.
Same format: Score, Verdict, Remaining Weaknesses, Minimum Fixes.