auto-review-loop

name: auto-review-loop description: "Autonomous multi-round research review loop. Supports two modes: (1) Plan-driven: takes an implementation plan file, executes TODO items respecting dependency DAG, uses Codex MCP to verify completion of each item. (2) Free-form: iterates review → fix → re-review until positive assessment. Use when user says 'auto review loop', 'review until it passes', or wants iterative improvement." argument-hint: "[topic-or-scope] [--plan path/to/plan.md]" allowed-tools: Bash(*), Read, Grep, Glob, Write, Edit, Agent, Skill, mcpcodexcodex, mcpcodexcodex-reply

Auto Review Loop: Autonomous Research Improvement

Context: $ARGUMENTS

Step 0: Environment Check

Parse $ARGUMENTS for --remote flag:

If --remote is present → dispatch to remote server (see below), then STOP
Otherwise → run locally, proceed to Mode Detection

Remote dispatch (only when `--remote` is specified)

ARIS_CONFIG=".aris/project.json"
if [ -f "$ARIS_CONFIG" ]; then
  PROJECT_ID=$(python3 -c "import json;print(json.load(open('$ARIS_CONFIG'))['projectId'])")
  API_URL=$(python3 -c "import json;print(json.load(open('$ARIS_CONFIG'))['apiUrl'])")
  API_TOKEN=$(python3 -c "import json;print(json.load(open('$HOME/.claude/aris-api.json'))['token'])" 2>/dev/null)
  # Strip --remote from arguments before forwarding
  CLEAN_ARGS=$(echo "$ARGUMENTS" | sed 's/--remote//')
  PROMPT=$(printf '%s' "$CLEAN_ARGS" | python3 -c "import sys,json;print(json.dumps(sys.stdin.read()))")
  if [ -n "$PROJECT_ID" ] && [ -n "$API_TOKEN" ]; then
    RESULT=$(curl -s -X POST "$API_URL/api/aris/runs" \
      -H "Authorization: Bearer $API_TOKEN" \
      -H "Content-Type: application/json" \
      -d "{\"projectId\":\"$PROJECT_ID\",\"workflowType\":\"auto_review_loop\",\"prompt\":$PROMPT,\"title\":\"Auto Review Loop\"}")
    RUN_ID=$(echo "$RESULT" | python3 -c "import sys,json;d=json.load(sys.stdin);print(d.get('run',{}).get('id','') or d.get('error','UNKNOWN'))" 2>/dev/null)
    echo "Dispatched to remote server. ARIS run ID: $RUN_ID"
  fi
fi

After dispatching, STOP. Tell the user the run ID and that they can monitor on the ARIS dashboard.

Mode Detection

Parse $ARGUMENTS for --plan <path>:

If --plan is present → Plan-Driven Mode (Section A)
Otherwise → Free-Form Review Mode (Section B)

Section A: Plan-Driven Mode

Execute an implementation plan with dependency-aware task ordering. Each TODO item is implemented, then verified by the Codex MCP reviewer before marking complete.

A.1 Initialization

Read the plan file specified by --plan <path>. This is a markdown file with Steps and TODO items.
Parse plan structure:
- Steps are top-level groups (### Step N: Title)
- TODO items within steps (#### TODO-N.M: Title)
- Items marked with ✅ are already completed — skip them

Load or create PLAN_STATE.json in project root:

{
  "planFile": "docs/implementation_plan.md",
  "status": "in_progress",
  "currentNode": "TODO-1.1",
  "completedNodes": ["TODO-1.0"],
  "failedNodes": [],
  "threadId": "019cd392-...",
  "timestamp": "2026-03-17T10:00:00"
}

If exists with "in_progress" and within 24h → resume from currentNode
Otherwise → fresh start

Register plan with ARIS API (if ~/.claude/aris-api.json exists):

ARIS_CFG="$HOME/.claude/aris-api.json"
if [ -f "$ARIS_CFG" ]; then
  API_URL=$(python3 -c "import json;print(json.load(open('$ARIS_CFG'))['api_url'])")
  API_TOKEN=$(python3 -c "import json;print(json.load(open('$ARIS_CFG'))['token'])")
  # Register run
  RESULT=$(curl -s -X POST "$API_URL/api/aris/runs/register" \
    -H "Authorization: Bearer $API_TOKEN" \
    -H "Content-Type: application/json" \
    -d '{"projectId":"PROJECT_ID","prompt":"Plan: PLAN_FILE","status":"running","workflowType":"auto_review_loop","title":"Plan Execution"}')
  ARIS_RUN_ID=$(echo "$RESULT" | python3 -c "import sys,json;print(json.load(sys.stdin).get('run',{}).get('id',''))")
  # Upload plan
  PLAN_MD=$(cat PLAN_FILE)
  curl -s -X POST "$API_URL/api/aris/runs/$ARIS_RUN_ID/plan" \
    -H "Authorization: Bearer $API_TOKEN" \
    -H "Content-Type: application/json" \
    -d "{\"markdown\":$(python3 -c "import json,sys;print(json.dumps(sys.argv[1]))" "$PLAN_MD")}"
fi

Build execution order: Topological sort of the TODO DAG, respecting dependsOn relationships.

A.2 Execution Loop

For each TODO item in topological order:

Phase 1: Check Prerequisites

Verify all items in dependsOn are in completedNodes
If any dependency failed → skip this item, mark as skipped
If dependencies not yet complete → this shouldn't happen in topological order, but wait/skip if it does
If item is already in completedNodes (from resume or ✅ marker) → skip

Phase 2: Implement

Read the TODO item's description from the plan file. It contains:

What needs to be built/implemented
File locations
Expected behavior
Any code snippets or references

Implement the TODO item. This may involve:

Writing new code files
Modifying existing code
Running commands (pip install, tests, etc.)
Creating data adapters
Running experiments — check GPU availability first (see "Multi-Server Experiment Routing" section) and dispatch to the server with free GPUs

Phase 3: Verify via Codex MCP

After implementing, send the work to the Codex MCP reviewer for verification:

mcp__codex__codex (or mcp__codex__codex-reply if threadId exists):
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    I am executing an implementation plan. I just completed this TODO item:

    **{TODO_KEY}: {TODO_TITLE}**

    Description from plan:
    {TODO_DESCRIPTION}

    What I implemented:
    {SUMMARY_OF_CHANGES}

    Files changed:
    {LIST_OF_FILES}

    Test results (if any):
    {TEST_OUTPUT}

    Please verify:
    1. Does the implementation match what the plan asked for? (Yes/No)
    2. Are there any bugs, missing pieces, or quality issues? (List them)
    3. Is this TODO item COMPLETE? (Yes/No/Partial)
    4. If Partial, what specific remaining work is needed?

    Be strict. Only mark complete if the implementation fully satisfies the plan.

Phase 4: Process Verification Result

Parse the Codex response:

"Complete: Yes" → Mark as completed, add to completedNodes, update ARIS API
"Complete: Partial" → Implement the remaining work, then re-verify (max 2 retries)
"Complete: No" → Fix issues, then re-verify (max 2 retries)
After 2 failed retries → mark as failed, add to failedNodes, continue to next item

Phase 5: Update State

After each TODO item:

Update PLAN_STATE.json:

{
  "currentNode": "NEXT_TODO_KEY",
  "completedNodes": ["TODO-1.0", "TODO-1.1", ...],
  "timestamp": "NOW"
}

Update ARIS API plan node (if registered):

curl -s -X PATCH "$API_URL/api/aris/runs/$ARIS_RUN_ID/plan/TODO_KEY" \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"status":"completed","resultSummary":"CODEX_VERDICT_SUMMARY"}'

Write per-TODO review report to review/ folder:

Create review/TODO-X.Y.md (where X.Y is the TODO key, e.g. review/TODO-1.1.md):

# Review: TODO-X.Y — {TODO_TITLE}

**Status**: ✅ completed / ❌ failed / ⚠️ needs_redo
**Timestamp**: {ISO_TIMESTAMP}
**Retries**: {0 / 1 / 2}

## Plan Description

{TODO_DESCRIPTION verbatim from plan file}

## What Was Implemented

{SUMMARY_OF_CHANGES — what you actually built/modified}

**Files changed:**
- `path/to/file1.py` — description of change
- `path/to/file2.py` — description of change

**Test / command output:**

{TEST_OUTPUT or "N/A"}


## Codex Review Response

{FULL raw Codex response verbatim — do not summarize}

## Verdict

- **Match plan?**: Yes / No
- **Complete?**: Yes / Partial / No
- **Issues found**: {list or "None"}
- **Remaining work** (if Partial): {list or "N/A"}

Rules for this file:

Always write this file immediately after receiving the Codex verdict, even on failure.
If the item was retried, overwrite the same file each time (latest verdict wins).

If the item was skipped due to dependency failure, write a minimal report:

# Review: TODO-X.Y — {TODO_TITLE}

**Status**: ⏭️ skipped
**Reason**: Dependency {DEP_KEY} failed — skipping downstream item.

This folder is the primary human-readable audit trail. Keep reports factual and complete.

Append to AUTO_REVIEW.md:

## TODO-X.Y: Title (timestamp)
- Status: completed/failed
- Reviewer verdict: [Codex response summary]
- Files changed: [list]
- Report: [review/TODO-X.Y.md](review/TODO-X.Y.md)
- Notes: [any observations]

Phase 6: Parallel Execution

When multiple TODO items have canParallel: true and share the same dependencies (all satisfied):

List all ready parallel items
Implement them sequentially (Claude can only do one at a time)
But batch-verify: send all implementations to Codex in a single review prompt
This is more efficient than individual reviews for independent items

A.3 Termination

When all TODO items are processed:

Set PLAN_STATE.json → "status": "completed"

Write final summary to AUTO_REVIEW.md:

## Plan Execution Summary

| TODO | Title | Status | Reviewer Verdict | Report |
|------|-------|--------|------------------|--------|
| TODO-1.0 | Environment setup | ✅ completed | Pre-verified | [report](review/TODO-1.0.md) |
| TODO-1.1 | Run SimpleMem | ✅ completed | Matches plan | [report](review/TODO-1.1.md) |
| TODO-2.1 | Implement C1 | ✅ completed | Code correct | [report](review/TODO-2.1.md) |
| TODO-4.3 | Implement C6 | ❌ failed | Missing graph builder | [report](review/TODO-4.3.md) |
| ... | ... | ... | ... | ... |

Completed: X/Y items
Failed: Z items (listed above with reasons)

Individual review reports: [`review/`](review/)

Update ARIS API run status → completed or failed

Upload review reports to ARIS API so client device can retrieve them:

ARIS_CFG="$HOME/.claude/aris-api.json"
if [ -f "$ARIS_CFG" ] && [ -n "$ARIS_RUN_ID" ] && [ -d "review" ]; then
  python3 -c "
import os, sys, json, base64, urllib.request
cfg = json.load(open(os.path.expanduser('~/.claude/aris-api.json')))
api_url, token, run_id = cfg['api_url'], cfg['token'], sys.argv[1]
files = {}
for f in sorted(os.listdir('review')):
    if f.endswith('.md'):
        with open(os.path.join('review', f), 'rb') as fp:
            files[f] = base64.b64encode(fp.read()).decode()
if not files:
    sys.exit(0)
payload = json.dumps(files).encode()
req = urllib.request.Request(
    api_url + '/api/aris/runs/' + run_id + '/review-reports',
    data=payload,
    headers={'Content-Type': 'application/json', 'Authorization': 'Bearer ' + token},
    method='POST'
)
urllib.request.urlopen(req, timeout=30)
print('[ARIS] Uploaded ' + str(len(files)) + ' review report(s)')
" "$ARIS_RUN_ID" 2>/dev/null || true
fi

Feishu notification (if configured): Send pipeline_done with completion stats

Section B: Free-Form Review Mode

(Original behavior when no --plan is specified)

Autonomously iterate: review → implement fixes → re-review, until the external reviewer gives a positive assessment or MAX_ROUNDS is reached.

Constants

MAX_ROUNDS = 4
POSITIVE_THRESHOLD: score >= 6/10, or verdict contains "accept", "sufficient", "ready for submission"
REVIEW_DOC: AUTO_REVIEW.md in project root (cumulative log)
REVIEWER_MODEL = gpt-5.4 — Model used via Codex MCP
HUMAN_CHECKPOINT = false — When true, pause after each round's review and present the score + weaknesses to the user.

Override: /auto-review-loop "topic" — human checkpoint: true

State Persistence (Compact Recovery)

Persist state to REVIEW_STATE.json after each round:

{
  "round": 2,
  "threadId": "019cd392-...",
  "status": "in_progress",
  "last_score": 5.0,
  "last_verdict": "not ready",
  "pending_experiments": ["screen_name_1"],
  "timestamp": "2026-03-13T21:00:00"
}

Workflow

Initialization

Check for REVIEW_STATE.json:
- Not exist or "completed" → fresh start
- "in_progress" older than 24h → fresh start
- "in_progress" within 24h → resume from round + 1
Read project docs, memory files, prior reviews
Read recent experiment results
Initialize round counter

Loop (repeat up to MAX_ROUNDS)

Phase A: Review via Codex MCP

mcp__codex__codex:
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    [Round N/MAX_ROUNDS]
    [Full research context: claims, methods, results, known weaknesses]
    [Changes since last round]

    Please act as a senior ML reviewer (NeurIPS/ICML level).
    1. Score this work 1-10 for a top venue
    2. List remaining critical weaknesses (ranked by severity)
    3. For each weakness, specify the MINIMUM fix
    4. State clearly: is this READY for submission? Yes/No/Almost

Round 2+: use mcp__codex__codex-reply with saved threadId.

Phase B: Parse Assessment

Save FULL raw response verbatim. Extract: Score, Verdict, Action items.

STOP CONDITION: score >= 6 AND verdict "ready"/"almost" → stop, document.

Human Checkpoint (if enabled)

Present score + weaknesses, wait for user input.

Feishu Notification (if configured)

Send review_scored notification if ~/.claude/feishu.json exists.

Phase C: Implement Fixes

For each action item (highest priority first): code changes, experiments, analysis, documentation. When running experiments, use multi-server routing: check /gpu-status or .aris/project.json servers and dispatch to whichever has free GPUs. Run independent experiments on different servers in parallel.

Phase D: Wait for Results

Monitor experiments, collect results.

Phase E: Document Round

Append to AUTO_REVIEW.md. Write REVIEW_STATE.json.

Termination

Update state → "completed"
Write final summary with score progression table
Feishu notification if configured

Multi-Server Experiment Routing (Both Modes)

When running experiments that require GPUs, use ALL available servers — not just one.

Before launching experiments:

Check GPU availability across all project servers:

# Read servers from .aris/project.json
python3 -c "
import json
cfg = json.load(open('.aris/project.json'))
for s in cfg['servers']:
    print(f\"{s['name']}: {s['ssh']}\")
"

Run /gpu-status (or manually SSH to each server with nvidia-smi) to find free GPUs.
Route experiments to servers with free GPUs:
- Pick the server(s) with the most free GPUs
- For multi-GPU jobs, prefer servers with contiguous free GPUs
- If all GPUs on one server are busy, try the next server
- Run independent experiments on different servers in parallel when possible

Dispatching to a remote server:

# SSH to a specific server and run an experiment
ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no <SSH_COMMAND_FROM_PROJECT_JSON> \
  "cd <remotePath> && CUDA_VISIBLE_DEVICES=<free_gpu_ids> <command>"

Key principles:

Never hardcode a single server — always check availability first
Parallel dispatch: if two experiments are independent, run them on different servers simultaneously
Retry on a different server if one server's GPUs fill up during the run
The prompt may include "Available experiment servers" — use ALL of them, not just the first

Key Rules (Both Modes)

Large file handling: If Write tool fails due to file size, use Bash (cat << 'EOF' > file) silently.
ALWAYS use config: {"model_reasoning_effort": "xhigh"} for Codex MCP calls
Save threadId from first Codex call, use mcp__codex__codex-reply for subsequent calls
Be honest — include negative results and failed experiments
Do NOT hide weaknesses to game scores
Implement BEFORE re-reviewing/re-verifying
Document EVERYTHING in AUTO_REVIEW.md
Plan mode: The Codex reviewer is the authority on whether a TODO is complete. Do not self-assess.
Plan mode: Respect dependency ordering. Never implement a TODO whose dependencies haven't been verified complete.

Prompt Template for Round 2+ (Free-Form Mode)

mcp__codex__codex-reply:
  threadId: [saved from round 1]
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    [Round N update]
    Since your last review, we have:
    1. [Action 1]: [result]
    2. [Action 2]: [result]

    Updated results table: [paste metrics]

    Please re-score and re-assess.
    Same format: Score, Verdict, Remaining Weaknesses, Minimum Fixes.