name: factorial-monitor version: 1.0.0 description: > Multi-job factorial experiment monitoring, aggregated diagnosis, and selective re-launch for SkyPilot jobs. Use when running factorial experiments (model x loss x calibration x fold) via SkyPilot and you need to track all jobs, aggregate failures by root cause, and re-launch only failed conditions. Do NOT use for: single-job monitoring (use ralph-loop), local test failures (use self-learning-iterative-coder), or sequential plan execution (use overnight-runner). last_updated: 2026-03-19 activation: manual invocation: /factorial-monitor metadata: category: operations tags: [skypilot, monitoring, factorial, cloud, batch, gcp, experiment] relations: compose_with: - ralph-loop - self-learning-iterative-coder - issue-creator depend_on: - ralph-loop similar_to: [] belong_to: - overnight-runner
Factorial Monitor
Multi-job factorial experiment monitoring with aggregated diagnosis, batch error resolution, and selective re-launch. The outer loop that orchestrates ralph-loop (per-job diagnosis) and self-learning-iterative-coder (batch code fixes).
Non-Negotiable Rules
Five rules prevent ALL known anti-patterns. Read instructions/rules.md for full details with rationale. Summary:
| Rule | Name | Prevents |
|---|---|---|
| F1 | WAIT-FOR-TERMINAL | Panic fixing, premature whac-a-mole |
| F2 | AGGREGATE-BEFORE-FIX | Silent dismissal, serial fixing |
| F3 | REBUILD-BEFORE-RELAUNCH | Docker image staleness |
| F4 | MAX-TWO-CYCLES | Infinite fix-relaunch loops, cost overrun |
| F5 | FACTORIAL-MANIFEST | Partial factorial amnesia |
Kill-switch exception (Rule F1): If 3+ jobs fail with IDENTICAL error within 5 min AND remaining running jobs haven't passed the failure point → cancel same-config jobs, begin batch diagnosis. Different-config jobs continue.
Workflow (6 Phases)
Phase 1: LAUNCH → protocols/launch.md
Phase 2: MONITOR → protocols/monitor.md (polling loop, READ-ONLY)
Phase 3: DIAGNOSE → protocols/diagnose.md (batch aggregation)
Phase 4: FIX → protocols/fix.md (reviewer-backed batch fix)
Phase 5: RELAUNCH → protocols/relaunch.md (selective, max 2 cycles)
Phase 6: REPORT → protocols/report.md (final summary + issues)
Phase 1: LAUNCH
- Execute
run_factorial.sh <config.yaml>or confirm already launched - Create
factorial_manifest.jsonmapping job_id → condition - Verify SkyPilot YAML uses
image_id: docker:...(Rule #17 — no bare VM) - Record experiment_id, factors, levels, expected job count
- See: protocols/launch.md
Phase 2: MONITOR (polling loop)
- Poll
sky jobs queueevery 60s - Print live status table:
| condition | job_id | status | duration | - For each newly-terminal failure: call
ralph_monitor.analyze_logs() - READ-ONLY: no code changes, no SSH, no
sky execwhile jobs run - Continue until ALL jobs reach terminal state (Rule F1)
- See: protocols/monitor.md
Phase 3: DIAGNOSE (all jobs terminal)
- Group failures by root cause category using ralph_monitor categories
- Present ONE aggregated report (Rule F2)
- Format:
{root_cause → [job_ids], fix_strategy, affected_files, confidence} - If 0 failures → skip to REPORT
- See: protocols/diagnose.md
Phase 4: FIX (with reviewer agents)
- For EACH root cause: plan batch fix strategy with reviewer agents
- If code fix needed → compose_with: self-learning-iterative-coder
- If config fix → edit YAML/env directly
- Execute Rule F3:
make test-staging→ Docker rebuild → push → verify digest - Commit all fixes in ONE batch commit
- See: protocols/fix.md
Phase 5: RELAUNCH (max 2 cycles)
- Generate filtered re-launch command (ONLY failed conditions)
- Execute and return to MONITOR phase
- Update manifest with
relaunch_batchnumber - Rule F4: hard stop after 2 cycles → escalate to user
- See: protocols/relaunch.md
Phase 6: REPORT
- Summary: X succeeded, Y failed (root causes: A, B, C)
- Cost: total $ across all cycles
- If unrecoverable failures → compose_with: issue-creator
- Save to
outputs/factorial_run_<experiment_id>.jsonl - See: protocols/report.md
Integration Points
| Skill | How It Integrates |
|---|---|
ralph-loop |
Per-job diagnosis via analyze_logs() — reuses failure pattern library |
self-learning-iterative-coder |
TDD loop when fixes require code changes |
issue-creator |
Unrecoverable failures after 2 cycles become GitHub issues |
overnight-runner |
Factorial runs are a type of batch execution |
Quality Evaluation
See eval/checklist.md for 5 binary pass/fail criteria.
Manifest Schema
See templates/factorial-manifest.json for the experiment state tracking schema.