factorial-monitor

star 0

Multi-job factorial experiment monitoring, aggregated diagnosis, and selective re-launch for SkyPilot jobs. Use when running factorial experiments (model x loss x calibration x fold) via SkyPilot and you need to track all jobs, aggregate failures by root cause, and re-launch only failed conditions. Do NOT use for: single-job monitoring (use ralph-loop), local test failures (use self-learning-iterative-coder), or sequential plan execution (use overnight-runner).

petteriTeikari By petteriTeikari schedule Updated 3/29/2026

name: factorial-monitor version: 1.0.0 description: > Multi-job factorial experiment monitoring, aggregated diagnosis, and selective re-launch for SkyPilot jobs. Use when running factorial experiments (model x loss x calibration x fold) via SkyPilot and you need to track all jobs, aggregate failures by root cause, and re-launch only failed conditions. Do NOT use for: single-job monitoring (use ralph-loop), local test failures (use self-learning-iterative-coder), or sequential plan execution (use overnight-runner). last_updated: 2026-03-19 activation: manual invocation: /factorial-monitor metadata: category: operations tags: [skypilot, monitoring, factorial, cloud, batch, gcp, experiment] relations: compose_with: - ralph-loop - self-learning-iterative-coder - issue-creator depend_on: - ralph-loop similar_to: [] belong_to: - overnight-runner


Factorial Monitor

Multi-job factorial experiment monitoring with aggregated diagnosis, batch error resolution, and selective re-launch. The outer loop that orchestrates ralph-loop (per-job diagnosis) and self-learning-iterative-coder (batch code fixes).

Non-Negotiable Rules

Five rules prevent ALL known anti-patterns. Read instructions/rules.md for full details with rationale. Summary:

Rule Name Prevents
F1 WAIT-FOR-TERMINAL Panic fixing, premature whac-a-mole
F2 AGGREGATE-BEFORE-FIX Silent dismissal, serial fixing
F3 REBUILD-BEFORE-RELAUNCH Docker image staleness
F4 MAX-TWO-CYCLES Infinite fix-relaunch loops, cost overrun
F5 FACTORIAL-MANIFEST Partial factorial amnesia

Kill-switch exception (Rule F1): If 3+ jobs fail with IDENTICAL error within 5 min AND remaining running jobs haven't passed the failure point → cancel same-config jobs, begin batch diagnosis. Different-config jobs continue.

Workflow (6 Phases)

Phase 1: LAUNCH       → protocols/launch.md
Phase 2: MONITOR      → protocols/monitor.md       (polling loop, READ-ONLY)
Phase 3: DIAGNOSE     → protocols/diagnose.md       (batch aggregation)
Phase 4: FIX          → protocols/fix.md            (reviewer-backed batch fix)
Phase 5: RELAUNCH     → protocols/relaunch.md       (selective, max 2 cycles)
Phase 6: REPORT       → protocols/report.md         (final summary + issues)

Phase 1: LAUNCH

  • Execute run_factorial.sh <config.yaml> or confirm already launched
  • Create factorial_manifest.json mapping job_id → condition
  • Verify SkyPilot YAML uses image_id: docker:... (Rule #17 — no bare VM)
  • Record experiment_id, factors, levels, expected job count
  • See: protocols/launch.md

Phase 2: MONITOR (polling loop)

  • Poll sky jobs queue every 60s
  • Print live status table: | condition | job_id | status | duration |
  • For each newly-terminal failure: call ralph_monitor.analyze_logs()
  • READ-ONLY: no code changes, no SSH, no sky exec while jobs run
  • Continue until ALL jobs reach terminal state (Rule F1)
  • See: protocols/monitor.md

Phase 3: DIAGNOSE (all jobs terminal)

  • Group failures by root cause category using ralph_monitor categories
  • Present ONE aggregated report (Rule F2)
  • Format: {root_cause → [job_ids], fix_strategy, affected_files, confidence}
  • If 0 failures → skip to REPORT
  • See: protocols/diagnose.md

Phase 4: FIX (with reviewer agents)

  • For EACH root cause: plan batch fix strategy with reviewer agents
  • If code fix needed → compose_with: self-learning-iterative-coder
  • If config fix → edit YAML/env directly
  • Execute Rule F3: make test-staging → Docker rebuild → push → verify digest
  • Commit all fixes in ONE batch commit
  • See: protocols/fix.md

Phase 5: RELAUNCH (max 2 cycles)

  • Generate filtered re-launch command (ONLY failed conditions)
  • Execute and return to MONITOR phase
  • Update manifest with relaunch_batch number
  • Rule F4: hard stop after 2 cycles → escalate to user
  • See: protocols/relaunch.md

Phase 6: REPORT

  • Summary: X succeeded, Y failed (root causes: A, B, C)
  • Cost: total $ across all cycles
  • If unrecoverable failures → compose_with: issue-creator
  • Save to outputs/factorial_run_<experiment_id>.jsonl
  • See: protocols/report.md

Integration Points

Skill How It Integrates
ralph-loop Per-job diagnosis via analyze_logs() — reuses failure pattern library
self-learning-iterative-coder TDD loop when fixes require code changes
issue-creator Unrecoverable failures after 2 cycles become GitHub issues
overnight-runner Factorial runs are a type of batch execution

Quality Evaluation

See eval/checklist.md for 5 binary pass/fail criteria.

Manifest Schema

See templates/factorial-manifest.json for the experiment state tracking schema.

Install via CLI
npx skills add https://github.com/petteriTeikari/vascadia --skill factorial-monitor
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
petteriTeikari
petteriTeikari Explore all skills →