name: run-ferries description: Launch, monitor, and seal Marin canary and daily ferry runs.
Skill: Ferries (Canary + Daily)
Overview
Two ferry lanes:
canary: fast, low-cost always-on health checkdaily: higher-scale integration run with bounded changes
Both keep core data assumptions aligned and share the same monitoring/triage discipline.
Ferry Lanes
Templates:
experiments/ferries/canary_ferry.py(MoE canary, TPU and GPU viaCANARY_ACCELERATOR)experiments/ferries/daily.py
Intent:
- canary: catch infra/pretraining regressions early with a stable Grug MoE config (one TPU, one GPU)
- daily: exercise a larger run envelope and test small, explicit changes
Shared baseline:
- data: shared
nemotron_mixbaseline - default cluster:
us-central1(zoneus-central1-a) - run log:
docs/experiments/daily-ferry-log.md
Daily baseline defaults:
- model size: Llama ~150M (
llama_150m) - sequence length: 4096
- train batch size: 512
- FLOP target: ~1e19 (overrideable via env)
Inputs Before Proposing (Daily Only)
Canary runs normally do not require a proposal cycle or PR. For daily, collect:
- Last ferry references: issue URL, PR/commit URL, W&B run URL and Iris job ID
- Human objective for this interval: standard integration pass, or explicit regression investigation
- Interval boundary: use "since last ferry run", not fixed wall-clock day boundaries
If objective is ambiguous, ask before editing.
Operating Policy
General
- Hard launch gate: get explicit requester approval before launching any ferry job. Only exception: the requester explicitly says to launch without asking.
- Follow the babysit-job skill until the run reaches a terminal state (
SUCCEEDED/FAILED/STOPPED); do not stop early. Full ferry monitoring often takes 4-5 hours. - Never restart/recreate/mutate cluster without explicit human consent in-thread. Keep cluster mutation guardrails aligned with babysit-job, including the debug exception path.
- Use major-event updates (not spam): launch, first eval, major incident, terminal state.
- Seal each completed daily run with a pushed git tag pointing to the exact launch commit.
- Canonical run-closure PR labels:
ferry,ferry-daily,ferry-log-only,ferry-sealed. - Canonical seal-tag format (daily):
ferry/daily/YYYYMMDD/<run_slug>
Canary
- Keep canary stable; only change it for explicit reliability fixes, and only when diagnosing/fixing a concrete failure mode.
- Canary launches usually do not require a PR if the script/config is unchanged.
- If canary fails, treat as urgent infrastructure/training-health triage.
- Canary is run-only by default (W&B + issue updates); no sealing tag or run-closure PR in the normal path.
Daily
- Use daily for bounded evolution, usually 1-2 knobs.
- If daily fails, debug with one bounded fix attempt, then escalate.
- Run-closure PR scope is log-only: update
docs/experiments/daily-ferry-log.md, keep detailed debug/run narrative in the issue.
Workflows
Daily lane (proposal + run)
1) Build context since last ferry
Check the latest entries in docs/experiments/daily-ferry-log.md.
LAST_FERRY_SHA=<last_ferry_commit_sha>
LAST_FERRY_DATE=<YYYY-MM-DD>
git log --oneline "${LAST_FERRY_SHA}..HEAD" -- experiments/ lib/ scripts/
gh issue list \
--label experiment \
--search "updated:>=${LAST_FERRY_DATE}" \
--limit 100
Treat GitHub-tagged ferry PRs/issues as source of truth. Use "since last ferry run" rather than fixed wall-clock boundaries.
2) Edit experiments/ferries/daily.py
- Keep edits bounded (typically 1-2 knobs); propose at least one intentional modification each interval.
- If no obvious change emerges from recent commits/issues/ferry history, pick a low-risk tweak (e.g. data-mix adjustment or hyperparameter change) that may improve loss at the same FLOPs budget.
- Pattern-match from the previous daily ferry; avoid high-churn rewrites.
- Update run naming for this interval (e.g. via
FERRY_DATEin launch env:daily-125m-YYYY-MM-DDstyle).
3) Record proposal in issue and push launch commit
In the run issue, record:
- last ferry links (issue + commit + W&B/job link),
- exact config delta and rationale,
- risk level (
low/medium/high), - relaunch fallback note,
- why this run is not literally identical to the previous daily run,
- launch checklist (explicit requester approval or explicit waiver + monitoring started).
Then push the launch commit (no proposal PR by default).
4) Launch
Confirm requester approval in-thread unless they already gave explicit "launch without asking" permission.
uv run iris --cluster=marin job run --no-wait --cpu=1 --memory=2G --extra=cpu \
-- python -m experiments.ferries.daily
After launch, capture and post to the issue:
- Iris job id (printed by
iris job run, form/<user>/iris-run-job-YYYYMMDD-HHMMSS) - cluster
- launch timestamp
- W&B link(s) when available
Optional deterministic daily rerun name:
uv run iris --cluster=marin job run --no-wait --cpu=1 --memory=2G --extra=cpu \
-e FERRY_DATE "$(date +%Y%m%d-%H%M%S)-daily-ferry" \
-- python -m experiments.ferries.daily
5) Monitor to terminal state
Follow the babysit-job skill with job_id, cluster, experiment=<ferry script path>.
- Keep the monitoring loop active until terminal status; ferry runs commonly take 4-5 hours.
- Follow monitoring-loop restart policy for recoverable failures.
- Escalate non-trivial failures to humans.
6) Close the loop
Post in the ferry issue: final status, key metrics/regressions, Iris job ID and W&B link(s), recommendation for next ferry. Optional: post a manual Discord update for major run state changes.
For daily-log metric fields, extract canonical final keys with:
uv run python scripts/ferries/daily_analysis.py \
--run <wandb_run_url_or_path> \
--format markdown
Required terminal issue comment template:
Final status: <SUCCEEDED|FAILED|STOPPED>
Iris job id: <job_id>
W&B link: <url>
Final eval summary: <short summary + key metrics>
Experiment link: <experiment JSON/browser link>
Recommendation / victory decision: <next action>
7) Seal and open log-only PR
- Create and push a sealing tag for the exact launch commit (the commit containing the
experiments/ferries/daily.pyused for the run). - Open a PR that updates only
docs/experiments/daily-ferry-log.md, following.agents/skills/commit/SKILL.mdfor description format. - Keep all detailed launch/retry/debug narrative in the run issue, not in the PR.
- Apply canonical labels:
ferry,ferry-daily,ferry-log-only,ferry-sealed.
Canary lane (steady-state run)
Default mode: launch the existing canary script as-is and monitor. Do not run the daily proposal/PR loop unless intentionally changing canary. Even for unchanged runs, ask the requester before launch unless they explicitly waived that requirement.
Launch (TPU):
uv run iris --config=lib/iris/config/marin.yaml \
job run --memory=16G --disk=16G --cpu=1 --extra=tpu \
-- python -m experiments.ferries.canary_ferry
Launch (GPU / CoreWeave):
uv run iris --config=lib/iris/config/coreweave.yaml \
job run --memory=16G --disk=16G --cpu=1 --extra=cpu \
-e MARIN_PREFIX s3://marin-na/marin \
-e CANARY_ACCELERATOR gpu \
-- python -m experiments.ferries.canary_ferry
If canary fails: triage and identify root cause, only then open a focused PR if a canary script/config change is necessary, relaunch and monitor to terminal state.
Canary profiling triage
- The full
profile_summary.jsonis in the workflow logs (default params:--warmup-steps 5,--breakdown-mode exclusive_per_track,--hot-op-limit 25). The step summary has pointers to the raw trace artifact and W&B run. - The log summary is ephemeral and not published. To re-analyze with different parameters, fetch the raw trace via
--run-target— see.agents/skills/profile-training/. - If the canary failed early, the profile may only cover warmup steps — check
step_time.all_steps.countbefore drawing conclusions from steady-state stats. exclusive_per_track(the default) can hide device stalls that overlap across tracks. Useexclusive_globalwhen investigating stall-heavy profiles.
Promotion Rule (Daily)
If a daily variant is clearly better holistically, promote it as the new default daily recipe/template.
Promotion signals:
- eval losses are broadly better,
- LM eval soft metrics improve in aggregate,
- no reliability regressions.
When promoting: open a follow-up PR updating experiments/ferries/daily.py and this skill; include a concise before/after metrics table.
Validation Checklist
- Diff is intentional and bounded for the selected lane.
- If daily was edited, launch commit is pushed and referenced in the run issue.
- Run-closure PR only updates
docs/experiments/daily-ferry-log.md. - Ferry issue has updated launch metadata.
- Monitoring loop ran until terminal state.
See Also
docs/experiments/daily-ferry-log.md.agents/skills/babysit-job/SKILL.md.agents/projects/ferry_framework.md.agents/skills/run-research/