name: hpc-run-tracking description: Use when submitting, monitoring, or reviewing HPC training runs to maintain a persistent log of each run's config, status, results, and next steps.
HPC Run Tracking
Overview
Use this after Phase 3 setup is complete and you are in the ongoing submit/monitor/iterate loop. Every job submission gets a run log file that tracks what was submitted, what happened, and what to do next.
There are three modes of operation:
- Replication — verifying the repo's training works on your cluster with your data. Usually a small number of runs. The goal is "does this work?" Once it does, you're done or you move to experiments.
- Experiments — iterating on the setup: trying different hyperparameters, architectures, data configurations. The goal is "which variation is best?" Many runs, compared against each other.
- Pipeline — supporting jobs that feed into training but aren't training themselves: embedding generation, data preprocessing, format conversion, dataset construction. The goal is "produce an artifact that a training run needs." Use a parenthetical to specify the kind, e.g.
pipeline (data generation),pipeline (embedding extraction),pipeline (format conversion).
All three use the same run log format. Experiments add a comparison summary.
When to Use
- Submitting a training, eval, or pipeline job
- Checking on a running or completed job
- Resuming a run after walltime interruption
- Comparing experiment results to decide what to try next
Cluster Access
When checking remote run status, use the cluster profile and try direct SSH yourself before asking the user for command output. For Isambard, the user may already have run clifton auth; try ssh isambard "<command>" first and only ask the user to refresh Clifton auth after a current SSH attempt fails.
Agent Algorithm
Use this lifecycle algorithm. The sections below define the log format and reference details.
Preflight
- Read the target repo's existing
run_logs/,slurm/, configs, and recent run log timelines. - Identify run type: replication, experiment, or pipeline.
- Identify cluster profile and remote paths if Slurm/cloud state is involved.
- Read the target repo's existing
On submit
- Create or update the run log before/at submission.
- Record objective, script/command, config, dataset, key settings, code state, and submission time.
- Submit using the repo's normal training/eval command.
- Record execution ID immediately.
On poll
- Query scheduler/VM state directly; do not ask the user for output unless access fails.
- Check real progress, not just "running": output log step counts, loss/metric movement, error log, disk usage, and GPU telemetry where available.
- Append timestamped status with concrete evidence.
On complete
- Capture runtime, final status/exit code, final step/metrics, checkpoint path, config snapshot, W&B local dir and synced URL if available.
- Before syncing offline W&B runs, infer the intended entity/project from the
run log/config, tell the user which project will receive the sync, and ask
if unclear or if the existing project name is generic/stale. Pass the
project explicitly to
wandb sync. - Fill
Results, W&B notes, andNext. - Do not recommend eval/upload before checkpoint passport/signoff.
On resume
- If resuming the same objective/config after walltime or interruption, append a
Job (resumed)block to the same run log. - Only create a new run log for a meaningfully different config/data/task.
- If resuming the same objective/config after walltime or interruption, append a
On publish or handoff
- Follow
checkpoint-passport/SKILL.mdfirst. - Sanitize any public
TRAINING_LOG.mdby removing cluster-specific paths, node names, job IDs, and secret-adjacent details while keeping training dynamics and W&B links.
- Follow
Status updates
- During long remote jobs, report what is done, what is running, what remains, and the last observed scheduler/log evidence.
Run Logs
Run logs live in run_logs/ in the target repo. One markdown file per run.
Directory structure
Group run logs into subdirectories when there are distinct groupings — by component (encoder, decoder), by task variant (arx5_multitask, libero_subtask), or by pipeline stage. Discuss the grouping with the user — the right split depends on the project. Create subdirs proactively when you know variants are coming, don't wait for the flat list to get messy.
run_logs/
arx5_multitask/
timeline.md
2026-03-22_train.md
...
libero_subtask/
timeline.md
...
For simpler repos with a single training target, a flat run_logs/ is fine.
Each subdirectory gets a timeline.md — a chronological index of all runs in that group with human-readable dates. The timeline header should match the subdirectory:
# ARX5 Multitask Timeline
1. `2026-02-20_train.md` — Friday, Feb 20th
2. `2026-02-21_train.md` — Saturday, Feb 21st
3. `2026-02-25_eval.md` — Wednesday, Feb 25th
Creating a run log
Create the file when you submit the job. Name it <date>_<task>.md — date-prefix ensures chronological sorting in the file explorer. Job IDs go inside the file, not in the filename, since they change on every resubmit.
Examples: 2026-03-11_train_front_cam.md, 2026-03-15_eval.md.
Include at minimum:
# <task> — <short description>
## Mode
- run_type: <replication, experiment, or pipeline (subtype)>
- objective: <one line — what this run is trying to verify or test>
## Config
- script: `slurm/<script>.sh` or `docker run` command
- config: `configurations/<config>.yaml`
- dataset: `<actual filename>` at `<path>` (if hosted online, link: `<URL>`)
- key settings: <whatever matters for this run — learning rate, batch size, resume, etc.>
## Job
- execution_id: <Slurm job_id, or instance_name/zone for cloud VMs>
- submitted/start: `<ISO timestamp>`
- start_human: `<Wednesday, Feb 25th, 2026>`
- end: `<ISO timestamp>`
- end_human: `<Thursday, Feb 26th, 2026>`
- runtime: `<HH:MM:SS>`
- node: <from squeue/logs> (Slurm only — for cloud VMs the instance is already in execution_id)
## Status
## Results
## W&B
- local: `<offline run dir, e.g. wandb/offline-run-...>`
- synced: `<URL after wandb sync, e.g. https://wandb.ai/team/project/runs/abc123>`
- notes: <brief qualitative read of the curves — discuss with user after reviewing the dashboard>
## Next
Updating a run log
When checking on a job, append evidence-backed entries to Status. A running process with GPU activity is not proof of progress; check logs for advancing steps/metrics and errors before recording health.
## Status
- 2026-03-11 15:00 — running, step 2400, train_loss 0.085
- 2026-03-12 14:30 — completed, exit code 0
When the job finishes, fill in Results and W&B:
## Results
- runtime: `<HH:MM:SS>` (start `<ISO timestamp>`, end `<ISO timestamp>`)
- final step: 100000
- start_train_loss: `<first logged value>`
- end_train_loss: `<last logged value>`
- start_val_loss: `<first logged value or n/a>`
- end_val_loss: `<last logged value or n/a>`
- loss_one_liner: <one-sentence qualitative summary of the loss progression>
- checkpoint: `<path to checkpoint on remote storage>`
- config_snapshot: `<path to resolved config from run output>`
## W&B
- local: `wandb/offline-run-20260311_150000-abc123`
- synced: `https://wandb.ai/team/project/runs/abc123`
- notes: <brief qualitative read of the curves>
Use the exact config snapshot from the run output; repo config files can drift.
Make loss_one_liner a human-readable takeaway, not a restatement of numbers.
If W&B is not yet synced, write pending — run wandb sync <local>.
W&B sync
Use the bounded AutoHPC sync helper instead of composing ad hoc wandb sync
commands. Install/run the helper in the same runtime environment that has
wandb available. For remote clusters, check the cluster profile for writable
scratch environment and launch details.
uv pip install -e ../autohpc/wandb-sync
Before syncing, choose the intended W&B destination deliberately. Ask the user
for both --entity and --project unless both are already explicit in the run
log/config. Do not rely on stale defaults embedded in the offline run.
Also inspect the offline run size before syncing. W&B offline directories can
quietly accumulate huge transient artifact payloads, especially under
wandb_data/artifacts/staging. If the run directory is unexpectedly large,
identify the bloated staging/artifact files and remove only transient W&B
staging data before sync. Do not delete the run history, config, summary, media
needed for the dashboard, or any checkpoint artifacts that still need to be
published separately.
autohpc-wandb-sync sync \
--entity <wandb-entity> \
--project <wandb-project> \
--wandb-token-file ~/.wandb_token \
--dry-run \
<offline-run-dir>
After the dry run looks correct, remove --dry-run and add --yes. The helper
also checks ~/.wandb_token and ~/.wandb_key if --wandb-token-file is not
provided. If no token file exists, ask the user to create one. Never hardcode
or print the key in scripts, commands, or run logs.
Per-job-block W&B URLs
When a run log has multiple job blocks (original + resumptions), record the synced URL in each job block it belongs to, not only in the W&B section at the bottom. This way you can find the right dashboard from whichever block you're reading without scrolling.
Suggest next steps in Next:
## Next
- resume for more steps: edit `LOAD_CKPT_PATH` in the sbatch script (or pass checkpoint path to `docker run`), then resubmit
- or generate `MODEL_PASSPORT.json` + `SIGNOFF.json` for the checkpoint (see `checkpoint-passport/SKILL.md`), then eval / upload / hand off
Do not suggest "eval this checkpoint" or "upload this checkpoint" as the next step without passport generation in between — the passport is a hard prerequisite for both. The eval harness reads the passport's input_contract to drive how the model is fed; uploading without a passport puts an unsigned snapshot on HF that downstream consumers cannot verify.
Resumptions
When resuming a walltime-interrupted run, don't create a new file. Append to the same run log:
## Job (resumed)
- execution_id: 12346
- submitted: 2026-03-12 15:00 UTC
- resumed from: `<path to checkpoint>`
Only create a new file when the run represents a meaningfully different experiment (different config, different data, different task).
Replication Runs
For replication, a single run log per training task is usually enough. The goal is confirming the setup works end-to-end. Once results look reasonable, you're done — note the outcome and move on.
Experiment Runs
For experiments, each variation gets its own run log. Additionally, maintain a run_logs/experiments.md summary that compares results across runs:
# Experiments
| run | description | val_loss | status | notes |
|-----|-------------|----------|--------|-------|
| train_baseline_2026-03-11 | front cam, default LR | 0.028 | keep | baseline |
| train_lr_sweep_2026-03-12 | LR 1e-3 -> 5e-4 | 0.025 | keep | small improvement |
| train_both_cams_2026-03-13 | front + wrist | 0.031 | discard | worse than single cam |
Before submitting a new experiment, read run_logs/experiments.md and recent run logs to understand what's been tried and what worked. Use this to decide what to try next — don't repeat failed variations.
Each experiment run should be on its own git branch or tagged commit so you can recover the exact code that produced a given result.
Publishing Checkpoints
When a checkpoint is worth sharing or backing up, follow
checkpoint-passport/SKILL.md first. Do not upload, copy, eval, or hand off a
checkpoint before MODEL_PASSPORT.json and SIGNOFF.json exist and validate.
The public package normally contains:
README.md # Model card: description, config, loss table, usage
TRAINING_LOG.md # Sanitized run log (see below)
MODEL_PASSPORT.json # Feeding contract + integrity manifest (generated by checkpoint-passport)
SIGNOFF.json # Verdict + sha256 of passport and weight files (generated by sign-checkpoint)
assets/ # Norm stats, dataset list, valid indices, etc.
checkpoints/<step>/params/ # Model weights (inference + fine-tuning)
Create TRAINING_LOG.md from the run log, stripped of cluster-specific details: Slurm job IDs, node names, scratch paths, wandb_id.txt references. Keep the training dynamics, config, loss progression, and W&B link.
W&B visibility
If the main W&B project is private but you want to share training curves, sync the run to a separate public project:
wandb sync --project <public-project-name> <offline-run-dir>
Then set the project to public in the W&B UI. Link this public URL in the README, not the private project URL.
Recording in the run log
After uploading, add a ## HuggingFace section to the run log:
## HuggingFace
- repo: https://huggingface.co/<user>/<repo>
- uploaded checkpoints: <which steps, params only or full>
- includes: README, TRAINING_LOG, MODEL_PASSPORT.json, SIGNOFF.json, assets, ...
- signoff verdict: `pass` | `soft_signal` (if soft_signal, copy the verdict_reason here verbatim)
Common Mistakes
- Not creating a run log — then you forget what config produced which checkpoint
- Creating a new file for every resumption of the same run
- Logging status without the step count or loss — timestamps alone aren't useful
- Forgetting to record the checkpoint path in results — makes resumption a guessing game
- Not recording run_type (replication vs experiment vs pipeline) — makes intent unclear when reviewing later
- Running experiments without updating the comparison summary — then you lose track of what's been tried
- Not branching/tagging experiment code — then you can't recover what produced a good result
- Not recording the W&B synced URL — then you have to hunt through
wandb/dirs or re-sync to find training curves - Putting the synced URL only in the W&B section when there are multiple job blocks — then you can't find it from the block you're reading
- Naming files
<task>_<date>.mdinstead of<date>_<task>.md— breaks chronological sorting in file explorers - Dumping all run logs flat in
run_logs/instead of grouping by project — becomes unreadable past ~10 files - Only recording ISO timestamps without human-readable dates — forces mental parsing every time you open a file
- Not maintaining
timeline.mdper subdirectory — then you have to open individual files to reconstruct order - Uploading a checkpoint to HF (or copying it to an eval box) without first generating
MODEL_PASSPORT.json+SIGNOFF.json— the snapshot is permanently unsigned, downstream consumers can't verify the feeding contract, and any cached copies people pulled in the interim will never get a passport - Rolling a custom sha256-of-sha256s manifest in the README instead of using
sign-checkpoint— loses the verdict, loses the per-file breakdown, and the consumer-sidevalidate-checkpoint --require-signoffgate doesn't work against it