name: fine-tuning description: Complete reference for the fine-tuning pipeline (SFT, KTO, GRPO), cloud HF Jobs workflows, autonomous experiment search, checkpoint evaluation, and LoRA surgery. Covers training CLI flags, YAML configuration, model presets, dataset requirements, LoRA settings, training monitoring, hyperparameter search, and post-training optimization. Use when training models, configuring training runs, choosing hyperparameters, running cloud experiments, inspecting HF jobs, or troubleshooting training issues. This skill is about USING the training system via CLI and YAML — never modifying source code. allowed-tools: Read, Bash, Write, Grep, Glob
Fine-Tuning Pipeline
Train language models with SFT, KTO, and GRPO locally or on supported cloud providers. This skill also covers the Karpathy-style experiment loop, checkpoint evaluation, LoRA surgery, and the current HF Jobs operational path.
Quick Reference
| Task | Command |
|---|---|
| Interactive menu | ./run.sh → Train |
| Local Docker config run | python tuner.py local-run --job-config Trainers/recipes/<recipe>.yaml --yes |
| SFT training | cd Trainers/sft && python train_sft.py --model-size 7b |
| KTO training | cd Trainers/kto && python train_kto.py --model-size 7b |
| GRPO training | cd Trainers/grpo && python train_grpo.py |
| Pivot-profile GRPO dataset | cd Trainers/grpo && python train_grpo.py --config configs/pivot_config.yaml --pivot-profile-only |
| GRPO with pivot filtering | cd Trainers/grpo && python train_grpo.py --config configs/pivot_config.yaml |
| Env-backed GRPO | cd Trainers/grpo && python train_env_grpo.py --config ./configs/env_config.yaml --dry-run |
| Experiment loop | python tuner.py experiment-loop --experiment-config configs/flywheel/experiment_loop.yaml |
| LoRA surgery | python tuner.py surgery --surgery-config configs/lora_surgery.yaml |
| HF custom job | python tuner.py cloud-run --job-config Trainers/recipes/<recipe>.yaml |
| Canonical HF train+eval | python tuner.py cloud-pipeline --method sft --preset full |
| Full experiment bundle | python tuner.py run-experiment --experiment-spec Trainers/cloud/experiments/<spec>.yaml --yes |
| Evolutionary SFT smoke test | python tuner.py run-experiment --experiment-spec Trainers/cloud/experiments/<evolutionary-spec>.yaml --yes |
| Staggered experiment batch | python3 scripts/launch_experiment_batch.py Trainers/cloud/experiments/<spec1>.yaml Trainers/cloud/experiments/<spec2>.yaml --yes |
| Blind hardware plan | python tuner.py plan-hardware --experiment-spec Trainers/cloud/experiments/<spec>.yaml |
| Analyze finished experiment | python tuner.py analyze-experiment --experiment-id latest |
| Analyze/prune dataset from loss | python3 scripts/prune_dataset_from_loss.py --dataset-path ... --experiment-id ... --analyze-only |
| Standalone prompt optimization | python tuner.py prompt-optimize --prompt-opt-config configs/prompt_optimization/NAME.yaml |
| Prompt-optimize SynthChat generation | python -m SynthChat.run generate --prompt-opt-config configs/prompt_optimization/NAME.yaml [options] |
| Analyze bucket-backed run | python tuner.py bucket analyze --path runs/hf_jobs/sft/<run-prefix>/ |
| Read bucket artifact | python tuner.py bucket read --path runs/.../logs/training_latest.jsonl --jsonl-latest --pretty |
| List bucket prefix | python tuner.py bucket list --path runs/hf_jobs/sft/<run-prefix>/ --limit 20 |
| Pull bucket prefix locally | python tuner.py bucket pull --path runs/hf_jobs/sft/<run-prefix>/ --dest . |
| Push local artifact to bucket | python tuner.py bucket push --path local/results.json --dest runs/manual_uploads/ |
| Live HF job list | python tuner.py cloud-jobs list |
| Live HF job logs | python tuner.py cloud-jobs logs --job professorsynapse/<job-id> --tail 200 |
| Cloud eval against a run | python tuner.py cloud-eval --run latest --preset full |
| HF gym against trained model | python tuner.py cloud-gym --run latest --method sft |
| Warm Space scaffold | python3 Trainers/cloud/scripts/manage_space.py render --template vllm_warm --output-dir /tmp/my-space --base-image ghcr.io/<org>/<image>:<tag> |
| Warm Space deploy | python3 Trainers/cloud/scripts/manage_space.py deploy --space-id <user>/<space> --template vllm_warm --base-image ghcr.io/<org>/<image>:<tag> --hardware a10g-small --sleep-time 3600 --var BASE_MODEL=<model> |
| ML training | python tuner.py ml train --config Trainers/ml/configs/templates/regression.yaml |
Training Methods at a Glance
| Method | Purpose | LR | Epochs | Dataset | When to Use |
|---|---|---|---|---|---|
| SFT | Teach format and behavior | 2e-4 | 3 | Positive examples only | First stage |
| KTO | Refine with preferences | 1e-6 | 1 | Interleaved True/False | Second stage |
| GRPO | Optimize against rewards | 5e-6 | 1 | Prompts + ground truth | Final online stage |
| Embedding | Train a retrieval bi-encoder | 2e-5 | 1 | Triplets / pairs | Retrieval / RAG embedders |
Recommended pipeline: SFT → KTO → GRPO
Embedding training (SentenceTransformer bi-encoders for retrieval) is a separate method with its own registry, dual loader, adapter modes (
full/lora/frozen_head), and corpus-level retrieval evaluation. It rides the samelocal-run/recipe path as SFT. For the full surface use the dedicatedembedding-trainingskill; the triplet/retrieval data shapes are inreference/dataset-formats.mdbelow.
Complexity Tiers
Use --tier on the local SFT and KTO trainers when you want a preset instead of hand-tuning LoRA rank and LR.
| Tier | LoRA Rank | LR | Time | Use Case |
|---|---|---|---|---|
quick |
r=8 | 5e-4 | ~5 min | Prototyping and smoke runs |
standard |
r=64 | 2e-4 | ~30-60 min | Normal training |
thorough |
r=128 | 1e-4 | ~2-4 hrs | Quality-focused final runs |
Key Directories
Trainers/sft/— active SFT trainerTrainers/recipes/— unified training recipe configs (local Docker + HF Jobs); each recipe declarestarget: local|cloud|bothandmethod: sft|kto|grpoEvaluator/recipes/— unified evaluation recipe configsTrainers/kto/— KTO trainerTrainers/grpo/— GRPO and env-GRPO trainerTrainers/embedding/— embedding (SentenceTransformer bi-encoder) trainer, registry, and dual loader; see theembedding-trainingskillTrainers/archive/legacy_rtx3090/— archived legacy RTX3090 trainer snapshots and outputs; do not use for new runsDatasets/— JSONL training datasetsSynthChat/scenarios/— synthetic data and environment-backed scenariosshared/flywheel/— autonomous experiment-loop logic and LightGBM surrogate searchTrainers/ml/— traditional ML / LightGBM training for tabular experiment analysisshared/experiment_tracking/— unified run and experiment tracking
CLI Discipline
- Never cancel a job, delete bucket artifacts, remove files, or relaunch a cost-incurring cloud run unless the user has explicitly approved that exact action in the current conversation.
- Treat cancel/delete/relaunch as irreversible or materially destructive operator actions. Do not infer permission from surrounding context or from a user's broader goal.
- Do not guess command names or flags from memory.
- Before giving command guidance, check
tuner/cli/parser.py,tuner/cli/router.py, or the real--helpoutput. - Prefer repo CLIs and checked-in scripts over ad hoc Python snippets.
- After benchmark runs complete, treat the checked-in benchmark ledger as part of the workflow:
- Treat stage lineage as the source of truth for automation:
- training:
training_lineage.json - evaluation:
evaluation_lineage.json - exact loss:
loss_lineage.json
- training:
- Treat
loss_summary.jsonas a supporting artifact, not the canonical final loss metadata file. - The ledger should accumulate real model-size / hardware / timing / cost data so future hardware planning can optimize against observed evidence instead of memory.
- For local trainer iteration, use the checked-in
train_sft.py,train_kto.py, andtrain_grpo.pyentrypoints. - For repeatable local GPU training, prefer
python tuner.py local-run --job-config Trainers/recipes/<recipe>.yaml --yesover ad hocdocker runcommands. Put the model, dataset, Docker image, package overrides, LoRA settings, training knobs, and artifact paths in YAML. - For Windows Docker Desktop with GPU, prefer
job.transfer: autoorcopyin local-run configs. The runner chooses copy mode on Windows because GPU bind mounts can fail with access denied. - Keep newly released model support in local-run
setup.pippins or image fields. Do not leave one-off package installs in shell history. - For canonical HF experiments, prefer
python tuner.py cloud-pipeline ...overcloud-run. - For full train → eval → exact loss → analysis → recommendation runs, prefer
python tuner.py run-experiment .... - Evolutionary SFT is experimental but now first-class in the cloud experiment path. Prefer a checked-in experiment spec or
cloud-pipeline --train-evolutionary-*overrides over editing trainer YAMLs by hand. - Use evolutionary training for technical validation or targeted experiments first. It adds significant per-step compute overhead, so smoke tests should usually cap
max_stepsbefore you commit to a long run. - For multi-spec benchmark launches, prefer
python3 scripts/launch_experiment_batch.py ...over back-to-back manual submissions. It defaults to a 5-second stagger. - For blind stage hardware selection before launch, use
python tuner.py plan-hardware .... - For live HF status and traceback inspection, use
python tuner.py cloud-jobs .... - For finished experiment bundles and next-run suggestions, use
python tuner.py analyze-experiment .... - For loss-driven dataset cleanup, start with
python3 scripts/prune_dataset_from_loss.py ... --analyze-only; only apply a pruning rule after checking which families are actually enriched in the high-loss slice. - Prefer the generic pruning strategies first (
loss_threshold,top_percent). Use repo-specific presets only when the analysis output shows that exact family is genuinely overrepresented. - For in-flight cloud-run health checks, inspect the bucket-backed artifacts first (
training_latest.jsonl,stage_summary.json,training_lineage.json, eval/loss partials). Use raw HF logs only as a fallback when the bucket prefix has not started writing yet. - For quick bucket spot checks, use
python tuner.py bucket read ...orpython tuner.py bucket list ...instead of manualhf buckets cpcommands. - For local inspection or offline diffing, use
python tuner.py bucket pull ...to sync a bucket-relative path into the current workspace while preserving its relative path. - For one-off uploads back into the HF artifact bucket, use
python tuner.py bucket push ...instead of ad hocsync_bucketsnippets. - For
a100-largeor larger tiers, bias toward aggressive packing. Do not lower batch just because the adapter recipe changed. Start from the highest known-good packed shape for the same model family and only back off after a real OOM or clear instability signal. - Treat large unused VRAM on
a100-largeas a mistake, not a comfort margin. Iftraining_lineage.jsonshows tens of GB of reserved headroom, the run is underpacked and the next iteration should push batch size harder even if that risks OOM. - For vLLM eval on multi-GPU hardware, prequantized BitsAndBytes base models (for example
*-bnb-4bit) cannot use tensor parallelism. Do not assumex4means vLLM will shard generation across all GPUs; in this path, eval may need to fall back to single-GPU while exact loss still fans out across all visible GPUs afterward. - For experiment specs in this repo, set
evaluation.runtime: vllmandevaluation.image_profile: fast_vllmby default. Do not silently switch an experiment tounslotheval just because the model family is new. If vLLM compatibility is uncertain, verify current official support and ask the user before deviating from the vLLM path. - When the user says "latest run" or "latest experiment", distinguish
latest attemptfromlatest completed baseline. Check.tracking/experiments/*/experiment.jsonsorted bycreated_atfirst, then state explicitly which interpretation you are using before copying hyperparameters forward. - When choosing an A100 packed shape, prefer the nearest latest attempt that actually exercised the hardware over an older completed baseline that clearly underpacked the card.
- If
run-experimentrefuses to launch because the tracked worktree is dirty, prefer creating a clean temporary git worktree and launching from there over asking the user to stash or cleaning their checkout. - If a cloud run fails before bucket artifacts appear, treat it as a bootstrap/runtime problem first. Inspect
cloud-jobs logsbefore changing training hyperparameters. - For newly released architectures or day-zero model launches, verify official Docker Hub tags for
unsloth/unslothandvllm/vllm-openaibefore trusting the repo's pinned image profiles. As of 2026-04-02, Docker Hub showsunsloth/unsloth:latestupdated 1 day ago andvllm/vllm-openai:latest/v0.17.1updated about 17 hours ago. - As of 2026-04-22, local
docker pull unsloth/unsloth:latestresolved to digestsha256:9be56babef4efc330316cff3a65f9f911b9e7709bce4114fa7817ba3ffd8565d. That image still reportstransformers 4.57.1, so Qwen3.5 local runs need config-level package overrides such astransformers==5.5.0,trl==0.22.2, and currentunsloth/unsloth_zoo. - If a named training image profile is broken, prefer an explicit
training.cloud_imageoverride to a currently verified official image tag over changing unrelated parts of the experiment such as evaluation backend. - If a run needs newer package versions but the right base image is otherwise close, use stage-local experiment-spec
pip_packagespins undertraining:,evaluation:, orloss:instead of a one-off helper script or a repo-global image pin. - For hyperparameter search, use
python tuner.py experiment-loop ...; this is the built-in LLM + LightGBM surrogate path. - For tabular post-hoc models, use
python tuner.py ml ...and the configs underTrainers/ml/configs/templates/.
Research Note Lifecycle
Use .skills/research-reporting/SKILL.md alongside this skill whenever the user wants experiment tracking, research summaries, or reusable post-run analysis.
- When a training run or experiment is launched, create the research note immediately rather than waiting for all stages to finish.
- The initial note should capture config intent from
experiment.jsonandspec_path, withstage_statusesset from the current known state and metrics leftnullor empty where evidence does not exist yet. - After training finishes, update the same note with training provenance from
training_lineage.jsonorstage_details.training. - After evaluation finishes, update the same note with evaluation metrics and observed failure patterns.
- After loss finishes, update the same note with loss metrics, high-loss hashes, and any dataset-review signals.
- After analysis/recommendation finishes, update the same note with hypotheses, selected candidate rank, and recommended next action.
- Do not fork separate notes per stage unless the user explicitly asks for per-stage notes. Default is one note per experiment, updated over time.
- Treat the note as a living artifact for the run: launch state, in-flight updates, and final postmortem should all land in the same document.
- If a stage has not run yet or failed, preserve the section and leave missing values explicit instead of inventing placeholders.
Progressive Reference
Load the specific reference you need:
| Reference | When to Load | Path |
|---|---|---|
| SFT Training | Running SFT, configuring SFT params | reference/sft-training.md |
| KTO Training | Running KTO, dataset interleaving, preference tuning | reference/kto-training.md |
| GRPO Training | Running GRPO, reward config, GSPO variant | reference/grpo-training.md |
| Model Presets | Choosing models, VRAM planning, LoRA settings | reference/model-presets.md |
| Dataset Formats | Preparing datasets, format requirements per method | reference/dataset-formats.md |
| Training Config | YAML config deep-dive | reference/training-config.md |
| Cloud Training | Provider-native persistence, exact-commit rules, cloud smoke tests | reference/cloud-training.md |
| Cloud Experiments | Canonical train→eval launches with --train-* overrides |
reference/cloud-experiment-launching.md |
| Checkpoint Evaluation | Best-checkpoint selection via eval | reference/checkpoint-evaluation.md |
| Experiment Loop | Autonomous hyperparameter search (LLM + LightGBM) | reference/experiment-loop.md |
| LoRA Techniques | LoRA variants, init methods, config recipes | reference/lora-techniques.md |
| Evolutionary Config | Experimental gradient-selection config schema and defaults | reference/training-config.md |
| LoRA Surgery | Eval-guided post-training weight optimization | reference/lora-surgery.md |
| Troubleshooting | OOM errors, instability, platform issues | reference/troubleshooting.md |
| Env Alignment Protocol | Canonical SynthChat → SFT → merge/publish → KTO → env-GRPO flow | protocols/environment-backed-alignment-pipeline.md |
LoRA Technique Configs
Reference config templates for LoRA variants in .skills/fine-tuning/configs/:
| Template | Technique | When to Use |
|---|---|---|
regret_free.yaml |
High-rank + rsLoRA + all-linear | Full-FT quality from LoRA |
dora.yaml |
DoRA (weight-decomposed) | Drop-in quality boost |
qlora_dora.yaml |
QLoRA + DoRA | Single GPU, VRAM-limited |
pissa.yaml |
PiSSA (SVD init) | Fast convergence |
eva.yaml |
EVA (activation SVD init) | Small datasets |
olora.yaml |
OLoRA (QR init) | Fast convergence, simpler than EVA |
loftq.yaml |
LoftQ (quant-aware init) | Minimize 4-bit quality loss |
grpo_minimal.yaml |
Minimal rank for RL | GRPO/reasoning tasks |
See reference/lora-techniques.md for full details, integration status, and compatibility notes.
Common Patterns
Quick SFT test run:
cd Trainers/sft
python train_sft.py --model-size 3b --tier quick --dry-run
Config-driven local Docker SFT smoke run:
python tuner.py local-run \
--job-config Trainers/recipes/qwen35_2b_sft_smoke.yaml \
--yes
For a different local SFT run, copy a recipe under Trainers/recipes/ (one with target: local or target: both) and change model, dataset, training, lora, job.image, and setup.pip as needed. Use repo-relative local dataset paths; the runner translates them for the container.
Config-driven local Docker embedding smoke run:
python tuner.py local-run \
--job-config Trainers/recipes/embedding_bge_base_smoke.yaml \
--yes
Trains a small bge-base-en embedding adapter for a few steps. The recipe rides
the modern Unsloth image and layers sentence-transformers + faiss-cpu +
datasets via setup.pip (pins are TBD-pending the cloud smoke — captured from
the first working run, never invented). For the full embedding surface (registry,
adapter modes, retrieval eval) use the embedding-training skill.
Generate a dataset with prompt optimization provenance:
python -m SynthChat.run generate \
--prompt-opt-config configs/prompt_optimization/synthchat_smoke.yaml \
--output Datasets/synthchat/prompt_opt_dryrun.jsonl
Use prompt optimization as an opt-in dataset-generation step before training, not as an implicit trainer behavior. SynthChat records the optimizer artifact path and selected candidate ID in row metadata and leaves source YAML unchanged.
Run prompt optimization before dataset generation:
python tuner.py prompt-optimize \
--prompt-opt-config configs/prompt_optimization/labkit_epistemic_humility_evaluator_smoke.yaml
Prompt optimization is a config-first workflow. Keep subjects, operators,
evaluation scenarios, objectives, and stopping rules in
configs/prompt_optimization/*.yaml; promote an optimized prompt into canonical
scenario/config YAML only after human review. The practical evaluator threshold
default is stopping.target_score: 0.8.
The checked-in evaluator smoke config should stay dry-run safe:
configs/prompt_optimization/labkit_epistemic_humility_evaluator_smoke.yaml
uses evaluation.evaluator.dry_run: true. For a real LM Studio smoke, use an
explicit local override or temporary manual copy rather than changing the
checked-in config. Keep that real smoke minimal: population_size: 3 and
max_generations: 1.
KTO with local dataset:
cd Trainers/kto
python train_kto.py --model-size 7b --local-file ../../Datasets/my_kto_data.jsonl
GRPO continuing from SFT checkpoint:
# Edit configs/config.yaml to set model.lora_path to SFT checkpoint
cd Trainers/grpo
python train_grpo.py
Autonomous hyperparameter search:
python tuner.py experiment-loop \
--experiment-config configs/flywheel/experiment_loop.yaml \
--max-experiments 10
Post-training LoRA surgery:
python tuner.py surgery --surgery-config configs/lora_surgery.yaml
Inspect live HF jobs:
python tuner.py cloud-jobs list --limit 20
python tuner.py cloud-jobs show --job professorsynapse/<job-id>
python tuner.py cloud-jobs logs --job professorsynapse/<job-id> --tail 200
Gotcha:
cloud-jobs showis useful, but for real progress checks prefer the bucket-backed artifacts once they exist. Early raw logs are often just bootstrap noise; the bucket tells you when training/eval/loss has actually started producing useful state.
Read the latest training record from a bucket-backed run:
python tuner.py bucket read \
--path runs/hf_jobs/sft/<run-prefix>/logs/training_latest.jsonl \
--jsonl-latest \
--pretty
Tail the structured progress log or summary artifact directly from the bucket:
python tuner.py bucket analyze \
--path runs/hf_jobs/sft/<run-prefix>/
python tuner.py bucket read \
--path runs/hf_jobs/sft/<run-prefix>/logs/training_latest.jsonl \
--tail 5
python tuner.py bucket read \
--path runs/hf_jobs/sft/<run-prefix>/logs/stage_summary.json \
--pretty
python tuner.py bucket list \
--path runs/hf_jobs/sft/<run-prefix>/ \
--limit 20
python tuner.py bucket pull \
--path runs/hf_jobs/sft/<run-prefix>/analysis/loss/ \
--dest .
python tuner.py bucket push \
--path local/analysis/notes.json \
--dest runs/manual_uploads/
Gotcha:
- Use
bucket analyzefirst when a run has finished. It summarizestraining_lineage.json, the newestevaluation_lineage.json, andloss_lineage.jsonin one pass. - If the same training run has multiple eval reruns or alternate loss prefixes, pass
--eval-pathand/or--loss-pathexplicitly so the summary cannot attach to the wrong post-training artifacts. - If schema pass rate is strong but behavior pass rate is still weak, do not assume you need more tool-call syntax SFT. That pattern usually means the next bottleneck is text-only restraint, clarification, or verify-before-action behavior.
Canonical one-off HF experiment with direct overrides:
python tuner.py cloud-pipeline --method sft --preset full \
--train-model-name Qwen/Qwen3.5-2B \
--train-image-profile next \
--train-gpu a10g-small \
--train-batch-size 8 \
--train-gradient-accumulation 4 \
--train-lora-r 128 \
--train-lora-alpha 256 \
--train-use-rslora \
--train-use-dora \
--train-no-load-in-4bit \
--train-lora-target-modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
--yes
python tuner.py cloud-pipeline --method sft \
--train-model-name Qwen/Qwen3-4B \
--train-gpu a100-large \
--train-batch-size 24 \
--train-gradient-accumulation 2 \
--train-learning-rate 1e-4 \
--train-max-steps 60 \
--train-evolutionary-enabled \
--train-evolutionary-candidates 4 \
--train-evolutionary-eval-batch-size 2 \
--train-evolutionary-validation-config configs/fitness/tool_calling.yaml \
--train-evolutionary-strategy gradient_noise \
--train-evolutionary-noise-scale 0.03 \
--train-evolutionary-max-grad-norm 1.0 \
--train-evolutionary-selection-method best \
--train-evolutionary-min-improvement 0.01 \
--train-evolutionary-eval-frequency 5 \
--train-evolutionary-warmup-steps 200 \
--yes
For SFT experiments, the cloud and experiment surfaces now accept:
--train-save-steps— Override checkpoint save frequency (steps)--train-save-total-limit— Override max checkpoints kept--train-lora-r--train-lora-alpha--train-lora-dropout--train-use-dora--train-use-rslora--train-init-lora-weights--train-lora-target-modules
Gotcha:
init_lora_weights=loftqis the only research-style init currently wired for the stable Unsloth SFT path. EVA, PiSSA, and OLoRA still need a separate PEFT-first path.lora_target_modules: all-linearis forwarded now, but the legacy Unsloth path is still the riskier runtime. Use an explicit module list for the stable baseline.
Cloud training + eval in one flow:
python tuner.py cloud-pipeline --method sft --preset full
Full experiment with train → eval → exact loss → analysis:
python tuner.py run-experiment \
--experiment-spec Trainers/cloud/experiments/qwen3_4b_full_cycle_full.yaml \
--yes
Blind hardware planning before launch:
python tuner.py plan-hardware \
--experiment-spec Trainers/cloud/experiments/qwen3_4b_full_cycle_full.yaml \
--optimize-for balanced
Run experiment with auto hardware selection:
python tuner.py run-experiment \
--experiment-spec Trainers/cloud/experiments/qwen3_4b_full_cycle_full.yaml \
--auto-hardware \
--optimize-for cost \
--yes
Prepare a fixed-tier GPU benchmark matrix for the same model:
python tuner.py plan-hardware --experiment-spec Trainers/cloud/experiments/qwen3_4b_full_cycle_benchmark_l4x1.yaml --optimize-for cost
python tuner.py plan-hardware --experiment-spec Trainers/cloud/experiments/qwen3_4b_full_cycle_benchmark_a10g_small.yaml --optimize-for cost
python tuner.py plan-hardware --experiment-spec Trainers/cloud/experiments/qwen3_4b_full_cycle_benchmark_a100_large.yaml --optimize-for cost
Use this when you want three comparable full-pipeline runs of the same model across increasing GPU tiers. Keep the GPU pinned in the spec, leave training batch/grad accumulation unset, and let run-experiment --auto-hardware fill the batch shape for that tier.
Gotcha:
--auto-hardwarecan changebatch_sizeandgradient_accumulationper hardware tier. When comparing speed/cost, always record the resolved training shape alongside the GPU flavor. Otherwise you can misread a faster run as “better hardware” when part of the gain came from a larger planner-selected batch.run-experimentanalysis now appends or updates the checked-in benchmark ledger automatically when analysis runs. If you are comparing ad hoc runs outside experiment orchestration, add or backfill the ledger deliberately.- When launching several experiment specs in one sweep, do not submit them back-to-back by hand. Use
python3 scripts/launch_experiment_batch.py ... --stagger-seconds 5. That avoids same-second job submission races and keeps artifact prefixes easier to track.
Launch a staggered experiment-spec benchmark batch:
python3 scripts/launch_experiment_batch.py \
Trainers/cloud/experiments/qwen3_4b_full_cycle_benchmark_l40sx1_pruned.yaml \
Trainers/cloud/experiments/qwen3_4b_full_cycle_benchmark_a100_large_pruned.yaml \
--auto-hardware \
--optimize-for cost \
--yes
Resume or slice the experiment pipeline:
python tuner.py run-experiment --experiment-spec Trainers/cloud/experiments/qwen3_4b_full_cycle_full.yaml --from-stage evaluation --yes
python tuner.py run-experiment --experiment-spec Trainers/cloud/experiments/qwen3_4b_full_cycle_full.yaml --only-stage loss --yes
python tuner.py run-experiment --experiment-spec Trainers/cloud/experiments/qwen3_4b_full_cycle_full.yaml --skip-stage analysis --skip-stage recommendation --yes
Inspect the finished experiment bundle and next-run candidates:
python tuner.py analyze-experiment --experiment-id latest
python tuner.py analyze-experiment --experiment-id exp_20260321_221651 --json
Analyze a dataset against finished per-example loss before pruning:
python3 scripts/prune_dataset_from_loss.py \
--dataset-path Datasets/synthchat/my_dataset.jsonl \
--experiment-id exp_20260322_103122 \
--analyze-only
Apply a generic pruning rule:
python3 scripts/prune_dataset_from_loss.py \
--dataset-path Datasets/synthchat/my_dataset.jsonl \
--experiment-id exp_20260322_103122 \
--strategy loss_threshold \
--min-loss 2.0
Apply a capped generic prune budget instead of a hard loss threshold:
python3 scripts/prune_dataset_from_loss.py \
--dataset-path Datasets/synthchat/my_dataset.jsonl \
--experiment-id exp_20260322_103122 \
--strategy top_percent \
--top-percent 0.02
Apply the current repo-specific result-echo preset and publish in one step:
python3 scripts/prune_dataset_from_loss.py \
--dataset-path Datasets/synthchat/my_dataset.jsonl \
--experiment-id exp_20260322_103122 \
--strategy result_echo_linesdelta_recommendations \
--publish-repo professorsynapse/claudesidian-synthetic-dataset
Use this only when the analysis report shows that text_only result-echo recap rows are genuinely overrepresented in the high-loss slice. Do not assume that family is always the right prune target.
Environment-backed gym against latest trained adapter on HF:
python tuner.py cloud-gym --run latest --method sft
Tabular LightGBM experiment:
python tuner.py ml train --config Trainers/ml/configs/templates/regression.yaml
Environment Variables
HF_TOKEN=hf_...
WANDB_API_KEY=...
MODAL_TOKEN_ID=...
MODAL_TOKEN_SECRET=...
RUNPOD_API_KEY=...
Output Structure
Local trainers produce timestamped run directories:
{method}_output/YYYYMMDD_HHMMSS/
├── checkpoints/
├── logs/
├── final_model/
└── training_lineage.json
Cloud runs use the canonical provider-native layout:
runs/{provider}/{method}/{timestamp}-{shortsha}/
├── checkpoints/
├── capacity_features.json
├── logs/
├── final_model/
├── training_lineage.json
└── manifest.json
Provider-native storage defaults:
hf_jobs→ Hugging Face Bucketmodal→ Modal Volumerunpod→ RunPod Network Volume
HF Jobs Notes
- Launch from a clean tracked worktree and a pushed commit only; the remote container checks out that exact SHA.
- Keep the main training interpreter compatible with Unsloth and Transformers; any Buckets-only
huggingface_hubupgrade must stay isolated from the trainer runtime. - Keep the bucket-sync overlay self-contained for the Hub client stack: install
huggingface_hub>=1.5.0,hf_transfer, andhf_xetinto the overlay. Do not rely on the base Unsloth image'shf_xet; older system copies can fail final sync withImportError: cannot import name 'SKIP_SHA256'. - Pass
HF_TOKENinto the cloud job explicitly; do not assume HF Jobs injects it automatically. - Treat blank
HF_TOKEN/HF_API_KEYvalues as unset, otherwise bucket sync can fail withAuthorization: Bearer. - For post-training cloud evaluation, prefer
python tuner.py cloud-eval --run latest --preset full. - Keep cloud-eval overlays split by responsibility. The evaluator runtime
PYTHONPATHmay include lightweight evaluator deps only; Buckets-only packages (huggingface_hub>=1.5.0,hf_transfer,hf_xet) must stay off the evaluatorPYTHONPATHand be exposed only throughHF_BUCKET_SYNC_PYTHONPATH. Otherwise base-imagetransformerscan import an incompatible Hub client. run-experimentnow supports stage controls:--only-stage,--from-stage, and repeated--skip-stage.run-experiment --auto-hardwareuses a blind planner: model size, method, seq length, quantization, and live HF flavor pricing. It does not require prior telemetry.plan-hardwareis the inspection surface for that same planner.- When using
--auto-hardware, treat the resolvedbatch_size/gradient_accumulationas part of the experiment definition. Compare wall-clock and cost only after checking those resolved values. - After training finishes, read
training_lineage.jsonbefore declaring the hardware/batch shape “good enough.” If peak reserved VRAM is still far below device capacity, the run is underpacked and the next iteration should push batch utilization harder. - On
a100-large, the default posture should be: push it until it breaks, then back off one notch. A large A100 headroom cushion is usually wasted iteration speed. - Finished experiments now write
.tracking/experiments/<id>/analysis/with:experiment_summary.json,run_matrix.csv,feature_dataset.{jsonl,csv},next_run_candidates.json,draft_next_spec.yaml. - For the common train-then-evaluate flow, prefer
python tuner.py cloud-pipeline --method sft --preset full. cloud-pipelineis currently a two-job orchestration on HF Jobs, not a single remote composite job.run-experimentis the higher-level experiment loop: training stays provider-native, then evaluation and exact dataset loss run as separate sibling post-training jobs by default.- Use
evaluation.runtime: vllmin experiment specs when you want the fast eval server path. The exact loss stage still uses a post-evaltransformersforward pass. - If you explicitly want the older embedded path for a smoke run, set
post_training.mode: same_jobin the experiment spec. Default isparallel. - For
cloud-eval --with-lossorpost_training.mode: same_job, first rely on the selected eval image's ML stack. If a truly missing package must be added, use explicit image-compatible pins or--no-depsstage overrides; do not let pip resolve a fresh Torch/Transformers stack inside the evaluator runtime overlay. - Checkpoint-vs-checkpoint comparison is not automatic in smoke runs; you only get that if the trainer emitted multiple checkpoints and you intentionally run checkpoint evaluation / experiment-loop workflows.
- For SFT model-comparison experiments, use
cloud-pipelinewith--train-*overrides so the experiment lands in canonical HF training storage instead ofruns/hf_jobs/custom/.... - When testing newer upstream Unsloth runtimes, switch images with
--train-image-profile nextinstead of upgrading packages in the old stable image. - To inspect a finished HF cloud evaluation run from the bucket, use
python tuner.py cloud-inspect --run latest --eval-run latest --method sft. - Completed SFT/KTO runs save a flat
capacity_features.jsonartifact designed for tabular modeling or capacity prediction.
Tips
- Always
--dry-runfirst to verify setup without training. - Use
--tier quickfor fast prototyping and--tier thoroughfor final quality runs. - Use
--model-size 3bfor fast iteration and7bfor production-style runs. - SFT with
packing: trueis much faster. - KTO datasets must be interleaved True/False.
- GRPO rewards are YAML-driven; edit
configs/rewards/, not Python. - Env-GRPO multi-turn rollouts are token-faithful by default (
env_training.token_faithful: true, requirestrl>=0.28.0); it emits a per-tokenenv_maskso GRPO trains only on sampled tokens. Seereference/grpo-training.md→ "Token-Faithful Multi-Turn Rollout". fitness.yamlusesFitnessEvaluatorfor structural validation.- After training, use surgery only after you have a stable baseline and evaluation scenario.
training_lineage.jsontracks full provenance for reproducibility.