fine-tuning

star 23

Complete reference for the fine-tuning pipeline (SFT, KTO, GRPO), cloud HF Jobs workflows, autonomous experiment search, checkpoint evaluation, and LoRA surgery. Covers training CLI flags, YAML configuration, model presets, dataset requirements, LoRA settings, training monitoring, hyperparameter search, and post-training optimization. Use when training models, configuring training runs, choosing hyperparameters, running cloud experiments, inspecting HF jobs, or troubleshooting training issues. This skill is about USING the training system via CLI and YAML — never modifying source code.

ProfSynapse By ProfSynapse schedule Updated 6/12/2026

name: fine-tuning description: Complete reference for the fine-tuning pipeline (SFT, KTO, GRPO), cloud HF Jobs workflows, autonomous experiment search, checkpoint evaluation, and LoRA surgery. Covers training CLI flags, YAML configuration, model presets, dataset requirements, LoRA settings, training monitoring, hyperparameter search, and post-training optimization. Use when training models, configuring training runs, choosing hyperparameters, running cloud experiments, inspecting HF jobs, or troubleshooting training issues. This skill is about USING the training system via CLI and YAML — never modifying source code. allowed-tools: Read, Bash, Write, Grep, Glob

Fine-Tuning Pipeline

Train language models with SFT, KTO, and GRPO locally or on supported cloud providers. This skill also covers the Karpathy-style experiment loop, checkpoint evaluation, LoRA surgery, and the current HF Jobs operational path.

Quick Reference

Task Command
Interactive menu ./run.sh → Train
Local Docker config run python tuner.py local-run --job-config Trainers/recipes/<recipe>.yaml --yes
SFT training cd Trainers/sft && python train_sft.py --model-size 7b
KTO training cd Trainers/kto && python train_kto.py --model-size 7b
GRPO training cd Trainers/grpo && python train_grpo.py
Pivot-profile GRPO dataset cd Trainers/grpo && python train_grpo.py --config configs/pivot_config.yaml --pivot-profile-only
GRPO with pivot filtering cd Trainers/grpo && python train_grpo.py --config configs/pivot_config.yaml
Env-backed GRPO cd Trainers/grpo && python train_env_grpo.py --config ./configs/env_config.yaml --dry-run
Experiment loop python tuner.py experiment-loop --experiment-config configs/flywheel/experiment_loop.yaml
LoRA surgery python tuner.py surgery --surgery-config configs/lora_surgery.yaml
HF custom job python tuner.py cloud-run --job-config Trainers/recipes/<recipe>.yaml
Canonical HF train+eval python tuner.py cloud-pipeline --method sft --preset full
Full experiment bundle python tuner.py run-experiment --experiment-spec Trainers/cloud/experiments/<spec>.yaml --yes
Evolutionary SFT smoke test python tuner.py run-experiment --experiment-spec Trainers/cloud/experiments/<evolutionary-spec>.yaml --yes
Staggered experiment batch python3 scripts/launch_experiment_batch.py Trainers/cloud/experiments/<spec1>.yaml Trainers/cloud/experiments/<spec2>.yaml --yes
Blind hardware plan python tuner.py plan-hardware --experiment-spec Trainers/cloud/experiments/<spec>.yaml
Analyze finished experiment python tuner.py analyze-experiment --experiment-id latest
Analyze/prune dataset from loss python3 scripts/prune_dataset_from_loss.py --dataset-path ... --experiment-id ... --analyze-only
Standalone prompt optimization python tuner.py prompt-optimize --prompt-opt-config configs/prompt_optimization/NAME.yaml
Prompt-optimize SynthChat generation python -m SynthChat.run generate --prompt-opt-config configs/prompt_optimization/NAME.yaml [options]
Analyze bucket-backed run python tuner.py bucket analyze --path runs/hf_jobs/sft/<run-prefix>/
Read bucket artifact python tuner.py bucket read --path runs/.../logs/training_latest.jsonl --jsonl-latest --pretty
List bucket prefix python tuner.py bucket list --path runs/hf_jobs/sft/<run-prefix>/ --limit 20
Pull bucket prefix locally python tuner.py bucket pull --path runs/hf_jobs/sft/<run-prefix>/ --dest .
Push local artifact to bucket python tuner.py bucket push --path local/results.json --dest runs/manual_uploads/
Live HF job list python tuner.py cloud-jobs list
Live HF job logs python tuner.py cloud-jobs logs --job professorsynapse/<job-id> --tail 200
Cloud eval against a run python tuner.py cloud-eval --run latest --preset full
HF gym against trained model python tuner.py cloud-gym --run latest --method sft
Warm Space scaffold python3 Trainers/cloud/scripts/manage_space.py render --template vllm_warm --output-dir /tmp/my-space --base-image ghcr.io/<org>/<image>:<tag>
Warm Space deploy python3 Trainers/cloud/scripts/manage_space.py deploy --space-id <user>/<space> --template vllm_warm --base-image ghcr.io/<org>/<image>:<tag> --hardware a10g-small --sleep-time 3600 --var BASE_MODEL=<model>
ML training python tuner.py ml train --config Trainers/ml/configs/templates/regression.yaml

Training Methods at a Glance

Method Purpose LR Epochs Dataset When to Use
SFT Teach format and behavior 2e-4 3 Positive examples only First stage
KTO Refine with preferences 1e-6 1 Interleaved True/False Second stage
GRPO Optimize against rewards 5e-6 1 Prompts + ground truth Final online stage
Embedding Train a retrieval bi-encoder 2e-5 1 Triplets / pairs Retrieval / RAG embedders

Recommended pipeline: SFT → KTO → GRPO

Embedding training (SentenceTransformer bi-encoders for retrieval) is a separate method with its own registry, dual loader, adapter modes (full/lora/frozen_head), and corpus-level retrieval evaluation. It rides the same local-run/recipe path as SFT. For the full surface use the dedicated embedding-training skill; the triplet/retrieval data shapes are in reference/dataset-formats.md below.

Complexity Tiers

Use --tier on the local SFT and KTO trainers when you want a preset instead of hand-tuning LoRA rank and LR.

Tier LoRA Rank LR Time Use Case
quick r=8 5e-4 ~5 min Prototyping and smoke runs
standard r=64 2e-4 ~30-60 min Normal training
thorough r=128 1e-4 ~2-4 hrs Quality-focused final runs

Key Directories

  • Trainers/sft/ — active SFT trainer
  • Trainers/recipes/ — unified training recipe configs (local Docker + HF Jobs); each recipe declares target: local|cloud|both and method: sft|kto|grpo
  • Evaluator/recipes/ — unified evaluation recipe configs
  • Trainers/kto/ — KTO trainer
  • Trainers/grpo/ — GRPO and env-GRPO trainer
  • Trainers/embedding/ — embedding (SentenceTransformer bi-encoder) trainer, registry, and dual loader; see the embedding-training skill
  • Trainers/archive/legacy_rtx3090/ — archived legacy RTX3090 trainer snapshots and outputs; do not use for new runs
  • Datasets/ — JSONL training datasets
  • SynthChat/scenarios/ — synthetic data and environment-backed scenarios
  • shared/flywheel/ — autonomous experiment-loop logic and LightGBM surrogate search
  • Trainers/ml/ — traditional ML / LightGBM training for tabular experiment analysis
  • shared/experiment_tracking/ — unified run and experiment tracking

CLI Discipline

  • Never cancel a job, delete bucket artifacts, remove files, or relaunch a cost-incurring cloud run unless the user has explicitly approved that exact action in the current conversation.
  • Treat cancel/delete/relaunch as irreversible or materially destructive operator actions. Do not infer permission from surrounding context or from a user's broader goal.
  • Do not guess command names or flags from memory.
  • Before giving command guidance, check tuner/cli/parser.py, tuner/cli/router.py, or the real --help output.
  • Prefer repo CLIs and checked-in scripts over ad hoc Python snippets.
  • After benchmark runs complete, treat the checked-in benchmark ledger as part of the workflow:
  • Treat stage lineage as the source of truth for automation:
    • training: training_lineage.json
    • evaluation: evaluation_lineage.json
    • exact loss: loss_lineage.json
  • Treat loss_summary.json as a supporting artifact, not the canonical final loss metadata file.
  • The ledger should accumulate real model-size / hardware / timing / cost data so future hardware planning can optimize against observed evidence instead of memory.
  • For local trainer iteration, use the checked-in train_sft.py, train_kto.py, and train_grpo.py entrypoints.
  • For repeatable local GPU training, prefer python tuner.py local-run --job-config Trainers/recipes/<recipe>.yaml --yes over ad hoc docker run commands. Put the model, dataset, Docker image, package overrides, LoRA settings, training knobs, and artifact paths in YAML.
  • For Windows Docker Desktop with GPU, prefer job.transfer: auto or copy in local-run configs. The runner chooses copy mode on Windows because GPU bind mounts can fail with access denied.
  • Keep newly released model support in local-run setup.pip pins or image fields. Do not leave one-off package installs in shell history.
  • For canonical HF experiments, prefer python tuner.py cloud-pipeline ... over cloud-run.
  • For full train → eval → exact loss → analysis → recommendation runs, prefer python tuner.py run-experiment ....
  • Evolutionary SFT is experimental but now first-class in the cloud experiment path. Prefer a checked-in experiment spec or cloud-pipeline --train-evolutionary-* overrides over editing trainer YAMLs by hand.
  • Use evolutionary training for technical validation or targeted experiments first. It adds significant per-step compute overhead, so smoke tests should usually cap max_steps before you commit to a long run.
  • For multi-spec benchmark launches, prefer python3 scripts/launch_experiment_batch.py ... over back-to-back manual submissions. It defaults to a 5-second stagger.
  • For blind stage hardware selection before launch, use python tuner.py plan-hardware ....
  • For live HF status and traceback inspection, use python tuner.py cloud-jobs ....
  • For finished experiment bundles and next-run suggestions, use python tuner.py analyze-experiment ....
  • For loss-driven dataset cleanup, start with python3 scripts/prune_dataset_from_loss.py ... --analyze-only; only apply a pruning rule after checking which families are actually enriched in the high-loss slice.
  • Prefer the generic pruning strategies first (loss_threshold, top_percent). Use repo-specific presets only when the analysis output shows that exact family is genuinely overrepresented.
  • For in-flight cloud-run health checks, inspect the bucket-backed artifacts first (training_latest.jsonl, stage_summary.json, training_lineage.json, eval/loss partials). Use raw HF logs only as a fallback when the bucket prefix has not started writing yet.
  • For quick bucket spot checks, use python tuner.py bucket read ... or python tuner.py bucket list ... instead of manual hf buckets cp commands.
  • For local inspection or offline diffing, use python tuner.py bucket pull ... to sync a bucket-relative path into the current workspace while preserving its relative path.
  • For one-off uploads back into the HF artifact bucket, use python tuner.py bucket push ... instead of ad hoc sync_bucket snippets.
  • For a100-large or larger tiers, bias toward aggressive packing. Do not lower batch just because the adapter recipe changed. Start from the highest known-good packed shape for the same model family and only back off after a real OOM or clear instability signal.
  • Treat large unused VRAM on a100-large as a mistake, not a comfort margin. If training_lineage.json shows tens of GB of reserved headroom, the run is underpacked and the next iteration should push batch size harder even if that risks OOM.
  • For vLLM eval on multi-GPU hardware, prequantized BitsAndBytes base models (for example *-bnb-4bit) cannot use tensor parallelism. Do not assume x4 means vLLM will shard generation across all GPUs; in this path, eval may need to fall back to single-GPU while exact loss still fans out across all visible GPUs afterward.
  • For experiment specs in this repo, set evaluation.runtime: vllm and evaluation.image_profile: fast_vllm by default. Do not silently switch an experiment to unsloth eval just because the model family is new. If vLLM compatibility is uncertain, verify current official support and ask the user before deviating from the vLLM path.
  • When the user says "latest run" or "latest experiment", distinguish latest attempt from latest completed baseline. Check .tracking/experiments/*/experiment.json sorted by created_at first, then state explicitly which interpretation you are using before copying hyperparameters forward.
  • When choosing an A100 packed shape, prefer the nearest latest attempt that actually exercised the hardware over an older completed baseline that clearly underpacked the card.
  • If run-experiment refuses to launch because the tracked worktree is dirty, prefer creating a clean temporary git worktree and launching from there over asking the user to stash or cleaning their checkout.
  • If a cloud run fails before bucket artifacts appear, treat it as a bootstrap/runtime problem first. Inspect cloud-jobs logs before changing training hyperparameters.
  • For newly released architectures or day-zero model launches, verify official Docker Hub tags for unsloth/unsloth and vllm/vllm-openai before trusting the repo's pinned image profiles. As of 2026-04-02, Docker Hub shows unsloth/unsloth:latest updated 1 day ago and vllm/vllm-openai:latest / v0.17.1 updated about 17 hours ago.
  • As of 2026-04-22, local docker pull unsloth/unsloth:latest resolved to digest sha256:9be56babef4efc330316cff3a65f9f911b9e7709bce4114fa7817ba3ffd8565d. That image still reports transformers 4.57.1, so Qwen3.5 local runs need config-level package overrides such as transformers==5.5.0, trl==0.22.2, and current unsloth / unsloth_zoo.
  • If a named training image profile is broken, prefer an explicit training.cloud_image override to a currently verified official image tag over changing unrelated parts of the experiment such as evaluation backend.
  • If a run needs newer package versions but the right base image is otherwise close, use stage-local experiment-spec pip_packages pins under training:, evaluation:, or loss: instead of a one-off helper script or a repo-global image pin.
  • For hyperparameter search, use python tuner.py experiment-loop ...; this is the built-in LLM + LightGBM surrogate path.
  • For tabular post-hoc models, use python tuner.py ml ... and the configs under Trainers/ml/configs/templates/.

Research Note Lifecycle

Use .skills/research-reporting/SKILL.md alongside this skill whenever the user wants experiment tracking, research summaries, or reusable post-run analysis.

  • When a training run or experiment is launched, create the research note immediately rather than waiting for all stages to finish.
  • The initial note should capture config intent from experiment.json and spec_path, with stage_statuses set from the current known state and metrics left null or empty where evidence does not exist yet.
  • After training finishes, update the same note with training provenance from training_lineage.json or stage_details.training.
  • After evaluation finishes, update the same note with evaluation metrics and observed failure patterns.
  • After loss finishes, update the same note with loss metrics, high-loss hashes, and any dataset-review signals.
  • After analysis/recommendation finishes, update the same note with hypotheses, selected candidate rank, and recommended next action.
  • Do not fork separate notes per stage unless the user explicitly asks for per-stage notes. Default is one note per experiment, updated over time.
  • Treat the note as a living artifact for the run: launch state, in-flight updates, and final postmortem should all land in the same document.
  • If a stage has not run yet or failed, preserve the section and leave missing values explicit instead of inventing placeholders.

Progressive Reference

Load the specific reference you need:

Reference When to Load Path
SFT Training Running SFT, configuring SFT params reference/sft-training.md
KTO Training Running KTO, dataset interleaving, preference tuning reference/kto-training.md
GRPO Training Running GRPO, reward config, GSPO variant reference/grpo-training.md
Model Presets Choosing models, VRAM planning, LoRA settings reference/model-presets.md
Dataset Formats Preparing datasets, format requirements per method reference/dataset-formats.md
Training Config YAML config deep-dive reference/training-config.md
Cloud Training Provider-native persistence, exact-commit rules, cloud smoke tests reference/cloud-training.md
Cloud Experiments Canonical train→eval launches with --train-* overrides reference/cloud-experiment-launching.md
Checkpoint Evaluation Best-checkpoint selection via eval reference/checkpoint-evaluation.md
Experiment Loop Autonomous hyperparameter search (LLM + LightGBM) reference/experiment-loop.md
LoRA Techniques LoRA variants, init methods, config recipes reference/lora-techniques.md
Evolutionary Config Experimental gradient-selection config schema and defaults reference/training-config.md
LoRA Surgery Eval-guided post-training weight optimization reference/lora-surgery.md
Troubleshooting OOM errors, instability, platform issues reference/troubleshooting.md
Env Alignment Protocol Canonical SynthChat → SFT → merge/publish → KTO → env-GRPO flow protocols/environment-backed-alignment-pipeline.md

LoRA Technique Configs

Reference config templates for LoRA variants in .skills/fine-tuning/configs/:

Template Technique When to Use
regret_free.yaml High-rank + rsLoRA + all-linear Full-FT quality from LoRA
dora.yaml DoRA (weight-decomposed) Drop-in quality boost
qlora_dora.yaml QLoRA + DoRA Single GPU, VRAM-limited
pissa.yaml PiSSA (SVD init) Fast convergence
eva.yaml EVA (activation SVD init) Small datasets
olora.yaml OLoRA (QR init) Fast convergence, simpler than EVA
loftq.yaml LoftQ (quant-aware init) Minimize 4-bit quality loss
grpo_minimal.yaml Minimal rank for RL GRPO/reasoning tasks

See reference/lora-techniques.md for full details, integration status, and compatibility notes.

Common Patterns

Quick SFT test run:

cd Trainers/sft
python train_sft.py --model-size 3b --tier quick --dry-run

Config-driven local Docker SFT smoke run:

python tuner.py local-run \
  --job-config Trainers/recipes/qwen35_2b_sft_smoke.yaml \
  --yes

For a different local SFT run, copy a recipe under Trainers/recipes/ (one with target: local or target: both) and change model, dataset, training, lora, job.image, and setup.pip as needed. Use repo-relative local dataset paths; the runner translates them for the container.

Config-driven local Docker embedding smoke run:

python tuner.py local-run \
  --job-config Trainers/recipes/embedding_bge_base_smoke.yaml \
  --yes

Trains a small bge-base-en embedding adapter for a few steps. The recipe rides the modern Unsloth image and layers sentence-transformers + faiss-cpu + datasets via setup.pip (pins are TBD-pending the cloud smoke — captured from the first working run, never invented). For the full embedding surface (registry, adapter modes, retrieval eval) use the embedding-training skill.

Generate a dataset with prompt optimization provenance:

python -m SynthChat.run generate \
  --prompt-opt-config configs/prompt_optimization/synthchat_smoke.yaml \
  --output Datasets/synthchat/prompt_opt_dryrun.jsonl

Use prompt optimization as an opt-in dataset-generation step before training, not as an implicit trainer behavior. SynthChat records the optimizer artifact path and selected candidate ID in row metadata and leaves source YAML unchanged.

Run prompt optimization before dataset generation:

python tuner.py prompt-optimize \
  --prompt-opt-config configs/prompt_optimization/labkit_epistemic_humility_evaluator_smoke.yaml

Prompt optimization is a config-first workflow. Keep subjects, operators, evaluation scenarios, objectives, and stopping rules in configs/prompt_optimization/*.yaml; promote an optimized prompt into canonical scenario/config YAML only after human review. The practical evaluator threshold default is stopping.target_score: 0.8.

The checked-in evaluator smoke config should stay dry-run safe: configs/prompt_optimization/labkit_epistemic_humility_evaluator_smoke.yaml uses evaluation.evaluator.dry_run: true. For a real LM Studio smoke, use an explicit local override or temporary manual copy rather than changing the checked-in config. Keep that real smoke minimal: population_size: 3 and max_generations: 1.

KTO with local dataset:

cd Trainers/kto
python train_kto.py --model-size 7b --local-file ../../Datasets/my_kto_data.jsonl

GRPO continuing from SFT checkpoint:

# Edit configs/config.yaml to set model.lora_path to SFT checkpoint
cd Trainers/grpo
python train_grpo.py

Autonomous hyperparameter search:

python tuner.py experiment-loop \
  --experiment-config configs/flywheel/experiment_loop.yaml \
  --max-experiments 10

Post-training LoRA surgery:

python tuner.py surgery --surgery-config configs/lora_surgery.yaml

Inspect live HF jobs:

python tuner.py cloud-jobs list --limit 20
python tuner.py cloud-jobs show --job professorsynapse/<job-id>
python tuner.py cloud-jobs logs --job professorsynapse/<job-id> --tail 200

Gotcha:

  • cloud-jobs show is useful, but for real progress checks prefer the bucket-backed artifacts once they exist. Early raw logs are often just bootstrap noise; the bucket tells you when training/eval/loss has actually started producing useful state.

Read the latest training record from a bucket-backed run:

python tuner.py bucket read \
  --path runs/hf_jobs/sft/<run-prefix>/logs/training_latest.jsonl \
  --jsonl-latest \
  --pretty

Tail the structured progress log or summary artifact directly from the bucket:

python tuner.py bucket analyze \
  --path runs/hf_jobs/sft/<run-prefix>/

python tuner.py bucket read \
  --path runs/hf_jobs/sft/<run-prefix>/logs/training_latest.jsonl \
  --tail 5

python tuner.py bucket read \
  --path runs/hf_jobs/sft/<run-prefix>/logs/stage_summary.json \
  --pretty

python tuner.py bucket list \
  --path runs/hf_jobs/sft/<run-prefix>/ \
  --limit 20

python tuner.py bucket pull \
  --path runs/hf_jobs/sft/<run-prefix>/analysis/loss/ \
  --dest .

python tuner.py bucket push \
  --path local/analysis/notes.json \
  --dest runs/manual_uploads/

Gotcha:

  • Use bucket analyze first when a run has finished. It summarizes training_lineage.json, the newest evaluation_lineage.json, and loss_lineage.json in one pass.
  • If the same training run has multiple eval reruns or alternate loss prefixes, pass --eval-path and/or --loss-path explicitly so the summary cannot attach to the wrong post-training artifacts.
  • If schema pass rate is strong but behavior pass rate is still weak, do not assume you need more tool-call syntax SFT. That pattern usually means the next bottleneck is text-only restraint, clarification, or verify-before-action behavior.

Canonical one-off HF experiment with direct overrides:

python tuner.py cloud-pipeline --method sft --preset full \
  --train-model-name Qwen/Qwen3.5-2B \
  --train-image-profile next \
  --train-gpu a10g-small \
  --train-batch-size 8 \
  --train-gradient-accumulation 4 \
  --train-lora-r 128 \
  --train-lora-alpha 256 \
  --train-use-rslora \
  --train-use-dora \
  --train-no-load-in-4bit \
  --train-lora-target-modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
  --yes

python tuner.py cloud-pipeline --method sft \
  --train-model-name Qwen/Qwen3-4B \
  --train-gpu a100-large \
  --train-batch-size 24 \
  --train-gradient-accumulation 2 \
  --train-learning-rate 1e-4 \
  --train-max-steps 60 \
  --train-evolutionary-enabled \
  --train-evolutionary-candidates 4 \
  --train-evolutionary-eval-batch-size 2 \
  --train-evolutionary-validation-config configs/fitness/tool_calling.yaml \
  --train-evolutionary-strategy gradient_noise \
  --train-evolutionary-noise-scale 0.03 \
  --train-evolutionary-max-grad-norm 1.0 \
  --train-evolutionary-selection-method best \
  --train-evolutionary-min-improvement 0.01 \
  --train-evolutionary-eval-frequency 5 \
  --train-evolutionary-warmup-steps 200 \
  --yes

For SFT experiments, the cloud and experiment surfaces now accept:

  • --train-save-steps — Override checkpoint save frequency (steps)
  • --train-save-total-limit — Override max checkpoints kept
  • --train-lora-r
  • --train-lora-alpha
  • --train-lora-dropout
  • --train-use-dora
  • --train-use-rslora
  • --train-init-lora-weights
  • --train-lora-target-modules

Gotcha:

  • init_lora_weights=loftq is the only research-style init currently wired for the stable Unsloth SFT path. EVA, PiSSA, and OLoRA still need a separate PEFT-first path.
  • lora_target_modules: all-linear is forwarded now, but the legacy Unsloth path is still the riskier runtime. Use an explicit module list for the stable baseline.

Cloud training + eval in one flow:

python tuner.py cloud-pipeline --method sft --preset full

Full experiment with train → eval → exact loss → analysis:

python tuner.py run-experiment \
  --experiment-spec Trainers/cloud/experiments/qwen3_4b_full_cycle_full.yaml \
  --yes

Blind hardware planning before launch:

python tuner.py plan-hardware \
  --experiment-spec Trainers/cloud/experiments/qwen3_4b_full_cycle_full.yaml \
  --optimize-for balanced

Run experiment with auto hardware selection:

python tuner.py run-experiment \
  --experiment-spec Trainers/cloud/experiments/qwen3_4b_full_cycle_full.yaml \
  --auto-hardware \
  --optimize-for cost \
  --yes

Prepare a fixed-tier GPU benchmark matrix for the same model:

python tuner.py plan-hardware --experiment-spec Trainers/cloud/experiments/qwen3_4b_full_cycle_benchmark_l4x1.yaml --optimize-for cost
python tuner.py plan-hardware --experiment-spec Trainers/cloud/experiments/qwen3_4b_full_cycle_benchmark_a10g_small.yaml --optimize-for cost
python tuner.py plan-hardware --experiment-spec Trainers/cloud/experiments/qwen3_4b_full_cycle_benchmark_a100_large.yaml --optimize-for cost

Use this when you want three comparable full-pipeline runs of the same model across increasing GPU tiers. Keep the GPU pinned in the spec, leave training batch/grad accumulation unset, and let run-experiment --auto-hardware fill the batch shape for that tier. Gotcha:

  • --auto-hardware can change batch_size and gradient_accumulation per hardware tier. When comparing speed/cost, always record the resolved training shape alongside the GPU flavor. Otherwise you can misread a faster run as “better hardware” when part of the gain came from a larger planner-selected batch.
  • run-experiment analysis now appends or updates the checked-in benchmark ledger automatically when analysis runs. If you are comparing ad hoc runs outside experiment orchestration, add or backfill the ledger deliberately.
  • When launching several experiment specs in one sweep, do not submit them back-to-back by hand. Use python3 scripts/launch_experiment_batch.py ... --stagger-seconds 5. That avoids same-second job submission races and keeps artifact prefixes easier to track.

Launch a staggered experiment-spec benchmark batch:

python3 scripts/launch_experiment_batch.py \
  Trainers/cloud/experiments/qwen3_4b_full_cycle_benchmark_l40sx1_pruned.yaml \
  Trainers/cloud/experiments/qwen3_4b_full_cycle_benchmark_a100_large_pruned.yaml \
  --auto-hardware \
  --optimize-for cost \
  --yes

Resume or slice the experiment pipeline:

python tuner.py run-experiment --experiment-spec Trainers/cloud/experiments/qwen3_4b_full_cycle_full.yaml --from-stage evaluation --yes
python tuner.py run-experiment --experiment-spec Trainers/cloud/experiments/qwen3_4b_full_cycle_full.yaml --only-stage loss --yes
python tuner.py run-experiment --experiment-spec Trainers/cloud/experiments/qwen3_4b_full_cycle_full.yaml --skip-stage analysis --skip-stage recommendation --yes

Inspect the finished experiment bundle and next-run candidates:

python tuner.py analyze-experiment --experiment-id latest
python tuner.py analyze-experiment --experiment-id exp_20260321_221651 --json

Analyze a dataset against finished per-example loss before pruning:

python3 scripts/prune_dataset_from_loss.py \
  --dataset-path Datasets/synthchat/my_dataset.jsonl \
  --experiment-id exp_20260322_103122 \
  --analyze-only

Apply a generic pruning rule:

python3 scripts/prune_dataset_from_loss.py \
  --dataset-path Datasets/synthchat/my_dataset.jsonl \
  --experiment-id exp_20260322_103122 \
  --strategy loss_threshold \
  --min-loss 2.0

Apply a capped generic prune budget instead of a hard loss threshold:

python3 scripts/prune_dataset_from_loss.py \
  --dataset-path Datasets/synthchat/my_dataset.jsonl \
  --experiment-id exp_20260322_103122 \
  --strategy top_percent \
  --top-percent 0.02

Apply the current repo-specific result-echo preset and publish in one step:

python3 scripts/prune_dataset_from_loss.py \
  --dataset-path Datasets/synthchat/my_dataset.jsonl \
  --experiment-id exp_20260322_103122 \
  --strategy result_echo_linesdelta_recommendations \
  --publish-repo professorsynapse/claudesidian-synthetic-dataset

Use this only when the analysis report shows that text_only result-echo recap rows are genuinely overrepresented in the high-loss slice. Do not assume that family is always the right prune target.

Environment-backed gym against latest trained adapter on HF:

python tuner.py cloud-gym --run latest --method sft

Tabular LightGBM experiment:

python tuner.py ml train --config Trainers/ml/configs/templates/regression.yaml

Environment Variables

HF_TOKEN=hf_...
WANDB_API_KEY=...
MODAL_TOKEN_ID=...
MODAL_TOKEN_SECRET=...
RUNPOD_API_KEY=...

Output Structure

Local trainers produce timestamped run directories:

{method}_output/YYYYMMDD_HHMMSS/
├── checkpoints/
├── logs/
├── final_model/
└── training_lineage.json

Cloud runs use the canonical provider-native layout:

runs/{provider}/{method}/{timestamp}-{shortsha}/
├── checkpoints/
├── capacity_features.json
├── logs/
├── final_model/
├── training_lineage.json
└── manifest.json

Provider-native storage defaults:

  • hf_jobs → Hugging Face Bucket
  • modal → Modal Volume
  • runpod → RunPod Network Volume

HF Jobs Notes

  • Launch from a clean tracked worktree and a pushed commit only; the remote container checks out that exact SHA.
  • Keep the main training interpreter compatible with Unsloth and Transformers; any Buckets-only huggingface_hub upgrade must stay isolated from the trainer runtime.
  • Keep the bucket-sync overlay self-contained for the Hub client stack: install huggingface_hub>=1.5.0, hf_transfer, and hf_xet into the overlay. Do not rely on the base Unsloth image's hf_xet; older system copies can fail final sync with ImportError: cannot import name 'SKIP_SHA256'.
  • Pass HF_TOKEN into the cloud job explicitly; do not assume HF Jobs injects it automatically.
  • Treat blank HF_TOKEN / HF_API_KEY values as unset, otherwise bucket sync can fail with Authorization: Bearer .
  • For post-training cloud evaluation, prefer python tuner.py cloud-eval --run latest --preset full.
  • Keep cloud-eval overlays split by responsibility. The evaluator runtime PYTHONPATH may include lightweight evaluator deps only; Buckets-only packages (huggingface_hub>=1.5.0, hf_transfer, hf_xet) must stay off the evaluator PYTHONPATH and be exposed only through HF_BUCKET_SYNC_PYTHONPATH. Otherwise base-image transformers can import an incompatible Hub client.
  • run-experiment now supports stage controls: --only-stage, --from-stage, and repeated --skip-stage.
  • run-experiment --auto-hardware uses a blind planner: model size, method, seq length, quantization, and live HF flavor pricing. It does not require prior telemetry.
  • plan-hardware is the inspection surface for that same planner.
  • When using --auto-hardware, treat the resolved batch_size / gradient_accumulation as part of the experiment definition. Compare wall-clock and cost only after checking those resolved values.
  • After training finishes, read training_lineage.json before declaring the hardware/batch shape “good enough.” If peak reserved VRAM is still far below device capacity, the run is underpacked and the next iteration should push batch utilization harder.
  • On a100-large, the default posture should be: push it until it breaks, then back off one notch. A large A100 headroom cushion is usually wasted iteration speed.
  • Finished experiments now write .tracking/experiments/<id>/analysis/ with: experiment_summary.json, run_matrix.csv, feature_dataset.{jsonl,csv}, next_run_candidates.json, draft_next_spec.yaml.
  • For the common train-then-evaluate flow, prefer python tuner.py cloud-pipeline --method sft --preset full.
  • cloud-pipeline is currently a two-job orchestration on HF Jobs, not a single remote composite job.
  • run-experiment is the higher-level experiment loop: training stays provider-native, then evaluation and exact dataset loss run as separate sibling post-training jobs by default.
  • Use evaluation.runtime: vllm in experiment specs when you want the fast eval server path. The exact loss stage still uses a post-eval transformers forward pass.
  • If you explicitly want the older embedded path for a smoke run, set post_training.mode: same_job in the experiment spec. Default is parallel.
  • For cloud-eval --with-loss or post_training.mode: same_job, first rely on the selected eval image's ML stack. If a truly missing package must be added, use explicit image-compatible pins or --no-deps stage overrides; do not let pip resolve a fresh Torch/Transformers stack inside the evaluator runtime overlay.
  • Checkpoint-vs-checkpoint comparison is not automatic in smoke runs; you only get that if the trainer emitted multiple checkpoints and you intentionally run checkpoint evaluation / experiment-loop workflows.
  • For SFT model-comparison experiments, use cloud-pipeline with --train-* overrides so the experiment lands in canonical HF training storage instead of runs/hf_jobs/custom/....
  • When testing newer upstream Unsloth runtimes, switch images with --train-image-profile next instead of upgrading packages in the old stable image.
  • To inspect a finished HF cloud evaluation run from the bucket, use python tuner.py cloud-inspect --run latest --eval-run latest --method sft.
  • Completed SFT/KTO runs save a flat capacity_features.json artifact designed for tabular modeling or capacity prediction.

Tips

  • Always --dry-run first to verify setup without training.
  • Use --tier quick for fast prototyping and --tier thorough for final quality runs.
  • Use --model-size 3b for fast iteration and 7b for production-style runs.
  • SFT with packing: true is much faster.
  • KTO datasets must be interleaved True/False.
  • GRPO rewards are YAML-driven; edit configs/rewards/, not Python.
  • Env-GRPO multi-turn rollouts are token-faithful by default (env_training.token_faithful: true, requires trl>=0.28.0); it emits a per-token env_mask so GRPO trains only on sampled tokens. See reference/grpo-training.md → "Token-Faithful Multi-Turn Rollout".
  • fitness.yaml uses FitnessEvaluator for structural validation.
  • After training, use surgery only after you have a stable baseline and evaluation scenario.
  • training_lineage.json tracks full provenance for reproducibility.
Install via CLI
npx skills add https://github.com/ProfSynapse/Synaptic-Tuner --skill fine-tuning
Repository Details
star Stars 23
call_split Forks 3
navigation Branch main
article Path SKILL.md
More from Creator