ai-model-training - SKILL.md Agent Skill

name: ai-model-training description: > Use this skill when fine-tuning LLMs: LoRA, QLoRA, RLHF, DPO, SFT, instruction tuning, preference tuning, PEFT, prompt tuning, adapter training, training data preparation, multi-GPU training, distributed training, hyperparameter search, full pre-training, continued pre-training. This skill enforces: fine-tuning strategy selection, training data preparation with chat templates, preference pair construction, evaluation before/during/after training, training configuration documentation, distributed setup, experiment tracking. Do NOT use for: feature store training, embedding model training (see ai-embeddings), RAG pipeline tuning, inference optimization. version: "2.0.0" author: "j4flmao" license: "MIT" type: skill compatibility: claude-code: true cursor: true codex: true windsurf: true tags: [ai, training, fine-tuning, distributed-training, hpo, experiment-tracking, phase-11]

Model Training Agent

Purpose

Design and execute model training plans for LLM fine-tuning, continued pre-training, and RLHF alignment: strategy selection, data pipeline, training configuration, distributed setup, hyperparameter optimization, evaluation, and production tracking.

Agent Protocol

Trigger

User request includes: fine-tuning, LoRA, QLoRA, RLHF, DPO, PPO, training LLM, model training, instruction tuning, preference tuning, SFT, prompt tuning, adapter, PEFT, Supervised Fine-Tuning, distributed training, hyperparameter search, pre-training, continued pre-training.

Protocol

Clarify: base model, task type, data volume (size + tokens), compute budget (GPU hours, dollars), hardware available.
Navigate decision tree to select training approach.
Prepare training data: format (instruction / chat / preference pairs), tokenize, split, validate.
Configure training: hyperparameters, optimizer, LR schedule, precision, batch size.
Design distributed setup: single GPU, FSDP, DeepSpeed, multi-node.
Define evaluation: pre-training baseline, in-training metrics, post-training benchmarks, forgetting checks.
Set up experiment tracking: metrics logging, checkpoint registry, hyperparameter capture.

Decision Tree: Training Approach

Q: Is this your first time training this model?
├── NO  → Go to "Fine-tuning or continued pre-training?"
└── YES → Go to "Available compute?"

Q: Available compute?
├── < 16 GB VRAM → QLoRA (4-bit NF4, double quant)
│   LoRA rank <= 32, batch size 1-2, gradient checkpointing req.
├── 16-48 GB VRAM → LoRA (BF16 base)
│   LoRA rank 16-64, batch size 2-8
├── 48-160 GB VRAM → Full fine-tune (single node)
│   BF16, gradient checkpointing, DeepSpeed ZeRO-2/3
└── > 160 GB VRAM / multi-node → Full fine-tune (distributed)
    BF16, FSDP or DeepSpeed ZeRO-3, tensor parallelism for 70B+

Q: Fine-tuning or continued pre-training?
├── Task adaptation (< 100K examples)           → LoRA
├── Domain shift / style change (> 100K examples) → Full fine-tune
├── New knowledge / continued pre-training       → Full pre-train or continued pre-train
└── Align model behavior                         → Go to "Alignment method?"

Q: Alignment method?
├── Human preference data available?
│   ├── YES → Ask: KL control importance?
│   │   ├── HIGH → PPO (3-stage: SFT → RM → PPO)
│   │   └── LOW  → DPO (single stage, no reward model)
│   └── NO  → SFT only (instruction tuning)
└── Want to avoid reward model training?
    ├── YES → DPO
    └── NO  → PPO (if compute budget allows 3 stages)

Q: Multi-task / multi-domain?
├── YES → Use LoRA adapters per task with shared base
│   Consider: AdapterFusion, LoRA ensembles
└── NO  → Single adapter or full fine-tune

Q: Data size for instruction tuning?
├── < 1K examples → Use LoRA, start with higher LR (3e-4), more epochs (5-10)
├── 1K-10K examples → LoRA, standard config, 3-5 epochs
├── 10K-100K examples → LoRA or full fine-tune, 2-3 epochs
└── > 100K examples → Full fine-tune preferred, 1-2 epochs

Workflow

Step 1: Select Training Method

Full Pre-training: Train from scratch. Requires massive data (1T+ tokens), compute, and engineering. Only when no suitable base model exists.
Continued Pre-training: Train on existing base with new domain data (code, biomedical, legal). Use same tokenizer, extend vocab if needed. LR 1e-5 to 5e-5.
Full Fine-tune: All parameters updated. Best for large distribution shifts. Requires most compute. LR 1e-5 to 5e-5.
LoRA: Low-rank adapters. ~1% of parameters. Best for task adaptation. Default choice. LR 1e-4 to 5e-4.
QLoRA: Quantized LoRA (4-bit NF4) with double quantization. ~0.5% of parameters. Best for limited GPU memory. LR 1e-4 to 3e-4.
Adapters: Bottleneck layers between transformer sublayers. Best for multi-task setups.
DPO: Direct Preference Optimization. Single-stage alignment. No reward model needed.
PPO: 3-stage RLHF. SFT → Reward Model → PPO. Most compute but best alignment control.

Step 2: Prepare Training Data

Instruction Format

from datasets import Dataset

data = [
    {"instruction": "Translate to French", "input": "Hello world", "output": "Bonjour le monde"},
    {"instruction": "Summarize", "input": "Long text...", "output": "Short summary..."}
]

Chat Template Format

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct")
tokenizer.pad_token = tokenizer.eos_token
messages = [
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi there!"}
]
formatted = tokenizer.apply_chat_template(messages, tokenize=False)

Preference Pairs (for DPO/RLHF)

preference_data = [
    {
        "prompt": "What is the capital of France?",
        "chosen": "Paris is the capital of France.",
        "rejected": "London is the capital of France."
    }
]

Tokenization with Label Masking

def tokenize_and_mask(examples, tokenizer, max_length=2048):
    outputs = tokenizer(
        examples["text"],
        truncation=True,
        max_length=max_length,
        padding="max_length",
        return_tensors=None,
    )
    # Copy input_ids to labels, mask user tokens with -100
    outputs["labels"] = outputs["input_ids"].copy()
    return outputs

Data Splitting & Validation

# Split: train (80%), eval (10%), test (10%)
# Stratify by category if available
from sklearn.model_selection import train_test_split

def prepare_splits(data, stratify_col=None):
    train_val, test = train_test_split(
        data, test_size=0.1, stratify=stratify_col
    )
    train, eval = train_test_split(
        train_val, test_size=0.111, stratify=stratify_col
    )
    return train, eval, test

Step 3: Configure Training with LoRA

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Step 4: Training Arguments

training_args = TrainingArguments(
    output_dir="./checkpoints",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    num_train_epochs=3,
    logging_steps=10,
    logging_strategy="steps",
    evaluation_strategy="steps",
    eval_steps=200,
    save_strategy="steps",
    save_steps=500,
    save_total_limit=3,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    bf16=True,
    tf32=True,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    optim="adamw_torch",
    weight_decay=0.01,
    max_grad_norm=1.0,
    report_to="wandb",
    run_name=f"lora-ft-{model_name}-{timestamp}",
    remove_unused_columns=False,
    dataloader_num_workers=4,
    ddp_find_unused_parameters=False,
)

Step 5: Trainer Setup

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=lambda data: tokenizer.pad(
        [{"input_ids": d["input_ids"], "attention_mask": d["attention_mask"], "labels": d["labels"]} for d in data],
        return_tensors="pt",
    ),
    compute_metrics=compute_metrics_fn if eval_task else None,
)
trainer.train()

Step 6: Distributed Training

FSDP Configuration

# fsdp_config.yaml
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

DeepSpeed ZeRO-3 Configuration

{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu", "pin_memory": true},
        "offload_param": {"device": "cpu", "pin_memory": true},
        "overlap_comm": true,
        "contiguous_gradients": true,
        "reduce_bucket_size": 5e7,
        "stage3_prefetch_bucket_limit": 5e7,
        "stage3_param_persistence_threshold": 1e6,
        "sub_group_size": 1e9
    },
    "bf16": {"enabled": true},
    "fp16": {"enabled": false},
    "gradient_accumulation_steps": 8,
    "gradient_clipping": 1.0,
    "steps_per_print": 100,
    "train_batch_size": 32,
    "train_micro_batch_size_per_gpu": 4,
    "wall_clock_breakdown": false
}

Launch Commands

# DeepSpeed
deepspeed --num_gpus=8 train.py \
    --deepspeed ds_config.json \
    --model_name meta-llama/Llama-2-13b-hf

# FSDP via torchrun
torchrun --nproc_per_node=8 train.py \
    --fsdp full_shard \
    --fsdp_transformer_layer_cls_to_wrap LlamaDecoderLayer

# Multi-node
torchrun --nnodes=4 --nproc_per_node=8 --rdzv_id=101 --rdzv_backend=c10d train.py

Architectural Patterns

Data Pipeline Architecture

Raw Sources (JSONL, Parquet, DB)
  → Data Cleaner (PII removal, dedup, quality scoring)
    → Formatter (chat template, instruction format)
      → Tokenizer (map-style dataset with caching)
        → DataLoader (batching, shuffling, num_workers)
          → Training Loop

Key design decisions:

Use datasets library with memory mapping for large datasets (no full RAM load).
Cache tokenized datasets to disk (keep_in_memory=False) between runs.
Set dataloader_num_workers=4-8 to avoid GPU starvation.
Use StreamingDataset for datasets larger than available disk.

Training Loop Architecture

# Custom training loop (when Trainer is insufficient)
for epoch in range(num_epochs):
    for step, batch in enumerate(train_dataloader):
        batch = {k: v.to(device) for k, v in batch.items()}

        with ctx:  # autocast for mixed precision
            outputs = model(**batch)
            loss = outputs.loss / gradient_accumulation_steps

        loss_scaler.scale(loss).backward()

        if (step + 1) % gradient_accumulation_steps == 0:
            loss_scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
            loss_scaler.step(optimizer)
            loss_scaler.update()
            optimizer.zero_grad()

        if step % logging_steps == 0:
            metrics = compute_metrics(model, eval_loader, device)
            log_to_tracker({"train/loss": loss.item(), **metrics})

Checkpointing Architecture

class CheckpointManager:
    def __init__(self, output_dir, save_every_n_steps, keep_last_k):
        self.dir = output_dir
        self.save_every = save_every_n_steps
        self.keep = keep_last_k
        self.checkpoints = []

    def save(self, model, optimizer, scheduler, step, metrics):
        if step % self.save_every != 0 and not self._is_best(metrics):
            return
        ckpt_path = os.path.join(self.dir, f"step_{step}")
        os.makedirs(ckpt_path, exist_ok=True)
        save_args = {
            "state_dict": model.state_dict(),
            "optimizer": optimizer.state_dict(),
            "scheduler": scheduler.state_dict(),
            "step": step,
            "metrics": metrics,
        }
        torch.save(save_args, os.path.join(ckpt_path, "training_state.pt"))
        model.save_pretrained(ckpt_path)
        self.checkpoints.append((step, metrics.get("eval_loss", float("inf"))))
        self.checkpoints.sort(key=lambda x: x[1])
        while len(self.checkpoints) > self.keep:
            stale_step = self.checkpoints.pop()[0]
            shutil.rmtree(os.path.join(self.dir, f"step_{stale_step}"))

Evaluation Loop Architecture

# In-training evaluation
# Run on a fixed subset of eval data (500-1000 samples) every N steps
# Track: eval_loss, perplexity, task accuracy, gradient norms
#
# Pre-training baseline: run before training starts
# In-training: every N steps on eval subset
# Post-training: full benchmark suite after training
#
# Catastrophic forgetting detection:
# - Maintain a "forgetting set" of diverse tasks
# - Track scores relative to pre-training baseline
# - Alert if any task drops > 5% from baseline

def evaluate_model(model, eval_dataset, tokenizer, device, max_samples=500):
    model.eval()
    total_loss = 0.0
    total_steps = 0
    with torch.no_grad():
        for batch in islice(eval_dataloader, max_samples // batch_size):
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            total_loss += outputs.loss.item()
            total_steps += 1
    return {"eval_loss": total_loss / total_steps, "perplexity": math.exp(total_loss / total_steps)}

Training Infrastructure Design

Compute

GPU	VRAM	FP16 TFLOPS	BF16 TFLOPS	Best For
RTX 4090	24 GB	82	N/A	QLoRA, LoRA ≤13B
A100 80GB	80 GB	312	312	Full FT ≤13B, LoRA ≤70B
H100	80 GB	989	989	Full FT ≤70B, pre-training
H200	141 GB	989	989	Full FT ≤70B+
MI300X	192 GB	653	653	Alternative to H100

Storage Requirements

Dataset storage: NVMe SSD recommended. Tokenized datasets benefit from fast random access.
Checkpoint storage: Large contiguous writes. One 70B checkpoint = ~140 GB (BF16) or ~560 GB (optimizer states + model). Budget 3-5x model size for checkpoint space.
Model registry: Object storage (S3, GCS, Blob) for versioned artifacts.
Cache: HuggingFace cache directory needs 10-100 GB for base models.

Networking (Multi-Node)

Minimum: 25 Gbps Ethernet. Expect 30-40% scaling efficiency.
Recommended: 200-400 Gbps InfiniBand (HDR/HDR100/NDR). Expect 80-90% scaling efficiency.
Topology: Fat-tree or Dragonfly for GPU clusters.
NCCL: Use NCCL_IB_HCA, NCCL_SOCKET_IFNAME, tune NCCL_IB_TIMEOUT and NCCL_IB_RETRY_CNT.

CPU/RAM Guidelines

Per GPU: Minimum 64 GB system RAM per GPU (128 GB recommended for ZeRO-3 with CPU offload).
CPU cores: At least 8-16 cores per GPU for data loading and preprocessing.

Hyperparameter Optimization Strategies

Bayesian Optimization with Optuna

import optuna
from optuna.integration import TransformersPruningCallback

def objective(trial):
    lr = trial.suggest_float("learning_rate", 5e-5, 5e-4, log=True)
    lora_r = trial.suggest_int("lora_r", 8, 64)
    lora_alpha = trial.suggest_int("lora_alpha", 16, 128)
    weight_decay = trial.suggest_float("weight_decay", 0.0, 0.1)
    warmup_ratio = trial.suggest_float("warmup_ratio", 0.01, 0.1)
    dropout = trial.suggest_float("lora_dropout", 0.0, 0.3)

    config = LoraConfig(r=lora_r, lora_alpha=lora_alpha, lora_dropout=dropout)
    model = get_peft_model(base_model, config)

    args = TrainingArguments(
        learning_rate=lr,
        weight_decay=weight_decay,
        warmup_ratio=warmup_ratio,
        num_train_epochs=2,
        report_to="none",
        logging_steps=50,
        save_strategy="no",
    )
    trainer = Trainer(model=model, args=args, train_dataset=train_data, eval_dataset=eval_data)
    trainer.train()
    eval_result = trainer.evaluate()
    return eval_result["eval_loss"]

study = optuna.create_study(direction="minimize", pruner=optuna.pruners.MedianPruner())
study.optimize(objective, n_trials=20)
print(f"Best params: {study.best_params}, best loss: {study.best_value}")

Learning Rate Range Test

# Find optimal LR by running a short training with increasing LR
# Use lr_finder from transformers or implement manually
def lr_range_test(model, dataloader, optimizer_cls, device, min_lr=1e-7, max_lr=1):
    optimizer = optimizer_cls(model.parameters(), lr=min_lr)
    num_batches = len(dataloader)
    lr_mult = (max_lr / min_lr) ** (1 / num_batches)
    losses = []
    lrs = []

    for step, batch in enumerate(dataloader):
        batch = {k: v.to(device) for k, v in batch.items()}
        loss = model(**batch).loss / 1
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        losses.append(loss.item())
        lrs.append(optimizer.param_groups[0]["lr"])

        optimizer.param_groups[0]["lr"] *= lr_mult
        if loss.item() > 3 * min(losses):
            break

    return lrs, losses
    # Optimal LR is near the steepest descent point

Key Hyperparameter Rules

Parameter	SFT LoRA	Full FT	DPO	Pre-training
Learning rate	1e-4 to 5e-4	1e-5 to 5e-5	5e-7 to 5e-6	1e-4 to 3e-4
LoRA rank	8-64	N/A	16-64	N/A
LoRA alpha	16-128	N/A	16-64	N/A
Batch size (effective)	16-128	32-512	16-64	512-4096
Warmup ratio	0.03-0.1	0.03-0.1	0.05-0.15	0.01-0.05
Weight decay	0.01-0.1	0.01-0.1	0.0-0.1	0.01-0.1
Epochs	3-10	2-5	1-3	1-3 (tokens)
LR scheduler	cosine	cosine	cosine	cosine or warmup-stable-decay
Gradient clipping	1.0	1.0	1.0	1.0

Anti-Patterns & Troubleshooting

Anti-Pattern 1: Overfitting

Symptom: Training loss decreases but eval loss increases (divergence). Detection: Track train_loss - eval_loss gap. If gap > 20% of train_loss for 3+ eval steps. Fixes:

Reduce epochs (early stopping when eval loss plateaus)
Increase LoRA dropout (0.05 → 0.2)
Add weight decay (0.01 → 0.1)
Increase data (augmentation, synthetic data)
Reduce model capacity (lower LoRA rank)
Add replay data from original distribution

Anti-Pattern 2: Underfitting

Symptom: Both train and eval losses are high and flat. Detection: Loss is > 2x expected perplexity baseline. Fixes:

Increase learning rate
Increase LoRA rank (16 → 64)
More training epochs
Check for data issues (wrong format, mismatched tokenizer)
Remove excessive dropout or regularization
Verify model is actually training (check gradient norms)

Anti-Pattern 3: Data Leakage

Symptom: Eval metrics are unrealistically high but model fails in production. Types:

Temporal leakage: Training on future data (time-series). Fix: split by time.
Feature leakage: Eval features present in training. Fix: ensure strict separation.
Benchmark contamination: Test examples leaked into training data. Fix: deduplicate against known benchmarks.
LLM-generated eval data: Using the same LLM for data generation and evaluation.

Anti-Pattern 4: Compute Waste

Symptoms: GPU utilization < 50%, long idle times between steps. Fixes:

Increase dataloader_num_workers to match CPU count per GPU
Enable prefetch_factor in DataLoader
Profile with nsys or PyTorch profiler to find bottlenecks
Use padding_free / unified padding for variable-length sequences
Reduce evaluation frequency if eval is slow
Save checkpoints asynchronously (save on separate thread)
Use Flash Attention if available

Anti-Pattern 5: Training Instability

Symptom: Loss spikes, NaN loss, gradient explosion. Detection: Monitor loss values, gradient norms, max/sum of gradients. Fixes:

Reduce learning rate
Enable gradient clipping (1.0)
Warm up LR more slowly
Use BF16 instead of FP16 (BF16 doesn't overflow)
Check for NaN in input data
Reduce batch size (smaller per-device + fewer accumulation steps)
Ensure loss scaling is configured correctly

Anti-Pattern 6: Catastrophic Forgetting

Symptom: Performance on general benchmarks drops after fine-tuning. Fixes:

Include 10-30% general data in training mix
Use lower learning rate (1e-5 for LoRA)
Multi-task learning: train task + general data simultaneously
Elastic Weight Consolidation (EWC)
Keep LoRA rank low to limit representational shift

Anti-Pattern 7: DPO Overfitting / Reward Hacking

Symptom: Policy exploits reward model rather than learning actual preference. Fixes:

Increase KL penalty (beta in DPO: 0.1 → 0.5)
Limit PPO epochs per batch (3-4 max)
Use reference model for KL regularization
Monitor response diversity (distinct n-grams)
Human evaluation alongside automated metrics

Production: Experiment Tracking & Model Registry

Experiment Tracking Setup

# Initialize logging to multiple backends
import wandb
from torch.utils.tensorboard import SummaryWriter

# W&B setup
wandb.init(
    project="model-fine-tuning",
    config={
        "model": "Mistral-7B",
        "method": "lora",
        "lora_r": 16,
        "learning_rate": 2e-4,
        "batch_size": 32,
        "epochs": 3,
        "dataset": "support-qa-v5",
        "dataset_size": 15000,
    },
    tags=["experiment", "lora-ft"],
)

# What to log every N steps:
# - train/loss
# - train/grad_norm
# - eval/loss
# - eval/perplexity
# - eval/task_accuracy (if available)
# - train/learning_rate
# - train/epoch

# What to log once:
# - model architecture summary
# - dataset statistics
# - hardware info (GPU type, count, RAM)
# - git commit hash
# - hyperparameter config (full yaml/json dump)

Model Registry (MLflow Style)

# Model versioning schema
model_registry_entry = {
    "model_id": "ft-mistral-7b-v3",
    "base_model": "mistralai/Mistral-7B-v0.1",
    "method": "lora",
    "hyperparameters": {
        "lora_r": 16,
        "lora_alpha": 32,
        "learning_rate": 2e-4,
        "batch_size": 32,
        "epochs": 3,
    },
    "dataset": {
        "name": "support-qa-v5",
        "size": 15000,
        "hash": "sha256:abc123...",
    },
    "metrics": {
        "eval_loss": 0.45,
        "task_accuracy": 0.94,
        "mmlu_before": 0.63,
        "mmlu_after": 0.61,
        "forgetting_delta": -0.02,
    },
    "artifacts": {
        "adapter_path": "s3://models/ft-mistral-7b-v3/adapter/",
        "tokenizer_path": "s3://models/ft-mistral-7b-v3/tokenizer/",
        "config_path": "s3://models/ft-mistral-7b-v3/config.yaml",
        "checkpoints": "s3://models/ft-mistral-7b-v3/checkpoints/",
    },
    "training_metadata": {
        "compute_hours": 48,
        "gpu_type": "A100-80GB",
        "gpu_count": 8,
        "framework": "transformers 4.38 + peft 0.9",
        "start_time": "2025-01-15T10:00:00Z",
        "end_time": "2025-01-17T10:00:00Z",
    },
    "status": "production",  # experimental, staging, production, archived
    "tags": ["fine-tuned", "support-qa"],
}

Training Monitoring Dashboard

# Key panels to include in monitoring dashboard
DASHBOARD_PANELS = [
    ("Training Loss", "train/loss", "line", "Should decrease smoothly, no spikes"),
    ("Eval Loss", "eval/loss", "line", "Should follow train loss, not diverge"),
    ("Perplexity", "eval/perplexity", "line", "exp(eval_loss). < 10 is good for most tasks"),
    ("Gradient Norm", "train/grad_norm", "line", "Should be stable < 5x. Spikes = instability"),
    ("Learning Rate", "train/learning_rate", "line", "Should follow schedule"),
    ("GPU Utilization", "system/gpu_util", "line", "Target > 80%. Lower = data bottleneck"),
    ("GPU Memory", "system/gpu_mem_alloc", "line", "Watch for OOM. Should be stable"),
    ("Throughput", "train/samples_per_sec", "line", "Samples/sec across all GPUs"),
    ("Epoch Progress", "train/epoch", "line", "Progress through epochs"),
    ("Overfitting Delta", "eval/train_eval_gap", "line", "Train loss - eval loss. Widening = overfit"),
]

Cost Estimation

def estimate_training_cost(gpu_type, gpu_count, hours, cloud_rate_per_hour=None):
    rates = {
        "A100-80GB": 3.0,   # on-demand approximate
        "H100": 5.0,
        "RTX-4090": 0.5,
        "L40S": 2.0,
    }
    rate = cloud_rate_per_hour or rates.get(gpu_type, 3.0)
    return gpu_count * hours * rate

# Examples:
# LoRA 7B on 1x A100 for 4 hours: 1 * 4 * 3.0 = $12
# Full FT 70B on 8x H100 for 24 hours: 8 * 24 * 5.0 = $960
# Pre-train 8B on 256x H100 for 30 days: 256 * 720 * 5.0 = $921,600

Rules

LoRA rank 8-64 depending on task complexity. Higher rank for harder tasks.
QLoRA uses 4-bit NF4 quantization with double quantization for memory efficiency.
Chat template must match base model's training format exactly.
Training data deduplicated and filtered for quality.
Evaluation at three stages: pre-training baseline, in-training (eval loss), post-training (benchmarks).
Gradient checkpointing enabled for models > 7B parameters.
Set padding=False in tokenizer and use DataCollatorWithPadding for variable-length batches.
Learning rate: 1e-4 to 5e-5 for LoRA, 5e-5 to 2e-5 for full fine-tune.
Batch size maximized for GPU memory — larger batches improve training stability.
Watch for catastrophic forgetting — include 10-30% general data replay.
Track experiment with all hyperparameters, metrics, and model checkpoints.
Use torch.compile for 10-30% throughput improvement (PyTorch 2.0+).
Enable Flash Attention 2 (attn_implementation="flash_attention_2") when available.
For LoRA: target both attention and FFN modules (q, k, v, o, gate, up, down).
Run a single overfit batch test before full training (model should memorize one batch).
Log the training config as YAML/JSON artifact for reproducibility.
Pin environment versions: torch==2.1.0, transformers==4.36.0, peft==0.7.0.

Completion Criteria

Training approach selected using decision tree with justification.
Training data prepared with correct chat template and formatting.
Training configuration documented with full hyperparameter spec & rationale.
Distributed training setup configured for available hardware.
Evaluation plan covers baseline, in-training, and post-training metrics.
Anti-pattern checks applied: overfitting, underfitting, data leakage, compute waste.
Experiment tracking configured (W&B/MLflow/TensorBoard).
Training cost estimated (compute hours, cloud costs).
Model registry entry drafted for versioning.

References

references/fine-tuning-guide.md — Fine-Tuning Guide
references/fine-tuning-strategies.md — Fine-Tuning Strategies
references/model-training-advanced.md — Model Training: Advanced Distributed & Scaling Topics
references/model-training-data-prep.md — Model Training Data Preparation
references/model-training-evaluation.md — Model Training Evaluation
references/model-training-fundamentals.md — Model Training: Core Concepts & Training Loop Fundamentals
references/rlhf-dpo.md — RLHF & DPO
references/training-pipeline.md — Training Pipeline
references/hyperparameter-optimization.md — Hyperparameter Optimization
references/production-experiment-tracking.md — Production Experiment Tracking & Model Registry
references/distributed-training-architecture.md — Distributed Training Architecture
references/training-infrastructure-design.md — Training Infrastructure Design
references/anti-patterns-troubleshooting.md — Anti-Patterns & Troubleshooting

Handoff

For model evaluation and testing, hand off to ai-ai-testing. For serving the fine-tuned model, hand off to ml-model-serving. For embedding model training, hand off to ai-embeddings. For data labeling / curation, hand off to ai-data-engineering.