colab-unsloth-finetune - SKILL.md Agent Skill

name: colab-unsloth-finetune description: >- Orchestrate autonomous LLM fine-tuning on Google Colab's free GPU tier using Unsloth, driven from VS Code (or Codex Desktop App) with sequential command execution and self-correcting error recovery.

Colab Unsloth Finetune — Autonomous LLM Fine-Tuning via Google Colab + Unsloth

Orchestrate autonomous LLM fine-tuning on Google Colab's free GPU tier using Unsloth, driven from VS Code (or Codex Desktop App) with sequential command execution and self-correcting error recovery.

Trigger

Use when the user asks to "fine-tune on Colab", "Colab Unsloth training", "free GPU fine-tuning", "autonomous Colab training", "overnight Colab fine-tune", "Colab으로 학습", "Unsloth 파인튜닝", "무료 GPU 학습", "코랩 파인튜닝", "자율 파인튜닝", "colab-unsloth-finetune", or wants to run LLM fine-tuning on Google Colab using Unsloth without paying for GPU.

Do NOT use for RunPod/HF Jobs GPU training (use runpod-pods or hf-model-trainer). Do NOT use for local GPU training without Colab (use hf-model-trainer with local setup). Do NOT use for model evaluation only (use hf-evaluation). Do NOT use for inference endpoint deployment (use hf-endpoints).

Prerequisites

Requirement	How to Verify
Google account with Drive access	Browser login at drive.google.com
VS Code + Google Colab extension	`code --list-extensions \| grep -i colab` or install `googlecolab.google-colab`
Chrome with Colab extension (optional)	Check chrome://extensions for "Open in Colab"
Dataset in JSONL format	`wc -l <dataset>.jsonl` + `head -1 <dataset>.jsonl \| python3 -m json.tool`
Sufficient Google Drive space	Dataset size + ~2x for model checkpoints
HuggingFace token (for gated models)	`hf whoami` or check `HF_TOKEN` env var

Workflow — 10-Phase Pipeline

Phase 1: Prepare Dataset

Validate the JSONL dataset locally:

python3 -c "
import json, sys
with open('<dataset>.jsonl') as f:
    lines = f.readlines()
print(f'Total samples: {len(lines)}')
sample = json.loads(lines[0])
print(f'Keys: {list(sample.keys())}')
print(f'File size: {sys.getsizeof(open(\"<dataset>.jsonl\").read()) / 1024 / 1024:.1f} MB')
"

Ensure the dataset follows one of Unsloth's supported formats:
- ShareGPT: {"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}
- Alpaca: {"instruction": "...", "input": "...", "output": "..."}
- Raw text: {"text": "..."}

Phase 2: Upload Dataset to Google Drive

Upload via gws-drive skill or gws drive upload:

gws drive upload <dataset>.jsonl --folder "Colab_Training"

Or upload manually via drive.google.com
Note the Drive path for later: /content/drive/MyDrive/Colab_Training/<dataset>.jsonl

Phase 3: Create Google Colab Notebook

Go to colab.research.google.com → New Notebook
Runtime → Change runtime type → GPU (T4 is free tier, sufficient for ≤8B models with 4-bit LoRA)
Name the notebook descriptively: unsloth-finetune-YYYY-MM-DD.ipynb
Save to Google Drive

Phase 4: Connect VS Code to Colab (Optional — for Codex/Agent-Driven Execution)

In VS Code, install the Google Colab extension (googlecolab.google-colab)
Copy the Colab notebook URL
Paste the URL into VS Code / Codex Desktop App
This enables the coding agent to execute cells sequentially in the Colab runtime

Alternative: Use the Chrome "Open in Colab" extension for direct browser-based execution.

Phase 5: Mount Google Drive

Execute in the first Colab cell:

from google.colab import drive
drive.mount('/content/drive')

Authorize via the OAuth popup.

Phase 6: Download Dataset from Drive

import shutil
shutil.copy(
    '/content/drive/MyDrive/Colab_Training/<dataset>.jsonl',
    '/content/<dataset>.jsonl'
)

Verify:

!wc -l /content/<dataset>.jsonl
!head -1 /content/<dataset>.jsonl

Phase 7: Install Unsloth

%%capture
!pip install unsloth
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

Verify installation:

import unsloth
print(f"Unsloth version: {unsloth.__version__}")

Phase 8: Configure and Run Training

from unsloth import FastLanguageModel
import torch

# 1. Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="<model_id>",  # e.g. "unsloth/Qwen3-4B-bnb-4bit"
    max_seq_length=2048,
    dtype=None,               # auto-detect
    load_in_4bit=True,
)

# 2. Apply LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

# 3. Prepare dataset
from datasets import load_dataset
dataset = load_dataset("json", data_files="/content/<dataset>.jsonl", split="train")

# 4. Format dataset (adapt to your format)
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(tokenizer, chat_template="chatml")

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(
        convo, tokenize=False, add_generation_prompt=False
    ) for convo in convos]
    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True)

# 5. Train
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=1,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="/content/outputs",
        report_to="none",
    ),
)

gpu_stats = torch.cuda.get_device_properties(0)
print(f"GPU: {gpu_stats.name}, VRAM: {gpu_stats.total_mem / 1024**3:.1f} GB")

trainer_stats = trainer.train()
print(f"Training completed in {trainer_stats.metrics['train_runtime']:.0f}s")

Phase 9: Save and Export Model

# Save to Drive
model.save_pretrained("/content/drive/MyDrive/Colab_Training/finetuned-model")
tokenizer.save_pretrained("/content/drive/MyDrive/Colab_Training/finetuned-model")

# Optional: Push to HuggingFace Hub
model.push_to_hub("<your-hf-username>/<model-name>", token="<HF_TOKEN>")
tokenizer.push_to_hub("<your-hf-username>/<model-name>", token="<HF_TOKEN>")

# Optional: Export as GGUF for local inference
model.save_pretrained_gguf(
    "/content/drive/MyDrive/Colab_Training/finetuned-model-gguf",
    tokenizer,
    quantization_method="q4_k_m",
)

Phase 10: Verify Results

# Quick inference test
FastLanguageModel.for_inference(model)
inputs = tokenizer(["<test_prompt>"], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.batch_decode(outputs))

Error Recovery Patterns

Error	Cause	Fix
`CUDA out of memory`	Batch size too large or model too big for free T4 (15GB)	Reduce `per_device_train_batch_size` to 1, enable `gradient_checkpointing`, use 4-bit quantization
`Drive mount timeout`	OAuth session expired	Re-run `drive.mount()` with `force_remount=True`
`Module not found: unsloth`	Installation failed silently	Run `!pip install unsloth` again, check for version conflicts with `!pip list \| grep -i unsloth`
`RuntimeError: Expected CUDA`	GPU runtime not selected	Runtime → Change runtime type → T4 GPU
`Connection to runtime lost`	Colab session timeout (free tier: ~90min idle, ~12h max)	Reconnect, resume from last checkpoint in `output_dir`
`Dataset loading error`	JSONL format mismatch	Validate with `python3 -m json.tool < first_line.json`, check encoding (UTF-8)
`tokenizer.apply_chat_template error`	Wrong chat template for model	Check model card for correct template; try `"qwen-2.5"` or `"chatml"`

Free Tier Constraints

Constraint	Limit	Workaround
GPU type	T4 (15GB VRAM)	Use 4-bit quantization + LoRA (fits ≤8B models)
Session duration	~12h continuous, ~90min idle timeout	Use checkpointing, resume from saved state
Storage	~78GB disk, 12GB RAM	Stream large datasets, clear cache with `torch.cuda.empty_cache()`
Daily GPU quota	~12h/day (varies)	Start training early, use Colab Pro for more quota

Autonomous Agent Execution Pattern

When driven by a coding agent (Codex, Cursor, Claude):

Cell-by-cell execution: The agent pastes and runs each code block sequentially
Output inspection: After each cell, the agent reads stdout/stderr for errors
Self-correction: On error, the agent diagnoses the issue and applies the fix from the Error Recovery table
Progress tracking: Training logs (loss, learning_rate, epoch) are printed every logging_steps
Completion verification: Agent confirms training finished by checking for saved model files in Drive

This pattern enables overnight autonomous fine-tuning — start the process before sleep, wake up to a trained model.

Composition with Existing Skills

Phase	Composable Skill	Purpose
Dataset prep	`hf-datasets`	Create/validate HF-format datasets
Dataset upload	`gws-drive`	Upload to Google Drive via CLI
Model search	`hf-models`	Find the right base model on Hub
Post-training eval	`hf-evaluation`	Run benchmarks on the fine-tuned model
Model upload	`hf-cli`	Push to HuggingFace Hub
GGUF conversion	`hf-model-trainer`	Reference for GGUF export patterns
Result distribution	`x-to-slack`	Post training results to Slack