name: training-colab description: Execute model training workflows on Google Colab with Google Drive integration for the Cherry Processing System.
Training on Google Colab Skill
This skill enables efficient model training using Google Colab's GPU resources while maintaining code and data synchronization with the local project via Google Drive.
When to Use
- Need GPU acceleration for model training
- Training experiments that exceed local hardware capabilities
- Long-running training jobs that benefit from cloud execution
- Testing hyperparameter variations at scale
Prerequisites
- Google account with Colab Pro (recommended) or Colab Free
- Google Drive with project folder structure
- Training data uploaded to Google Drive
- Local project configured for Drive sync (see
docs/reference/colab-pro-setup.md)
Workflow
Step 1: Prepare Training Configuration
Create or update a training config in training/configs/:
# training/configs/experiment_name.yaml
model:
name: resnet50
num_classes: 2
pretrained: true
training:
epochs: 100
batch_size: 32
learning_rate: 0.001
optimizer: adam
data:
train_path: /content/drive/MyDrive/traina/data/train
val_path: /content/drive/MyDrive/traina/data/val
augmentation:
enabled: true
normalize: false # Set based on production requirements
Step 2: Sync Code to Google Drive
Before syncing, perform a Pre-flight Check:
- Run a local smoke test (e.g.,
python smoke_test.pyor setDRY_RUN=Truein your notebook). - Verify that 1 epoch runs on CPU with a few batches.
- Only sync after local verification passes.
# From project root
./training/scripts/sync_to_drive.sh
Step 3: Launch Colab Notebook
- Open
training/notebooks/colab_training.ipynbin Google Colab - Connect to GPU runtime (Runtime → Change runtime type → GPU)
- Mount Google Drive:
from google.colab import drive drive.mount('/content/drive') - Execute training cells
Step 4: Download Results
After training completes:
# Sync trained models back from Drive
./training/scripts/sync_from_drive.sh \
--source "drive/MyDrive/traina/experiments/" \
--dest "training/experiments/"
Configuration Options
Environment Modes
| Mode | Description | Use Case |
|---|---|---|
| Development | Quick iterations, small dataset | Testing configs, debugging |
| Training | Full dataset, GPU acceleration | Production model training |
| Evaluation | Validation/test metrics only | Assessing trained models |
GPU Optimization
# Enable mixed precision for faster training
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch in dataloader:
optimizer.zero_grad()
with autocast():
outputs = model(batch)
loss = criterion(outputs, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Best Practices
Version Control
- Git tracks code and configs locally
- Drive stores datasets and model weights
- Never commit large
.ptfiles to git
Experiment Tracking
- Use descriptive experiment names
- Log all hyperparameters
- Save training curves and metrics
Data Management
- Keep training data in Drive under versioned folders
- Use symbolic links for large datasets
- Validate data integrity before training
Cost Optimization
- Colab Free: Limited GPU hours per day
- Colab Pro: Faster GPUs, longer sessions
- Monitor runtime to avoid losing progress
"Run All" Configuration Pattern
- Use config cells at the top of notebooks with skip flags
- Allows selective experiment execution when using "Run All"
- Example pattern:
# === EXPERIMENT CONFIGURATION === DRY_RUN = False # Smoke test mode SKIP_EXPERIMENT_1 = True # Skip if inconclusive/slow THRESHOLD_MODEL = 'production' # Model selection- Each experiment cell checks its skip flag before executing
- Ensures reproducible "Run All" behavior
Troubleshooting
| Issue | Solution |
|---|---|
| Drive not mounting | Re-authenticate, check permissions |
| CUDA out of memory | Reduce batch_size, use gradient accumulation |
| Training interrupted | Enable checkpointing, resume from last epoch |
| Slow data loading | Pre-cache dataset to local Colab storage |