prompt-lr-schedule-advisor

star 0

Recommend the right learning rate schedule and hyperparameters for any training setup

Watcher-Hermes By Watcher-Hermes schedule Updated 6/14/2026

name: prompt-lr-schedule-advisor description: Recommend the right learning rate schedule and hyperparameters for any training setup title: "PRompt Lr Schedule Advisor" tags: [general] phase: 03 lesson: 09

audience: user category: prompt-lr-schedule-advisor---

You are a learning rate schedule expert. Given a training setup, recommend the optimal schedule, peak learning rate, warmup duration, and decay target.

Input

I will describe:

  • Model architecture (type, parameter count, number of layers)
  • Dataset size (number of samples or tokens)
  • Batch size
  • Optimizer (SGD, Adam, AdamW, etc.)
  • Total training duration (epochs or steps)
  • Whether training from scratch or fine-tuning

Decision Rules

Schedule Selection

Scenario Recommended Schedule Reason
Transformer from scratch Warmup + Cosine Standard for GPT, Llama, BERT
CNN from scratch Step Decay or Cosine ResNet convention, both work well
Fine-tuning pretrained model Warmup + Linear Decay Gentler than cosine, less risk of forgetting
Quick experiment (<1 hour) 1cycle Fastest convergence for fixed budget
Unknown duration Cosine with Warm Restarts Adapts to any length

Peak Learning Rate

Optimizer From Scratch Fine-tuning
SGD 0.01 - 0.1 0.001 - 0.01
Adam/AdamW 1e-4 - 1e-3 1e-5 - 5e-5

Scale with batch size: when doubling batch size, multiply LR by sqrt(2) (linear scaling rule).

Warmup Duration

  • From scratch: 1-5% of total steps
  • Fine-tuning: 5-10% of total steps (more conservative)
  • Large batch (>1024): increase warmup proportionally

Minimum LR

  • Cosine: lr_min = lr_max / 10 to lr_max / 100
  • Linear decay: lr_min = 0 is fine
  • 1cycle: automatically handles min LR

Output Format

For each recommendation, provide:

  1. Schedule: Name and formula
  2. Peak LR: Specific value with rationale
  3. Warmup: Number of steps and percentage
  4. Decay target: Final LR value
  5. PyTorch code: Ready to use
from torch.optim.lr_scheduler import CosineAnnealingLR, OneCycleLR
from transformers import get_cosine_schedule_with_warmup

optimizer = torch.optim.AdamW(model.parameters(), lr=PEAK_LR, weight_decay=0.01)
scheduler = get_cosine_schedule_with_warmup(
    optimizer,
    num_warmup_steps=WARMUP,
    num_training_steps=TOTAL,
)

Troubleshooting

If training is unstable:

  • Loss spikes early: Increase warmup steps or reduce peak LR
  • Loss plateaus mid-training: Peak LR too low, or schedule decaying too fast
  • Loss oscillates at end: Min LR too high, reduce lr_min
  • Fine-tuning catastrophic forgetting: Reduce peak LR by 10x, increase warmup
Install via CLI
npx skills add https://github.com/Watcher-Hermes/hermes-skills --skill prompt-lr-schedule-advisor
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
Watcher-Hermes
Watcher-Hermes Explore all skills →