lionmuon-optimizer

star 1

LionMuon optimizer methodology - alternating between spectral (Muon) and sign-based (Lion) updates on a fixed period for compute-efficient large-scale training

hiyenwong By hiyenwong schedule Updated 6/3/2026

name: lionmuon-optimizer category: deep-learning description: LionMuon optimizer methodology - alternating between spectral (Muon) and sign-based (Lion) updates on a fixed period for compute-efficient large-scale training trigger: LionMuon, sign-based optimizer, Muon optimizer, spectral descent, sign descent, alternating optimizer, optimizer period, dual-EMA momentum

LionMuon: Alternating Spectral and Sign Descent for Efficient Training

Methodology for combining strong spectral optimization (Muon) with cheap sign-based updates (Lion) via alternating periods, achieving Pareto-dominant training efficiency.

Core Problem

Sign-based optimizers (Lion, Signum) produce cheap per-step updates but weaker directions. Muon's spectral matrix-sign update gives much stronger direction at substantially higher per-step cost. Neither alone achieves the compute-loss Pareto frontier.

Key Methodology

1. Alternating Update Scheme

  • Alternate between Lion and Muon updates on a fixed period P
  • Lion step (cheap, fast): standard sign-based update with momentum
  • Muon step (expensive, strong): spectral matrix-sign via Newton-Schulz iteration
  • At P=2, LionMuon Pareto-dominates Muon, Lion, Signum, and AdamW at 124M/355M/720M scale

2. Shared Dual-EMA Momentum Buffer

  • Both Lion and Muon share a single dual-EMA momentum buffer
  • Optimizer state memory matches Lion, exactly half of AdamW
  • No additional memory overhead despite alternating between two optimizer types

3. SignMuon Variant

  • Simpler single-EMA variant of LionMuon
  • Already outperforms pure Muon standalone
  • Useful as a lighter-weight alternative when dual-EMA is unnecessary

4. Complexity Theory

  • Proves sharp complexity bounds under heavy-tailed noise
  • Bounds governed by period-averaged smoothness and noise
  • Interpolates between Muon's and Lion's constants
  • Predicts compute-optimal period P for given architecture/dataset
  • Theoretical conditions for when LionMuon outruns both Muon and Lion

Implementation Pattern

class LionMuon:
    """Alternating Lion-Muon optimizer with shared momentum."""
    def __init__(self, params, period=2, lr=1e-3, beta1=0.9, beta2=0.99):
        # Shared dual-EMA momentum buffer
        self.momentum = {p: (torch.zeros_like(p), torch.zeros_like(p)) for p in params}
        self.period = period
        self.step_count = 0
    
    def step(self):
        self.step_count += 1
        if self.step_count % self.period == 0:
            # Muon step: spectral matrix-sign (expensive but strong)
            self.muon_update()
        else:
            # Lion step: sign-based (cheap, fast)
            self.lion_update()

When to Use

  • Large-scale model training where both compute efficiency and optimization quality matter
  • When AdamW memory overhead is a bottleneck (LionMuon uses half the memory)
  • Training at 100M-1B+ parameter scale where optimizer cost becomes significant
  • When you want Muon-quality updates without paying full per-step cost

Key Parameters

  • Period P: Controls Lion:Muon ratio. P=2 means alternating every step. Larger P = more Lion steps.
  • Dual-EMA betas: (beta1, beta2) for momentum smoothing, same as AdamW
  • Learning rate: Similar range to AdamW/Lion

Performance

  • Reaches lower validation loss at lower compute than Muon, Lion, Signum, AdamW
  • Advantage persists across model scales (124M → 720M)
  • 50% optimizer memory vs AdamW

Code

https://github.com/brain-lab-research/lion-muon

Reference

arXiv: 2605.19811v1 - "LionMuon: Alternating Spectral and Sign Descent for Efficient Training"

Install via CLI
npx skills add https://github.com/hiyenwong/ai_collection --skill lionmuon-optimizer
Repository Details
star Stars 1
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator