advantage-collapse-grpo-avspo - SKILL.md Agent Skill

name: advantage-collapse-grpo-avspo description: "Advantage Collapse in Group Relative Policy Optimization (GRPO): Diagnosis and Mitigation via Adaptive Virtual Sample Policy Optimization (AVSPO). Introduces the Advantage Collapse Rate (ACR) metric to diagnose training stagnation, and proposes AVSPO to inject virtual reward samples guided by real-time ACR monitoring. Use when: diagnosing GRPO training failures, improving LLM reasoning RL post-training, mitigating advantage collapse, ICML 2026 accepted. Activation: advantage collapse GRPO, AVSPO, ACR metric, GRPO diagnosis, virtual sample policy optimization, RLVR training stagnation."

Advantage Collapse in GRPO: Diagnosis and Mitigation via AVSPO

Source paper: arXiv:2605.21125 (ICML 2026) Authors: Xixiang He, Qiyao Sun, Ao Cheng, Xingming Li, Xuanyu Ji, Hailun Lu, Runke Huang, Qingyong Hu

Core Problem

GRPO (Group Relative Policy Optimization), a key algorithm in RLVR (Reinforcement Learning from Verifiable Rewards), suffers from advantage collapse: when all responses within a group have homogeneous rewards (all correct or all incorrect), the computed advantages become near-zero, resulting in vanishing gradients and training stagnation.

Key Contributions

1. Advantage Collapse Rate (ACR)

The first diagnostic metric for GRPO training, quantifying the proportion of training batches with ineffective gradients:

ACR = proportion of groups where reward variance is below a critical threshold
Strongly predicts training stagnation and final performance across 0.5B–14B parameter models on math reasoning benchmarks

2. Adaptive Virtual Sample Policy Optimization (AVSPO)

A lightweight extension of GRPO that:

Injects virtual reward samples into homogeneous groups
Guided by real-time ACR monitoring
Enables learning from homogeneous groups without additional model rollouts
Reduces advantage collapse by 58–63% relative to GRPO
Yields consistent accuracy gains of 4–6 percentage points across all model scales

Algorithm Design

Standard GRPO Advantage

Advantage_i = (reward_i - mean(group_rewards)) / std(group_rewards)

When all rewards are identical → advantage = 0 for all → no gradient.

AVSPO Extension

Monitor ACR during training in real-time
When ACR exceeds a threshold (homogeneous group detected):
- Generate virtual samples with perturbed rewards
- Virtual reward = original reward + small Gaussian noise
- Use confidence-weighted interpolation between real and virtual advantages
Apply standard GRPO update with augmented advantages

Training Pipeline

for each training step:
  1. Sample prompt → generate group of G responses
  2. Compute verifiable rewards for all responses
  3. Compute ACR for current batch
  4. if ACR > threshold:
       inject virtual reward samples
       compute augmented advantages
     else:
       compute standard GRPO advantages
  5. Update policy via GRPO loss

Implementation Points

# ACR computation
def compute_acr(rewards, threshold=0.01):
    # rewards: [batch_size, group_size]
    variances = rewards.var(dim=1)  # variance per group
    acr = (variances < threshold).float().mean()
    return acr.item()

# AVSPO virtual sample injection
def avspo_advantages(rewards, acr, acr_threshold=0.3):
    advantages = (rewards - rewards.mean(dim=1, keepdim=True)) / (rewards.std(dim=1, keepdim=True) + 1e-8)
    if acr > acr_threshold:
        # Inject virtual samples
        noise = torch.randn_like(rewards) * 0.1
        virtual_rewards = rewards + noise
        virtual_advantages = (virtual_rewards - virtual_rewards.mean(dim=1, keepdim=True)) / (virtual_rewards.std(dim=1, keepdim=True) + 1e-8)
        # Confidence-weighted interpolation
        alpha = min(1.0, (acr - acr_threshold) * 5.0)
        advantages = (1 - alpha) * advantages + alpha * virtual_advantages
    return advantages

Key Results

Metric	GRPO	AVSPO	Improvement
Advantage Collapse	Baseline	-58–63%	58–63% reduction
Accuracy (0.5B)	Baseline	+4–6 pp	Consistent gains
Accuracy (7B)	Baseline	+4–6 pp	Consistent gains
Accuracy (14B)	Baseline	+4–6 pp	Consistent gains
OOD Generalization	Baseline	Maintained	No degradation

Application Scenarios

LLM reasoning RL post-training: Where GRPO currently used (math, code)
Any GRPO-based system: Transparent plug-in for existing GRPO trainers
Diagnosing training failures: Use ACR to detect when GRPO training is stuck
Multi-task RLVR: Tasks with varying reward distributions benefit from adaptive monitoring

Related Skills

[[gcpo-cooperative-policy-optimization]] - Cooperative alternative to GRPO
[[learning-zone-energy-data-selection]] - Data selection for efficient RL post-training
[[d2evo-dual-difficulty-self-evolution]] - Difficulty-aware sample selection for RL

Activation Keywords

advantage collapse, GRPO, AVSPO, ACR, RLVR, virtual sample policy optimization, GRPO diagnosis, LLM reasoning RL, group relative policy optimization, training stagnation, ICML 2026