vla-probabilistic-chunk-masking

star 1

Drop-in GRPO modification that allocates gradient computation to a small, probabilistically selected subset of trajectory chunks using success-failure action variance. Achieves 2.38x wall-clock speedup while matching final performance.

hiyenwong By hiyenwong schedule Updated 6/3/2026

name: vla-probabilistic-chunk-masking category: research created: "2026-05-19" source: "arXiv:2605.16154v1" description: Drop-in GRPO modification that allocates gradient computation to a small, probabilistically selected subset of trajectory chunks using success-failure action variance. Achieves 2.38x wall-clock speedup while matching final performance. tags: [rl, grpo, vla, efficient-training, chunk-masking]

Probabilistic Chunk Masking (PCM) for Efficient VLA RL

Source: arXiv:2605.16154v1 - "Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking"

Summary

Proposes a drop-in modification to GRPO that uses success-failure action variance as a proxy for per-phase gradient variance, sampling only the most informative chunks for gradient computation. Achieves 2.38x wall-clock speedup while matching standard GRPO's final success rate, backpropagating through <20% of trajectory chunks.

Core Methodology

Key Insight

In GRPO-based VLA RL, gradient computation (78% of wall-clock) dominates rollout collection (21%). However, GRPO assigns the same advantage to every chunk, wasting compute on phases the policy already handles well. Only phases where successful and failed rollouts diverge produce learning signal.

Algorithm

  1. Success-Failure Action Variance: Compute per-phase action variance between successful and failed rollouts
    • Proxy for per-phase gradient variance
    • Identifies where learning signal is concentrated
  2. Probabilistic Chunk Selection: Sample a fixed chunk budget using online-updated phase-level keep probabilities
    • High-variance phases more likely selected
    • Maintains exploration via stochastic selection
  3. Selective Backpropagation: Only compute gradients for selected chunks

Results

  • 3 LIBERO benchmarks: matches standard GRPO success rate
  • 2.38x wall-clock speedup
  • 4.8x faster gradient updates
  • 60% lower peak activation memory
  • <20% of trajectory chunks backpropagated

When to Use

  • GRPO-based VLA (vision-language-action) policy training
  • Any RL training where gradient computation dominates wall-clock time
  • Scenarios needing efficient RL post-training without reward models

Implementation Considerations

  • Drop-in modification — no architectural changes needed
  • Requires no reward model or learned critic
  • Online probability updates adapt to training dynamics
  • Works with any chunked trajectory format

Activation

probabilistic chunk masking, efficient GRPO, VLA RL, gradient variance, action variance

Install via CLI
npx skills add https://github.com/hiyenwong/ai_collection --skill vla-probabilistic-chunk-masking
Repository Details
star Stars 1
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator