vla-probabilistic-chunk-masking - SKILL.md Agent Skill

name: vla-probabilistic-chunk-masking category: research created: "2026-05-19" source: "arXiv:2605.16154v1" description: Drop-in GRPO modification that allocates gradient computation to a small, probabilistically selected subset of trajectory chunks using success-failure action variance. Achieves 2.38x wall-clock speedup while matching final performance. tags: [rl, grpo, vla, efficient-training, chunk-masking]

Probabilistic Chunk Masking (PCM) for Efficient VLA RL

Source: arXiv:2605.16154v1 - "Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking"

Summary

Proposes a drop-in modification to GRPO that uses success-failure action variance as a proxy for per-phase gradient variance, sampling only the most informative chunks for gradient computation. Achieves 2.38x wall-clock speedup while matching standard GRPO's final success rate, backpropagating through <20% of trajectory chunks.

Core Methodology

Key Insight

In GRPO-based VLA RL, gradient computation (~~78% of wall-clock) dominates rollout collection (~~21%). However, GRPO assigns the same advantage to every chunk, wasting compute on phases the policy already handles well. Only phases where successful and failed rollouts diverge produce learning signal.

Algorithm

Success-Failure Action Variance: Compute per-phase action variance between successful and failed rollouts
- Proxy for per-phase gradient variance
- Identifies where learning signal is concentrated
Probabilistic Chunk Selection: Sample a fixed chunk budget using online-updated phase-level keep probabilities
- High-variance phases more likely selected
- Maintains exploration via stochastic selection
Selective Backpropagation: Only compute gradients for selected chunks

Results

3 LIBERO benchmarks: matches standard GRPO success rate
2.38x wall-clock speedup
4.8x faster gradient updates
60% lower peak activation memory
<20% of trajectory chunks backpropagated

When to Use

GRPO-based VLA (vision-language-action) policy training
Any RL training where gradient computation dominates wall-clock time
Scenarios needing efficient RL post-training without reward models

Implementation Considerations

Drop-in modification — no architectural changes needed
Requires no reward model or learned critic
Online probability updates adapt to training dynamics
Works with any chunked trajectory format

Activation

probabilistic chunk masking, efficient GRPO, VLA RL, gradient variance, action variance