readiness-driven-pipeline-runtime

star 2

Readiness-First Pipeline (RRFP) methodology — treating pipeline schedules as non-binding hint orders rather than pre-committed execution sequences. Reduces bubbles and stage misalignment in distributed training under runtime variability. Up to 1.77x speedup on language-only, 2.77x on multimodal workloads. Activation: readiness-driven pipeline, RRFP, pipeline parallel runtime, schedule flexibility, distributed training variability, 1F1B optimization

hiyenwong By hiyenwong schedule Updated 6/4/2026

name: readiness-driven-pipeline-runtime description: "Readiness-First Pipeline (RRFP) methodology — treating pipeline schedules as non-binding hint orders rather than pre-committed execution sequences. Reduces bubbles and stage misalignment in distributed training under runtime variability. Up to 1.77x speedup on language-only, 2.77x on multimodal workloads. Activation: readiness-driven pipeline, RRFP, pipeline parallel runtime, schedule flexibility, distributed training variability, 1F1B optimization"

Readiness-Driven Pipeline Runtime (RRFP)

Source

arXiv:2605.18750 — "A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability"

  • Authors: Ruitao Liu, Xinyang Tian, Shuo Chen, Tingrui Zhang, Guang Yang, Alan Zhao, Wei Xu
  • Tsinghua University & Scitix AI
  • Published: 18 May 2026

Core Problem

Pipeline parallelism for large-model training suffers from runtime variability — computation and communication jitter causes stages to wait for not-yet-ready work, creating idle bubbles and reduced utilization. Existing systems treat schedules as pre-committed execution orders that stages must follow strictly, even when the schedule no longer matches realized readiness.

Key Innovation: Schedule-as-Hint

RRFP fundamentally changes how pipeline schedules are consumed:

Traditional Approach RRFP Approach
Schedule = fixed execution order Schedule = non-binding hint order
Stage waits for scheduled task Stage skips unavailable tasks, dispatches ready work
Synchronized barriers at every step Message-driven asynchronous communication
Static bubble elimination Dynamic bubble reduction via ready-set arbitration

Architecture Components

1. Hint-Order Scheduler

  • Pre-committed schedule (e.g., 1F1B, BFW) serves as priority ranking for ready work
  • Not a mandatory sequence — stages dispatch whatever is ready, ranked by hint priority
  • BFW (Backward-Forward with Warmup) hint: orders tasks to minimize bubble propagation

2. Message-Driven Asynchronous Communication

  • Eliminates synchronous barriers that block execution
  • Tasks are dispatched via message-passing rather than clock synchronization
  • Reduces coordination overhead in heterogeneous environments

3. Lightweight Tensor-Parallel Coordination

  • Maintains collective consistency across tensor-parallel groups
  • Ensures out-of-order execution within a stage doesn't break collective operations
  • Minimal overhead — only coordinates when collectives are involved

4. Ready-Set Arbitration

  • Low-overhead dispatch mechanism that selects from currently-ready tasks
  • Uses hint order as tie-breaking priority when multiple tasks are ready
  • Guarantees deadlock-free execution through buffer-size policies

Implementation Pattern

# Conceptual RRFP dispatch loop (per pipeline stage)
class RRFPStage:
    def __init__(self, stage_id, hint_schedule):
        self.hint = hint_schedule  # Non-binding priority order
        self.ready_set = set()     # Currently ready tasks
        self.pending = set()       # Tasks waiting for dependencies
    
    def dispatch(self):
        """Select and execute highest-priority ready task."""
        # 1. Update ready set from message arrivals
        self._update_ready_set()
        
        # 2. Rank ready tasks by hint priority (not execution order)
        ranked = sorted(self.ready_set, key=lambda t: self.hint.priority(t))
        
        # 3. Dispatch highest-priority ready task
        if ranked:
            task = ranked[0]
            self._execute(task)
            self.ready_set.remove(task)
    
    def _update_ready_set(self):
        """Message-driven: tasks become ready when dependencies arrive."""
        for task in self.pending:
            if task.all_dependencies_satisfied():
                self.ready_set.add(task)

Results

Workload GPUs Speedup over 1F1B
Language-only up to 128 up to 1.77x
Multimodal up to 128 up to 2.77x
Cross-framework comparison up to 1.84x over fastest external system

When to Use

  • Large-model training where pipeline parallelism is used (LLMs, multimodal)
  • Runtime variability in compute/communication (heterogeneous GPUs, network jitter)
  • Multimodal workloads with input-dependent execution times
  • Multi-GPU training (8-128+ GPUs) where bubble elimination matters

Design Principles for Distributed Systems

  1. Schedule-as-Hint: Never treat schedules as mandatory sequences in variable environments
  2. Message-Driven over Clock-Driven: Asynchronous message passing beats synchronized barriers under variability
  3. Ready-Set Arbitration: Dispatch what's ready, not what's scheduled
  4. Lightweight Coordination: Maintain correctness with minimal synchronization overhead
  5. Buffer-Size Deadlock Prevention: Guarantee progress through bounded buffer policies

Pitfalls

  • Deadlock risk: Out-of-order execution can deadlock without proper buffer-size guarantees
  • Collective consistency: Tensor-parallel groups require coordination even in async execution
  • Training correctness: Must verify gradient equivalence with fixed-order baselines
  • Hint quality: Better hint schedules (BFW > BF > 1F1B) yield better speedups
Install via CLI
npx skills add https://github.com/hiyenwong/ai_collection --skill readiness-driven-pipeline-runtime
Repository Details
star Stars 2
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator