readiness-driven-pipeline-runtime - SKILL.md Agent Skill

name: readiness-driven-pipeline-runtime description: "Readiness-First Pipeline (RRFP) methodology — treating pipeline schedules as non-binding hint orders rather than pre-committed execution sequences. Reduces bubbles and stage misalignment in distributed training under runtime variability. Up to 1.77x speedup on language-only, 2.77x on multimodal workloads. Activation: readiness-driven pipeline, RRFP, pipeline parallel runtime, schedule flexibility, distributed training variability, 1F1B optimization"

Readiness-Driven Pipeline Runtime (RRFP)

Source

arXiv:2605.18750 — "A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability"

Authors: Ruitao Liu, Xinyang Tian, Shuo Chen, Tingrui Zhang, Guang Yang, Alan Zhao, Wei Xu
Tsinghua University & Scitix AI
Published: 18 May 2026

Core Problem

Pipeline parallelism for large-model training suffers from runtime variability — computation and communication jitter causes stages to wait for not-yet-ready work, creating idle bubbles and reduced utilization. Existing systems treat schedules as pre-committed execution orders that stages must follow strictly, even when the schedule no longer matches realized readiness.

Key Innovation: Schedule-as-Hint

RRFP fundamentally changes how pipeline schedules are consumed:

Traditional Approach	RRFP Approach
Schedule = fixed execution order	Schedule = non-binding hint order
Stage waits for scheduled task	Stage skips unavailable tasks, dispatches ready work
Synchronized barriers at every step	Message-driven asynchronous communication
Static bubble elimination	Dynamic bubble reduction via ready-set arbitration

Architecture Components

1. Hint-Order Scheduler

Pre-committed schedule (e.g., 1F1B, BFW) serves as priority ranking for ready work
Not a mandatory sequence — stages dispatch whatever is ready, ranked by hint priority
BFW (Backward-Forward with Warmup) hint: orders tasks to minimize bubble propagation

2. Message-Driven Asynchronous Communication

Eliminates synchronous barriers that block execution
Tasks are dispatched via message-passing rather than clock synchronization
Reduces coordination overhead in heterogeneous environments

3. Lightweight Tensor-Parallel Coordination

Maintains collective consistency across tensor-parallel groups
Ensures out-of-order execution within a stage doesn't break collective operations
Minimal overhead — only coordinates when collectives are involved

4. Ready-Set Arbitration

Low-overhead dispatch mechanism that selects from currently-ready tasks
Uses hint order as tie-breaking priority when multiple tasks are ready
Guarantees deadlock-free execution through buffer-size policies

Implementation Pattern

# Conceptual RRFP dispatch loop (per pipeline stage)
class RRFPStage:
    def __init__(self, stage_id, hint_schedule):
        self.hint = hint_schedule  # Non-binding priority order
        self.ready_set = set()     # Currently ready tasks
        self.pending = set()       # Tasks waiting for dependencies
    
    def dispatch(self):
        """Select and execute highest-priority ready task."""
        # 1. Update ready set from message arrivals
        self._update_ready_set()
        
        # 2. Rank ready tasks by hint priority (not execution order)
        ranked = sorted(self.ready_set, key=lambda t: self.hint.priority(t))
        
        # 3. Dispatch highest-priority ready task
        if ranked:
            task = ranked[0]
            self._execute(task)
            self.ready_set.remove(task)
    
    def _update_ready_set(self):
        """Message-driven: tasks become ready when dependencies arrive."""
        for task in self.pending:
            if task.all_dependencies_satisfied():
                self.ready_set.add(task)

Results

Workload	GPUs	Speedup over 1F1B
Language-only	up to 128	up to 1.77x
Multimodal	up to 128	up to 2.77x
Cross-framework comparison	—	up to 1.84x over fastest external system

When to Use

Large-model training where pipeline parallelism is used (LLMs, multimodal)
Runtime variability in compute/communication (heterogeneous GPUs, network jitter)
Multimodal workloads with input-dependent execution times
Multi-GPU training (8-128+ GPUs) where bubble elimination matters

Design Principles for Distributed Systems

Schedule-as-Hint: Never treat schedules as mandatory sequences in variable environments
Message-Driven over Clock-Driven: Asynchronous message passing beats synchronized barriers under variability
Ready-Set Arbitration: Dispatch what's ready, not what's scheduled
Lightweight Coordination: Maintain correctness with minimal synchronization overhead
Buffer-Size Deadlock Prevention: Guarantee progress through bounded buffer policies

Pitfalls

Deadlock risk: Out-of-order execution can deadlock without proper buffer-size guarantees
Collective consistency: Tensor-parallel groups require coordination even in async execution
Training correctness: Must verify gradient equivalence with fixed-order baselines
Hint quality: Better hint schedules (BFW > BF > 1F1B) yield better speedups