distributed-training - SKILL.md Agent Skill

name: distributed-training description: Use when setting up multi-GPU or multi-node training, configuring torchrun or deepspeed launchers, choosing parallelism strategy (FSDP, DeepSpeed ZeRO, tensor parallel, pipeline parallel), tuning NCCL, or enabling high-performance networking with SkyPilot

Distributed Training Patterns

Overview

Scale training beyond a single GPU using data parallelism, model parallelism, or a combination. SkyPilot automates multi-node cluster provisioning, environment variable injection, and high-performance networking setup.

Core principle: Start simple (single GPU with gradient accumulation), then scale up only when needed. Each parallelism strategy adds complexity; use the minimum that fits your model and data.

When to Use

Model does not fit in a single GPU's memory
Training is too slow on a single GPU and you need to parallelize
Need to use multiple nodes (multi-machine training)
Configuring DeepSpeed ZeRO stages for memory optimization
Setting up FSDP for PyTorch-native sharding

Do not use for:

Models that fit comfortably on one GPU (use gradient accumulation instead)
Inference-only workloads (use vLLM tensor parallelism instead)
Small fine-tuning jobs (QLoRA on single GPU is simpler)

Decision Guide: Which Parallelism?

Scenario	Strategy	Why
Model fits on 1 GPU, need speed	Data Parallel (DDP)	Simplest, near-linear scaling
Model fits on 1 GPU with optimizer sharding	DeepSpeed ZeRO-1/2 or FSDP	Reduce optimizer memory
Model does NOT fit on 1 GPU	DeepSpeed ZeRO-3 or FSDP (full shard)	Shard parameters across GPUs
70B+ model, single node	ZeRO-3 + CPU offload	Trade compute for memory
70B+ model, multi-node	FSDP2 + network_tier:best	Best PyTorch-native scaling
100B+ model, Megatron	TP + PP + DP	Maximum control for massive models
MoE model	Expert Parallel (EP)	Shard experts across GPUs
Very long sequences	Context Parallel (CP)	Split sequence across GPUs

Flowchart:

Does model fit on 1 GPU?
  |
  +-- YES: Use DDP or gradient accumulation
  |
  +-- NO: Does model fit with ZeRO-2/FSDP (shard optimizer + gradients)?
       |
       +-- YES: Use ZeRO-2 or FSDP
       |
       +-- NO: Use ZeRO-3 or FSDP full shard
            |
            +-- Still OOM? Add CPU offload or use TP+PP

SkyPilot Multi-Node Setup

SkyPilot provides environment variables for distributed training automatically. No manual hostfile or IP discovery needed.

num_nodes: 4
resources:
  accelerators: H100:8
  network_tier: best    # Enables InfiniBand/EFA

run: |
  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)

  torchrun \
    --nnodes=$SKYPILOT_NUM_NODES \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --node_rank=$SKYPILOT_NODE_RANK \
    --master_addr=$MASTER_ADDR \
    --master_port=29500 \
    train.py

SkyPilot environment variables:

Variable	Value	Description
`SKYPILOT_NODE_IPS`	Newline-separated IPs	All node IPs (head node first)
`SKYPILOT_NUM_NODES`	Integer	Total number of nodes
`SKYPILOT_NODE_RANK`	0, 1, 2, ...	This node's rank (0 = head)
`SKYPILOT_NUM_GPUS_PER_NODE`	Integer	GPUs on this node
`SKYPILOT_TASK_ID`	String	Stable across preemptions

torchrun Multi-Node

The standard PyTorch distributed launcher.

num_nodes: 2
resources:
  accelerators: H100:8
  network_tier: best

run: |
  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  NUM_GPUS=$(( $SKYPILOT_NUM_NODES * $SKYPILOT_NUM_GPUS_PER_NODE ))

  torchrun \
    --nnodes=$SKYPILOT_NUM_NODES \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --node_rank=$SKYPILOT_NODE_RANK \
    --master_addr=$MASTER_ADDR \
    --master_port=29500 \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:29500 \
    train.py \
      --total_gpus $NUM_GPUS

In your training script:

import torch.distributed as dist
import os

def setup_distributed():
    dist.init_process_group("nccl")
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)
    return local_rank

local_rank = setup_distributed()

DeepSpeed Multi-Node

DeepSpeed has its own launcher that handles multi-node orchestration.

num_nodes: 4
resources:
  accelerators: H100:8
  network_tier: best

run: |
  # Build hostfile from SkyPilot env vars
  echo "$SKYPILOT_NODE_IPS" | awk -v gpus=$SKYPILOT_NUM_GPUS_PER_NODE \
    '{print $1 " slots=" gpus}' > /tmp/hostfile

  # Only head node launches deepspeed
  if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
    deepspeed \
      --hostfile /tmp/hostfile \
      --master_addr $(echo "$SKYPILOT_NODE_IPS" | head -n1) \
      --master_port 29500 \
      train.py \
        --deepspeed ds_config.json
  fi

Head-only pattern: Only node rank 0 runs the launcher. Other nodes wait for the launcher to SSH in and start workers. This is how DeepSpeed and some other launchers work.

if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
  # Head-only commands: launch distributed training, start tensorboard, run eval
  deepspeed --hostfile /tmp/hostfile train.py --deepspeed ds_config.json
  tensorboard --logdir /logs --port 6006 &
fi

FSDP Configuration

PyTorch-native Fully Sharded Data Parallel. Preferred over DeepSpeed for new projects using PyTorch 2.x.

FSDP1 (PyTorch 2.0-2.4)

from torch.distributed.fsdp import (
    FullyShardedDataParallel as FSDP,
    MixedPrecision,
    ShardingStrategy,
)
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
import functools

# Define wrapping policy (wrap each transformer layer)
auto_wrap_policy = functools.partial(
    transformer_auto_wrap_policy,
    transformer_layer_cls={TransformerBlock},  # your model's block class
)

# Mixed precision
mp_policy = MixedPrecision(
    param_dtype=torch.bfloat16,
    reduce_dtype=torch.bfloat16,
    buffer_dtype=torch.bfloat16,
)

# Wrap model
model = FSDP(
    model,
    auto_wrap_policy=auto_wrap_policy,
    mixed_precision=mp_policy,
    sharding_strategy=ShardingStrategy.FULL_SHARD,  # ZeRO-3 equivalent
    device_id=local_rank,
    limit_all_gathers=True,  # reduces memory peak
)

FSDP2 (PyTorch 2.5+)

Cleaner API, better composability with other features.

from torch.distributed._composable.fsdp import fully_shard, MixedPrecisionPolicy

mp_policy = MixedPrecisionPolicy(
    param_dtype=torch.bfloat16,
    reduce_dtype=torch.float32,
)

# Shard each transformer block
for layer in model.layers:
    fully_shard(layer, mp_policy=mp_policy)

# Shard the entire model (root)
fully_shard(model, mp_policy=mp_policy)

FSDP sharding strategies:

Strategy	Memory Savings	Communication	Equivalent
NO_SHARD	None	Lowest	DDP
SHARD_GRAD_OP	~2x	Medium	ZeRO-2
FULL_SHARD	~4x	Highest	ZeRO-3
HYBRID_SHARD	~2-4x	Medium	ZeRO-3 within node, DDP between nodes

DeepSpeed ZeRO Stages

ZeRO-1 (Shard Optimizer States)

{
  "zero_optimization": {
    "stage": 1
  },
  "bf16": {"enabled": true},
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto"
}

Memory savings: ~4x optimizer memory reduction. No communication overhead beyond DDP.

ZeRO-2 (Shard Optimizer + Gradients)

{
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients": true
  },
  "bf16": {"enabled": true},
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto"
}

Memory savings: ~8x vs DDP. Good balance of memory savings and speed.

ZeRO-3 (Shard Everything)

{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {"device": "none"},
    "offload_param": {"device": "none"},
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "bf16": {"enabled": true},
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto"
}

ZeRO-3 with CPU Offload (Maximum Memory Savings)

{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "bf16": {"enabled": true},
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto"
}

Warning: CPU offload is significantly slower (2-5x). Use only when GPU memory is truly insufficient.

For complete DeepSpeed ZeRO configuration reference with all tunable parameters, see references/deepspeed-zero-config.md.

High-Performance Networking

Multi-node training is bottlenecked by inter-node communication. Use SkyPilot's network_tier: best to enable high-bandwidth interconnects.

resources:
  accelerators: H100:8
  network_tier: best    # CRITICAL for multi-node

What network_tier: best enables:

AWS: Elastic Fabric Adapter (EFA) -- up to 3200 Gbps
GCP: GPUDirect-TCPX -- up to 200 Gbps per GPU
Azure: InfiniBand NDR -- up to 3200 Gbps

Without network_tier: best: Standard TCP networking (~25-100 Gbps), which becomes a severe bottleneck for multi-node training.

NCCL Tuning

NCCL (NVIDIA Collective Communication Library) handles GPU-to-GPU communication.

# Common NCCL environment variables
export NCCL_DEBUG=INFO               # Debug logging
export NCCL_IB_DISABLE=0            # Enable InfiniBand (default)
export NCCL_NET_GDR_LEVEL=5         # GPUDirect RDMA level
export NCCL_SOCKET_IFNAME=eth0      # Network interface
export NCCL_P2P_DISABLE=0           # Enable P2P (NVLink)
export NCCL_CROSS_NIC=1             # Enable cross-NIC communication

SkyPilot typically sets these automatically when network_tier: best is configured. Only override for debugging.

Tensor Parallel + Pipeline Parallel

For very large models (100B+) where even FSDP/ZeRO-3 is insufficient.

Tensor Parallel (TP): Split individual layers across GPUs.

Each GPU holds a slice of every layer
Requires fast inter-GPU communication (NVLink within node)
Typical: TP=8 (one node)

Pipeline Parallel (PP): Split layers sequentially across GPUs.

GPU 0 holds layers 0-7, GPU 1 holds layers 8-15, etc.
Micro-batching hides pipeline bubble
Can span across nodes

Combined TP + PP + DP:

Example: 128 GPUs (16 nodes x 8 GPUs)
TP=8  (within each node)
PP=4  (across 4 nodes)
DP=4  (4 pipeline replicas)

Total: 8 * 4 * 4 = 128 GPUs

This level of parallelism requires frameworks like Megatron-LM or NeMo. For most models under 70B, FSDP or DeepSpeed ZeRO is sufficient.

Gradient Accumulation (Single GPU Alternative)

Before scaling to multiple GPUs, consider gradient accumulation.

accumulation_steps = 8
optimizer.zero_grad()

for i, batch in enumerate(dataloader):
    loss = model(batch) / accumulation_steps
    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Effective batch size = micro_batch * accumulation_steps * num_gpus

This simulates a larger batch size on a single GPU. No communication overhead, no distributed complexity.

Common Mistakes

Mistake	Fix
Forgetting `network_tier: best` for multi-node	Without it, inter-node communication bottlenecks training.
Using DeepSpeed launcher on all nodes	Only head node (rank 0) runs the launcher. Others wait.
Not matching batch size to GPU count	Effective batch = micro_batch * accum_steps * world_size. Adjust accordingly.
ZeRO-3 + CPU offload by default	Start with ZeRO-2. Only use ZeRO-3 if model doesn't fit. CPU offload is last resort.
Using FSDP NO_SHARD (DDP) for large models	NO_SHARD doesn't save memory. Use FULL_SHARD for memory savings.
Not testing single-node before multi-node	Debug on 1 node first. Multi-node adds network failure modes.

Quick Reference

# SkyPilot multi-node launch
# In YAML: num_nodes: 4, resources.network_tier: best

# torchrun (in run: section)
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
torchrun --nnodes=$SKYPILOT_NUM_NODES \
  --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
  --node_rank=$SKYPILOT_NODE_RANK \
  --master_addr=$MASTER_ADDR \
  --master_port=29500 \
  train.py

# DeepSpeed (head node only)
if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
  echo "$SKYPILOT_NODE_IPS" | awk -v g=$SKYPILOT_NUM_GPUS_PER_NODE \
    '{print $1 " slots=" g}' > /tmp/hostfile
  deepspeed --hostfile /tmp/hostfile train.py --deepspeed ds_config.json
fi

# NCCL debug
export NCCL_DEBUG=INFO

For detailed parallelism strategies, see references/parallelism-guide.md. For complete DeepSpeed ZeRO configuration, see references/deepspeed-zero-config.md.