HuggingFace EcosystemAdvanced
Distributed Training with Accelerate
Multi-GPU and multi-node training with mixed precision and DeepSpeed
Distributed Training with Accelerate
TL;DR
Use HuggingFace accelerate to scale training from a single GPU to multi-GPU and multi-node setups with minimal code changes. Learn the Accelerator class, mixed precision, gradient accumulation, DeepSpeed ZeRO stages, and FSDP — all through a unified API.
Scale model training from a single GPU to multi-GPU and multi-node clusters using the HuggingFace accelerate library, with DeepSpeed and FSDP integration.
What You'll Learn
- The
Acceleratorclass and distributed training patterns - Mixed precision training (fp16, bf16)
- Gradient accumulation for large effective batch sizes
- DeepSpeed ZeRO stages (1, 2, 3)
- FSDP (Fully Sharded Data Parallel)
- Multi-node training setup
- Configuration with
accelerate config
Tech Stack
| Component | Technology |
|---|---|
| Distributed | accelerate |
| Sharding | deepspeed, PyTorch FSDP |
| Models | transformers |
| Python | 3.10+ |
Architecture
┌──────────────────────────────────────────────────────────────────────────────┐
│ DISTRIBUTED TRAINING STRATEGIES │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ DDP (Data Parallel) — Replicate model, split data │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ GPU 0 │ │ GPU 1 │ │ GPU 2 │ │ GPU 3 │ │
│ │ Full │ │ Full │ │ Full │ │ Full │ │
│ │ Model │ │ Model │ │ Model │ │ Model │ │
│ │ Batch 0 │ │ Batch 1 │ │ Batch 2 │ │ Batch 3 │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ Gradients synced via AllReduce after each step │
│ │
│ DeepSpeed ZeRO — Shard optimizer/gradients/params across GPUs │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ GPU 0 │ │ GPU 1 │ │ GPU 2 │ │ GPU 3 │ │
│ │ Params 0 │ │ Params 1 │ │ Params 2 │ │ Params 3 │ │
│ │ Optim 0 │ │ Optim 1 │ │ Optim 2 │ │ Optim 3 │ │
│ │ Grads 0 │ │ Grads 1 │ │ Grads 2 │ │ Grads 3 │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ Each GPU holds 1/N of the model state (N = num GPUs) │
│ │
│ FSDP (Fully Sharded Data Parallel) — PyTorch native sharding │
│ Same concept as ZeRO-3 but built into PyTorch │
│ │
└──────────────────────────────────────────────────────────────────────────────┘Project Structure
distributed-training/
├── src/
│ ├── __init__.py
│ ├── accelerator_basics.py # Basic Accelerator usage
│ ├── mixed_precision.py # FP16/BF16 training
│ ├── grad_accumulation.py # Gradient accumulation
│ ├── deepspeed_training.py # DeepSpeed ZeRO integration
│ ├── fsdp_training.py # FSDP integration
│ └── multi_node.py # Multi-node setup
├── configs/
│ ├── deepspeed_zero2.json
│ ├── deepspeed_zero3.json
│ └── fsdp_config.yaml
├── examples/
│ └── train_distributed.py
├── requirements.txt
└── README.mdImplementation
Step 1: Dependencies
accelerate>=0.30.0
transformers>=4.40.0
datasets>=2.19.0
deepspeed>=0.14.0
torch>=2.0.0Step 2: Accelerator Basics
"""Basic distributed training with the Accelerator class."""
import torch
from torch.utils.data import DataLoader
from accelerate import Accelerator
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
get_linear_schedule_with_warmup,
)
from datasets import load_dataset
def train_with_accelerator(
model_name: str = "gpt2",
dataset_name: str = "wikitext",
dataset_config: str = "wikitext-2-raw-v1",
epochs: int = 3,
batch_size: int = 8,
learning_rate: float = 5e-5,
max_length: int = 128,
):
"""
Train a model using HuggingFace Accelerate.
The Accelerator class handles:
1. Device placement (CPU, GPU, multi-GPU, TPU)
2. Distributed data parallel (DDP)
3. Mixed precision training
4. Gradient accumulation
The key insight: you write single-GPU code, and Accelerate
makes it work on any hardware configuration.
"""
# Initialize Accelerator
accelerator = Accelerator()
# Load model and tokenizer (standard code — no device management)
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Load and tokenize dataset
dataset = load_dataset(dataset_name, dataset_config, split="train")
def tokenize(batch):
return tokenizer(
batch["text"],
truncation=True,
max_length=max_length,
padding="max_length",
return_tensors="pt",
)
dataset = dataset.map(tokenize, batched=True, remove_columns=["text"])
dataset.set_format("torch")
# Create DataLoader (standard PyTorch)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
# Optimizer and scheduler (standard PyTorch)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
num_training_steps = epochs * len(dataloader)
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=num_training_steps // 10,
num_training_steps=num_training_steps,
)
# THIS IS THE KEY LINE: prepare everything for distributed training
model, optimizer, dataloader, scheduler = accelerator.prepare(
model, optimizer, dataloader, scheduler
)
# Training loop (identical to single-GPU code!)
model.train()
for epoch in range(epochs):
total_loss = 0
for batch in dataloader:
outputs = model(
input_ids=batch["input_ids"],
attention_mask=batch["attention_mask"],
labels=batch["input_ids"],
)
loss = outputs.loss
# Replace loss.backward() with accelerator.backward()
accelerator.backward(loss)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
total_loss += loss.item()
avg_loss = total_loss / len(dataloader)
# Only print from main process (avoid duplicate output)
if accelerator.is_main_process:
print(f"Epoch {epoch + 1}: loss = {avg_loss:.4f}")
# Save model (only from main process)
accelerator.wait_for_everyone()
if accelerator.is_main_process:
unwrapped = accelerator.unwrap_model(model)
unwrapped.save_pretrained("models/distributed")
tokenizer.save_pretrained("models/distributed")What accelerator.prepare() Does:
┌─────────────────────────────────────────────────────────────────┐
│ accelerator.prepare(model, optimizer, dataloader, scheduler) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ MODEL: │
│ • Wraps in DistributedDataParallel (DDP) │
│ • Moves to correct device │
│ • Applies mixed precision if configured │
│ │
│ OPTIMIZER: │
│ • Scales learning rate if needed │
│ • Wraps for gradient accumulation │
│ │
│ DATALOADER: │
│ • Adds DistributedSampler (each GPU gets different data) │
│ • Adjusts batch size per device │
│ │
│ SCHEDULER: │
│ • Adjusts for actual number of optimization steps │
│ │
│ After prepare(), your training loop works identically │
│ on 1 GPU, 4 GPUs, or 8 nodes × 8 GPUs. │
│ │
└─────────────────────────────────────────────────────────────────┘Step 3: Mixed Precision Training
"""Mixed precision training with Accelerate."""
from accelerate import Accelerator
def setup_mixed_precision(precision: str = "fp16") -> Accelerator:
"""
Configure mixed precision training.
Options:
- "no": Full fp32 (baseline, most VRAM)
- "fp16": Float16 mixed precision (50% less VRAM, faster on NVIDIA)
- "bf16": BFloat16 (better for training stability, needs Ampere+ GPU)
Mixed precision keeps master weights in fp32 but does forward/backward
in lower precision. The Accelerator handles all the casting.
"""
accelerator = Accelerator(mixed_precision=precision)
print(f"Device: {accelerator.device}")
print(f"Distributed: {accelerator.distributed_type}")
print(f"Mixed precision: {accelerator.mixed_precision}")
print(f"Num processes: {accelerator.num_processes}")
return acceleratorMixed Precision Comparison:
| Precision | VRAM | Speed | Stability | GPU Support |
|---|---|---|---|---|
| fp32 | Baseline | Baseline | Best | All |
| fp16 | ~50% less | ~2x faster | Good (loss scaling needed) | All NVIDIA |
| bf16 | ~50% less | ~2x faster | Best (no loss scaling) | Ampere+ (A100, RTX 30xx+) |
Step 4: Gradient Accumulation
"""Gradient accumulation for large effective batch sizes."""
from accelerate import Accelerator
def train_with_grad_accumulation(
model,
dataloader,
optimizer,
gradient_accumulation_steps: int = 8,
):
"""
Simulate larger batches by accumulating gradients.
With batch_size=4 and accumulation=8:
Effective batch size = 4 × 8 = 32
This lets you train with large effective batch sizes
even when your GPU can only fit small batches.
"""
accelerator = Accelerator(
gradient_accumulation_steps=gradient_accumulation_steps,
)
model, optimizer, dataloader = accelerator.prepare(
model, optimizer, dataloader
)
model.train()
for batch in dataloader:
# The context manager handles accumulation automatically
with accelerator.accumulate(model):
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad()Gradient Accumulation Explained:
┌─────────────────────────────────────────────────────────────────┐
│ GRADIENT ACCUMULATION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Without accumulation (batch_size=4): │
│ Batch 1 ──► forward ──► backward ──► optimizer step │
│ Batch 2 ──► forward ──► backward ──► optimizer step │
│ Each step uses 4 examples. │
│ │
│ With accumulation (batch_size=4, accumulation_steps=4): │
│ Batch 1 ──► forward ──► backward ──► accumulate │
│ Batch 2 ──► forward ──► backward ──► accumulate │
│ Batch 3 ──► forward ──► backward ──► accumulate │
│ Batch 4 ──► forward ──► backward ──► optimizer step │
│ Each step uses 4 × 4 = 16 examples. │
│ │
│ With DDP + accumulation (4 GPUs, batch=4, accum=4): │
│ Effective batch = 4 GPUs × 4 batch × 4 accum = 64 │
│ │
│ Why large batches matter: │
│ • More stable gradient estimates │
│ • Better convergence for large models │
│ • LLM training often uses effective batch sizes of 256-2048 │
│ │
└─────────────────────────────────────────────────────────────────┘Step 5: DeepSpeed Integration
"""DeepSpeed ZeRO integration with Accelerate."""
from accelerate import Accelerator, DeepSpeedPlugin
from transformers import AutoModelForCausalLM, AutoTokenizer
def train_with_deepspeed(
model_name: str = "meta-llama/Llama-3.1-8B",
zero_stage: int = 2,
):
"""
Train with DeepSpeed ZeRO optimization.
ZeRO stages shard different parts of the training state:
Stage 1: Shard optimizer states (Adam has 2 states per param)
Memory savings: ~4x
Stage 2: + Shard gradients
Memory savings: ~8x
Stage 3: + Shard model parameters
Memory savings: ~Nx (N = num GPUs)
Enables training models larger than single GPU memory
"""
deepspeed_plugin = DeepSpeedPlugin(
zero_stage=zero_stage,
gradient_accumulation_steps=4,
gradient_clipping=1.0,
offload_optimizer_device="none", # "cpu" for ZeRO-Offload
offload_param_device="none", # "cpu" for parameter offloading
zero3_init_flag=True if zero_stage == 3 else False,
)
accelerator = Accelerator(
deepspeed_plugin=deepspeed_plugin,
mixed_precision="fp16",
)
# Model loading varies by ZeRO stage
if zero_stage == 3:
# ZeRO-3: Initialize empty, then shard
from accelerate import init_empty_weights
with init_empty_weights():
model = AutoModelForCausalLM.from_pretrained(model_name)
else:
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
)
return accelerator, model{
"train_batch_size": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": 1.0,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"overlap_comm": true,
"contiguous_gradients": true
}
}{
"train_batch_size": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": 1.0,
"fp16": {
"enabled": true
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}DeepSpeed ZeRO Memory Savings:
┌─────────────────────────────────────────────────────────────────┐
│ ZeRO MEMORY ANALYSIS (7B model, 4 GPUs) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Training State Memory (per GPU): │
│ ┌───────────────────┬────────┬────────┬────────┬────────┐ │
│ │ Component │ No ZeRO│ ZeRO-1 │ ZeRO-2 │ ZeRO-3 │ │
│ ├───────────────────┼────────┼────────┼────────┼────────┤ │
│ │ Parameters (fp16) │ 14 GB │ 14 GB │ 14 GB │ 3.5 GB│ │
│ │ Gradients (fp16) │ 14 GB │ 14 GB │ 3.5 GB│ 3.5 GB│ │
│ │ Optimizer (fp32) │ 56 GB │ 14 GB │ 14 GB │ 14 GB │ │
│ ├───────────────────┼────────┼────────┼────────┼────────┤ │
│ │ Total per GPU │ 84 GB │ 42 GB │ 31.5GB│ 21 GB │ │
│ └───────────────────┴────────┴────────┴────────┴────────┘ │
│ │
│ ZeRO-Offload: Move optimizer states or params to CPU RAM │
│ Enables training on fewer GPUs at cost of speed │
│ │
└─────────────────────────────────────────────────────────────────┘Step 6: FSDP Integration
"""FSDP (Fully Sharded Data Parallel) with Accelerate."""
from accelerate import Accelerator, FullyShardedDataParallelPlugin
from torch.distributed.fsdp.fully_sharded_data_parallel import (
FullOptimStateDictType,
FullStateDictType,
)
def train_with_fsdp(
model_name: str = "meta-llama/Llama-3.1-8B",
):
"""
Train with PyTorch FSDP via Accelerate.
FSDP is PyTorch's native equivalent of DeepSpeed ZeRO-3.
It shards model parameters, gradients, and optimizer states
across all GPUs.
FSDP vs DeepSpeed:
- FSDP: PyTorch native, simpler setup, good for most cases
- DeepSpeed: More features (ZeRO-Offload, inference, etc.)
"""
fsdp_plugin = FullyShardedDataParallelPlugin(
state_dict_type=FullStateDictType.FULL_STATE_DICT,
optim_state_dict_type=FullOptimStateDictType.FULL_OPTIM_STATE_DICT,
)
accelerator = Accelerator(
fsdp_plugin=fsdp_plugin,
mixed_precision="bf16",
)
return acceleratorStep 7: Multi-Node Setup
"""Multi-node distributed training configuration."""
def explain_launch_commands():
"""
How to launch distributed training with accelerate.
The `accelerate` CLI handles all the distributed setup.
"""
commands = {
"Single GPU": "accelerate launch train.py",
"Multi-GPU (single node)": "accelerate launch --num_processes 4 train.py",
"Multi-Node (2 nodes × 4 GPUs)": """
# On node 0 (master):
accelerate launch \\
--num_processes 8 \\
--num_machines 2 \\
--machine_rank 0 \\
--main_process_ip 10.0.0.1 \\
--main_process_port 29500 \\
train.py
# On node 1:
accelerate launch \\
--num_processes 8 \\
--num_machines 2 \\
--machine_rank 1 \\
--main_process_ip 10.0.0.1 \\
--main_process_port 29500 \\
train.py
""",
"With DeepSpeed": "accelerate launch --config_file configs/deepspeed.yaml train.py",
"Interactive config": "accelerate config",
}
for scenario, cmd in commands.items():
print(f"\n{scenario}:")
print(f" {cmd}")Running the Project
# Install dependencies
pip install -r requirements.txt
# Interactive configuration
accelerate config
# Answer prompts about your hardware setup
# Launch on a single GPU
accelerate launch examples/train_distributed.py
# Launch on 4 GPUs
accelerate launch --num_processes 4 examples/train_distributed.py
# Launch with DeepSpeed ZeRO-2
accelerate launch --config_file configs/deepspeed_zero2.json examples/train_distributed.py
# Check your config
accelerate envKey Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
Accelerator | Unified distributed training API | Write once, run on any hardware |
prepare() | Wraps model/optimizer/dataloader for distribution | Single line to enable multi-GPU |
| DDP | Replicate model, split data | Simplest multi-GPU strategy |
| ZeRO-1 | Shard optimizer states | ~4x memory savings |
| ZeRO-2 | + Shard gradients | ~8x memory savings |
| ZeRO-3 | + Shard model parameters | Train models larger than one GPU |
| FSDP | PyTorch native sharding (like ZeRO-3) | No external dependency needed |
| Mixed Precision | fp16/bf16 for compute, fp32 for storage | 2x speed, 50% less memory |
| Gradient Accumulation | Accumulate over N steps before update | Large effective batch on small GPUs |
Next Steps
- Production AI Workbench — Deploy trained models with Gradio
- Preference Alignment with TRL — Combine with DPO for aligned models