Distributed Training with Accelerate
Multi-GPU and multi-node training with mixed precision and DeepSpeed
Distributed Training with Accelerate
TL;DR
Use HuggingFace accelerate to scale training from a single GPU to multi-GPU and multi-node setups with minimal code changes. Learn the Accelerator class, mixed precision, gradient accumulation, DeepSpeed ZeRO stages, and FSDP — all through a unified API.
What You'll Learn
- The
Acceleratorclass and distributed training patterns - Mixed precision training (fp16, bf16)
- Gradient accumulation for large effective batch sizes
- DeepSpeed ZeRO stages (1, 2, 3)
- FSDP (Fully Sharded Data Parallel)
- Multi-node training setup
- Configuration with
accelerate config
Why Accelerate Simplifies Multi-GPU Training
Without Accelerate, scaling from 1 GPU to 4 GPUs requires rewriting your training script with torch.distributed, managing process groups, wrapping models in DDP, splitting data with DistributedSampler, and coordinating logging/saving across processes -- easily 100+ lines of boilerplate. The accelerate library reduces this to a single prepare() call: you write standard single-GPU PyTorch code, and Accelerate handles device placement, data sharding, gradient synchronization, and mixed precision transparently. It also provides a unified interface to DeepSpeed and FSDP, so you can switch between sharding strategies by changing a config file rather than rewriting code.
| Property | Value |
|---|---|
| Difficulty | Advanced |
| Time | ~4 days |
| Lines of Code | ~400 |
| Prerequisites | Fine-Tuning with PEFT, PyTorch training loops |
Tech Stack
| Component | Technology | Why |
|---|---|---|
| Distributed | accelerate | Unified API for DDP, DeepSpeed, and FSDP |
| Sharding | deepspeed, PyTorch FSDP | Shard model state across GPUs for memory savings |
| Models | transformers | Load and train any HuggingFace model |
| Python | 3.10+ | Type hint support |
Architecture
Distributed Training Strategies
DDP (Data Parallel)
DeepSpeed ZeRO
RecommendedFSDP (Fully Sharded Data Parallel)
Project Structure
distributed-training/
├── src/
│ ├── __init__.py
│ ├── accelerator_basics.py # Basic Accelerator usage
│ ├── mixed_precision.py # FP16/BF16 training
│ ├── grad_accumulation.py # Gradient accumulation
│ ├── deepspeed_training.py # DeepSpeed ZeRO integration
│ ├── fsdp_training.py # FSDP integration
│ └── multi_node.py # Multi-node setup
├── configs/
│ ├── deepspeed_zero2.json
│ ├── deepspeed_zero3.json
│ └── fsdp_config.yaml
├── examples/
│ └── train_distributed.py
├── requirements.txt
└── README.mdImplementation
Step 1: Dependencies
accelerate>=0.30.0
transformers>=4.40.0
datasets>=2.19.0
deepspeed>=0.14.0
torch>=2.0.0Step 2: Accelerator Basics
"""Basic distributed training with the Accelerator class."""
import torch
from torch.utils.data import DataLoader
from accelerate import Accelerator
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
get_linear_schedule_with_warmup,
)
from datasets import load_dataset
def train_with_accelerator(
model_name: str = "gpt2",
dataset_name: str = "wikitext",
dataset_config: str = "wikitext-2-raw-v1",
epochs: int = 3,
batch_size: int = 8,
learning_rate: float = 5e-5,
max_length: int = 128,
):
"""
Train a model using HuggingFace Accelerate.
The Accelerator class handles:
1. Device placement (CPU, GPU, multi-GPU, TPU)
2. Distributed data parallel (DDP)
3. Mixed precision training
4. Gradient accumulation
The key insight: you write single-GPU code, and Accelerate
makes it work on any hardware configuration.
"""
# Initialize Accelerator
accelerator = Accelerator()
# Load model and tokenizer (standard code — no device management)
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Load and tokenize dataset
dataset = load_dataset(dataset_name, dataset_config, split="train")
def tokenize(batch):
return tokenizer(
batch["text"],
truncation=True,
max_length=max_length,
padding="max_length",
return_tensors="pt",
)
dataset = dataset.map(tokenize, batched=True, remove_columns=["text"])
dataset.set_format("torch")
# Create DataLoader (standard PyTorch)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
# Optimizer and scheduler (standard PyTorch)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
num_training_steps = epochs * len(dataloader)
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=num_training_steps // 10,
num_training_steps=num_training_steps,
)
# THIS IS THE KEY LINE: prepare everything for distributed training
model, optimizer, dataloader, scheduler = accelerator.prepare(
model, optimizer, dataloader, scheduler
)
# Training loop (identical to single-GPU code!)
model.train()
for epoch in range(epochs):
total_loss = 0
for batch in dataloader:
outputs = model(
input_ids=batch["input_ids"],
attention_mask=batch["attention_mask"],
labels=batch["input_ids"],
)
loss = outputs.loss
# Replace loss.backward() with accelerator.backward()
accelerator.backward(loss)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
total_loss += loss.item()
avg_loss = total_loss / len(dataloader)
# Only print from main process (avoid duplicate output)
if accelerator.is_main_process:
print(f"Epoch {epoch + 1}: loss = {avg_loss:.4f}")
# Save model (only from main process)
accelerator.wait_for_everyone()
if accelerator.is_main_process:
unwrapped = accelerator.unwrap_model(model)
unwrapped.save_pretrained("models/distributed")
tokenizer.save_pretrained("models/distributed")Understanding the Training Loop:
The training loop above is almost identical to single-GPU PyTorch -- that is the entire point of Accelerate. Three lines are different from standard code: (1) accelerator.prepare() wraps model, optimizer, dataloader, and scheduler for distributed execution. (2) accelerator.backward(loss) replaces loss.backward() to handle mixed precision loss scaling and gradient synchronization. (3) accelerator.is_main_process guards print/save operations so only one process (rank 0) writes output -- without this, 4 GPUs would print 4 copies of every log line. The accelerator.wait_for_everyone() call is a synchronization barrier that ensures all processes finish training before the main process saves weights.
What accelerator.prepare() Does:
What accelerator.prepare() Does
Model
Wraps in DistributedDataParallel (DDP), moves to correct device, applies mixed precision if configured
Optimizer
Scales learning rate if needed, wraps for gradient accumulation
Dataloader
Adds DistributedSampler (each GPU gets different data), adjusts batch size per device
Scheduler
Adjusts for actual number of optimization steps
Step 3: Mixed Precision Training
"""Mixed precision training with Accelerate."""
from accelerate import Accelerator
def setup_mixed_precision(precision: str = "fp16") -> Accelerator:
"""
Configure mixed precision training.
Options:
- "no": Full fp32 (baseline, most VRAM)
- "fp16": Float16 mixed precision (50% less VRAM, faster on NVIDIA)
- "bf16": BFloat16 (better for training stability, needs Ampere+ GPU)
Mixed precision keeps master weights in fp32 but does forward/backward
in lower precision. The Accelerator handles all the casting.
"""
accelerator = Accelerator(mixed_precision=precision)
print(f"Device: {accelerator.device}")
print(f"Distributed: {accelerator.distributed_type}")
print(f"Mixed precision: {accelerator.mixed_precision}")
print(f"Num processes: {accelerator.num_processes}")
return acceleratorMixed Precision Comparison:
| Precision | VRAM | Speed | Stability | GPU Support |
|---|---|---|---|---|
| fp32 | Baseline | Baseline | Best | All |
| fp16 | ~50% less | ~2x faster | Good (loss scaling needed) | All NVIDIA |
| bf16 | ~50% less | ~2x faster | Best (no loss scaling) | Ampere+ (A100, RTX 30xx+) |
Step 4: Gradient Accumulation
"""Gradient accumulation for large effective batch sizes."""
from accelerate import Accelerator
def train_with_grad_accumulation(
model,
dataloader,
optimizer,
gradient_accumulation_steps: int = 8,
):
"""
Simulate larger batches by accumulating gradients.
With batch_size=4 and accumulation=8:
Effective batch size = 4 × 8 = 32
This lets you train with large effective batch sizes
even when your GPU can only fit small batches.
"""
accelerator = Accelerator(
gradient_accumulation_steps=gradient_accumulation_steps,
)
model, optimizer, dataloader = accelerator.prepare(
model, optimizer, dataloader
)
model.train()
for batch in dataloader:
# The context manager handles accumulation automatically
with accelerator.accumulate(model):
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad()Gradient Accumulation Explained:
Gradient Accumulation
Without accumulation (batch_size=4)
With accumulation (batch=4, accum=4)
RecommendedDDP + accumulation (4 GPUs, batch=4, accum=4)
Why large batches matter: More stable gradient estimates, better convergence for large models.
Step 5: DeepSpeed Integration
"""DeepSpeed ZeRO integration with Accelerate."""
from accelerate import Accelerator, DeepSpeedPlugin
from transformers import AutoModelForCausalLM, AutoTokenizer
def train_with_deepspeed(
model_name: str = "meta-llama/Llama-3.1-8B",
zero_stage: int = 2,
):
"""
Train with DeepSpeed ZeRO optimization.
ZeRO stages shard different parts of the training state:
Stage 1: Shard optimizer states (Adam has 2 states per param)
Memory savings: ~4x
Stage 2: + Shard gradients
Memory savings: ~8x
Stage 3: + Shard model parameters
Memory savings: ~Nx (N = num GPUs)
Enables training models larger than single GPU memory
"""
deepspeed_plugin = DeepSpeedPlugin(
zero_stage=zero_stage,
gradient_accumulation_steps=4,
gradient_clipping=1.0,
offload_optimizer_device="none", # "cpu" for ZeRO-Offload
offload_param_device="none", # "cpu" for parameter offloading
zero3_init_flag=True if zero_stage == 3 else False,
)
accelerator = Accelerator(
deepspeed_plugin=deepspeed_plugin,
mixed_precision="fp16",
)
# Model loading varies by ZeRO stage
if zero_stage == 3:
# ZeRO-3: Initialize empty, then shard
from accelerate import init_empty_weights
with init_empty_weights():
model = AutoModelForCausalLM.from_pretrained(model_name)
else:
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
)
return accelerator, model{
"train_batch_size": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": 1.0,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"overlap_comm": true,
"contiguous_gradients": true
}
}{
"train_batch_size": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": 1.0,
"fp16": {
"enabled": true
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}DeepSpeed ZeRO Memory Savings:
ZeRO Memory Analysis (7B model, 4 GPUs) — Per GPU
No ZeRO
ZeRO-1 (shard optimizer)
ZeRO-2 (+ shard gradients)
ZeRO-3 (+ shard parameters)
RecommendedZeRO-Offload: Move optimizer states or params to CPU RAM. Enables training on fewer GPUs at cost of speed.
Step 6: FSDP Integration
"""FSDP (Fully Sharded Data Parallel) with Accelerate."""
from accelerate import Accelerator, FullyShardedDataParallelPlugin
from torch.distributed.fsdp.fully_sharded_data_parallel import (
FullOptimStateDictType,
FullStateDictType,
)
def train_with_fsdp(
model_name: str = "meta-llama/Llama-3.1-8B",
):
"""
Train with PyTorch FSDP via Accelerate.
FSDP is PyTorch's native equivalent of DeepSpeed ZeRO-3.
It shards model parameters, gradients, and optimizer states
across all GPUs.
FSDP vs DeepSpeed:
- FSDP: PyTorch native, simpler setup, good for most cases
- DeepSpeed: More features (ZeRO-Offload, inference, etc.)
"""
fsdp_plugin = FullyShardedDataParallelPlugin(
state_dict_type=FullStateDictType.FULL_STATE_DICT,
optim_state_dict_type=FullOptimStateDictType.FULL_OPTIM_STATE_DICT,
)
accelerator = Accelerator(
fsdp_plugin=fsdp_plugin,
mixed_precision="bf16",
)
return acceleratorStep 7: Multi-Node Setup
"""Multi-node distributed training configuration."""
def explain_launch_commands():
"""
How to launch distributed training with accelerate.
The `accelerate` CLI handles all the distributed setup.
"""
commands = {
"Single GPU": "accelerate launch train.py",
"Multi-GPU (single node)": "accelerate launch --num_processes 4 train.py",
"Multi-Node (2 nodes × 4 GPUs)": """
# On node 0 (master):
accelerate launch \\
--num_processes 8 \\
--num_machines 2 \\
--machine_rank 0 \\
--main_process_ip 10.0.0.1 \\
--main_process_port 29500 \\
train.py
# On node 1:
accelerate launch \\
--num_processes 8 \\
--num_machines 2 \\
--machine_rank 1 \\
--main_process_ip 10.0.0.1 \\
--main_process_port 29500 \\
train.py
""",
"With DeepSpeed": "accelerate launch --config_file configs/deepspeed.yaml train.py",
"Interactive config": "accelerate config",
}
for scenario, cmd in commands.items():
print(f"\n{scenario}:")
print(f" {cmd}")Running the Project
# Install dependencies
pip install -r requirements.txt
# Interactive configuration
accelerate config
# Answer prompts about your hardware setup
# Launch on a single GPU
accelerate launch examples/train_distributed.py
# Launch on 4 GPUs
accelerate launch --num_processes 4 examples/train_distributed.py
# Launch with DeepSpeed ZeRO-2
accelerate launch --config_file configs/deepspeed_zero2.json examples/train_distributed.py
# Check your config
accelerate envKey Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
Accelerator | Unified distributed training API | Write once, run on any hardware |
prepare() | Wraps model/optimizer/dataloader for distribution | Single line to enable multi-GPU |
| DDP | Replicate model, split data | Simplest multi-GPU strategy |
| ZeRO-1 | Shard optimizer states | ~4x memory savings |
| ZeRO-2 | + Shard gradients | ~8x memory savings |
| ZeRO-3 | + Shard model parameters | Train models larger than one GPU |
| FSDP | PyTorch native sharding (like ZeRO-3) | No external dependency needed |
| Mixed Precision | fp16/bf16 for compute, fp32 for storage | 2x speed, 50% less memory |
| Gradient Accumulation | Accumulate over N steps before update | Large effective batch on small GPUs |
Next Steps
- Production AI Workbench — Deploy trained models with Gradio
- Preference Alignment with TRL — Combine with DPO for aligned models