Distributed Training with Accelerate

TL;DR

Use HuggingFace accelerate to scale training from a single GPU to multi-GPU and multi-node setups with minimal code changes. Learn the Accelerator class, mixed precision, gradient accumulation, DeepSpeed ZeRO stages, and FSDP — all through a unified API.

What You'll Learn

The Accelerator class and distributed training patterns
Mixed precision training (fp16, bf16)
Gradient accumulation for large effective batch sizes
DeepSpeed ZeRO stages (1, 2, 3)
FSDP (Fully Sharded Data Parallel)
Multi-node training setup
Configuration with accelerate config

Why Accelerate Simplifies Multi-GPU Training

Without Accelerate, scaling from 1 GPU to 4 GPUs requires rewriting your training script with torch.distributed, managing process groups, wrapping models in DDP, splitting data with DistributedSampler, and coordinating logging/saving across processes -- easily 100+ lines of boilerplate. The accelerate library reduces this to a single prepare() call: you write standard single-GPU PyTorch code, and Accelerate handles device placement, data sharding, gradient synchronization, and mixed precision transparently. It also provides a unified interface to DeepSpeed and FSDP, so you can switch between sharding strategies by changing a config file rather than rewriting code.

Property	Value
Difficulty	Advanced
Time	~4 days
Lines of Code	~400
Prerequisites	Fine-Tuning with PEFT, PyTorch training loops

Tech Stack

Component	Technology	Why
Distributed	`accelerate`	Unified API for DDP, DeepSpeed, and FSDP
Sharding	`deepspeed`, PyTorch FSDP	Shard model state across GPUs for memory savings
Models	`transformers`	Load and train any HuggingFace model
Python	3.10+	Type hint support

Architecture

Distributed Training Strategies

DDP (Data Parallel)

Replicate full model on each GPU, split data into batches. GPU 0-3 each hold full model + different batch. Gradients synced via AllReduce after each step.

DeepSpeed ZeRO

Recommended

Shard optimizer/gradients/params across GPUs. Each GPU holds 1/N of the model state. Much lower memory per GPU.

FSDP (Fully Sharded Data Parallel)

PyTorch native sharding — same concept as ZeRO-3 but built into PyTorch. No external dependency needed.

Project Structure

distributed-training/
├── src/
│   ├── __init__.py
│   ├── accelerator_basics.py  # Basic Accelerator usage
│   ├── mixed_precision.py     # FP16/BF16 training
│   ├── grad_accumulation.py   # Gradient accumulation
│   ├── deepspeed_training.py  # DeepSpeed ZeRO integration
│   ├── fsdp_training.py       # FSDP integration
│   └── multi_node.py          # Multi-node setup
├── configs/
│   ├── deepspeed_zero2.json
│   ├── deepspeed_zero3.json
│   └── fsdp_config.yaml
├── examples/
│   └── train_distributed.py
├── requirements.txt
└── README.md

Implementation

Step 1: Dependencies

requirements.txt

accelerate>=0.30.0
transformers>=4.40.0
datasets>=2.19.0
deepspeed>=0.14.0
torch>=2.0.0

Step 2: Accelerator Basics

src/accelerator_basics.py

"""Basic distributed training with the Accelerator class."""

import torch
from torch.utils.data import DataLoader
from accelerate import Accelerator
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    get_linear_schedule_with_warmup,
)
from datasets import load_dataset


def train_with_accelerator(
    model_name: str = "gpt2",
    dataset_name: str = "wikitext",
    dataset_config: str = "wikitext-2-raw-v1",
    epochs: int = 3,
    batch_size: int = 8,
    learning_rate: float = 5e-5,
    max_length: int = 128,
):
    """
    Train a model using HuggingFace Accelerate.

    The Accelerator class handles:
    1. Device placement (CPU, GPU, multi-GPU, TPU)
    2. Distributed data parallel (DDP)
    3. Mixed precision training
    4. Gradient accumulation

    The key insight: you write single-GPU code, and Accelerate
    makes it work on any hardware configuration.
    """
    # Initialize Accelerator
    accelerator = Accelerator()

    # Load model and tokenizer (standard code — no device management)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Load and tokenize dataset
    dataset = load_dataset(dataset_name, dataset_config, split="train")

    def tokenize(batch):
        return tokenizer(
            batch["text"],
            truncation=True,
            max_length=max_length,
            padding="max_length",
            return_tensors="pt",
        )

    dataset = dataset.map(tokenize, batched=True, remove_columns=["text"])
    dataset.set_format("torch")

    # Create DataLoader (standard PyTorch)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

    # Optimizer and scheduler (standard PyTorch)
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
    num_training_steps = epochs * len(dataloader)
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=num_training_steps // 10,
        num_training_steps=num_training_steps,
    )

    # THIS IS THE KEY LINE: prepare everything for distributed training
    model, optimizer, dataloader, scheduler = accelerator.prepare(
        model, optimizer, dataloader, scheduler
    )

    # Training loop (identical to single-GPU code!)
    model.train()
    for epoch in range(epochs):
        total_loss = 0

        for batch in dataloader:
            outputs = model(
                input_ids=batch["input_ids"],
                attention_mask=batch["attention_mask"],
                labels=batch["input_ids"],
            )
            loss = outputs.loss

            # Replace loss.backward() with accelerator.backward()
            accelerator.backward(loss)

            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()

            total_loss += loss.item()

        avg_loss = total_loss / len(dataloader)

        # Only print from main process (avoid duplicate output)
        if accelerator.is_main_process:
            print(f"Epoch {epoch + 1}: loss = {avg_loss:.4f}")

    # Save model (only from main process)
    accelerator.wait_for_everyone()
    if accelerator.is_main_process:
        unwrapped = accelerator.unwrap_model(model)
        unwrapped.save_pretrained("models/distributed")
        tokenizer.save_pretrained("models/distributed")

Understanding the Training Loop:

The training loop above is almost identical to single-GPU PyTorch -- that is the entire point of Accelerate. Three lines are different from standard code: (1) accelerator.prepare() wraps model, optimizer, dataloader, and scheduler for distributed execution. (2) accelerator.backward(loss) replaces loss.backward() to handle mixed precision loss scaling and gradient synchronization. (3) accelerator.is_main_process guards print/save operations so only one process (rank 0) writes output -- without this, 4 GPUs would print 4 copies of every log line. The accelerator.wait_for_everyone() call is a synchronization barrier that ensures all processes finish training before the main process saves weights.

What accelerator.prepare() Does:

What accelerator.prepare() Does

accelerator.prepare(model, optimizer, dataloader, scheduler)

Model

Wraps in DistributedDataParallel (DDP), moves to correct device, applies mixed precision if configured

Optimizer

Scales learning rate if needed, wraps for gradient accumulation

Dataloader

Adds DistributedSampler (each GPU gets different data), adjusts batch size per device

Scheduler

Adjusts for actual number of optimization steps

Training loop works identically on 1 GPU, 4 GPUs, or 8 nodes x 8 GPUs

Step 3: Mixed Precision Training

src/mixed_precision.py

"""Mixed precision training with Accelerate."""

from accelerate import Accelerator


def setup_mixed_precision(precision: str = "fp16") -> Accelerator:
    """
    Configure mixed precision training.

    Options:
    - "no": Full fp32 (baseline, most VRAM)
    - "fp16": Float16 mixed precision (50% less VRAM, faster on NVIDIA)
    - "bf16": BFloat16 (better for training stability, needs Ampere+ GPU)

    Mixed precision keeps master weights in fp32 but does forward/backward
    in lower precision. The Accelerator handles all the casting.
    """
    accelerator = Accelerator(mixed_precision=precision)

    print(f"Device: {accelerator.device}")
    print(f"Distributed: {accelerator.distributed_type}")
    print(f"Mixed precision: {accelerator.mixed_precision}")
    print(f"Num processes: {accelerator.num_processes}")

    return accelerator

Mixed Precision Comparison:

Precision	VRAM	Speed	Stability	GPU Support
fp32	Baseline	Baseline	Best	All
fp16	~50% less	~2x faster	Good (loss scaling needed)	All NVIDIA
bf16	~50% less	~2x faster	Best (no loss scaling)	Ampere+ (A100, RTX 30xx+)

Step 4: Gradient Accumulation

src/grad_accumulation.py

"""Gradient accumulation for large effective batch sizes."""

from accelerate import Accelerator


def train_with_grad_accumulation(
    model,
    dataloader,
    optimizer,
    gradient_accumulation_steps: int = 8,
):
    """
    Simulate larger batches by accumulating gradients.

    With batch_size=4 and accumulation=8:
    Effective batch size = 4 × 8 = 32

    This lets you train with large effective batch sizes
    even when your GPU can only fit small batches.
    """
    accelerator = Accelerator(
        gradient_accumulation_steps=gradient_accumulation_steps,
    )

    model, optimizer, dataloader = accelerator.prepare(
        model, optimizer, dataloader
    )

    model.train()
    for batch in dataloader:
        # The context manager handles accumulation automatically
        with accelerator.accumulate(model):
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)
            optimizer.step()
            optimizer.zero_grad()

Gradient Accumulation Explained:

Gradient Accumulation

Without accumulation (batch_size=4)

Batch 1 → forward → backward → optimizer step. Batch 2 → forward → backward → optimizer step. Each step uses 4 examples.

With accumulation (batch=4, accum=4)

Recommended

Batch 1-3 → forward → backward → accumulate. Batch 4 → forward → backward → optimizer step. Each step uses 4 x 4 = 16 examples.

DDP + accumulation (4 GPUs, batch=4, accum=4)

Effective batch = 4 GPUs x 4 batch x 4 accum = 64. LLM training often uses effective batch sizes of 256-2048.

Why large batches matter: More stable gradient estimates, better convergence for large models.

Step 5: DeepSpeed Integration

src/deepspeed_training.py

"""DeepSpeed ZeRO integration with Accelerate."""

from accelerate import Accelerator, DeepSpeedPlugin
from transformers import AutoModelForCausalLM, AutoTokenizer


def train_with_deepspeed(
    model_name: str = "meta-llama/Llama-3.1-8B",
    zero_stage: int = 2,
):
    """
    Train with DeepSpeed ZeRO optimization.

    ZeRO stages shard different parts of the training state:

    Stage 1: Shard optimizer states (Adam has 2 states per param)
             Memory savings: ~4x
    Stage 2: + Shard gradients
             Memory savings: ~8x
    Stage 3: + Shard model parameters
             Memory savings: ~Nx (N = num GPUs)
             Enables training models larger than single GPU memory
    """
    deepspeed_plugin = DeepSpeedPlugin(
        zero_stage=zero_stage,
        gradient_accumulation_steps=4,
        gradient_clipping=1.0,
        offload_optimizer_device="none",  # "cpu" for ZeRO-Offload
        offload_param_device="none",      # "cpu" for parameter offloading
        zero3_init_flag=True if zero_stage == 3 else False,
    )

    accelerator = Accelerator(
        deepspeed_plugin=deepspeed_plugin,
        mixed_precision="fp16",
    )

    # Model loading varies by ZeRO stage
    if zero_stage == 3:
        # ZeRO-3: Initialize empty, then shard
        from accelerate import init_empty_weights

        with init_empty_weights():
            model = AutoModelForCausalLM.from_pretrained(model_name)
    else:
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype="auto",
        )

    return accelerator, model

configs/deepspeed_zero2.json

{
  "train_batch_size": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": 1.0,
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000
  },
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "overlap_comm": true,
    "contiguous_gradients": true
  }
}

configs/deepspeed_zero3.json

{
  "train_batch_size": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": 1.0,
  "fp16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  }
}

DeepSpeed ZeRO Memory Savings:

ZeRO Memory Analysis (7B model, 4 GPUs) — Per GPU

No ZeRO

Parameters: 14 GB + Gradients: 14 GB + Optimizer: 56 GB = 84 GB total per GPU.

ZeRO-1 (shard optimizer)

Parameters: 14 GB + Gradients: 14 GB + Optimizer: 14 GB = 42 GB total per GPU.

ZeRO-2 (+ shard gradients)

Parameters: 14 GB + Gradients: 3.5 GB + Optimizer: 14 GB = 31.5 GB total per GPU.

ZeRO-3 (+ shard parameters)

Recommended

Parameters: 3.5 GB + Gradients: 3.5 GB + Optimizer: 14 GB = 21 GB total per GPU.

ZeRO-Offload: Move optimizer states or params to CPU RAM. Enables training on fewer GPUs at cost of speed.

Step 6: FSDP Integration

src/fsdp_training.py

"""FSDP (Fully Sharded Data Parallel) with Accelerate."""

from accelerate import Accelerator, FullyShardedDataParallelPlugin
from torch.distributed.fsdp.fully_sharded_data_parallel import (
    FullOptimStateDictType,
    FullStateDictType,
)


def train_with_fsdp(
    model_name: str = "meta-llama/Llama-3.1-8B",
):
    """
    Train with PyTorch FSDP via Accelerate.

    FSDP is PyTorch's native equivalent of DeepSpeed ZeRO-3.
    It shards model parameters, gradients, and optimizer states
    across all GPUs.

    FSDP vs DeepSpeed:
    - FSDP: PyTorch native, simpler setup, good for most cases
    - DeepSpeed: More features (ZeRO-Offload, inference, etc.)
    """
    fsdp_plugin = FullyShardedDataParallelPlugin(
        state_dict_type=FullStateDictType.FULL_STATE_DICT,
        optim_state_dict_type=FullOptimStateDictType.FULL_OPTIM_STATE_DICT,
    )

    accelerator = Accelerator(
        fsdp_plugin=fsdp_plugin,
        mixed_precision="bf16",
    )

    return accelerator

Step 7: Multi-Node Setup

src/multi_node.py

"""Multi-node distributed training configuration."""


def explain_launch_commands():
    """
    How to launch distributed training with accelerate.

    The `accelerate` CLI handles all the distributed setup.
    """

    commands = {
        "Single GPU": "accelerate launch train.py",

        "Multi-GPU (single node)": "accelerate launch --num_processes 4 train.py",

        "Multi-Node (2 nodes × 4 GPUs)": """
# On node 0 (master):
accelerate launch \\
    --num_processes 8 \\
    --num_machines 2 \\
    --machine_rank 0 \\
    --main_process_ip 10.0.0.1 \\
    --main_process_port 29500 \\
    train.py

# On node 1:
accelerate launch \\
    --num_processes 8 \\
    --num_machines 2 \\
    --machine_rank 1 \\
    --main_process_ip 10.0.0.1 \\
    --main_process_port 29500 \\
    train.py
""",

        "With DeepSpeed": "accelerate launch --config_file configs/deepspeed.yaml train.py",

        "Interactive config": "accelerate config",
    }

    for scenario, cmd in commands.items():
        print(f"\n{scenario}:")
        print(f"  {cmd}")

Running the Project

# Install dependencies
pip install -r requirements.txt

# Interactive configuration
accelerate config
# Answer prompts about your hardware setup

# Launch on a single GPU
accelerate launch examples/train_distributed.py

# Launch on 4 GPUs
accelerate launch --num_processes 4 examples/train_distributed.py

# Launch with DeepSpeed ZeRO-2
accelerate launch --config_file configs/deepspeed_zero2.json examples/train_distributed.py

# Check your config
accelerate env

Key Concepts Recap

Concept	What It Is	Why It Matters
`Accelerator`	Unified distributed training API	Write once, run on any hardware
`prepare()`	Wraps model/optimizer/dataloader for distribution	Single line to enable multi-GPU
DDP	Replicate model, split data	Simplest multi-GPU strategy
ZeRO-1	Shard optimizer states	~4x memory savings
ZeRO-2	+ Shard gradients	~8x memory savings
ZeRO-3	+ Shard model parameters	Train models larger than one GPU
FSDP	PyTorch native sharding (like ZeRO-3)	No external dependency needed
Mixed Precision	fp16/bf16 for compute, fp32 for storage	2x speed, 50% less memory
Gradient Accumulation	Accumulate over N steps before update	Large effective batch on small GPUs

Next Steps

Production AI Workbench — Deploy trained models with Gradio
Preference Alignment with TRL — Combine with DPO for aligned models

Distributed Training with Accelerate

TL;DR

What You'll Learn

The Accelerator class and distributed training patterns
Mixed precision training (fp16, bf16)
Gradient accumulation for large effective batch sizes
DeepSpeed ZeRO stages (1, 2, 3)
FSDP (Fully Sharded Data Parallel)
Multi-node training setup
Configuration with accelerate config

Why Accelerate Simplifies Multi-GPU Training

Property	Value
Difficulty	Advanced
Time	~4 days
Lines of Code	~400
Prerequisites	Fine-Tuning with PEFT, PyTorch training loops

Tech Stack

Component	Technology	Why
Distributed	`accelerate`	Unified API for DDP, DeepSpeed, and FSDP
Sharding	`deepspeed`, PyTorch FSDP	Shard model state across GPUs for memory savings
Models	`transformers`	Load and train any HuggingFace model
Python	3.10+	Type hint support

Architecture

Distributed Training Strategies

DDP (Data Parallel)

Replicate full model on each GPU, split data into batches. GPU 0-3 each hold full model + different batch. Gradients synced via AllReduce after each step.

DeepSpeed ZeRO

Recommended

Shard optimizer/gradients/params across GPUs. Each GPU holds 1/N of the model state. Much lower memory per GPU.

FSDP (Fully Sharded Data Parallel)

PyTorch native sharding — same concept as ZeRO-3 but built into PyTorch. No external dependency needed.

Project Structure

distributed-training/
├── src/
│   ├── __init__.py
│   ├── accelerator_basics.py  # Basic Accelerator usage
│   ├── mixed_precision.py     # FP16/BF16 training
│   ├── grad_accumulation.py   # Gradient accumulation
│   ├── deepspeed_training.py  # DeepSpeed ZeRO integration
│   ├── fsdp_training.py       # FSDP integration
│   └── multi_node.py          # Multi-node setup
├── configs/
│   ├── deepspeed_zero2.json
│   ├── deepspeed_zero3.json
│   └── fsdp_config.yaml
├── examples/
│   └── train_distributed.py
├── requirements.txt
└── README.md

Implementation

Step 1: Dependencies

requirements.txt

accelerate>=0.30.0
transformers>=4.40.0
datasets>=2.19.0
deepspeed>=0.14.0
torch>=2.0.0

Step 2: Accelerator Basics

src/accelerator_basics.py

"""Basic distributed training with the Accelerator class."""

import torch
from torch.utils.data import DataLoader
from accelerate import Accelerator
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    get_linear_schedule_with_warmup,
)
from datasets import load_dataset


def train_with_accelerator(
    model_name: str = "gpt2",
    dataset_name: str = "wikitext",
    dataset_config: str = "wikitext-2-raw-v1",
    epochs: int = 3,
    batch_size: int = 8,
    learning_rate: float = 5e-5,
    max_length: int = 128,
):
    """
    Train a model using HuggingFace Accelerate.

    The Accelerator class handles:
    1. Device placement (CPU, GPU, multi-GPU, TPU)
    2. Distributed data parallel (DDP)
    3. Mixed precision training
    4. Gradient accumulation

    The key insight: you write single-GPU code, and Accelerate
    makes it work on any hardware configuration.
    """
    # Initialize Accelerator
    accelerator = Accelerator()

    # Load model and tokenizer (standard code — no device management)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Load and tokenize dataset
    dataset = load_dataset(dataset_name, dataset_config, split="train")

    def tokenize(batch):
        return tokenizer(
            batch["text"],
            truncation=True,
            max_length=max_length,
            padding="max_length",
            return_tensors="pt",
        )

    dataset = dataset.map(tokenize, batched=True, remove_columns=["text"])
    dataset.set_format("torch")

    # Create DataLoader (standard PyTorch)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

    # Optimizer and scheduler (standard PyTorch)
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
    num_training_steps = epochs * len(dataloader)
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=num_training_steps // 10,
        num_training_steps=num_training_steps,
    )

    # THIS IS THE KEY LINE: prepare everything for distributed training
    model, optimizer, dataloader, scheduler = accelerator.prepare(
        model, optimizer, dataloader, scheduler
    )

    # Training loop (identical to single-GPU code!)
    model.train()
    for epoch in range(epochs):
        total_loss = 0

        for batch in dataloader:
            outputs = model(
                input_ids=batch["input_ids"],
                attention_mask=batch["attention_mask"],
                labels=batch["input_ids"],
            )
            loss = outputs.loss

            # Replace loss.backward() with accelerator.backward()
            accelerator.backward(loss)

            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()

            total_loss += loss.item()

        avg_loss = total_loss / len(dataloader)

        # Only print from main process (avoid duplicate output)
        if accelerator.is_main_process:
            print(f"Epoch {epoch + 1}: loss = {avg_loss:.4f}")

    # Save model (only from main process)
    accelerator.wait_for_everyone()
    if accelerator.is_main_process:
        unwrapped = accelerator.unwrap_model(model)
        unwrapped.save_pretrained("models/distributed")
        tokenizer.save_pretrained("models/distributed")

Understanding the Training Loop:

What accelerator.prepare() Does:

What accelerator.prepare() Does

accelerator.prepare(model, optimizer, dataloader, scheduler)

Model

Wraps in DistributedDataParallel (DDP), moves to correct device, applies mixed precision if configured

Optimizer

Scales learning rate if needed, wraps for gradient accumulation

Dataloader

Adds DistributedSampler (each GPU gets different data), adjusts batch size per device

Scheduler

Adjusts for actual number of optimization steps

Training loop works identically on 1 GPU, 4 GPUs, or 8 nodes x 8 GPUs

Step 3: Mixed Precision Training

src/mixed_precision.py

"""Mixed precision training with Accelerate."""

from accelerate import Accelerator


def setup_mixed_precision(precision: str = "fp16") -> Accelerator:
    """
    Configure mixed precision training.

    Options:
    - "no": Full fp32 (baseline, most VRAM)
    - "fp16": Float16 mixed precision (50% less VRAM, faster on NVIDIA)
    - "bf16": BFloat16 (better for training stability, needs Ampere+ GPU)

    Mixed precision keeps master weights in fp32 but does forward/backward
    in lower precision. The Accelerator handles all the casting.
    """
    accelerator = Accelerator(mixed_precision=precision)

    print(f"Device: {accelerator.device}")
    print(f"Distributed: {accelerator.distributed_type}")
    print(f"Mixed precision: {accelerator.mixed_precision}")
    print(f"Num processes: {accelerator.num_processes}")

    return accelerator

Mixed Precision Comparison:

Precision	VRAM	Speed	Stability	GPU Support
fp32	Baseline	Baseline	Best	All
fp16	~50% less	~2x faster	Good (loss scaling needed)	All NVIDIA
bf16	~50% less	~2x faster	Best (no loss scaling)	Ampere+ (A100, RTX 30xx+)

Step 4: Gradient Accumulation

src/grad_accumulation.py

"""Gradient accumulation for large effective batch sizes."""

from accelerate import Accelerator


def train_with_grad_accumulation(
    model,
    dataloader,
    optimizer,
    gradient_accumulation_steps: int = 8,
):
    """
    Simulate larger batches by accumulating gradients.

    With batch_size=4 and accumulation=8:
    Effective batch size = 4 × 8 = 32

    This lets you train with large effective batch sizes
    even when your GPU can only fit small batches.
    """
    accelerator = Accelerator(
        gradient_accumulation_steps=gradient_accumulation_steps,
    )

    model, optimizer, dataloader = accelerator.prepare(
        model, optimizer, dataloader
    )

    model.train()
    for batch in dataloader:
        # The context manager handles accumulation automatically
        with accelerator.accumulate(model):
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)
            optimizer.step()
            optimizer.zero_grad()

Gradient Accumulation Explained:

Gradient Accumulation

Without accumulation (batch_size=4)

Batch 1 → forward → backward → optimizer step. Batch 2 → forward → backward → optimizer step. Each step uses 4 examples.

With accumulation (batch=4, accum=4)

Recommended

Batch 1-3 → forward → backward → accumulate. Batch 4 → forward → backward → optimizer step. Each step uses 4 x 4 = 16 examples.

DDP + accumulation (4 GPUs, batch=4, accum=4)

Effective batch = 4 GPUs x 4 batch x 4 accum = 64. LLM training often uses effective batch sizes of 256-2048.

Why large batches matter: More stable gradient estimates, better convergence for large models.

Step 5: DeepSpeed Integration

src/deepspeed_training.py

"""DeepSpeed ZeRO integration with Accelerate."""

from accelerate import Accelerator, DeepSpeedPlugin
from transformers import AutoModelForCausalLM, AutoTokenizer


def train_with_deepspeed(
    model_name: str = "meta-llama/Llama-3.1-8B",
    zero_stage: int = 2,
):
    """
    Train with DeepSpeed ZeRO optimization.

    ZeRO stages shard different parts of the training state:

    Stage 1: Shard optimizer states (Adam has 2 states per param)
             Memory savings: ~4x
    Stage 2: + Shard gradients
             Memory savings: ~8x
    Stage 3: + Shard model parameters
             Memory savings: ~Nx (N = num GPUs)
             Enables training models larger than single GPU memory
    """
    deepspeed_plugin = DeepSpeedPlugin(
        zero_stage=zero_stage,
        gradient_accumulation_steps=4,
        gradient_clipping=1.0,
        offload_optimizer_device="none",  # "cpu" for ZeRO-Offload
        offload_param_device="none",      # "cpu" for parameter offloading
        zero3_init_flag=True if zero_stage == 3 else False,
    )

    accelerator = Accelerator(
        deepspeed_plugin=deepspeed_plugin,
        mixed_precision="fp16",
    )

    # Model loading varies by ZeRO stage
    if zero_stage == 3:
        # ZeRO-3: Initialize empty, then shard
        from accelerate import init_empty_weights

        with init_empty_weights():
            model = AutoModelForCausalLM.from_pretrained(model_name)
    else:
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype="auto",
        )

    return accelerator, model

configs/deepspeed_zero2.json

{
  "train_batch_size": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": 1.0,
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000
  },
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "overlap_comm": true,
    "contiguous_gradients": true
  }
}

configs/deepspeed_zero3.json

{
  "train_batch_size": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": 1.0,
  "fp16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  }
}

DeepSpeed ZeRO Memory Savings:

ZeRO Memory Analysis (7B model, 4 GPUs) — Per GPU

No ZeRO

Parameters: 14 GB + Gradients: 14 GB + Optimizer: 56 GB = 84 GB total per GPU.

ZeRO-1 (shard optimizer)

Parameters: 14 GB + Gradients: 14 GB + Optimizer: 14 GB = 42 GB total per GPU.

ZeRO-2 (+ shard gradients)

Parameters: 14 GB + Gradients: 3.5 GB + Optimizer: 14 GB = 31.5 GB total per GPU.

ZeRO-3 (+ shard parameters)

Recommended

Parameters: 3.5 GB + Gradients: 3.5 GB + Optimizer: 14 GB = 21 GB total per GPU.

ZeRO-Offload: Move optimizer states or params to CPU RAM. Enables training on fewer GPUs at cost of speed.

Step 6: FSDP Integration

src/fsdp_training.py

"""FSDP (Fully Sharded Data Parallel) with Accelerate."""

from accelerate import Accelerator, FullyShardedDataParallelPlugin
from torch.distributed.fsdp.fully_sharded_data_parallel import (
    FullOptimStateDictType,
    FullStateDictType,
)


def train_with_fsdp(
    model_name: str = "meta-llama/Llama-3.1-8B",
):
    """
    Train with PyTorch FSDP via Accelerate.

    FSDP is PyTorch's native equivalent of DeepSpeed ZeRO-3.
    It shards model parameters, gradients, and optimizer states
    across all GPUs.

    FSDP vs DeepSpeed:
    - FSDP: PyTorch native, simpler setup, good for most cases
    - DeepSpeed: More features (ZeRO-Offload, inference, etc.)
    """
    fsdp_plugin = FullyShardedDataParallelPlugin(
        state_dict_type=FullStateDictType.FULL_STATE_DICT,
        optim_state_dict_type=FullOptimStateDictType.FULL_OPTIM_STATE_DICT,
    )

    accelerator = Accelerator(
        fsdp_plugin=fsdp_plugin,
        mixed_precision="bf16",
    )

    return accelerator

Step 7: Multi-Node Setup

src/multi_node.py

"""Multi-node distributed training configuration."""


def explain_launch_commands():
    """
    How to launch distributed training with accelerate.

    The `accelerate` CLI handles all the distributed setup.
    """

    commands = {
        "Single GPU": "accelerate launch train.py",

        "Multi-GPU (single node)": "accelerate launch --num_processes 4 train.py",

        "Multi-Node (2 nodes × 4 GPUs)": """
# On node 0 (master):
accelerate launch \\
    --num_processes 8 \\
    --num_machines 2 \\
    --machine_rank 0 \\
    --main_process_ip 10.0.0.1 \\
    --main_process_port 29500 \\
    train.py

# On node 1:
accelerate launch \\
    --num_processes 8 \\
    --num_machines 2 \\
    --machine_rank 1 \\
    --main_process_ip 10.0.0.1 \\
    --main_process_port 29500 \\
    train.py
""",

        "With DeepSpeed": "accelerate launch --config_file configs/deepspeed.yaml train.py",

        "Interactive config": "accelerate config",
    }

    for scenario, cmd in commands.items():
        print(f"\n{scenario}:")
        print(f"  {cmd}")

Running the Project

# Install dependencies
pip install -r requirements.txt

# Interactive configuration
accelerate config
# Answer prompts about your hardware setup

# Launch on a single GPU
accelerate launch examples/train_distributed.py

# Launch on 4 GPUs
accelerate launch --num_processes 4 examples/train_distributed.py

# Launch with DeepSpeed ZeRO-2
accelerate launch --config_file configs/deepspeed_zero2.json examples/train_distributed.py

# Check your config
accelerate env

Key Concepts Recap

Concept	What It Is	Why It Matters
`Accelerator`	Unified distributed training API	Write once, run on any hardware
`prepare()`	Wraps model/optimizer/dataloader for distribution	Single line to enable multi-GPU
DDP	Replicate model, split data	Simplest multi-GPU strategy
ZeRO-1	Shard optimizer states	~4x memory savings
ZeRO-2	+ Shard gradients	~8x memory savings
ZeRO-3	+ Shard model parameters	Train models larger than one GPU
FSDP	PyTorch native sharding (like ZeRO-3)	No external dependency needed
Mixed Precision	fp16/bf16 for compute, fp32 for storage	2x speed, 50% less memory
Gradient Accumulation	Accumulate over N steps before update	Large effective batch on small GPUs

Next Steps

Production AI Workbench — Deploy trained models with Gradio
Preference Alignment with TRL — Combine with DPO for aligned models

Distributed Training with Accelerate

Distributed Training with Accelerate

What You'll Learn

Why Accelerate Simplifies Multi-GPU Training

Tech Stack

Architecture

Project Structure

Implementation

Step 1: Dependencies

Step 2: Accelerator Basics

Step 3: Mixed Precision Training

Step 4: Gradient Accumulation

Step 5: DeepSpeed Integration

Step 6: FSDP Integration

Step 7: Multi-Node Setup

Running the Project

Key Concepts Recap

Next Steps

On this page

Distributed Training with Accelerate

Distributed Training with Accelerate

What You'll Learn

Why Accelerate Simplifies Multi-GPU Training

Tech Stack

Architecture

Project Structure

Implementation

Step 1: Dependencies

Step 2: Accelerator Basics

Step 3: Mixed Precision Training

Step 4: Gradient Accumulation

Step 5: DeepSpeed Integration

Step 6: FSDP Integration

Step 7: Multi-Node Setup

Running the Project

Key Concepts Recap

Next Steps

On this page