HuggingFace EcosystemAdvanced
Preference Alignment with TRL
Align models with human preferences using SFT, DPO, and Reward training
Preference Alignment with TRL
TL;DR
Use the HuggingFace TRL (Transformer Reinforcement Learning) library to align language models with human preferences. Learn the three-stage pipeline: Supervised Fine-Tuning (SFT), Reward Modeling, and Direct Preference Optimization (DPO), all with production-ready trainer APIs.
Build a complete preference alignment pipeline using TRL, covering SFT, reward modeling, DPO, and evaluation of aligned models.
What You'll Learn
- SFTTrainer for instruction fine-tuning
- DPOTrainer for direct preference optimization
- RewardTrainer for reward model training
- Preference dataset preparation and formatting
- PPO overview and when to use it vs DPO
- Evaluation of aligned models
Tech Stack
| Component | Technology |
|---|---|
| Alignment | trl |
| Base Models | transformers |
| Efficient Training | peft, bitsandbytes |
| Datasets | datasets |
| Python | 3.10+ |
Architecture
┌──────────────────────────────────────────────────────────────────────────────┐
│ ALIGNMENT PIPELINE (RLHF / DPO) │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ STAGE 1: Supervised Fine-Tuning (SFT) │
│ ┌──────────────┐ ┌─────────────┐ ┌──────────────────┐ │
│ │ Base Model │───▶│ SFTTrainer │───▶│ Instruction- │ │
│ │ (Llama 3.1) │ │ │ │ tuned Model │ │
│ └──────────────┘ │ + Instruct │ └──────────────────┘ │
│ │ Dataset │ │ │
│ └─────────────┘ │ │
│ ▼ │
│ STAGE 2a: Reward Model Training │
│ ┌──────────────┐ ┌─────────────┐ ┌──────────────────┐ │
│ │ SFT Model │───▶│ RewardTrainer│──▶│ Reward Model │ │
│ │ │ │ │ │ (scores responses)│ │
│ └──────────────┘ │ + Preference│ └──────────────────┘ │
│ │ Pairs │ │
│ └─────────────┘ │
│ │ │
│ STAGE 2b: Direct Preference Optimization (DPO) │ │
│ ┌──────────────┐ ┌─────────────┐ ┌───────────▼──────┐ │
│ │ SFT Model │───▶│ DPOTrainer │───▶│ Aligned Model │ │
│ │ │ │ │ │ (follows prefs) │ │
│ └──────────────┘ │ + Preference│ └──────────────────┘ │
│ │ Pairs │ │
│ └─────────────┘ │
│ │
│ DPO vs PPO: │
│ • DPO: No reward model needed, simpler, faster │
│ • PPO: More flexible, can use any reward signal, harder to tune │
│ │
└──────────────────────────────────────────────────────────────────────────────┘Project Structure
alignment-trl/
├── src/
│ ├── __init__.py
│ ├── data_prep.py # Preference dataset preparation
│ ├── sft.py # Supervised fine-tuning
│ ├── reward_model.py # Reward model training
│ ├── dpo.py # Direct preference optimization
│ ├── ppo_overview.py # PPO pipeline (reference)
│ └── evaluate_alignment.py # Alignment evaluation
├── configs/
│ ├── sft_config.yaml
│ └── dpo_config.yaml
├── examples/
│ └── full_pipeline.py
├── requirements.txt
└── README.mdImplementation
Step 1: Dependencies
trl>=0.9.0
transformers>=4.40.0
peft>=0.11.0
bitsandbytes>=0.43.0
datasets>=2.19.0
accelerate>=0.30.0
torch>=2.0.0Step 2: Dataset Preparation
"""Prepare datasets for SFT and preference alignment."""
from datasets import load_dataset, Dataset
def prepare_sft_dataset(
dataset_name: str = "tatsu-lab/alpaca",
split: str = "train",
max_samples: int | None = None,
) -> Dataset:
"""
Prepare a dataset for Supervised Fine-Tuning.
SFT datasets have instruction-response pairs.
The model learns to generate the response given the instruction.
"""
dataset = load_dataset(dataset_name, split=split)
if max_samples:
dataset = dataset.select(range(min(max_samples, len(dataset))))
def format_chat(example):
"""Format as a chat conversation."""
messages = [
{"role": "user", "content": example["instruction"]},
{"role": "assistant", "content": example["output"]},
]
if example.get("input"):
messages[0]["content"] += f"\n\nInput: {example['input']}"
example["messages"] = messages
return example
dataset = dataset.map(format_chat)
return dataset
def prepare_preference_dataset(
dataset_name: str = "Anthropic/hh-rlhf",
split: str = "train",
max_samples: int | None = None,
) -> Dataset:
"""
Prepare a preference dataset for DPO/Reward training.
Preference datasets have (prompt, chosen, rejected) triples.
The model learns to prefer "chosen" over "rejected" responses.
"""
dataset = load_dataset(dataset_name, split=split)
if max_samples:
dataset = dataset.select(range(min(max_samples, len(dataset))))
def format_preferences(example):
"""Extract prompt, chosen, and rejected from the dataset."""
# The hh-rlhf format has full conversations
chosen = example["chosen"]
rejected = example["rejected"]
# Extract the last turn as the response
prompt = chosen.rsplit("\n\nAssistant:", 1)[0] + "\n\nAssistant:"
chosen_response = chosen.rsplit("\n\nAssistant:", 1)[-1].strip()
rejected_response = rejected.rsplit("\n\nAssistant:", 1)[-1].strip()
return {
"prompt": prompt,
"chosen": chosen_response,
"rejected": rejected_response,
}
dataset = dataset.map(format_preferences)
return dataset
def create_synthetic_preferences(
instructions: list[str],
good_responses: list[str],
bad_responses: list[str],
) -> Dataset:
"""Create a preference dataset from existing pairs."""
return Dataset.from_dict({
"prompt": instructions,
"chosen": good_responses,
"rejected": bad_responses,
})Preference Dataset Format:
┌─────────────────────────────────────────────────────────────────┐
│ PREFERENCE DATA │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Prompt: "Explain machine learning in simple terms." │
│ │
│ Chosen (preferred): │
│ "Machine learning is when computers learn from examples │
│ instead of being explicitly programmed. Like teaching a │
│ child to recognize cats by showing many cat pictures." │
│ │
│ Rejected (dispreferred): │
│ "ML is a subset of AI utilizing statistical learning theory │
│ and optimization algorithms to minimize empirical risk │
│ functions over hypothesis spaces." │
│ │
│ The model learns: simple explanations > jargon-heavy ones │
│ │
│ Sources of preference data: │
│ • Human annotators (most reliable, expensive) │
│ • LLM-as-judge (use GPT-4 to rank responses) │
│ • Heuristic rules (length, toxicity, helpfulness) │
│ • User feedback (thumbs up/down in production) │
│ │
└─────────────────────────────────────────────────────────────────┘Step 3: Supervised Fine-Tuning (SFT)
"""Stage 1: Supervised Fine-Tuning with SFTTrainer."""
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
from datasets import Dataset
def run_sft(
model_name: str = "meta-llama/Llama-3.1-8B",
dataset: Dataset = None,
output_dir: str = "models/sft",
use_qlora: bool = True,
epochs: int = 1,
batch_size: int = 4,
max_seq_length: int = 1024,
):
"""
Run supervised fine-tuning.
SFT teaches the model to follow instructions by training
on (instruction, response) pairs. This is the foundation
for alignment — you need a good SFT model before DPO.
"""
# Load model (with optional 4-bit quantization)
model_kwargs = {"torch_dtype": torch.float16, "device_map": "auto"}
if use_qlora:
from transformers import BitsAndBytesConfig
model_kwargs["quantization_config"] = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# LoRA config
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
bias="none",
task_type="CAUSAL_LM",
)
# Training config
training_config = SFTConfig(
output_dir=output_dir,
num_train_epochs=epochs,
per_device_train_batch_size=batch_size,
gradient_accumulation_steps=4,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.1,
fp16=True,
logging_steps=10,
save_strategy="epoch",
max_seq_length=max_seq_length,
report_to="none",
)
trainer = SFTTrainer(
model=model,
args=training_config,
train_dataset=dataset,
tokenizer=tokenizer,
peft_config=peft_config,
)
trainer.train()
trainer.save_model(output_dir)
return trainerStep 4: Direct Preference Optimization (DPO)
"""Stage 2: Direct Preference Optimization with DPOTrainer."""
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import DPOTrainer, DPOConfig
from datasets import Dataset
def run_dpo(
model_name: str = "models/sft",
ref_model_name: str | None = None,
dataset: Dataset = None,
output_dir: str = "models/dpo",
beta: float = 0.1,
epochs: int = 1,
batch_size: int = 2,
max_length: int = 1024,
max_prompt_length: int = 512,
):
"""
Run Direct Preference Optimization.
DPO directly optimizes the model to prefer chosen over rejected
responses, without needing a separate reward model.
The beta parameter controls how much the model can deviate from
the reference model:
- Low beta (0.05): Allows more deviation (stronger alignment)
- High beta (0.5): Stays closer to reference (conservative)
- Default beta (0.1): Good balance
DPO loss: -log(sigmoid(beta * (log(π/πref)(chosen) - log(π/πref)(rejected))))
"""
# Load the SFT model
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Reference model (the SFT model before DPO)
# If None, DPO uses implicit reference from the model's initial weights
ref_model = None
if ref_model_name:
ref_model = AutoModelForCausalLM.from_pretrained(
ref_model_name,
torch_dtype=torch.float16,
device_map="auto",
)
# LoRA for DPO training
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
bias="none",
task_type="CAUSAL_LM",
)
# DPO config
training_config = DPOConfig(
output_dir=output_dir,
num_train_epochs=epochs,
per_device_train_batch_size=batch_size,
gradient_accumulation_steps=4,
learning_rate=5e-5,
lr_scheduler_type="cosine",
warmup_ratio=0.1,
fp16=True,
logging_steps=10,
save_strategy="epoch",
beta=beta,
max_length=max_length,
max_prompt_length=max_prompt_length,
report_to="none",
)
trainer = DPOTrainer(
model=model,
ref_model=ref_model,
args=training_config,
train_dataset=dataset,
tokenizer=tokenizer,
peft_config=peft_config,
)
trainer.train()
trainer.save_model(output_dir)
return trainerDPO vs PPO vs RLHF:
┌─────────────────────────────────────────────────────────────────┐
│ ALIGNMENT METHODS COMPARED │
├─────────────────────────────────────────────────────────────────┤
│ │
│ RLHF with PPO (original ChatGPT approach): │
│ SFT ──► Train Reward Model ──► PPO optimization ──► Aligned │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Pros: Most flexible, can use any reward signal │ │
│ │ Cons: 4 models in memory, unstable training │ │
│ │ Needs: policy, ref, reward, value models │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ DPO (newer, simpler approach): │
│ SFT ──► DPO with preference pairs ──► Aligned │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Pros: Only 2 models (policy + ref), stable, simple │ │
│ │ Cons: Requires paired preference data, less flexible │ │
│ │ Needs: policy and reference models │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ When to use which: │
│ • DPO: You have preference pairs, want simplicity │
│ • PPO: You have a reward model/signal, need flexibility │
│ • ORPO/SimPO: You want even simpler (no reference model) │
│ │
└─────────────────────────────────────────────────────────────────┘Step 5: Reward Model Training
"""Train a reward model for preference alignment."""
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardTrainer, RewardConfig
from datasets import Dataset
def train_reward_model(
model_name: str = "meta-llama/Llama-3.1-8B",
dataset: Dataset = None,
output_dir: str = "models/reward",
epochs: int = 1,
batch_size: int = 4,
):
"""
Train a reward model that scores response quality.
The reward model takes (prompt + response) and outputs a scalar
score. Higher score = better response.
Used in PPO-based RLHF, or for filtering/ranking during inference.
"""
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=1, # Single scalar output
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
training_config = RewardConfig(
output_dir=output_dir,
num_train_epochs=epochs,
per_device_train_batch_size=batch_size,
gradient_accumulation_steps=4,
learning_rate=1e-5,
fp16=True,
logging_steps=10,
save_strategy="epoch",
max_length=512,
report_to="none",
)
trainer = RewardTrainer(
model=model,
args=training_config,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
trainer.save_model(output_dir)
return trainerStep 6: Alignment Evaluation
"""Evaluate alignment quality of trained models."""
from transformers import pipeline
def evaluate_helpfulness(
model_path: str,
test_prompts: list[str],
judge_model: str = "gpt-4o-mini",
) -> dict:
"""
Evaluate model alignment using test prompts.
Methods:
1. Human evaluation (gold standard, expensive)
2. LLM-as-judge (GPT-4 rates responses)
3. Automated metrics (toxicity, coherence, length)
"""
# Generate responses from aligned model
pipe = pipeline("text-generation", model=model_path, device=0)
results = []
for prompt in test_prompts:
response = pipe(
prompt,
max_new_tokens=256,
temperature=0.7,
do_sample=True,
)[0]["generated_text"]
results.append({
"prompt": prompt,
"response": response,
"length": len(response.split()),
})
return {
"num_responses": len(results),
"avg_length": sum(r["length"] for r in results) / len(results),
"responses": results,
}
def win_rate_comparison(
model_a_responses: list[str],
model_b_responses: list[str],
prompts: list[str],
) -> dict:
"""
Compare two models by computing win rate.
For each prompt, determine which model gave a better response.
This can be done by human annotators or LLM-as-judge.
"""
# Placeholder for LLM-as-judge evaluation
wins_a = 0
wins_b = 0
ties = 0
for prompt, resp_a, resp_b in zip(prompts, model_a_responses, model_b_responses):
# In production, you'd use GPT-4 to judge:
# "Which response better answers the question? A or B?"
# For now, use length as a simple heuristic
len_a = len(resp_a.split())
len_b = len(resp_b.split())
if abs(len_a - len_b) < 5:
ties += 1
elif len_a > len_b:
wins_a += 1
else:
wins_b += 1
total = len(prompts)
return {
"model_a_win_rate": wins_a / total,
"model_b_win_rate": wins_b / total,
"tie_rate": ties / total,
}Step 7: Complete Pipeline Example
"""Complete alignment pipeline: SFT → DPO."""
from src.data_prep import prepare_sft_dataset, prepare_preference_dataset
from src.sft import run_sft
from src.dpo import run_dpo
def main():
print("=== Stage 1: SFT ===")
sft_dataset = prepare_sft_dataset(max_samples=5000)
run_sft(
model_name="meta-llama/Llama-3.1-8B",
dataset=sft_dataset,
output_dir="models/sft",
epochs=1,
)
print("\n=== Stage 2: DPO ===")
pref_dataset = prepare_preference_dataset(max_samples=5000)
run_dpo(
model_name="models/sft",
dataset=pref_dataset,
output_dir="models/dpo",
beta=0.1,
epochs=1,
)
print("\nAlignment pipeline complete!")
print("Aligned model saved to models/dpo/")
if __name__ == "__main__":
main()Running the Project
# Install dependencies
pip install -r requirements.txt
# Run the full pipeline
python examples/full_pipeline.py
# Or run stages individually
python -c "
from src.data_prep import prepare_sft_dataset
from src.sft import run_sft
ds = prepare_sft_dataset(max_samples=1000)
run_sft(dataset=ds, output_dir='models/sft', epochs=1)
"Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| SFT | Train on instruction-response pairs | Foundation: teaches model to follow instructions |
| DPO | Optimize directly from preference pairs | Simpler than PPO, no reward model needed |
| RewardTrainer | Train a model to score responses | Enables PPO-based RLHF |
| Beta (DPO) | KL divergence penalty strength | Controls how far alignment can deviate from SFT |
| Preference Data | (prompt, chosen, rejected) triples | The signal that defines "good" behavior |
| TRL | HuggingFace's alignment library | Production-ready trainers for SFT, DPO, PPO |
| QLoRA + DPO | 4-bit base + LoRA for alignment | Align 8B models on consumer GPUs |
Next Steps
- Distributed Training with Accelerate — Scale alignment to multi-GPU
- Model Evaluation & Benchmarks — Evaluate alignment quality