Preference Alignment with TRL
Align models with human preferences using SFT, DPO, and Reward training
Preference Alignment with TRL
TL;DR
Use the HuggingFace TRL (Transformer Reinforcement Learning) library to align language models with human preferences. Learn the three-stage pipeline: Supervised Fine-Tuning (SFT), Reward Modeling, and Direct Preference Optimization (DPO), all with production-ready trainer APIs.
What You'll Learn
- SFTTrainer for instruction fine-tuning
- DPOTrainer for direct preference optimization
- RewardTrainer for reward model training
- Preference dataset preparation and formatting
- PPO overview and when to use it vs DPO
- Evaluation of aligned models
Why Align Models with Human Preferences?
A pre-trained language model can generate fluent text, but it has no notion of what a good response looks like. It may produce toxic content, hallucinate confidently, or give technically correct but unhelpful answers. Alignment -- the process of training models to follow human preferences -- is what transforms a raw LLM into an assistant like ChatGPT or Claude. The TRL library provides production-ready implementations of the alignment algorithms (SFT, DPO, PPO, ORPO) that would otherwise take weeks to implement correctly. Understanding alignment is essential for anyone building user-facing LLM applications.
| Property | Value |
|---|---|
| Difficulty | Advanced |
| Time | ~4 days |
| Lines of Code | ~500 |
| Prerequisites | Fine-Tuning with PEFT, Datasets Mastery |
Tech Stack
| Component | Technology | Why |
|---|---|---|
| Alignment | trl | Production-ready SFT, DPO, PPO, ORPO trainers |
| Base Models | transformers | Load and configure pretrained LLMs |
| Efficient Training | peft, bitsandbytes | QLoRA for alignment on consumer GPUs |
| Datasets | datasets | Load and process preference datasets |
| Python | 3.10+ | Type hint support |
Architecture
Alignment Pipeline — Stage 1: SFT
Alignment Pipeline — Stage 2: Preference Optimization
2a: Reward Model Training
RewardTrainer + Preference Pairs → Reward Model (scores responses)
2b: DPO (Direct Preference Optimization)
DPOTrainer + Preference Pairs → Aligned Model (follows preferences)
DPO vs PPO: DPO needs no reward model, is simpler and faster. PPO is more flexible, can use any reward signal, but is harder to tune.
Project Structure
alignment-trl/
├── src/
│ ├── __init__.py
│ ├── data_prep.py # Preference dataset preparation
│ ├── sft.py # Supervised fine-tuning
│ ├── reward_model.py # Reward model training
│ ├── dpo.py # Direct preference optimization
│ ├── ppo_overview.py # PPO pipeline (reference)
│ └── evaluate_alignment.py # Alignment evaluation
├── configs/
│ ├── sft_config.yaml
│ └── dpo_config.yaml
├── examples/
│ └── full_pipeline.py
├── requirements.txt
└── README.mdImplementation
Step 1: Dependencies
trl>=0.9.0
transformers>=4.40.0
peft>=0.11.0
bitsandbytes>=0.43.0
datasets>=2.19.0
accelerate>=0.30.0
torch>=2.0.0Step 2: Dataset Preparation
"""Prepare datasets for SFT and preference alignment."""
from datasets import load_dataset, Dataset
def prepare_sft_dataset(
dataset_name: str = "tatsu-lab/alpaca",
split: str = "train",
max_samples: int | None = None,
) -> Dataset:
"""
Prepare a dataset for Supervised Fine-Tuning.
SFT datasets have instruction-response pairs.
The model learns to generate the response given the instruction.
"""
dataset = load_dataset(dataset_name, split=split)
if max_samples:
dataset = dataset.select(range(min(max_samples, len(dataset))))
def format_chat(example):
"""Format as a chat conversation."""
messages = [
{"role": "user", "content": example["instruction"]},
{"role": "assistant", "content": example["output"]},
]
if example.get("input"):
messages[0]["content"] += f"\n\nInput: {example['input']}"
example["messages"] = messages
return example
dataset = dataset.map(format_chat)
return dataset
def prepare_preference_dataset(
dataset_name: str = "Anthropic/hh-rlhf",
split: str = "train",
max_samples: int | None = None,
) -> Dataset:
"""
Prepare a preference dataset for DPO/Reward training.
Preference datasets have (prompt, chosen, rejected) triples.
The model learns to prefer "chosen" over "rejected" responses.
"""
dataset = load_dataset(dataset_name, split=split)
if max_samples:
dataset = dataset.select(range(min(max_samples, len(dataset))))
def format_preferences(example):
"""Extract prompt, chosen, and rejected from the dataset."""
# The hh-rlhf format has full conversations
chosen = example["chosen"]
rejected = example["rejected"]
# Extract the last turn as the response
prompt = chosen.rsplit("\n\nAssistant:", 1)[0] + "\n\nAssistant:"
chosen_response = chosen.rsplit("\n\nAssistant:", 1)[-1].strip()
rejected_response = rejected.rsplit("\n\nAssistant:", 1)[-1].strip()
return {
"prompt": prompt,
"chosen": chosen_response,
"rejected": rejected_response,
}
dataset = dataset.map(format_preferences)
return dataset
def create_synthetic_preferences(
instructions: list[str],
good_responses: list[str],
bad_responses: list[str],
) -> Dataset:
"""Create a preference dataset from existing pairs."""
return Dataset.from_dict({
"prompt": instructions,
"chosen": good_responses,
"rejected": bad_responses,
})Preference Dataset Format:
Preference Data — Prompt: "Explain machine learning in simple terms."
Chosen (preferred)
Rejected (dispreferred)
The model learns: simple explanations > jargon-heavy ones. Sources of preference data: human annotators (most reliable, expensive), LLM-as-judge (use GPT-4 to rank), heuristic rules (length, toxicity), user feedback (thumbs up/down).
Step 3: Supervised Fine-Tuning (SFT)
"""Stage 1: Supervised Fine-Tuning with SFTTrainer."""
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
from datasets import Dataset
def run_sft(
model_name: str = "meta-llama/Llama-3.1-8B",
dataset: Dataset = None,
output_dir: str = "models/sft",
use_qlora: bool = True,
epochs: int = 1,
batch_size: int = 4,
max_seq_length: int = 1024,
):
"""
Run supervised fine-tuning.
SFT teaches the model to follow instructions by training
on (instruction, response) pairs. This is the foundation
for alignment — you need a good SFT model before DPO.
"""
# Load model (with optional 4-bit quantization)
model_kwargs = {"torch_dtype": torch.float16, "device_map": "auto"}
if use_qlora:
from transformers import BitsAndBytesConfig
model_kwargs["quantization_config"] = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# LoRA config
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
bias="none",
task_type="CAUSAL_LM",
)
# Training config
training_config = SFTConfig(
output_dir=output_dir,
num_train_epochs=epochs,
per_device_train_batch_size=batch_size,
gradient_accumulation_steps=4,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.1,
fp16=True,
logging_steps=10,
save_strategy="epoch",
max_seq_length=max_seq_length,
report_to="none",
)
trainer = SFTTrainer(
model=model,
args=training_config,
train_dataset=dataset,
tokenizer=tokenizer,
peft_config=peft_config,
)
trainer.train()
trainer.save_model(output_dir)
return trainerStep 4: Direct Preference Optimization (DPO)
"""Stage 2: Direct Preference Optimization with DPOTrainer."""
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import DPOTrainer, DPOConfig
from datasets import Dataset
def run_dpo(
model_name: str = "models/sft",
ref_model_name: str | None = None,
dataset: Dataset = None,
output_dir: str = "models/dpo",
beta: float = 0.1,
epochs: int = 1,
batch_size: int = 2,
max_length: int = 1024,
max_prompt_length: int = 512,
):
"""
Run Direct Preference Optimization.
DPO directly optimizes the model to prefer chosen over rejected
responses, without needing a separate reward model.
The beta parameter controls how much the model can deviate from
the reference model:
- Low beta (0.05): Allows more deviation (stronger alignment)
- High beta (0.5): Stays closer to reference (conservative)
- Default beta (0.1): Good balance
DPO loss: -log(sigmoid(beta * (log(π/πref)(chosen) - log(π/πref)(rejected))))
"""
# Load the SFT model
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Reference model (the SFT model before DPO)
# If None, DPO uses implicit reference from the model's initial weights
ref_model = None
if ref_model_name:
ref_model = AutoModelForCausalLM.from_pretrained(
ref_model_name,
torch_dtype=torch.float16,
device_map="auto",
)
# LoRA for DPO training
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
bias="none",
task_type="CAUSAL_LM",
)
# DPO config
training_config = DPOConfig(
output_dir=output_dir,
num_train_epochs=epochs,
per_device_train_batch_size=batch_size,
gradient_accumulation_steps=4,
learning_rate=5e-5,
lr_scheduler_type="cosine",
warmup_ratio=0.1,
fp16=True,
logging_steps=10,
save_strategy="epoch",
beta=beta,
max_length=max_length,
max_prompt_length=max_prompt_length,
report_to="none",
)
trainer = DPOTrainer(
model=model,
ref_model=ref_model,
args=training_config,
train_dataset=dataset,
tokenizer=tokenizer,
peft_config=peft_config,
)
trainer.train()
trainer.save_model(output_dir)
return trainerUnderstanding the DPO Training Setup:
The run_dpo function takes an SFT-trained model and optimizes it to prefer chosen responses over rejected ones. The beta parameter is the most important hyperparameter: it controls the KL divergence penalty that prevents the model from deviating too far from the reference model (the SFT checkpoint). A low beta (0.05) allows aggressive alignment but risks "reward hacking" -- the model may find shortcuts that score well on the preference signal but degrade general quality. A high beta (0.5) keeps the model conservative. The ref_model is loaded separately because DPO computes log-probabilities under both the current policy and the reference policy at every step; setting it to None uses the model's initial weights as the implicit reference, which saves memory but is less precise.
DPO vs PPO vs RLHF:
Alignment Methods Compared
RLHF with PPO
DPO (Direct Preference Optimization)
RecommendedORPO / SimPO
When to use which: DPO if you have preference pairs and want simplicity. PPO if you have a reward model/signal and need flexibility. ORPO/SimPO if you want even simpler (no reference model).
Step 5: Reward Model Training
"""Train a reward model for preference alignment."""
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardTrainer, RewardConfig
from datasets import Dataset
def train_reward_model(
model_name: str = "meta-llama/Llama-3.1-8B",
dataset: Dataset = None,
output_dir: str = "models/reward",
epochs: int = 1,
batch_size: int = 4,
):
"""
Train a reward model that scores response quality.
The reward model takes (prompt + response) and outputs a scalar
score. Higher score = better response.
Used in PPO-based RLHF, or for filtering/ranking during inference.
"""
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=1, # Single scalar output
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
training_config = RewardConfig(
output_dir=output_dir,
num_train_epochs=epochs,
per_device_train_batch_size=batch_size,
gradient_accumulation_steps=4,
learning_rate=1e-5,
fp16=True,
logging_steps=10,
save_strategy="epoch",
max_length=512,
report_to="none",
)
trainer = RewardTrainer(
model=model,
args=training_config,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
trainer.save_model(output_dir)
return trainerStep 6: Alignment Evaluation
"""Evaluate alignment quality of trained models."""
from transformers import pipeline
def evaluate_helpfulness(
model_path: str,
test_prompts: list[str],
judge_model: str = "gpt-4o-mini",
) -> dict:
"""
Evaluate model alignment using test prompts.
Methods:
1. Human evaluation (gold standard, expensive)
2. LLM-as-judge (GPT-4 rates responses)
3. Automated metrics (toxicity, coherence, length)
"""
# Generate responses from aligned model
pipe = pipeline("text-generation", model=model_path, device=0)
results = []
for prompt in test_prompts:
response = pipe(
prompt,
max_new_tokens=256,
temperature=0.7,
do_sample=True,
)[0]["generated_text"]
results.append({
"prompt": prompt,
"response": response,
"length": len(response.split()),
})
return {
"num_responses": len(results),
"avg_length": sum(r["length"] for r in results) / len(results),
"responses": results,
}
def win_rate_comparison(
model_a_responses: list[str],
model_b_responses: list[str],
prompts: list[str],
) -> dict:
"""
Compare two models by computing win rate.
For each prompt, determine which model gave a better response.
This can be done by human annotators or LLM-as-judge.
"""
# Placeholder for LLM-as-judge evaluation
wins_a = 0
wins_b = 0
ties = 0
for prompt, resp_a, resp_b in zip(prompts, model_a_responses, model_b_responses):
# In production, you'd use GPT-4 to judge:
# "Which response better answers the question? A or B?"
# For now, use length as a simple heuristic
len_a = len(resp_a.split())
len_b = len(resp_b.split())
if abs(len_a - len_b) < 5:
ties += 1
elif len_a > len_b:
wins_a += 1
else:
wins_b += 1
total = len(prompts)
return {
"model_a_win_rate": wins_a / total,
"model_b_win_rate": wins_b / total,
"tie_rate": ties / total,
}Step 7: Complete Pipeline Example
"""Complete alignment pipeline: SFT → DPO."""
from src.data_prep import prepare_sft_dataset, prepare_preference_dataset
from src.sft import run_sft
from src.dpo import run_dpo
def main():
print("=== Stage 1: SFT ===")
sft_dataset = prepare_sft_dataset(max_samples=5000)
run_sft(
model_name="meta-llama/Llama-3.1-8B",
dataset=sft_dataset,
output_dir="models/sft",
epochs=1,
)
print("\n=== Stage 2: DPO ===")
pref_dataset = prepare_preference_dataset(max_samples=5000)
run_dpo(
model_name="models/sft",
dataset=pref_dataset,
output_dir="models/dpo",
beta=0.1,
epochs=1,
)
print("\nAlignment pipeline complete!")
print("Aligned model saved to models/dpo/")
if __name__ == "__main__":
main()Running the Project
# Install dependencies
pip install -r requirements.txt
# Run the full pipeline
python examples/full_pipeline.py
# Or run stages individually
python -c "
from src.data_prep import prepare_sft_dataset
from src.sft import run_sft
ds = prepare_sft_dataset(max_samples=1000)
run_sft(dataset=ds, output_dir='models/sft', epochs=1)
"Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| SFT | Train on instruction-response pairs | Foundation: teaches model to follow instructions |
| DPO | Optimize directly from preference pairs | Simpler than PPO, no reward model needed |
| RewardTrainer | Train a model to score responses | Enables PPO-based RLHF |
| Beta (DPO) | KL divergence penalty strength | Controls how far alignment can deviate from SFT |
| Preference Data | (prompt, chosen, rejected) triples | The signal that defines "good" behavior |
| TRL | HuggingFace's alignment library | Production-ready trainers for SFT, DPO, PPO |
| QLoRA + DPO | 4-bit base + LoRA for alignment | Align 8B models on consumer GPUs |
Next Steps
- Distributed Training with Accelerate — Scale alignment to multi-GPU
- Model Evaluation & Benchmarks — Evaluate alignment quality