Fine-Tuning with PEFT
LoRA, QLoRA, and adapter methods for efficient fine-tuning with the PEFT library
Fine-Tuning with PEFT
TL;DR
PEFT (Parameter-Efficient Fine-Tuning) lets you fine-tune large models by updating only 0.1-1% of parameters. Learn LoRA, QLoRA (4-bit), and Prefix Tuning using the peft library, with bitsandbytes for quantization and safetensors for safe weight storage.
What You'll Learn
- LoRA: Low-Rank Adaptation for efficient fine-tuning
- QLoRA: 4-bit quantization + LoRA for minimal VRAM
- Prefix Tuning and Prompt Tuning methods
- BitsAndBytes 4-bit and 8-bit quantization
- Adapter merging and saving with safetensors
- Publishing adapters to HuggingFace Hub
Why Parameter-Efficient Fine-Tuning?
Full fine-tuning of a 7B parameter model requires ~64 GB of GPU memory (model weights + optimizer states + gradients in fp16). That means a single A100 80GB is the minimum -- costing $2-4/hour in the cloud. PEFT methods like LoRA reduce this to ~6 GB by freezing the base model and training only small adapter matrices (0.1% of parameters). This means you can fine-tune Llama 3.1 8B on a free Colab T4 GPU. The adapters are tiny (5-50 MB vs 14 GB for full weights), making it practical to maintain dozens of task-specific adapters that share a single base model.
| Property | Value |
|---|---|
| Difficulty | Intermediate |
| Time | ~6 hours |
| Lines of Code | ~400 |
| Prerequisites | Pipelines & Hub, Tokenizers, basic PyTorch |
Tech Stack
| Component | Technology | Why |
|---|---|---|
| PEFT Methods | peft | Unified API for LoRA, Prefix Tuning, and more |
| Base Models | transformers | Load and configure any pretrained model |
| Quantization | bitsandbytes | 4-bit NF4 quantization for extreme memory savings |
| Storage | safetensors | Secure, fast model serialization |
| Hub | huggingface_hub | Share adapters with the community |
| Python | 3.10+ | Type hint support |
Architecture
PEFT Fine-Tuning Approaches
Full Fine-Tuning
LoRA (Low-Rank Adaptation)
QLoRA (Quantized LoRA)
RecommendedLoRA Math: W' = W + BA where B is [d, r] and A is [r, d]. Instead of updating a d x d matrix, learn two small r x d matrices. With r=16 and d=4096: 131K params instead of 16M (125x reduction).
Project Structure
fine-tuning-peft/
├── src/
│ ├── __init__.py
│ ├── lora.py # LoRA fine-tuning
│ ├── qlora.py # QLoRA (4-bit) fine-tuning
│ ├── prefix_tuning.py # Prefix and prompt tuning
│ ├── adapter_ops.py # Merge, save, push adapters
│ └── data_prep.py # Dataset preparation for fine-tuning
├── configs/
│ └── training_config.yaml
├── examples/
│ └── finetune_llama.py
├── requirements.txt
└── README.mdImplementation
Step 1: Dependencies
peft>=0.11.0
transformers>=4.40.0
bitsandbytes>=0.43.0
safetensors>=0.4.0
datasets>=2.19.0
huggingface_hub>=0.23.0
torch>=2.0.0
trl>=0.9.0
accelerate>=0.30.0Step 2: LoRA Fine-Tuning
"""LoRA fine-tuning with the PEFT library."""
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
)
from peft import (
LoraConfig,
get_peft_model,
TaskType,
PeftModel,
)
from datasets import Dataset
def setup_lora(
model_name: str = "meta-llama/Llama-3.1-8B",
lora_r: int = 16,
lora_alpha: int = 32,
lora_dropout: float = 0.05,
target_modules: list[str] | None = None,
):
"""
Set up a model with LoRA adapters.
Args:
model_name: Base model from HuggingFace Hub
lora_r: LoRA rank — lower = fewer params, less capacity
lora_alpha: LoRA scaling factor — controls adapter strength
lora_dropout: Dropout on LoRA layers for regularization
target_modules: Which layers to add LoRA to
"""
# Load base model
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Default target modules for transformer models
if target_modules is None:
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
]
# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=lora_r,
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
target_modules=target_modules,
bias="none",
)
# Apply LoRA to model
model = get_peft_model(model, lora_config)
# Print trainable parameters
model.print_trainable_parameters()
# Example output: trainable params: 6,553,600 || all params: 8,030,261,248
# || trainable%: 0.0816
return model, tokenizer
def train_lora(
model,
tokenizer,
train_dataset: Dataset,
eval_dataset: Dataset | None = None,
output_dir: str = "models/lora-adapter",
epochs: int = 3,
batch_size: int = 4,
learning_rate: float = 2e-4,
max_length: int = 512,
):
"""Train the LoRA adapter."""
# Tokenize dataset
def tokenize(example):
result = tokenizer(
example["text"],
truncation=True,
max_length=max_length,
padding="max_length",
)
result["labels"] = result["input_ids"].copy()
return result
train_tokenized = train_dataset.map(tokenize, remove_columns=["text"])
eval_tokenized = eval_dataset.map(tokenize, remove_columns=["text"]) if eval_dataset else None
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=epochs,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
gradient_accumulation_steps=4,
learning_rate=learning_rate,
lr_scheduler_type="cosine",
warmup_ratio=0.1,
fp16=True,
logging_steps=10,
eval_strategy="steps" if eval_tokenized else "no",
eval_steps=100,
save_strategy="steps",
save_steps=200,
save_total_limit=3,
report_to="none",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_tokenized,
eval_dataset=eval_tokenized,
)
trainer.train()
# Save the adapter (NOT the full model — just the LoRA weights)
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
return trainerLoRA Hyperparameter Guide:
LoRA Configuration — Rank (r)
r = 4 (0.02% params)
r = 16 (0.08% params)
Recommendedr = 64 (0.3% params)
r = 256 (1.2% params)
Alpha: Effective scaling = alpha / r. Rule of thumb: alpha = 2 x r (so scaling = 2.0).
LoRA Configuration — Target Modules
Attention only
All attention
Recommended+ MLP layers
| Parameter | Recommended | Effect of Increasing |
|---|---|---|
r | 16 | More adapter capacity, more VRAM |
alpha | 32 (2×r) | Stronger adapter influence |
dropout | 0.05 | More regularization |
target_modules | All attention + MLP | More layers adapted, more VRAM |
Step 3: QLoRA (4-bit)
"""QLoRA: LoRA with 4-bit quantization for minimal VRAM usage."""
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
)
from peft import (
LoraConfig,
get_peft_model,
prepare_model_for_kbit_training,
TaskType,
)
def setup_qlora(
model_name: str = "meta-llama/Llama-3.1-8B",
lora_r: int = 16,
lora_alpha: int = 32,
):
"""
Set up a model with QLoRA (4-bit quantized base + LoRA adapters).
QLoRA reduces VRAM from ~28GB (full) to ~6GB (4-bit + LoRA).
The base model is frozen in 4-bit, while LoRA adapters train in fp16.
"""
# BitsAndBytes 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 — better than int4
bnb_4bit_compute_dtype=torch.float16, # Compute in fp16 for LoRA
bnb_4bit_use_double_quant=True, # Quantize the quantization constants
)
# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Prepare model for k-bit training
# (handles gradient checkpointing and layer norm casting)
model = prepare_model_for_kbit_training(model)
# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=lora_r,
lora_alpha=lora_alpha,
lora_dropout=0.05,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
return model, tokenizerUnderstanding the QLoRA Setup:
The BitsAndBytesConfig is the key to QLoRA's memory savings. load_in_4bit=True quantizes every weight to 4 bits on load. bnb_4bit_quant_type="nf4" selects NormalFloat4, a data type designed for normally-distributed neural network weights (better than uniform int4). bnb_4bit_compute_dtype=torch.float16 means the 4-bit weights are dequantized to fp16 on-the-fly during computation -- the LoRA adapters then compute gradients in fp16 while the frozen base stays in 4-bit. bnb_4bit_use_double_quant=True quantizes the quantization constants themselves, saving an additional ~0.4 bits per parameter. The prepare_model_for_kbit_training() call handles two critical details: it enables gradient checkpointing (recompute activations during backward to save memory) and casts LayerNorm layers to fp32 for training stability.
QLoRA Memory Savings:
QLoRA Memory Comparison (Llama 3.1 8B)
Full Fine-tuning (fp16)
LoRA (fp16 base)
QLoRA (4-bit base)
Recommendednf4 (NormalFloat4): Quantization format optimized for normally-distributed weights. Better than uniform int4. Double quantization: Quantizes the quantization constants themselves, saving ~0.4 bits per parameter.
Step 4: Prefix Tuning
"""Prefix Tuning and Prompt Tuning with PEFT."""
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PrefixTuningConfig, get_peft_model, TaskType
def setup_prefix_tuning(
model_name: str = "gpt2",
num_virtual_tokens: int = 20,
):
"""
Set up Prefix Tuning.
Prefix Tuning prepends learnable "virtual tokens" to the
key and value states of every attention layer.
Unlike LoRA (which modifies weight matrices), Prefix Tuning
adds new information to the attention context.
"""
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
config = PrefixTuningConfig(
task_type=TaskType.CAUSAL_LM,
num_virtual_tokens=num_virtual_tokens,
# prefix_projection=True uses an MLP to generate the prefix
# (more parameters but often better quality)
prefix_projection=True,
encoder_hidden_size=512,
)
model = get_peft_model(model, config)
model.print_trainable_parameters()
return model, tokenizerPEFT Method Comparison:
| Method | What It Modifies | Trainable Params | Best For |
|---|---|---|---|
| LoRA | Weight matrices (W + BA) | 0.1-1% | General fine-tuning |
| QLoRA | Same as LoRA + 4-bit base | 0.1-1% | Low-VRAM fine-tuning |
| Prefix Tuning | Attention key/value context | ~0.1% | Task-specific prompts |
| Prompt Tuning | Input embeddings only | ~0.01% | Simple task adaptation |
| IA3 | Scale activations | ~0.01% | Few-shot adaptation |
Step 5: Adapter Operations
"""Adapter management: merge, save, load, and push to Hub."""
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
from safetensors.torch import save_file, load_file
def load_adapter(
base_model_name: str,
adapter_path: str,
) -> tuple:
"""Load a base model and apply a saved adapter."""
# Load base model
model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
# Load and apply adapter
model = PeftModel.from_pretrained(model, adapter_path)
return model, tokenizer
def merge_adapter(
base_model_name: str,
adapter_path: str,
output_path: str,
):
"""
Merge LoRA adapter into the base model.
After merging, the model runs at full speed (no adapter overhead)
but you lose the ability to swap adapters.
"""
model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
model = PeftModel.from_pretrained(model, adapter_path)
# Merge adapter weights into base model
model = model.merge_and_unload()
# Save merged model
model.save_pretrained(output_path, safe_serialization=True)
tokenizer.save_pretrained(output_path)
print(f"Merged model saved to {output_path}")
def push_adapter_to_hub(
adapter_path: str,
repo_id: str,
private: bool = False,
):
"""Push a trained adapter to HuggingFace Hub."""
from peft import PeftModel, PeftConfig
from huggingface_hub import HfApi
config = PeftConfig.from_pretrained(adapter_path)
api = HfApi()
api.upload_folder(
folder_path=adapter_path,
repo_id=repo_id,
repo_type="model",
)
print(f"Adapter pushed to https://huggingface.co/{repo_id}")
print(f"Base model: {config.base_model_name_or_path}")
def explain_safetensors():
"""
Why safetensors over pickle (.bin)?
1. Security: No arbitrary code execution (pickle can run malicious code)
2. Speed: Memory-mapped loading (2-5x faster for large models)
3. Size: Same as pickle (no compression difference)
4. Lazy loading: Load specific tensors without loading the full file
transformers uses safetensors by default since v4.35.
"""
# Save tensors
tensors = {
"weight": torch.randn(768, 768),
"bias": torch.randn(768),
}
save_file(tensors, "model.safetensors")
# Load tensors (memory-mapped)
loaded = load_file("model.safetensors")
print(f"Loaded keys: {list(loaded.keys())}")Step 6: Complete Fine-tuning Example
"""Complete example: Fine-tune Llama 3.1 with QLoRA."""
from datasets import load_dataset
from src.qlora import setup_qlora
from trl import SFTTrainer, SFTConfig
def main():
# 1. Load model with QLoRA
model, tokenizer = setup_qlora(
model_name="meta-llama/Llama-3.1-8B",
lora_r=16,
lora_alpha=32,
)
# 2. Load and format dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train[:5000]")
def format_prompt(example):
if example["input"]:
text = (
f"### Instruction:\n{example['instruction']}\n\n"
f"### Input:\n{example['input']}\n\n"
f"### Response:\n{example['output']}"
)
else:
text = (
f"### Instruction:\n{example['instruction']}\n\n"
f"### Response:\n{example['output']}"
)
return {"text": text}
dataset = dataset.map(format_prompt)
# 3. Train with SFTTrainer (from TRL — simplifies the training loop)
training_config = SFTConfig(
output_dir="models/llama-qlora",
num_train_epochs=1,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.1,
fp16=True,
logging_steps=10,
save_strategy="steps",
save_steps=200,
max_seq_length=512,
dataset_text_field="text",
report_to="none",
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=training_config,
train_dataset=dataset,
)
trainer.train()
# 4. Save adapter
model.save_pretrained("models/llama-qlora/final")
print("Training complete! Adapter saved.")
if __name__ == "__main__":
main()Running the Project
# Install dependencies
pip install -r requirements.txt
# Fine-tune with QLoRA
python examples/finetune_llama.py
# Merge adapter into base model
python -c "
from src.adapter_ops import merge_adapter
merge_adapter(
base_model_name='meta-llama/Llama-3.1-8B',
adapter_path='models/llama-qlora/final',
output_path='models/llama-merged',
)
"
# Push adapter to Hub
python -c "
from src.adapter_ops import push_adapter_to_hub
push_adapter_to_hub(
adapter_path='models/llama-qlora/final',
repo_id='your-username/llama-qlora-adapter',
)
"Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| LoRA | Low-rank decomposition of weight updates | Train 0.1% of parameters instead of 100% |
| QLoRA | 4-bit quantized base + LoRA | Fine-tune 8B models on consumer GPUs |
| PEFT | Library for parameter-efficient methods | Unified API for LoRA, Prefix, Prompt tuning |
| NF4 | NormalFloat 4-bit quantization | Optimal for normally-distributed weights |
| safetensors | Secure tensor serialization format | Safe, fast, memory-mapped model loading |
| Adapter Merging | Fold LoRA into base weights | Full speed inference after training |
| SFTTrainer | Supervised fine-tuning trainer from TRL | Simplifies instruction fine-tuning |
Next Steps
- Model Evaluation & Benchmarks — Evaluate your fine-tuned model
- Preference Alignment with TRL — Align models with human preferences