HuggingFace EcosystemIntermediate
Fine-Tuning with PEFT
LoRA, QLoRA, and adapter methods for efficient fine-tuning with the PEFT library
Fine-Tuning with PEFT
TL;DR
PEFT (Parameter-Efficient Fine-Tuning) lets you fine-tune large models by updating only 0.1-1% of parameters. Learn LoRA, QLoRA (4-bit), and Prefix Tuning using the peft library, with bitsandbytes for quantization and safetensors for safe weight storage.
Fine-tune large language models efficiently using the HuggingFace peft library, covering LoRA, QLoRA, Prefix Tuning, and adapter management.
What You'll Learn
- LoRA: Low-Rank Adaptation for efficient fine-tuning
- QLoRA: 4-bit quantization + LoRA for minimal VRAM
- Prefix Tuning and Prompt Tuning methods
- BitsAndBytes 4-bit and 8-bit quantization
- Adapter merging and saving with safetensors
- Publishing adapters to HuggingFace Hub
Tech Stack
| Component | Technology |
|---|---|
| PEFT Methods | peft |
| Base Models | transformers |
| Quantization | bitsandbytes |
| Storage | safetensors |
| Hub | huggingface_hub |
| Python | 3.10+ |
Architecture
┌──────────────────────────────────────────────────────────────────────────────┐
│ PEFT FINE-TUNING │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ FULL FINE-TUNING (for comparison) │
│ ┌────────────────────────────────────────┐ │
│ │ Update ALL 7B parameters │ VRAM: ~28 GB (fp16) │
│ │ ████████████████████████████████████████│ Storage: 14 GB per checkpoint │
│ └────────────────────────────────────────┘ │
│ │
│ LoRA (Low-Rank Adaptation) │
│ ┌────────────────────────────────────────┐ │
│ │ Freeze base model (░░░░░░░░░░░░░░░░░) │ VRAM: ~10 GB (fp16) │
│ │ Train only LoRA adapters (██) 0.1-1% │ Storage: 5-50 MB per adapter │
│ └────────────────────────────────────────┘ │
│ │
│ QLoRA (Quantized LoRA) │
│ ┌────────────────────────────────────────┐ │
│ │ 4-bit quantized base (▒▒▒▒▒▒▒▒▒▒▒▒▒) │ VRAM: ~6 GB (4-bit) │
│ │ Train LoRA adapters (██) in fp16 │ Storage: 5-50 MB per adapter │
│ └────────────────────────────────────────┘ │
│ │
│ LoRA Math: W' = W + BA where B is [d, r] and A is [r, d] │
│ Instead of updating d×d matrix, learn two small r×d matrices │
│ With r=16 and d=4096: 131K params instead of 16M (125x reduction) │
│ │
└──────────────────────────────────────────────────────────────────────────────┘Project Structure
fine-tuning-peft/
├── src/
│ ├── __init__.py
│ ├── lora.py # LoRA fine-tuning
│ ├── qlora.py # QLoRA (4-bit) fine-tuning
│ ├── prefix_tuning.py # Prefix and prompt tuning
│ ├── adapter_ops.py # Merge, save, push adapters
│ └── data_prep.py # Dataset preparation for fine-tuning
├── configs/
│ └── training_config.yaml
├── examples/
│ └── finetune_llama.py
├── requirements.txt
└── README.mdImplementation
Step 1: Dependencies
peft>=0.11.0
transformers>=4.40.0
bitsandbytes>=0.43.0
safetensors>=0.4.0
datasets>=2.19.0
huggingface_hub>=0.23.0
torch>=2.0.0
trl>=0.9.0
accelerate>=0.30.0Step 2: LoRA Fine-Tuning
"""LoRA fine-tuning with the PEFT library."""
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
)
from peft import (
LoraConfig,
get_peft_model,
TaskType,
PeftModel,
)
from datasets import Dataset
def setup_lora(
model_name: str = "meta-llama/Llama-3.1-8B",
lora_r: int = 16,
lora_alpha: int = 32,
lora_dropout: float = 0.05,
target_modules: list[str] | None = None,
):
"""
Set up a model with LoRA adapters.
Args:
model_name: Base model from HuggingFace Hub
lora_r: LoRA rank — lower = fewer params, less capacity
lora_alpha: LoRA scaling factor — controls adapter strength
lora_dropout: Dropout on LoRA layers for regularization
target_modules: Which layers to add LoRA to
"""
# Load base model
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Default target modules for transformer models
if target_modules is None:
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
]
# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=lora_r,
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
target_modules=target_modules,
bias="none",
)
# Apply LoRA to model
model = get_peft_model(model, lora_config)
# Print trainable parameters
model.print_trainable_parameters()
# Example output: trainable params: 6,553,600 || all params: 8,030,261,248
# || trainable%: 0.0816
return model, tokenizer
def train_lora(
model,
tokenizer,
train_dataset: Dataset,
eval_dataset: Dataset | None = None,
output_dir: str = "models/lora-adapter",
epochs: int = 3,
batch_size: int = 4,
learning_rate: float = 2e-4,
max_length: int = 512,
):
"""Train the LoRA adapter."""
# Tokenize dataset
def tokenize(example):
result = tokenizer(
example["text"],
truncation=True,
max_length=max_length,
padding="max_length",
)
result["labels"] = result["input_ids"].copy()
return result
train_tokenized = train_dataset.map(tokenize, remove_columns=["text"])
eval_tokenized = eval_dataset.map(tokenize, remove_columns=["text"]) if eval_dataset else None
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=epochs,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
gradient_accumulation_steps=4,
learning_rate=learning_rate,
lr_scheduler_type="cosine",
warmup_ratio=0.1,
fp16=True,
logging_steps=10,
eval_strategy="steps" if eval_tokenized else "no",
eval_steps=100,
save_strategy="steps",
save_steps=200,
save_total_limit=3,
report_to="none",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_tokenized,
eval_dataset=eval_tokenized,
)
trainer.train()
# Save the adapter (NOT the full model — just the LoRA weights)
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
return trainerLoRA Hyperparameter Guide:
┌─────────────────────────────────────────────────────────────────┐
│ LoRA CONFIGURATION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ rank (r): Controls adapter capacity │
│ ┌──────────┬────────────┬─────────────────────────────┐ │
│ │ r = 4 │ 0.02% params│ Minimal adaptation │ │
│ │ r = 16 │ 0.08% params│ Good balance (recommended) │ │
│ │ r = 64 │ 0.3% params │ High capacity, near full FT │ │
│ │ r = 256 │ 1.2% params │ Diminishing returns │ │
│ └──────────┴────────────┴─────────────────────────────┘ │
│ │
│ alpha: Scaling factor │
│ Effective scaling = alpha / r │
│ Rule of thumb: alpha = 2 × r (so scaling = 2.0) │
│ │
│ target_modules: Which layers to adapt │
│ ┌──────────────────┬──────────────────────────────────┐ │
│ │ Attention only │ q_proj, v_proj (minimum viable) │ │
│ │ All attention │ q,k,v,o_proj (recommended) │ │
│ │ + MLP layers │ + gate,up,down_proj (maximum) │ │
│ └──────────────────┴──────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘| Parameter | Recommended | Effect of Increasing |
|---|---|---|
r | 16 | More adapter capacity, more VRAM |
alpha | 32 (2×r) | Stronger adapter influence |
dropout | 0.05 | More regularization |
target_modules | All attention + MLP | More layers adapted, more VRAM |
Step 3: QLoRA (4-bit)
"""QLoRA: LoRA with 4-bit quantization for minimal VRAM usage."""
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
)
from peft import (
LoraConfig,
get_peft_model,
prepare_model_for_kbit_training,
TaskType,
)
def setup_qlora(
model_name: str = "meta-llama/Llama-3.1-8B",
lora_r: int = 16,
lora_alpha: int = 32,
):
"""
Set up a model with QLoRA (4-bit quantized base + LoRA adapters).
QLoRA reduces VRAM from ~28GB (full) to ~6GB (4-bit + LoRA).
The base model is frozen in 4-bit, while LoRA adapters train in fp16.
"""
# BitsAndBytes 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 — better than int4
bnb_4bit_compute_dtype=torch.float16, # Compute in fp16 for LoRA
bnb_4bit_use_double_quant=True, # Quantize the quantization constants
)
# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Prepare model for k-bit training
# (handles gradient checkpointing and layer norm casting)
model = prepare_model_for_kbit_training(model)
# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=lora_r,
lora_alpha=lora_alpha,
lora_dropout=0.05,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
return model, tokenizerQLoRA Memory Savings:
┌─────────────────────────────────────────────────────────────────┐
│ MEMORY COMPARISON (Llama 3.1 8B) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Full Fine-tuning (fp16): │
│ ├── Model weights: 16 GB │
│ ├── Optimizer states: 32 GB (Adam: 2× model + momentum) │
│ ├── Gradients: 16 GB │
│ └── Total: ~64 GB ──► Needs A100 80GB │
│ │
│ LoRA (fp16 base): │
│ ├── Model weights: 16 GB (frozen, no optimizer states) │
│ ├── LoRA adapters: 0.05 GB │
│ ├── Optimizer states: 0.1 GB (only for LoRA params) │
│ └── Total: ~18 GB ──► Needs A100 40GB │
│ │
│ QLoRA (4-bit base): │
│ ├── Model weights: 4 GB (4-bit quantized) │
│ ├── LoRA adapters: 0.05 GB (fp16) │
│ ├── Optimizer states: 0.1 GB │
│ └── Total: ~6 GB ──► Runs on RTX 3090 / Colab T4 │
│ │
│ nf4 (NormalFloat4): Quantization format optimized for │
│ normally-distributed weights. Better than uniform int4. │
│ │
│ Double quantization: Quantizes the quantization constants │
│ themselves, saving ~0.4 bits per parameter. │
│ │
└─────────────────────────────────────────────────────────────────┘Step 4: Prefix Tuning
"""Prefix Tuning and Prompt Tuning with PEFT."""
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PrefixTuningConfig, get_peft_model, TaskType
def setup_prefix_tuning(
model_name: str = "gpt2",
num_virtual_tokens: int = 20,
):
"""
Set up Prefix Tuning.
Prefix Tuning prepends learnable "virtual tokens" to the
key and value states of every attention layer.
Unlike LoRA (which modifies weight matrices), Prefix Tuning
adds new information to the attention context.
"""
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
config = PrefixTuningConfig(
task_type=TaskType.CAUSAL_LM,
num_virtual_tokens=num_virtual_tokens,
# prefix_projection=True uses an MLP to generate the prefix
# (more parameters but often better quality)
prefix_projection=True,
encoder_hidden_size=512,
)
model = get_peft_model(model, config)
model.print_trainable_parameters()
return model, tokenizerPEFT Method Comparison:
| Method | What It Modifies | Trainable Params | Best For |
|---|---|---|---|
| LoRA | Weight matrices (W + BA) | 0.1-1% | General fine-tuning |
| QLoRA | Same as LoRA + 4-bit base | 0.1-1% | Low-VRAM fine-tuning |
| Prefix Tuning | Attention key/value context | ~0.1% | Task-specific prompts |
| Prompt Tuning | Input embeddings only | ~0.01% | Simple task adaptation |
| IA3 | Scale activations | ~0.01% | Few-shot adaptation |
Step 5: Adapter Operations
"""Adapter management: merge, save, load, and push to Hub."""
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
from safetensors.torch import save_file, load_file
def load_adapter(
base_model_name: str,
adapter_path: str,
) -> tuple:
"""Load a base model and apply a saved adapter."""
# Load base model
model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
# Load and apply adapter
model = PeftModel.from_pretrained(model, adapter_path)
return model, tokenizer
def merge_adapter(
base_model_name: str,
adapter_path: str,
output_path: str,
):
"""
Merge LoRA adapter into the base model.
After merging, the model runs at full speed (no adapter overhead)
but you lose the ability to swap adapters.
"""
model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
model = PeftModel.from_pretrained(model, adapter_path)
# Merge adapter weights into base model
model = model.merge_and_unload()
# Save merged model
model.save_pretrained(output_path, safe_serialization=True)
tokenizer.save_pretrained(output_path)
print(f"Merged model saved to {output_path}")
def push_adapter_to_hub(
adapter_path: str,
repo_id: str,
private: bool = False,
):
"""Push a trained adapter to HuggingFace Hub."""
from peft import PeftModel, PeftConfig
from huggingface_hub import HfApi
config = PeftConfig.from_pretrained(adapter_path)
api = HfApi()
api.upload_folder(
folder_path=adapter_path,
repo_id=repo_id,
repo_type="model",
)
print(f"Adapter pushed to https://huggingface.co/{repo_id}")
print(f"Base model: {config.base_model_name_or_path}")
def explain_safetensors():
"""
Why safetensors over pickle (.bin)?
1. Security: No arbitrary code execution (pickle can run malicious code)
2. Speed: Memory-mapped loading (2-5x faster for large models)
3. Size: Same as pickle (no compression difference)
4. Lazy loading: Load specific tensors without loading the full file
transformers uses safetensors by default since v4.35.
"""
# Save tensors
tensors = {
"weight": torch.randn(768, 768),
"bias": torch.randn(768),
}
save_file(tensors, "model.safetensors")
# Load tensors (memory-mapped)
loaded = load_file("model.safetensors")
print(f"Loaded keys: {list(loaded.keys())}")Step 6: Complete Fine-tuning Example
"""Complete example: Fine-tune Llama 3.1 with QLoRA."""
from datasets import load_dataset
from src.qlora import setup_qlora
from trl import SFTTrainer, SFTConfig
def main():
# 1. Load model with QLoRA
model, tokenizer = setup_qlora(
model_name="meta-llama/Llama-3.1-8B",
lora_r=16,
lora_alpha=32,
)
# 2. Load and format dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train[:5000]")
def format_prompt(example):
if example["input"]:
text = (
f"### Instruction:\n{example['instruction']}\n\n"
f"### Input:\n{example['input']}\n\n"
f"### Response:\n{example['output']}"
)
else:
text = (
f"### Instruction:\n{example['instruction']}\n\n"
f"### Response:\n{example['output']}"
)
return {"text": text}
dataset = dataset.map(format_prompt)
# 3. Train with SFTTrainer (from TRL — simplifies the training loop)
training_config = SFTConfig(
output_dir="models/llama-qlora",
num_train_epochs=1,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.1,
fp16=True,
logging_steps=10,
save_strategy="steps",
save_steps=200,
max_seq_length=512,
dataset_text_field="text",
report_to="none",
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=training_config,
train_dataset=dataset,
)
trainer.train()
# 4. Save adapter
model.save_pretrained("models/llama-qlora/final")
print("Training complete! Adapter saved.")
if __name__ == "__main__":
main()Running the Project
# Install dependencies
pip install -r requirements.txt
# Fine-tune with QLoRA
python examples/finetune_llama.py
# Merge adapter into base model
python -c "
from src.adapter_ops import merge_adapter
merge_adapter(
base_model_name='meta-llama/Llama-3.1-8B',
adapter_path='models/llama-qlora/final',
output_path='models/llama-merged',
)
"
# Push adapter to Hub
python -c "
from src.adapter_ops import push_adapter_to_hub
push_adapter_to_hub(
adapter_path='models/llama-qlora/final',
repo_id='your-username/llama-qlora-adapter',
)
"Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| LoRA | Low-rank decomposition of weight updates | Train 0.1% of parameters instead of 100% |
| QLoRA | 4-bit quantized base + LoRA | Fine-tune 8B models on consumer GPUs |
| PEFT | Library for parameter-efficient methods | Unified API for LoRA, Prefix, Prompt tuning |
| NF4 | NormalFloat 4-bit quantization | Optimal for normally-distributed weights |
| safetensors | Secure tensor serialization format | Safe, fast, memory-mapped model loading |
| Adapter Merging | Fold LoRA into base weights | Full speed inference after training |
| SFTTrainer | Supervised fine-tuning trainer from TRL | Simplifies instruction fine-tuning |
Next Steps
- Model Evaluation & Benchmarks — Evaluate your fine-tuned model
- Preference Alignment with TRL — Align models with human preferences