LoRA Fine-tuning
Efficient LLM fine-tuning with Low-Rank Adaptation
LoRA Fine-tuning
TL;DR
Fine-tune billion-parameter models by training small adapter matrices (0.06% of params) instead of full weights. QLoRA adds 4-bit quantization to fit 7B models on 8GB GPUs. Key parameters: rank (r=8-64), alpha (2x rank), target modules (attention projections).
Fine-tune large language models efficiently using Low-Rank Adaptation (LoRA) and the PEFT library.
Overview
| Difficulty | Intermediate |
| Time | ~6 hours |
| Prerequisites | PyTorch basics, Transformers library |
| Learning Outcomes | LoRA theory, PEFT integration, QLoRA, adapter merging |
Introduction
Fine-tuning large language models traditionally requires updating billions of parameters, demanding expensive GPU hardware and risking catastrophic forgetting. LoRA solves this by training only small adapter matrices while keeping base model weights frozen.
┌─────────────────────────────────────────────────────────────────────────────┐
│ Traditional vs LoRA Fine-tuning │
├────────────────────────────────┬────────────────────────────────────────────┤
│ Traditional Fine-tuning │ LoRA Fine-tuning │
├────────────────────────────────┼────────────────────────────────────────────┤
│ │ │
│ ┌──────────────┐ │ ┌──────────────┐ │
│ │ Base Model │ │ │ Base Model │ │
│ │ 7B params │ │ │ 7B frozen │ │
│ └──────┬───────┘ │ └──────┬───────┘ │
│ │ │ │ │
│ ▼ │ ▼ │
│ ┌──────────────┐ │ ┌──────────────┐ │
│ │ Update All │ │ │Train Adapters│ │
│ │ 7B params │ │ │ ~4M params │ │
│ └──────┬───────┘ │ └──────┬───────┘ │
│ │ │ │ │
│ ▼ │ ▼ │
│ ┌──────────────┐ │ ┌──────────────┐ │
│ │ High Memory │ │ │ Low Memory │ │
│ │ 28GB+ VRAM │ │ │ 8GB VRAM │ │
│ └──────────────┘ │ └──────────────┘ │
│ │ │
└────────────────────────────────┴────────────────────────────────────────────┘Understanding LoRA
The Low-Rank Hypothesis
Neural network weight updates during fine-tuning have low intrinsic dimensionality. Instead of updating a full weight matrix W, we can decompose the update into two smaller matrices.
┌─────────────────────────────────────────────────────────────────────────────┐
│ LoRA Low-Rank Decomposition │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Original Update LoRA Decomposition │
│ ─────────────── ────────────────── │
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │Weight Matrix│ │ Matrix A │──┐ │
│ │ d × k │ │ d × r │ │ │
│ └──────┬──────┘ └─────────────┘ │ │
│ │ │ │
│ ▼ ┌─────────────┐ │ │
│ ┌─────────────┐ │ Matrix B │──┼──►┌───────────┐ │
│ │ ΔW │ ─ ─ ─ ─ ─ ≈ ─ ─ ─► │ r × k │ │ │ B × A │ │
│ │ d × k │ └─────────────┘──┘ │ d × k │ │
│ └─────────────┘ └───────────┘ │
│ │
│ Full update Low-rank approximation │
│ (d × k params) (d×r + r×k params) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘For a weight matrix of dimensions (d × k), the update is decomposed as:
- Matrix A: (d × r) - Down-projection
- Matrix B: (r × k) - Up-projection
- r (rank): Typically 4-64, much smaller than d and k
Parameter Reduction
For a 7B parameter model targeting attention layers:
- Full fine-tuning: ~7 billion parameters
- LoRA (r=8): ~4 million parameters (0.06% of original)
Project Setup
# Create project directory
mkdir lora-finetuning && cd lora-finetuning
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install torch transformers datasets peft accelerate bitsandbytes
pip install trl wandb evaluate scikit-learnProject Structure
lora-finetuning/
├── config/
│ └── lora_config.py # LoRA configuration
├── data/
│ └── dataset.py # Dataset preparation
├── training/
│ ├── trainer.py # Training loop
│ └── callbacks.py # Custom callbacks
├── evaluation/
│ └── evaluate.py # Model evaluation
├── inference/
│ └── generate.py # Inference utilities
├── scripts/
│ ├── train.py # Training script
│ └── merge.py # Adapter merging
└── requirements.txtConfiguration
LoRA Configuration
# config/lora_config.py
from dataclasses import dataclass, field
from typing import Optional, List
from peft import LoraConfig, TaskType
@dataclass
class ModelConfig:
"""Base model configuration."""
model_name: str = "meta-llama/Llama-2-7b-hf"
use_4bit: bool = True
use_8bit: bool = False
trust_remote_code: bool = False
use_flash_attention: bool = True
@dataclass
class LoRASettings:
"""LoRA hyperparameters."""
# Core LoRA parameters
r: int = 16 # Rank of adaptation
lora_alpha: int = 32 # Scaling factor
lora_dropout: float = 0.05 # Dropout probability
# Target modules for different architectures
target_modules: List[str] = field(default_factory=lambda: [
"q_proj", "k_proj", "v_proj", "o_proj", # Attention
"gate_proj", "up_proj", "down_proj" # MLP
])
# Advanced settings
bias: str = "none" # none, all, lora_only
task_type: TaskType = TaskType.CAUSAL_LM
modules_to_save: Optional[List[str]] = None # Full fine-tuning layers
def to_peft_config(self) -> LoraConfig:
"""Convert to PEFT LoraConfig."""
return LoraConfig(
r=self.r,
lora_alpha=self.lora_alpha,
lora_dropout=self.lora_dropout,
target_modules=self.target_modules,
bias=self.bias,
task_type=self.task_type,
modules_to_save=self.modules_to_save,
)
@dataclass
class TrainingConfig:
"""Training hyperparameters."""
output_dir: str = "./outputs"
num_train_epochs: int = 3
per_device_train_batch_size: int = 4
per_device_eval_batch_size: int = 4
gradient_accumulation_steps: int = 4
# Optimizer settings
learning_rate: float = 2e-4
weight_decay: float = 0.01
warmup_ratio: float = 0.03
lr_scheduler_type: str = "cosine"
# Memory optimization
gradient_checkpointing: bool = True
max_grad_norm: float = 0.3
# Logging
logging_steps: int = 10
eval_steps: int = 100
save_steps: int = 100
# Mixed precision
fp16: bool = False
bf16: bool = True # Use bf16 on Ampere+ GPUs
@dataclass
class DataConfig:
"""Dataset configuration."""
dataset_name: str = "databricks/databricks-dolly-15k"
max_seq_length: int = 2048
validation_split: float = 0.1
# Prompt template
instruction_template: str = "### Instruction:\n{instruction}\n\n"
input_template: str = "### Input:\n{input}\n\n"
response_template: str = "### Response:\n{output}"Understanding LoRA Parameters
┌─────────────────────────────────────────────────────────────────────────────┐
│ LoRA Key Parameters │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Key Parameters │ │
│ ├───────────────────┬───────────────────┬─────────────────────────────┤ │
│ │ r: Rank │ alpha: Scaling │ dropout: Regularization │ │
│ │ Higher = More │ Controls adapt- │ Prevents overfitting │ │
│ │ capacity │ ation strength │ │ │
│ └─────────┬─────────┴─────────┬─────────┴─────────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────┐ ┌───────────────────┐ │
│ │ Rank Guidelines: │ │ Alpha/r ratio: │ │
│ │ • r=8: Simple │ │ • Typically 2x │ │
│ │ • r=16: General │ │ • alpha=32 for │ │
│ │ • r=64: Complex │ │ r=16 │ │
│ └─────────────────────┘ └───────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Model Loading with Quantization
QLoRA: 4-bit Quantization
QLoRA enables fine-tuning on consumer GPUs by quantizing the base model to 4-bit precision while keeping LoRA adapters in higher precision.
# training/model_loader.py
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
)
from peft import (
get_peft_model,
prepare_model_for_kbit_training,
LoraConfig,
)
from typing import Tuple, Optional
from config.lora_config import ModelConfig, LoRASettings
class ModelLoader:
"""Load and prepare models for LoRA training."""
def __init__(
self,
model_config: ModelConfig,
lora_settings: LoRASettings,
):
self.model_config = model_config
self.lora_settings = lora_settings
self.device = "cuda" if torch.cuda.is_available() else "cpu"
def get_quantization_config(self) -> Optional[BitsAndBytesConfig]:
"""Create quantization configuration."""
if self.model_config.use_4bit:
return BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # Normal Float 4-bit
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Nested quantization
)
elif self.model_config.use_8bit:
return BitsAndBytesConfig(
load_in_8bit=True,
)
return None
def load_tokenizer(self) -> AutoTokenizer:
"""Load and configure tokenizer."""
tokenizer = AutoTokenizer.from_pretrained(
self.model_config.model_name,
trust_remote_code=self.model_config.trust_remote_code,
padding_side="right", # Required for LoRA
)
# Set padding token if not present
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
return tokenizer
def load_base_model(self) -> AutoModelForCausalLM:
"""Load base model with optional quantization."""
quant_config = self.get_quantization_config()
model_kwargs = {
"pretrained_model_name_or_path": self.model_config.model_name,
"trust_remote_code": self.model_config.trust_remote_code,
"torch_dtype": torch.bfloat16,
"device_map": "auto",
}
if quant_config:
model_kwargs["quantization_config"] = quant_config
if self.model_config.use_flash_attention:
model_kwargs["attn_implementation"] = "flash_attention_2"
model = AutoModelForCausalLM.from_pretrained(**model_kwargs)
return model
def prepare_for_training(
self,
model: AutoModelForCausalLM,
) -> AutoModelForCausalLM:
"""Prepare model for k-bit training."""
if self.model_config.use_4bit or self.model_config.use_8bit:
model = prepare_model_for_kbit_training(
model,
use_gradient_checkpointing=True,
)
return model
def apply_lora(
self,
model: AutoModelForCausalLM,
) -> AutoModelForCausalLM:
"""Apply LoRA adapters to model."""
lora_config = self.lora_settings.to_peft_config()
model = get_peft_model(model, lora_config)
return model
def load(self) -> Tuple[AutoModelForCausalLM, AutoTokenizer]:
"""Load complete model with LoRA adapters."""
print(f"Loading model: {self.model_config.model_name}")
# Load tokenizer
tokenizer = self.load_tokenizer()
# Load and prepare model
model = self.load_base_model()
model = self.prepare_for_training(model)
model = self.apply_lora(model)
# Print trainable parameters
self._print_trainable_params(model)
return model, tokenizer
def _print_trainable_params(self, model: AutoModelForCausalLM) -> None:
"""Print number of trainable parameters."""
trainable_params = sum(
p.numel() for p in model.parameters() if p.requires_grad
)
total_params = sum(p.numel() for p in model.parameters())
print(f"Trainable parameters: {trainable_params:,}")
print(f"Total parameters: {total_params:,}")
print(f"Trainable %: {100 * trainable_params / total_params:.4f}%")Dataset Preparation
Instruction Dataset Format
# data/dataset.py
from datasets import load_dataset, Dataset
from transformers import PreTrainedTokenizer
from typing import Dict, Any, Optional
from dataclasses import dataclass
@dataclass
class PromptTemplate:
"""Template for formatting instructions."""
instruction_key: str = "instruction"
input_key: str = "context"
output_key: str = "response"
system_prompt: str = (
"Below is an instruction that describes a task. "
"Write a response that appropriately completes the request."
)
def format(self, example: Dict[str, Any]) -> str:
"""Format a single example."""
parts = [f"### System:\n{self.system_prompt}\n\n"]
# Add instruction
instruction = example.get(self.instruction_key, "")
parts.append(f"### Instruction:\n{instruction}\n\n")
# Add optional input/context
input_text = example.get(self.input_key, "")
if input_text:
parts.append(f"### Input:\n{input_text}\n\n")
# Add response
output = example.get(self.output_key, "")
parts.append(f"### Response:\n{output}")
return "".join(parts)
def format_for_inference(self, instruction: str, input_text: str = "") -> str:
"""Format prompt for inference (no response)."""
parts = [f"### System:\n{self.system_prompt}\n\n"]
parts.append(f"### Instruction:\n{instruction}\n\n")
if input_text:
parts.append(f"### Input:\n{input_text}\n\n")
parts.append("### Response:\n")
return "".join(parts)
class DatasetPreparation:
"""Prepare datasets for LoRA fine-tuning."""
def __init__(
self,
tokenizer: PreTrainedTokenizer,
max_length: int = 2048,
template: Optional[PromptTemplate] = None,
):
self.tokenizer = tokenizer
self.max_length = max_length
self.template = template or PromptTemplate()
def load_dataset(
self,
dataset_name: str,
split: str = "train",
) -> Dataset:
"""Load dataset from HuggingFace Hub."""
dataset = load_dataset(dataset_name, split=split)
return dataset
def tokenize_function(self, examples: Dict[str, Any]) -> Dict[str, Any]:
"""Tokenize a batch of examples."""
texts = []
for i in range(len(examples[self.template.instruction_key])):
example = {
self.template.instruction_key: examples[self.template.instruction_key][i],
self.template.input_key: examples.get(self.template.input_key, [""])[i] if self.template.input_key in examples else "",
self.template.output_key: examples[self.template.output_key][i],
}
texts.append(self.template.format(example))
# Tokenize
tokenized = self.tokenizer(
texts,
truncation=True,
max_length=self.max_length,
padding="max_length",
return_tensors=None,
)
# Labels are same as input_ids for causal LM
tokenized["labels"] = tokenized["input_ids"].copy()
return tokenized
def prepare(
self,
dataset_name: str,
validation_split: float = 0.1,
) -> Dict[str, Dataset]:
"""Prepare train and validation datasets."""
# Load raw dataset
dataset = self.load_dataset(dataset_name)
# Split into train/validation
split_dataset = dataset.train_test_split(test_size=validation_split)
# Tokenize
tokenized_train = split_dataset["train"].map(
self.tokenize_function,
batched=True,
remove_columns=split_dataset["train"].column_names,
desc="Tokenizing training set",
)
tokenized_val = split_dataset["test"].map(
self.tokenize_function,
batched=True,
remove_columns=split_dataset["test"].column_names,
desc="Tokenizing validation set",
)
return {
"train": tokenized_train,
"validation": tokenized_val,
}
class ChatDatasetPreparation(DatasetPreparation):
"""Prepare chat/conversation datasets."""
def __init__(
self,
tokenizer: PreTrainedTokenizer,
max_length: int = 2048,
):
super().__init__(tokenizer, max_length)
def format_chat(self, messages: list) -> str:
"""Format chat messages using tokenizer's chat template."""
return self.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False,
)
def tokenize_chat(self, examples: Dict[str, Any]) -> Dict[str, Any]:
"""Tokenize chat conversations."""
texts = []
for messages in examples["messages"]:
text = self.format_chat(messages)
texts.append(text)
tokenized = self.tokenizer(
texts,
truncation=True,
max_length=self.max_length,
padding="max_length",
return_tensors=None,
)
tokenized["labels"] = tokenized["input_ids"].copy()
return tokenizedTraining Implementation
Custom Trainer with LoRA
# training/trainer.py
import torch
from transformers import (
Trainer,
TrainingArguments,
DataCollatorForLanguageModeling,
)
from transformers.trainer_callback import TrainerCallback
from peft import PeftModel
from typing import Optional, Dict, Any
import wandb
from config.lora_config import TrainingConfig
class LoRATrainer:
"""Trainer for LoRA fine-tuning."""
def __init__(
self,
model: PeftModel,
tokenizer,
train_dataset,
eval_dataset,
training_config: TrainingConfig,
):
self.model = model
self.tokenizer = tokenizer
self.train_dataset = train_dataset
self.eval_dataset = eval_dataset
self.config = training_config
def get_training_args(self) -> TrainingArguments:
"""Create training arguments."""
return TrainingArguments(
output_dir=self.config.output_dir,
num_train_epochs=self.config.num_train_epochs,
per_device_train_batch_size=self.config.per_device_train_batch_size,
per_device_eval_batch_size=self.config.per_device_eval_batch_size,
gradient_accumulation_steps=self.config.gradient_accumulation_steps,
# Optimizer
learning_rate=self.config.learning_rate,
weight_decay=self.config.weight_decay,
warmup_ratio=self.config.warmup_ratio,
lr_scheduler_type=self.config.lr_scheduler_type,
optim="paged_adamw_32bit", # Memory-efficient optimizer
# Memory optimization
gradient_checkpointing=self.config.gradient_checkpointing,
max_grad_norm=self.config.max_grad_norm,
# Precision
fp16=self.config.fp16,
bf16=self.config.bf16,
# Logging
logging_steps=self.config.logging_steps,
eval_strategy="steps",
eval_steps=self.config.eval_steps,
save_strategy="steps",
save_steps=self.config.save_steps,
save_total_limit=3,
load_best_model_at_end=True,
# W&B logging
report_to="wandb",
run_name=f"lora-{self.config.output_dir.split('/')[-1]}",
# Other
remove_unused_columns=False,
dataloader_pin_memory=True,
dataloader_num_workers=4,
)
def get_data_collator(self):
"""Create data collator for language modeling."""
return DataCollatorForLanguageModeling(
tokenizer=self.tokenizer,
mlm=False, # Causal LM, not masked LM
)
def train(self) -> Dict[str, Any]:
"""Run training."""
training_args = self.get_training_args()
data_collator = self.get_data_collator()
# Create trainer
trainer = Trainer(
model=self.model,
args=training_args,
train_dataset=self.train_dataset,
eval_dataset=self.eval_dataset,
data_collator=data_collator,
callbacks=[
LoRAProgressCallback(),
EarlyStoppingCallback(patience=3),
],
)
# Train
train_result = trainer.train()
# Save final model
trainer.save_model(f"{self.config.output_dir}/final")
return {
"train_loss": train_result.training_loss,
"train_samples": len(self.train_dataset),
"train_steps": train_result.global_step,
}
class LoRAProgressCallback(TrainerCallback):
"""Custom callback for LoRA training progress."""
def on_log(self, args, state, control, logs=None, **kwargs):
"""Log training progress."""
if logs:
# Log gradient norms for LoRA layers
if "grad_norm" in logs:
print(f"Step {state.global_step}: grad_norm = {logs['grad_norm']:.4f}")
class EarlyStoppingCallback(TrainerCallback):
"""Early stopping based on validation loss."""
def __init__(self, patience: int = 3, min_delta: float = 0.01):
self.patience = patience
self.min_delta = min_delta
self.best_loss = float("inf")
self.counter = 0
def on_evaluate(self, args, state, control, metrics=None, **kwargs):
"""Check for improvement after evaluation."""
if metrics:
eval_loss = metrics.get("eval_loss", float("inf"))
if eval_loss < self.best_loss - self.min_delta:
self.best_loss = eval_loss
self.counter = 0
else:
self.counter += 1
if self.counter >= self.patience:
print(f"Early stopping: no improvement for {self.patience} evaluations")
control.should_training_stop = TrueSFT Trainer Alternative
For instruction fine-tuning, the TRL library provides a specialized trainer:
# training/sft_trainer.py
from trl import SFTTrainer, SFTConfig
from peft import PeftModel
from typing import Optional
class SFTLoRATrainer:
"""SFT Trainer for instruction tuning with LoRA."""
def __init__(
self,
model: PeftModel,
tokenizer,
train_dataset,
eval_dataset,
max_seq_length: int = 2048,
output_dir: str = "./outputs",
):
self.model = model
self.tokenizer = tokenizer
self.train_dataset = train_dataset
self.eval_dataset = eval_dataset
self.max_seq_length = max_seq_length
self.output_dir = output_dir
def train(self):
"""Run SFT training."""
sft_config = SFTConfig(
output_dir=self.output_dir,
max_seq_length=self.max_seq_length,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
logging_steps=10,
save_strategy="steps",
save_steps=100,
eval_strategy="steps",
eval_steps=100,
bf16=True,
gradient_checkpointing=True,
optim="paged_adamw_32bit",
packing=True, # Pack multiple samples into one sequence
)
trainer = SFTTrainer(
model=self.model,
args=sft_config,
train_dataset=self.train_dataset,
eval_dataset=self.eval_dataset,
tokenizer=self.tokenizer,
)
trainer.train()
return trainerEvaluation
# evaluation/evaluate.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from typing import List, Dict, Any
import evaluate
from tqdm import tqdm
class LoRAEvaluator:
"""Evaluate LoRA fine-tuned models."""
def __init__(
self,
model: PeftModel,
tokenizer: AutoTokenizer,
device: str = "cuda",
):
self.model = model
self.tokenizer = tokenizer
self.device = device
# Load metrics
self.rouge = evaluate.load("rouge")
self.bleu = evaluate.load("bleu")
@torch.no_grad()
def generate(
self,
prompt: str,
max_new_tokens: int = 256,
temperature: float = 0.7,
top_p: float = 0.9,
) -> str:
"""Generate response for a prompt."""
inputs = self.tokenizer(
prompt,
return_tensors="pt",
truncation=True,
max_length=2048,
).to(self.device)
outputs = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_p=top_p,
do_sample=True,
pad_token_id=self.tokenizer.pad_token_id,
eos_token_id=self.tokenizer.eos_token_id,
)
response = self.tokenizer.decode(
outputs[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True,
)
return response.strip()
def evaluate_generation(
self,
test_samples: List[Dict[str, str]],
prompt_template: callable,
) -> Dict[str, float]:
"""Evaluate generation quality."""
predictions = []
references = []
for sample in tqdm(test_samples, desc="Evaluating"):
prompt = prompt_template(sample["instruction"], sample.get("input", ""))
prediction = self.generate(prompt)
predictions.append(prediction)
references.append(sample["output"])
# Calculate ROUGE
rouge_scores = self.rouge.compute(
predictions=predictions,
references=references,
)
# Calculate BLEU
bleu_scores = self.bleu.compute(
predictions=[p.split() for p in predictions],
references=[[r.split()] for r in references],
)
return {
"rouge1": rouge_scores["rouge1"],
"rouge2": rouge_scores["rouge2"],
"rougeL": rouge_scores["rougeL"],
"bleu": bleu_scores["bleu"],
}
def evaluate_perplexity(
self,
eval_dataset,
batch_size: int = 4,
) -> float:
"""Calculate perplexity on evaluation set."""
self.model.eval()
total_loss = 0
total_tokens = 0
dataloader = torch.utils.data.DataLoader(
eval_dataset,
batch_size=batch_size,
shuffle=False,
)
for batch in tqdm(dataloader, desc="Calculating perplexity"):
input_ids = batch["input_ids"].to(self.device)
attention_mask = batch["attention_mask"].to(self.device)
labels = batch["labels"].to(self.device)
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels,
)
# Count non-padding tokens
num_tokens = (labels != -100).sum().item()
total_loss += outputs.loss.item() * num_tokens
total_tokens += num_tokens
avg_loss = total_loss / total_tokens
perplexity = torch.exp(torch.tensor(avg_loss)).item()
return perplexity
def compare_models(
base_model_name: str,
lora_adapter_path: str,
test_prompts: List[str],
) -> Dict[str, List[str]]:
"""Compare base model vs LoRA fine-tuned model."""
# Load base model
base_tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Load LoRA model
lora_model = PeftModel.from_pretrained(base_model, lora_adapter_path)
results = {"base": [], "lora": []}
for prompt in test_prompts:
# Base model generation
inputs = base_tokenizer(prompt, return_tensors="pt").to("cuda")
base_output = base_model.generate(**inputs, max_new_tokens=256)
results["base"].append(
base_tokenizer.decode(base_output[0], skip_special_tokens=True)
)
# LoRA model generation
lora_output = lora_model.generate(**inputs, max_new_tokens=256)
results["lora"].append(
base_tokenizer.decode(lora_output[0], skip_special_tokens=True)
)
return resultsAdapter Merging and Deployment
Merging Adapters into Base Model
# scripts/merge.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import argparse
from pathlib import Path
def merge_lora_weights(
base_model_name: str,
adapter_path: str,
output_path: str,
push_to_hub: bool = False,
hub_model_id: str = None,
):
"""Merge LoRA adapters into base model."""
print(f"Loading base model: {base_model_name}")
# Load base model in full precision for merging
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
print(f"Loading LoRA adapters from: {adapter_path}")
model = PeftModel.from_pretrained(base_model, adapter_path)
print("Merging weights...")
merged_model = model.merge_and_unload()
print(f"Saving merged model to: {output_path}")
merged_model.save_pretrained(output_path)
tokenizer.save_pretrained(output_path)
if push_to_hub and hub_model_id:
print(f"Pushing to Hub: {hub_model_id}")
merged_model.push_to_hub(hub_model_id)
tokenizer.push_to_hub(hub_model_id)
print("Done!")
return merged_model
def merge_multiple_adapters(
base_model_name: str,
adapter_paths: list,
weights: list,
output_path: str,
):
"""Merge multiple LoRA adapters with different weights."""
from peft import PeftModel, add_weighted_adapter
print(f"Loading base model: {base_model_name}")
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.float16,
device_map="auto",
)
# Load first adapter
model = PeftModel.from_pretrained(
base_model,
adapter_paths[0],
adapter_name="adapter_0",
)
# Load additional adapters
for i, path in enumerate(adapter_paths[1:], 1):
model.load_adapter(path, adapter_name=f"adapter_{i}")
# Create weighted combination
adapter_names = [f"adapter_{i}" for i in range(len(adapter_paths))]
model.add_weighted_adapter(
adapters=adapter_names,
weights=weights,
adapter_name="merged",
combination_type="linear",
)
# Set merged as active and merge
model.set_adapter("merged")
merged_model = model.merge_and_unload()
merged_model.save_pretrained(output_path)
return merged_model
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--base-model", required=True)
parser.add_argument("--adapter-path", required=True)
parser.add_argument("--output-path", required=True)
parser.add_argument("--push-to-hub", action="store_true")
parser.add_argument("--hub-model-id", default=None)
args = parser.parse_args()
merge_lora_weights(
args.base_model,
args.adapter_path,
args.output_path,
args.push_to_hub,
args.hub_model_id,
)Inference with Adapters
# inference/generate.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
from typing import Optional, Generator
import asyncio
class LoRAInference:
"""Inference with LoRA adapters."""
def __init__(
self,
base_model_name: str,
adapter_path: Optional[str] = None,
use_4bit: bool = True,
):
self.base_model_name = base_model_name
self.adapter_path = adapter_path
self.use_4bit = use_4bit
self.model = None
self.tokenizer = None
self.device = "cuda" if torch.cuda.is_available() else "cpu"
def load(self):
"""Load model and adapter."""
# Quantization config
quant_config = None
if self.use_4bit:
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load base model
self.model = AutoModelForCausalLM.from_pretrained(
self.base_model_name,
quantization_config=quant_config,
device_map="auto",
torch_dtype=torch.bfloat16,
)
# Load adapter if provided
if self.adapter_path:
self.model = PeftModel.from_pretrained(
self.model,
self.adapter_path,
)
self.tokenizer = AutoTokenizer.from_pretrained(self.base_model_name)
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token
self.model.eval()
@torch.no_grad()
def generate(
self,
prompt: str,
max_new_tokens: int = 512,
temperature: float = 0.7,
top_p: float = 0.9,
top_k: int = 50,
repetition_penalty: float = 1.1,
) -> str:
"""Generate response."""
inputs = self.tokenizer(
prompt,
return_tensors="pt",
truncation=True,
max_length=2048,
).to(self.device)
outputs = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_p=top_p,
top_k=top_k,
repetition_penalty=repetition_penalty,
do_sample=True,
pad_token_id=self.tokenizer.pad_token_id,
eos_token_id=self.tokenizer.eos_token_id,
)
response = self.tokenizer.decode(
outputs[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True,
)
return response.strip()
@torch.no_grad()
def generate_stream(
self,
prompt: str,
max_new_tokens: int = 512,
temperature: float = 0.7,
) -> Generator[str, None, None]:
"""Generate response with streaming."""
from transformers import TextIteratorStreamer
from threading import Thread
inputs = self.tokenizer(
prompt,
return_tensors="pt",
truncation=True,
).to(self.device)
streamer = TextIteratorStreamer(
self.tokenizer,
skip_prompt=True,
skip_special_tokens=True,
)
generation_kwargs = {
**inputs,
"streamer": streamer,
"max_new_tokens": max_new_tokens,
"temperature": temperature,
"do_sample": True,
"pad_token_id": self.tokenizer.pad_token_id,
}
thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
thread.start()
for text in streamer:
yield text
thread.join()
def swap_adapter(self, new_adapter_path: str):
"""Hot-swap to a different adapter."""
if hasattr(self.model, "load_adapter"):
self.model.load_adapter(new_adapter_path, adapter_name="new")
self.model.set_adapter("new")
else:
# Reload with new adapter
base_model = self.model.get_base_model()
self.model = PeftModel.from_pretrained(base_model, new_adapter_path)Training Script
# scripts/train.py
import argparse
import wandb
from config.lora_config import ModelConfig, LoRASettings, TrainingConfig, DataConfig
from training.model_loader import ModelLoader
from data.dataset import DatasetPreparation, PromptTemplate
from training.trainer import LoRATrainer
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--model", default="meta-llama/Llama-2-7b-hf")
parser.add_argument("--dataset", default="databricks/databricks-dolly-15k")
parser.add_argument("--output-dir", default="./outputs/lora-llama2")
parser.add_argument("--epochs", type=int, default=3)
parser.add_argument("--batch-size", type=int, default=4)
parser.add_argument("--lora-r", type=int, default=16)
parser.add_argument("--learning-rate", type=float, default=2e-4)
parser.add_argument("--use-4bit", action="store_true", default=True)
parser.add_argument("--wandb-project", default="lora-finetuning")
args = parser.parse_args()
# Initialize wandb
wandb.init(project=args.wandb_project, config=vars(args))
# Create configurations
model_config = ModelConfig(
model_name=args.model,
use_4bit=args.use_4bit,
)
lora_settings = LoRASettings(
r=args.lora_r,
lora_alpha=args.lora_r * 2,
)
training_config = TrainingConfig(
output_dir=args.output_dir,
num_train_epochs=args.epochs,
per_device_train_batch_size=args.batch_size,
learning_rate=args.learning_rate,
)
data_config = DataConfig(
dataset_name=args.dataset,
)
# Load model
print("Loading model...")
loader = ModelLoader(model_config, lora_settings)
model, tokenizer = loader.load()
# Prepare dataset
print("Preparing dataset...")
template = PromptTemplate()
dataset_prep = DatasetPreparation(
tokenizer=tokenizer,
max_length=data_config.max_seq_length,
template=template,
)
datasets = dataset_prep.prepare(
data_config.dataset_name,
validation_split=data_config.validation_split,
)
# Train
print("Starting training...")
trainer = LoRATrainer(
model=model,
tokenizer=tokenizer,
train_dataset=datasets["train"],
eval_dataset=datasets["validation"],
training_config=training_config,
)
results = trainer.train()
# Log results
wandb.log(results)
print(f"Training complete! Results: {results}")
# Save final adapter
model.save_pretrained(f"{args.output_dir}/final_adapter")
tokenizer.save_pretrained(f"{args.output_dir}/final_adapter")
if __name__ == "__main__":
main()Advanced Techniques
QLoRA with Double Quantization
# config/qlora_config.py
from transformers import BitsAndBytesConfig
import torch
def get_qlora_config():
"""QLoRA configuration with double quantization."""
return BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat 4-bit
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Quantize quantization constants
)DoRA: Weight-Decomposed Low-Rank Adaptation
# config/dora_config.py
from peft import LoraConfig
def get_dora_config():
"""DoRA configuration - weight-decomposed LoRA."""
return LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
use_dora=True, # Enable DoRA
task_type="CAUSAL_LM",
)Layer-wise Learning Rates
# training/layerwise_lr.py
from torch.optim import AdamW
def get_layerwise_optimizer(model, base_lr=2e-4, lr_decay=0.9):
"""Create optimizer with layer-wise learning rate decay."""
params = []
# Get all LoRA parameters grouped by layer
for name, param in model.named_parameters():
if not param.requires_grad:
continue
# Extract layer number from name
layer_num = None
for part in name.split("."):
if part.isdigit():
layer_num = int(part)
break
# Calculate layer-specific learning rate
if layer_num is not None:
lr = base_lr * (lr_decay ** (31 - layer_num)) # 32 layers
else:
lr = base_lr
params.append({"params": param, "lr": lr})
return AdamW(params, weight_decay=0.01)Memory Optimization Tips
┌─────────────────────────────────────────────────────────────────────────────┐
│ Memory Optimization Techniques │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ 4-bit Quantization │ │ Gradient Checkpointing│ │
│ │ ~4GB VRAM │ │ Trade compute for mem │ │
│ └──────────┬───────────┘ └──────────┬───────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ Fit 7B on 8GB GPU │ │Reduce activation mem │ │
│ └──────────────────────┘ └──────────────────────┘ │
│ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ Gradient Accumulation│ │ Paged Optimizer │ │
│ │ Simulate larger batch│ │ CPU offload │ │
│ └──────────┬───────────┘ └──────────┬───────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │Effective batch 16+ │ │Manage optimizer state│ │
│ └──────────────────────┘ └──────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Memory Calculation Guide
| Model Size | Full FT (fp16) | LoRA (fp16) | QLoRA (4-bit) |
|---|---|---|---|
| 7B | ~56GB | ~16GB | ~6GB |
| 13B | ~104GB | ~30GB | ~10GB |
| 70B | ~560GB | ~160GB | ~48GB |
Common Issues and Solutions
Issue: Gradient Overflow
# Use gradient clipping
training_args = TrainingArguments(
max_grad_norm=0.3, # Clip gradients
fp16=False, # Use bf16 instead
bf16=True,
)Issue: Loss Not Decreasing
# Check learning rate and warmup
training_args = TrainingArguments(
learning_rate=2e-4, # Try 1e-4 to 5e-4
warmup_ratio=0.03, # Increase to 0.1
lr_scheduler_type="cosine",
)Issue: Out of Memory
# Reduce memory usage
training_args = TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
gradient_checkpointing=True,
optim="paged_adamw_8bit",
)Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| LoRA (Low-Rank Adaptation) | Freezes base model, trains small adapter matrices | Reduces trainable params from 7B to ~4M (0.06%) |
| Rank (r) | Dimension of the low-rank matrices (typically 8-64) | Controls capacity vs efficiency trade-off |
| Alpha (lora_alpha) | Scaling factor for LoRA updates (typically 2×r) | Adjusts adaptation strength without retraining |
| Target Modules | Which layers get LoRA adapters (q/k/v/o_proj, MLP) | More modules = more capacity but more params |
| QLoRA | LoRA + 4-bit quantization of base model | Fits 7B models on 8GB GPUs |
| NF4 (NormalFloat 4-bit) | Optimal 4-bit data type for normally-distributed weights | Better accuracy than standard INT4 |
| Double Quantization | Quantizes the quantization constants themselves | Extra ~0.5GB memory savings |
| Gradient Checkpointing | Recomputes activations during backward pass | Trades compute for memory |
| Paged Optimizer | Offloads optimizer states to CPU when needed | Handles memory spikes gracefully |
| Adapter Merging | Combines LoRA weights into base model permanently | No inference overhead, easy deployment |
Next Steps
After completing this project, consider:
- Custom Reranker - Train cross-encoders for RAG
- Knowledge Distillation - Compress fine-tuned models
- DPO Alignment - Align models with human preferences