LoRA Fine-tuning

TL;DR

Fine-tune billion-parameter models by training small adapter matrices (0.06% of params) instead of full weights. QLoRA adds 4-bit quantization to fit 7B models on 8GB GPUs. Key parameters: rank (r=8-64), alpha (2x rank), target modules (attention projections).

Overview


Difficulty	Intermediate
Time	~6 hours
Prerequisites	PyTorch basics, Transformers library
Learning Outcomes	LoRA theory, PEFT integration, QLoRA, adapter merging

Why LoRA Over Full Fine-tuning?

Full fine-tuning a 7B model requires 56GB+ VRAM (FP16), costs hundreds of dollars in cloud GPU time, and risks catastrophic forgetting of the model's general knowledge. LoRA reduces trainable parameters to 0.06% of the original by decomposing weight updates into low-rank matrices. This means you can fine-tune on a single consumer GPU (8GB with QLoRA), train multiple task-specific adapters that share the same base model, and hot-swap adapters at inference time without reloading the full model.

Introduction

Fine-tuning large language models traditionally requires updating billions of parameters, demanding expensive GPU hardware and risking catastrophic forgetting. LoRA solves this by training only small adapter matrices while keeping base model weights frozen.

Traditional vs LoRA Fine-tuning

Traditional Fine-tuning

Base Model (7B params) → Update All 7B params → High Memory (28GB+ VRAM)

LoRA Fine-tuning

Recommended

Base Model (7B frozen) → Train Adapters (~4M params) → Low Memory (8GB VRAM)

Understanding LoRA

The Low-Rank Hypothesis

Neural network weight updates during fine-tuning have low intrinsic dimensionality. Instead of updating a full weight matrix W, we can decompose the update into two smaller matrices.

LoRA Low-Rank Decomposition

Weight Update ΔW (d × k)

Original Update

Full ΔW matrix with d × k parameters

LoRA Decomposition

Matrix A (d × r) × Matrix B (r × k) = B × A (d × k) with only d×r + r×k parameters

Approximation: ΔW ≈ B × A (low-rank)

For a weight matrix of dimensions (d × k), the update is decomposed as:

Matrix A: (d × r) - Down-projection
Matrix B: (r × k) - Up-projection
r (rank): Typically 4-64, much smaller than d and k

Parameter Reduction

For an 8B parameter model targeting attention layers:

Full fine-tuning: ~8 billion parameters
LoRA (r=8): ~4 million parameters (0.05% of original)

Project Setup

# Create project directory
mkdir lora-finetuning && cd lora-finetuning

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install torch transformers datasets peft accelerate bitsandbytes
pip install trl wandb evaluate scikit-learn

Project Structure

lora-finetuning/
├── config/
│   └── lora_config.py       # LoRA configuration
├── data/
│   └── dataset.py           # Dataset preparation
├── training/
│   ├── trainer.py           # Training loop
│   └── callbacks.py         # Custom callbacks
├── evaluation/
│   └── evaluate.py          # Model evaluation
├── inference/
│   └── generate.py          # Inference utilities
├── scripts/
│   ├── train.py             # Training script
│   └── merge.py             # Adapter merging
└── requirements.txt

Configuration

LoRA Configuration

# config/lora_config.py
from dataclasses import dataclass, field
from typing import Optional, List
from peft import LoraConfig, TaskType

@dataclass
class ModelConfig:
    """Base model configuration."""
    model_name: str = "meta-llama/Llama-3.1-8B"
    use_4bit: bool = True
    use_8bit: bool = False
    trust_remote_code: bool = False
    use_flash_attention: bool = True

@dataclass
class LoRASettings:
    """LoRA hyperparameters."""
    # Core LoRA parameters
    r: int = 16                          # Rank of adaptation
    lora_alpha: int = 32                 # Scaling factor
    lora_dropout: float = 0.05           # Dropout probability

    # Target modules for different architectures
    target_modules: List[str] = field(default_factory=lambda: [
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj"       # MLP
    ])

    # Advanced settings
    bias: str = "none"                   # none, all, lora_only
    task_type: TaskType = TaskType.CAUSAL_LM
    modules_to_save: Optional[List[str]] = None  # Full fine-tuning layers

    def to_peft_config(self) -> LoraConfig:
        """Convert to PEFT LoraConfig."""
        return LoraConfig(
            r=self.r,
            lora_alpha=self.lora_alpha,
            lora_dropout=self.lora_dropout,
            target_modules=self.target_modules,
            bias=self.bias,
            task_type=self.task_type,
            modules_to_save=self.modules_to_save,
        )

@dataclass
class TrainingConfig:
    """Training hyperparameters."""
    output_dir: str = "./outputs"
    num_train_epochs: int = 3
    per_device_train_batch_size: int = 4
    per_device_eval_batch_size: int = 4
    gradient_accumulation_steps: int = 4

    # Optimizer settings
    learning_rate: float = 2e-4
    weight_decay: float = 0.01
    warmup_ratio: float = 0.03
    lr_scheduler_type: str = "cosine"

    # Memory optimization
    gradient_checkpointing: bool = True
    max_grad_norm: float = 0.3

    # Logging
    logging_steps: int = 10
    eval_steps: int = 100
    save_steps: int = 100

    # Mixed precision
    fp16: bool = False
    bf16: bool = True  # Use bf16 on Ampere+ GPUs

@dataclass
class DataConfig:
    """Dataset configuration."""
    dataset_name: str = "databricks/databricks-dolly-15k"
    max_seq_length: int = 2048
    validation_split: float = 0.1

    # Prompt template
    instruction_template: str = "### Instruction:\n{instruction}\n\n"
    input_template: str = "### Input:\n{input}\n\n"
    response_template: str = "### Response:\n{output}"

Understanding LoRA Parameters

LoRA Key Parameters

r: Rank

Higher = more capacity

r=8: Simple tasks

r=16: General use

r=64: Complex tasks

alpha: Scaling

Controls adaptation strength

Typically 2x rank

alpha=32 for r=16

dropout: Regularization

Prevents overfitting

Typical: 0.05-0.1

Model Loading with Quantization

QLoRA: 4-bit Quantization

QLoRA enables fine-tuning on consumer GPUs by quantizing the base model to 4-bit precision while keeping LoRA adapters in higher precision.

# training/model_loader.py
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)
from peft import (
    get_peft_model,
    prepare_model_for_kbit_training,
    LoraConfig,
)
from typing import Tuple, Optional
from config.lora_config import ModelConfig, LoRASettings

class ModelLoader:
    """Load and prepare models for LoRA training."""

    def __init__(
        self,
        model_config: ModelConfig,
        lora_settings: LoRASettings,
    ):
        self.model_config = model_config
        self.lora_settings = lora_settings
        self.device = "cuda" if torch.cuda.is_available() else "cpu"

    def get_quantization_config(self) -> Optional[BitsAndBytesConfig]:
        """Create quantization configuration."""
        if self.model_config.use_4bit:
            return BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_quant_type="nf4",         # Normal Float 4-bit
                bnb_4bit_compute_dtype=torch.bfloat16,
                bnb_4bit_use_double_quant=True,    # Nested quantization
            )
        elif self.model_config.use_8bit:
            return BitsAndBytesConfig(
                load_in_8bit=True,
            )
        return None

    def load_tokenizer(self) -> AutoTokenizer:
        """Load and configure tokenizer."""
        tokenizer = AutoTokenizer.from_pretrained(
            self.model_config.model_name,
            trust_remote_code=self.model_config.trust_remote_code,
            padding_side="right",  # Required for LoRA
        )

        # Set padding token if not present
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token
            tokenizer.pad_token_id = tokenizer.eos_token_id

        return tokenizer

    def load_base_model(self) -> AutoModelForCausalLM:
        """Load base model with optional quantization."""
        quant_config = self.get_quantization_config()

        model_kwargs = {
            "pretrained_model_name_or_path": self.model_config.model_name,
            "trust_remote_code": self.model_config.trust_remote_code,
            "torch_dtype": torch.bfloat16,
            "device_map": "auto",
        }

        if quant_config:
            model_kwargs["quantization_config"] = quant_config

        if self.model_config.use_flash_attention:
            model_kwargs["attn_implementation"] = "flash_attention_2"

        model = AutoModelForCausalLM.from_pretrained(**model_kwargs)

        return model

    def prepare_for_training(
        self,
        model: AutoModelForCausalLM,
    ) -> AutoModelForCausalLM:
        """Prepare model for k-bit training."""
        if self.model_config.use_4bit or self.model_config.use_8bit:
            model = prepare_model_for_kbit_training(
                model,
                use_gradient_checkpointing=True,
            )

        return model

    def apply_lora(
        self,
        model: AutoModelForCausalLM,
    ) -> AutoModelForCausalLM:
        """Apply LoRA adapters to model."""
        lora_config = self.lora_settings.to_peft_config()
        model = get_peft_model(model, lora_config)

        return model

    def load(self) -> Tuple[AutoModelForCausalLM, AutoTokenizer]:
        """Load complete model with LoRA adapters."""
        print(f"Loading model: {self.model_config.model_name}")

        # Load tokenizer
        tokenizer = self.load_tokenizer()

        # Load and prepare model
        model = self.load_base_model()
        model = self.prepare_for_training(model)
        model = self.apply_lora(model)

        # Print trainable parameters
        self._print_trainable_params(model)

        return model, tokenizer

    def _print_trainable_params(self, model: AutoModelForCausalLM) -> None:
        """Print number of trainable parameters."""
        trainable_params = sum(
            p.numel() for p in model.parameters() if p.requires_grad
        )
        total_params = sum(p.numel() for p in model.parameters())

        print(f"Trainable parameters: {trainable_params:,}")
        print(f"Total parameters: {total_params:,}")
        print(f"Trainable %: {100 * trainable_params / total_params:.4f}%")

Dataset Preparation

Instruction Dataset Format

# data/dataset.py
from datasets import load_dataset, Dataset
from transformers import PreTrainedTokenizer
from typing import Dict, Any, Optional
from dataclasses import dataclass

@dataclass
class PromptTemplate:
    """Template for formatting instructions."""
    instruction_key: str = "instruction"
    input_key: str = "context"
    output_key: str = "response"

    system_prompt: str = (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request."
    )

    def format(self, example: Dict[str, Any]) -> str:
        """Format a single example."""
        parts = [f"### System:\n{self.system_prompt}\n\n"]

        # Add instruction
        instruction = example.get(self.instruction_key, "")
        parts.append(f"### Instruction:\n{instruction}\n\n")

        # Add optional input/context
        input_text = example.get(self.input_key, "")
        if input_text:
            parts.append(f"### Input:\n{input_text}\n\n")

        # Add response
        output = example.get(self.output_key, "")
        parts.append(f"### Response:\n{output}")

        return "".join(parts)

    def format_for_inference(self, instruction: str, input_text: str = "") -> str:
        """Format prompt for inference (no response)."""
        parts = [f"### System:\n{self.system_prompt}\n\n"]
        parts.append(f"### Instruction:\n{instruction}\n\n")

        if input_text:
            parts.append(f"### Input:\n{input_text}\n\n")

        parts.append("### Response:\n")

        return "".join(parts)


class DatasetPreparation:
    """Prepare datasets for LoRA fine-tuning."""

    def __init__(
        self,
        tokenizer: PreTrainedTokenizer,
        max_length: int = 2048,
        template: Optional[PromptTemplate] = None,
    ):
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.template = template or PromptTemplate()

    def load_dataset(
        self,
        dataset_name: str,
        split: str = "train",
    ) -> Dataset:
        """Load dataset from HuggingFace Hub."""
        dataset = load_dataset(dataset_name, split=split)
        return dataset

    def tokenize_function(self, examples: Dict[str, Any]) -> Dict[str, Any]:
        """Tokenize a batch of examples."""
        texts = []

        for i in range(len(examples[self.template.instruction_key])):
            example = {
                self.template.instruction_key: examples[self.template.instruction_key][i],
                self.template.input_key: examples.get(self.template.input_key, [""])[i] if self.template.input_key in examples else "",
                self.template.output_key: examples[self.template.output_key][i],
            }
            texts.append(self.template.format(example))

        # Tokenize
        tokenized = self.tokenizer(
            texts,
            truncation=True,
            max_length=self.max_length,
            padding="max_length",
            return_tensors=None,
        )

        # Labels are same as input_ids for causal LM
        tokenized["labels"] = tokenized["input_ids"].copy()

        return tokenized

    def prepare(
        self,
        dataset_name: str,
        validation_split: float = 0.1,
    ) -> Dict[str, Dataset]:
        """Prepare train and validation datasets."""
        # Load raw dataset
        dataset = self.load_dataset(dataset_name)

        # Split into train/validation
        split_dataset = dataset.train_test_split(test_size=validation_split)

        # Tokenize
        tokenized_train = split_dataset["train"].map(
            self.tokenize_function,
            batched=True,
            remove_columns=split_dataset["train"].column_names,
            desc="Tokenizing training set",
        )

        tokenized_val = split_dataset["test"].map(
            self.tokenize_function,
            batched=True,
            remove_columns=split_dataset["test"].column_names,
            desc="Tokenizing validation set",
        )

        return {
            "train": tokenized_train,
            "validation": tokenized_val,
        }


class ChatDatasetPreparation(DatasetPreparation):
    """Prepare chat/conversation datasets."""

    def __init__(
        self,
        tokenizer: PreTrainedTokenizer,
        max_length: int = 2048,
    ):
        super().__init__(tokenizer, max_length)

    def format_chat(self, messages: list) -> str:
        """Format chat messages using tokenizer's chat template."""
        return self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False,
        )

    def tokenize_chat(self, examples: Dict[str, Any]) -> Dict[str, Any]:
        """Tokenize chat conversations."""
        texts = []

        for messages in examples["messages"]:
            text = self.format_chat(messages)
            texts.append(text)

        tokenized = self.tokenizer(
            texts,
            truncation=True,
            max_length=self.max_length,
            padding="max_length",
            return_tensors=None,
        )

        tokenized["labels"] = tokenized["input_ids"].copy()

        return tokenized

Training Implementation

Custom Trainer with LoRA

# training/trainer.py
import torch
from transformers import (
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling,
)
from transformers.trainer_callback import TrainerCallback
from peft import PeftModel
from typing import Optional, Dict, Any
import wandb
from config.lora_config import TrainingConfig

class LoRATrainer:
    """Trainer for LoRA fine-tuning."""

    def __init__(
        self,
        model: PeftModel,
        tokenizer,
        train_dataset,
        eval_dataset,
        training_config: TrainingConfig,
    ):
        self.model = model
        self.tokenizer = tokenizer
        self.train_dataset = train_dataset
        self.eval_dataset = eval_dataset
        self.config = training_config

    def get_training_args(self) -> TrainingArguments:
        """Create training arguments."""
        return TrainingArguments(
            output_dir=self.config.output_dir,
            num_train_epochs=self.config.num_train_epochs,
            per_device_train_batch_size=self.config.per_device_train_batch_size,
            per_device_eval_batch_size=self.config.per_device_eval_batch_size,
            gradient_accumulation_steps=self.config.gradient_accumulation_steps,

            # Optimizer
            learning_rate=self.config.learning_rate,
            weight_decay=self.config.weight_decay,
            warmup_ratio=self.config.warmup_ratio,
            lr_scheduler_type=self.config.lr_scheduler_type,
            optim="paged_adamw_32bit",  # Memory-efficient optimizer

            # Memory optimization
            gradient_checkpointing=self.config.gradient_checkpointing,
            max_grad_norm=self.config.max_grad_norm,

            # Precision
            fp16=self.config.fp16,
            bf16=self.config.bf16,

            # Logging
            logging_steps=self.config.logging_steps,
            eval_strategy="steps",
            eval_steps=self.config.eval_steps,
            save_strategy="steps",
            save_steps=self.config.save_steps,
            save_total_limit=3,
            load_best_model_at_end=True,

            # W&B logging
            report_to="wandb",
            run_name=f"lora-{self.config.output_dir.split('/')[-1]}",

            # Other
            remove_unused_columns=False,
            dataloader_pin_memory=True,
            dataloader_num_workers=4,
        )

    def get_data_collator(self):
        """Create data collator for language modeling."""
        return DataCollatorForLanguageModeling(
            tokenizer=self.tokenizer,
            mlm=False,  # Causal LM, not masked LM
        )

    def train(self) -> Dict[str, Any]:
        """Run training."""
        training_args = self.get_training_args()
        data_collator = self.get_data_collator()

        # Create trainer
        trainer = Trainer(
            model=self.model,
            args=training_args,
            train_dataset=self.train_dataset,
            eval_dataset=self.eval_dataset,
            data_collator=data_collator,
            callbacks=[
                LoRAProgressCallback(),
                EarlyStoppingCallback(patience=3),
            ],
        )

        # Train
        train_result = trainer.train()

        # Save final model
        trainer.save_model(f"{self.config.output_dir}/final")

        return {
            "train_loss": train_result.training_loss,
            "train_samples": len(self.train_dataset),
            "train_steps": train_result.global_step,
        }


class LoRAProgressCallback(TrainerCallback):
    """Custom callback for LoRA training progress."""

    def on_log(self, args, state, control, logs=None, **kwargs):
        """Log training progress."""
        if logs:
            # Log gradient norms for LoRA layers
            if "grad_norm" in logs:
                print(f"Step {state.global_step}: grad_norm = {logs['grad_norm']:.4f}")


class EarlyStoppingCallback(TrainerCallback):
    """Early stopping based on validation loss."""

    def __init__(self, patience: int = 3, min_delta: float = 0.01):
        self.patience = patience
        self.min_delta = min_delta
        self.best_loss = float("inf")
        self.counter = 0

    def on_evaluate(self, args, state, control, metrics=None, **kwargs):
        """Check for improvement after evaluation."""
        if metrics:
            eval_loss = metrics.get("eval_loss", float("inf"))

            if eval_loss < self.best_loss - self.min_delta:
                self.best_loss = eval_loss
                self.counter = 0
            else:
                self.counter += 1

                if self.counter >= self.patience:
                    print(f"Early stopping: no improvement for {self.patience} evaluations")
                    control.should_training_stop = True

SFT Trainer Alternative

For instruction fine-tuning, the TRL library provides a specialized trainer:

# training/sft_trainer.py
from trl import SFTTrainer, SFTConfig
from peft import PeftModel
from typing import Optional

class SFTLoRATrainer:
    """SFT Trainer for instruction tuning with LoRA."""

    def __init__(
        self,
        model: PeftModel,
        tokenizer,
        train_dataset,
        eval_dataset,
        max_seq_length: int = 2048,
        output_dir: str = "./outputs",
    ):
        self.model = model
        self.tokenizer = tokenizer
        self.train_dataset = train_dataset
        self.eval_dataset = eval_dataset
        self.max_seq_length = max_seq_length
        self.output_dir = output_dir

    def train(self):
        """Run SFT training."""
        sft_config = SFTConfig(
            output_dir=self.output_dir,
            max_seq_length=self.max_seq_length,
            num_train_epochs=3,
            per_device_train_batch_size=4,
            gradient_accumulation_steps=4,
            learning_rate=2e-4,
            logging_steps=10,
            save_strategy="steps",
            save_steps=100,
            eval_strategy="steps",
            eval_steps=100,
            bf16=True,
            gradient_checkpointing=True,
            optim="paged_adamw_32bit",
            packing=True,  # Pack multiple samples into one sequence
        )

        trainer = SFTTrainer(
            model=self.model,
            args=sft_config,
            train_dataset=self.train_dataset,
            eval_dataset=self.eval_dataset,
            tokenizer=self.tokenizer,
        )

        trainer.train()

        return trainer

Evaluation

# evaluation/evaluate.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from typing import List, Dict, Any
import evaluate
from tqdm import tqdm

class LoRAEvaluator:
    """Evaluate LoRA fine-tuned models."""

    def __init__(
        self,
        model: PeftModel,
        tokenizer: AutoTokenizer,
        device: str = "cuda",
    ):
        self.model = model
        self.tokenizer = tokenizer
        self.device = device

        # Load metrics
        self.rouge = evaluate.load("rouge")
        self.bleu = evaluate.load("bleu")

    @torch.no_grad()
    def generate(
        self,
        prompt: str,
        max_new_tokens: int = 256,
        temperature: float = 0.7,
        top_p: float = 0.9,
    ) -> str:
        """Generate response for a prompt."""
        inputs = self.tokenizer(
            prompt,
            return_tensors="pt",
            truncation=True,
            max_length=2048,
        ).to(self.device)

        outputs = self.model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            pad_token_id=self.tokenizer.pad_token_id,
            eos_token_id=self.tokenizer.eos_token_id,
        )

        response = self.tokenizer.decode(
            outputs[0][inputs["input_ids"].shape[1]:],
            skip_special_tokens=True,
        )

        return response.strip()

    def evaluate_generation(
        self,
        test_samples: List[Dict[str, str]],
        prompt_template: callable,
    ) -> Dict[str, float]:
        """Evaluate generation quality."""
        predictions = []
        references = []

        for sample in tqdm(test_samples, desc="Evaluating"):
            prompt = prompt_template(sample["instruction"], sample.get("input", ""))
            prediction = self.generate(prompt)

            predictions.append(prediction)
            references.append(sample["output"])

        # Calculate ROUGE
        rouge_scores = self.rouge.compute(
            predictions=predictions,
            references=references,
        )

        # Calculate BLEU
        bleu_scores = self.bleu.compute(
            predictions=[p.split() for p in predictions],
            references=[[r.split()] for r in references],
        )

        return {
            "rouge1": rouge_scores["rouge1"],
            "rouge2": rouge_scores["rouge2"],
            "rougeL": rouge_scores["rougeL"],
            "bleu": bleu_scores["bleu"],
        }

    def evaluate_perplexity(
        self,
        eval_dataset,
        batch_size: int = 4,
    ) -> float:
        """Calculate perplexity on evaluation set."""
        self.model.eval()
        total_loss = 0
        total_tokens = 0

        dataloader = torch.utils.data.DataLoader(
            eval_dataset,
            batch_size=batch_size,
            shuffle=False,
        )

        for batch in tqdm(dataloader, desc="Calculating perplexity"):
            input_ids = batch["input_ids"].to(self.device)
            attention_mask = batch["attention_mask"].to(self.device)
            labels = batch["labels"].to(self.device)

            outputs = self.model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels,
            )

            # Count non-padding tokens
            num_tokens = (labels != -100).sum().item()
            total_loss += outputs.loss.item() * num_tokens
            total_tokens += num_tokens

        avg_loss = total_loss / total_tokens
        perplexity = torch.exp(torch.tensor(avg_loss)).item()

        return perplexity


def compare_models(
    base_model_name: str,
    lora_adapter_path: str,
    test_prompts: List[str],
) -> Dict[str, List[str]]:
    """Compare base model vs LoRA fine-tuned model."""
    # Load base model
    base_tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        torch_dtype=torch.bfloat16,
        device_map="auto",
    )

    # Load LoRA model
    lora_model = PeftModel.from_pretrained(base_model, lora_adapter_path)

    results = {"base": [], "lora": []}

    for prompt in test_prompts:
        # Base model generation
        inputs = base_tokenizer(prompt, return_tensors="pt").to("cuda")
        base_output = base_model.generate(**inputs, max_new_tokens=256)
        results["base"].append(
            base_tokenizer.decode(base_output[0], skip_special_tokens=True)
        )

        # LoRA model generation
        lora_output = lora_model.generate(**inputs, max_new_tokens=256)
        results["lora"].append(
            base_tokenizer.decode(lora_output[0], skip_special_tokens=True)
        )

    return results

Adapter Merging and Deployment

Merging Adapters into Base Model

# scripts/merge.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import argparse
from pathlib import Path

def merge_lora_weights(
    base_model_name: str,
    adapter_path: str,
    output_path: str,
    push_to_hub: bool = False,
    hub_model_id: str = None,
):
    """Merge LoRA adapters into base model."""
    print(f"Loading base model: {base_model_name}")

    # Load base model in full precision for merging
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        torch_dtype=torch.float16,
        device_map="auto",
    )

    tokenizer = AutoTokenizer.from_pretrained(base_model_name)

    print(f"Loading LoRA adapters from: {adapter_path}")
    model = PeftModel.from_pretrained(base_model, adapter_path)

    print("Merging weights...")
    merged_model = model.merge_and_unload()

    print(f"Saving merged model to: {output_path}")
    merged_model.save_pretrained(output_path)
    tokenizer.save_pretrained(output_path)

    if push_to_hub and hub_model_id:
        print(f"Pushing to Hub: {hub_model_id}")
        merged_model.push_to_hub(hub_model_id)
        tokenizer.push_to_hub(hub_model_id)

    print("Done!")
    return merged_model


def merge_multiple_adapters(
    base_model_name: str,
    adapter_paths: list,
    weights: list,
    output_path: str,
):
    """Merge multiple LoRA adapters with different weights."""
    from peft import PeftModel, add_weighted_adapter

    print(f"Loading base model: {base_model_name}")
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        torch_dtype=torch.float16,
        device_map="auto",
    )

    # Load first adapter
    model = PeftModel.from_pretrained(
        base_model,
        adapter_paths[0],
        adapter_name="adapter_0",
    )

    # Load additional adapters
    for i, path in enumerate(adapter_paths[1:], 1):
        model.load_adapter(path, adapter_name=f"adapter_{i}")

    # Create weighted combination
    adapter_names = [f"adapter_{i}" for i in range(len(adapter_paths))]
    model.add_weighted_adapter(
        adapters=adapter_names,
        weights=weights,
        adapter_name="merged",
        combination_type="linear",
    )

    # Set merged as active and merge
    model.set_adapter("merged")
    merged_model = model.merge_and_unload()

    merged_model.save_pretrained(output_path)

    return merged_model


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--base-model", required=True)
    parser.add_argument("--adapter-path", required=True)
    parser.add_argument("--output-path", required=True)
    parser.add_argument("--push-to-hub", action="store_true")
    parser.add_argument("--hub-model-id", default=None)

    args = parser.parse_args()

    merge_lora_weights(
        args.base_model,
        args.adapter_path,
        args.output_path,
        args.push_to_hub,
        args.hub_model_id,
    )

Inference with Adapters

# inference/generate.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
from typing import Optional, Generator
import asyncio

class LoRAInference:
    """Inference with LoRA adapters."""

    def __init__(
        self,
        base_model_name: str,
        adapter_path: Optional[str] = None,
        use_4bit: bool = True,
    ):
        self.base_model_name = base_model_name
        self.adapter_path = adapter_path
        self.use_4bit = use_4bit

        self.model = None
        self.tokenizer = None
        self.device = "cuda" if torch.cuda.is_available() else "cpu"

    def load(self):
        """Load model and adapter."""
        # Quantization config
        quant_config = None
        if self.use_4bit:
            quant_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_quant_type="nf4",
                bnb_4bit_compute_dtype=torch.bfloat16,
            )

        # Load base model
        self.model = AutoModelForCausalLM.from_pretrained(
            self.base_model_name,
            quantization_config=quant_config,
            device_map="auto",
            torch_dtype=torch.bfloat16,
        )

        # Load adapter if provided
        if self.adapter_path:
            self.model = PeftModel.from_pretrained(
                self.model,
                self.adapter_path,
            )

        self.tokenizer = AutoTokenizer.from_pretrained(self.base_model_name)
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token

        self.model.eval()

    @torch.no_grad()
    def generate(
        self,
        prompt: str,
        max_new_tokens: int = 512,
        temperature: float = 0.7,
        top_p: float = 0.9,
        top_k: int = 50,
        repetition_penalty: float = 1.1,
    ) -> str:
        """Generate response."""
        inputs = self.tokenizer(
            prompt,
            return_tensors="pt",
            truncation=True,
            max_length=2048,
        ).to(self.device)

        outputs = self.model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            repetition_penalty=repetition_penalty,
            do_sample=True,
            pad_token_id=self.tokenizer.pad_token_id,
            eos_token_id=self.tokenizer.eos_token_id,
        )

        response = self.tokenizer.decode(
            outputs[0][inputs["input_ids"].shape[1]:],
            skip_special_tokens=True,
        )

        return response.strip()

    @torch.no_grad()
    def generate_stream(
        self,
        prompt: str,
        max_new_tokens: int = 512,
        temperature: float = 0.7,
    ) -> Generator[str, None, None]:
        """Generate response with streaming."""
        from transformers import TextIteratorStreamer
        from threading import Thread

        inputs = self.tokenizer(
            prompt,
            return_tensors="pt",
            truncation=True,
        ).to(self.device)

        streamer = TextIteratorStreamer(
            self.tokenizer,
            skip_prompt=True,
            skip_special_tokens=True,
        )

        generation_kwargs = {
            **inputs,
            "streamer": streamer,
            "max_new_tokens": max_new_tokens,
            "temperature": temperature,
            "do_sample": True,
            "pad_token_id": self.tokenizer.pad_token_id,
        }

        thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
        thread.start()

        for text in streamer:
            yield text

        thread.join()

    def swap_adapter(self, new_adapter_path: str):
        """Hot-swap to a different adapter."""
        if hasattr(self.model, "load_adapter"):
            self.model.load_adapter(new_adapter_path, adapter_name="new")
            self.model.set_adapter("new")
        else:
            # Reload with new adapter
            base_model = self.model.get_base_model()
            self.model = PeftModel.from_pretrained(base_model, new_adapter_path)

Training Script

# scripts/train.py
import argparse
import wandb
from config.lora_config import ModelConfig, LoRASettings, TrainingConfig, DataConfig
from training.model_loader import ModelLoader
from data.dataset import DatasetPreparation, PromptTemplate
from training.trainer import LoRATrainer

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", default="meta-llama/Llama-3.1-8B")
    parser.add_argument("--dataset", default="databricks/databricks-dolly-15k")
    parser.add_argument("--output-dir", default="./outputs/lora-llama2")
    parser.add_argument("--epochs", type=int, default=3)
    parser.add_argument("--batch-size", type=int, default=4)
    parser.add_argument("--lora-r", type=int, default=16)
    parser.add_argument("--learning-rate", type=float, default=2e-4)
    parser.add_argument("--use-4bit", action="store_true", default=True)
    parser.add_argument("--wandb-project", default="lora-finetuning")

    args = parser.parse_args()

    # Initialize wandb
    wandb.init(project=args.wandb_project, config=vars(args))

    # Create configurations
    model_config = ModelConfig(
        model_name=args.model,
        use_4bit=args.use_4bit,
    )

    lora_settings = LoRASettings(
        r=args.lora_r,
        lora_alpha=args.lora_r * 2,
    )

    training_config = TrainingConfig(
        output_dir=args.output_dir,
        num_train_epochs=args.epochs,
        per_device_train_batch_size=args.batch_size,
        learning_rate=args.learning_rate,
    )

    data_config = DataConfig(
        dataset_name=args.dataset,
    )

    # Load model
    print("Loading model...")
    loader = ModelLoader(model_config, lora_settings)
    model, tokenizer = loader.load()

    # Prepare dataset
    print("Preparing dataset...")
    template = PromptTemplate()
    dataset_prep = DatasetPreparation(
        tokenizer=tokenizer,
        max_length=data_config.max_seq_length,
        template=template,
    )
    datasets = dataset_prep.prepare(
        data_config.dataset_name,
        validation_split=data_config.validation_split,
    )

    # Train
    print("Starting training...")
    trainer = LoRATrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=datasets["train"],
        eval_dataset=datasets["validation"],
        training_config=training_config,
    )

    results = trainer.train()

    # Log results
    wandb.log(results)
    print(f"Training complete! Results: {results}")

    # Save final adapter
    model.save_pretrained(f"{args.output_dir}/final_adapter")
    tokenizer.save_pretrained(f"{args.output_dir}/final_adapter")

if __name__ == "__main__":
    main()

Advanced Techniques

QLoRA with Double Quantization

# config/qlora_config.py
from transformers import BitsAndBytesConfig
import torch

def get_qlora_config():
    """QLoRA configuration with double quantization."""
    return BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",          # NormalFloat 4-bit
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,      # Quantize quantization constants
    )

DoRA: Weight-Decomposed Low-Rank Adaptation

# config/dora_config.py
from peft import LoraConfig

def get_dora_config():
    """DoRA configuration - weight-decomposed LoRA."""
    return LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
        use_dora=True,  # Enable DoRA
        task_type="CAUSAL_LM",
    )

Layer-wise Learning Rates

# training/layerwise_lr.py
from torch.optim import AdamW

def get_layerwise_optimizer(model, base_lr=2e-4, lr_decay=0.9):
    """Create optimizer with layer-wise learning rate decay."""
    params = []

    # Get all LoRA parameters grouped by layer
    for name, param in model.named_parameters():
        if not param.requires_grad:
            continue

        # Extract layer number from name
        layer_num = None
        for part in name.split("."):
            if part.isdigit():
                layer_num = int(part)
                break

        # Calculate layer-specific learning rate
        if layer_num is not None:
            lr = base_lr * (lr_decay ** (31 - layer_num))  # 32 layers
        else:
            lr = base_lr

        params.append({"params": param, "lr": lr})

    return AdamW(params, weight_decay=0.01)

Memory Optimization Tips

Memory Optimization Techniques

4-bit Quantization

~4GB VRAM

Fit 7B models on 8GB GPU

Gradient Checkpointing

Trade compute for memory

Reduce activation memory

Gradient Accumulation

Simulate larger batch sizes

Effective batch 16+

Paged Optimizer

CPU offload of optimizer states

Manage optimizer state gracefully

Memory Calculation Guide

Model Size	Full FT (fp16)	LoRA (fp16)	QLoRA (4-bit)
7B	~56GB	~16GB	~6GB
13B	~104GB	~30GB	~10GB
70B	~560GB	~160GB	~48GB

Common Issues and Solutions

Issue: Gradient Overflow

# Use gradient clipping
training_args = TrainingArguments(
    max_grad_norm=0.3,  # Clip gradients
    fp16=False,         # Use bf16 instead
    bf16=True,
)

Issue: Loss Not Decreasing

# Check learning rate and warmup
training_args = TrainingArguments(
    learning_rate=2e-4,    # Try 1e-4 to 5e-4
    warmup_ratio=0.03,     # Increase to 0.1
    lr_scheduler_type="cosine",
)

Issue: Out of Memory

# Reduce memory usage
training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    gradient_checkpointing=True,
    optim="paged_adamw_8bit",
)

Key Concepts Recap

Concept	What It Is	Why It Matters
LoRA (Low-Rank Adaptation)	Freezes base model, trains small adapter matrices	Reduces trainable params from 7B to ~4M (0.06%)
Rank (r)	Dimension of the low-rank matrices (typically 8-64)	Controls capacity vs efficiency trade-off
Alpha (lora_alpha)	Scaling factor for LoRA updates (typically 2×r)	Adjusts adaptation strength without retraining
Target Modules	Which layers get LoRA adapters (q/k/v/o_proj, MLP)	More modules = more capacity but more params
QLoRA	LoRA + 4-bit quantization of base model	Fits 7B models on 8GB GPUs
NF4 (NormalFloat 4-bit)	Optimal 4-bit data type for normally-distributed weights	Better accuracy than standard INT4
Double Quantization	Quantizes the quantization constants themselves	Extra ~0.5GB memory savings
Gradient Checkpointing	Recomputes activations during backward pass	Trades compute for memory
Paged Optimizer	Offloads optimizer states to CPU when needed	Handles memory spikes gracefully
Adapter Merging	Combines LoRA weights into base model permanently	No inference overhead, easy deployment

Next Steps

After completing this project, consider:

Custom Reranker - Train cross-encoders for RAG
Knowledge Distillation - Compress fine-tuned models
DPO Alignment - Align models with human preferences

Resources

LoRA Fine-tuning

TL;DR

Overview


Difficulty	Intermediate
Time	~6 hours
Prerequisites	PyTorch basics, Transformers library
Learning Outcomes	LoRA theory, PEFT integration, QLoRA, adapter merging

Why LoRA Over Full Fine-tuning?

Introduction

Traditional vs LoRA Fine-tuning

Traditional Fine-tuning

Base Model (7B params) → Update All 7B params → High Memory (28GB+ VRAM)

LoRA Fine-tuning

Recommended

Base Model (7B frozen) → Train Adapters (~4M params) → Low Memory (8GB VRAM)

Understanding LoRA

The Low-Rank Hypothesis

Neural network weight updates during fine-tuning have low intrinsic dimensionality. Instead of updating a full weight matrix W, we can decompose the update into two smaller matrices.

LoRA Low-Rank Decomposition

Weight Update ΔW (d × k)

Original Update

Full ΔW matrix with d × k parameters

LoRA Decomposition

Matrix A (d × r) × Matrix B (r × k) = B × A (d × k) with only d×r + r×k parameters

Approximation: ΔW ≈ B × A (low-rank)

For a weight matrix of dimensions (d × k), the update is decomposed as:

Matrix A: (d × r) - Down-projection
Matrix B: (r × k) - Up-projection
r (rank): Typically 4-64, much smaller than d and k

Parameter Reduction

For an 8B parameter model targeting attention layers:

Full fine-tuning: ~8 billion parameters
LoRA (r=8): ~4 million parameters (0.05% of original)

Project Setup

# Create project directory
mkdir lora-finetuning && cd lora-finetuning

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install torch transformers datasets peft accelerate bitsandbytes
pip install trl wandb evaluate scikit-learn

Project Structure

lora-finetuning/
├── config/
│   └── lora_config.py       # LoRA configuration
├── data/
│   └── dataset.py           # Dataset preparation
├── training/
│   ├── trainer.py           # Training loop
│   └── callbacks.py         # Custom callbacks
├── evaluation/
│   └── evaluate.py          # Model evaluation
├── inference/
│   └── generate.py          # Inference utilities
├── scripts/
│   ├── train.py             # Training script
│   └── merge.py             # Adapter merging
└── requirements.txt

Configuration

LoRA Configuration

# config/lora_config.py
from dataclasses import dataclass, field
from typing import Optional, List
from peft import LoraConfig, TaskType

@dataclass
class ModelConfig:
    """Base model configuration."""
    model_name: str = "meta-llama/Llama-3.1-8B"
    use_4bit: bool = True
    use_8bit: bool = False
    trust_remote_code: bool = False
    use_flash_attention: bool = True

@dataclass
class LoRASettings:
    """LoRA hyperparameters."""
    # Core LoRA parameters
    r: int = 16                          # Rank of adaptation
    lora_alpha: int = 32                 # Scaling factor
    lora_dropout: float = 0.05           # Dropout probability

    # Target modules for different architectures
    target_modules: List[str] = field(default_factory=lambda: [
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj"       # MLP
    ])

    # Advanced settings
    bias: str = "none"                   # none, all, lora_only
    task_type: TaskType = TaskType.CAUSAL_LM
    modules_to_save: Optional[List[str]] = None  # Full fine-tuning layers

    def to_peft_config(self) -> LoraConfig:
        """Convert to PEFT LoraConfig."""
        return LoraConfig(
            r=self.r,
            lora_alpha=self.lora_alpha,
            lora_dropout=self.lora_dropout,
            target_modules=self.target_modules,
            bias=self.bias,
            task_type=self.task_type,
            modules_to_save=self.modules_to_save,
        )

@dataclass
class TrainingConfig:
    """Training hyperparameters."""
    output_dir: str = "./outputs"
    num_train_epochs: int = 3
    per_device_train_batch_size: int = 4
    per_device_eval_batch_size: int = 4
    gradient_accumulation_steps: int = 4

    # Optimizer settings
    learning_rate: float = 2e-4
    weight_decay: float = 0.01
    warmup_ratio: float = 0.03
    lr_scheduler_type: str = "cosine"

    # Memory optimization
    gradient_checkpointing: bool = True
    max_grad_norm: float = 0.3

    # Logging
    logging_steps: int = 10
    eval_steps: int = 100
    save_steps: int = 100

    # Mixed precision
    fp16: bool = False
    bf16: bool = True  # Use bf16 on Ampere+ GPUs

@dataclass
class DataConfig:
    """Dataset configuration."""
    dataset_name: str = "databricks/databricks-dolly-15k"
    max_seq_length: int = 2048
    validation_split: float = 0.1

    # Prompt template
    instruction_template: str = "### Instruction:\n{instruction}\n\n"
    input_template: str = "### Input:\n{input}\n\n"
    response_template: str = "### Response:\n{output}"

Understanding LoRA Parameters

LoRA Key Parameters

r: Rank

Higher = more capacity

r=8: Simple tasks

r=16: General use

r=64: Complex tasks

alpha: Scaling

Controls adaptation strength

Typically 2x rank

alpha=32 for r=16

dropout: Regularization

Prevents overfitting

Typical: 0.05-0.1

Model Loading with Quantization

QLoRA: 4-bit Quantization

QLoRA enables fine-tuning on consumer GPUs by quantizing the base model to 4-bit precision while keeping LoRA adapters in higher precision.

# training/model_loader.py
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)
from peft import (
    get_peft_model,
    prepare_model_for_kbit_training,
    LoraConfig,
)
from typing import Tuple, Optional
from config.lora_config import ModelConfig, LoRASettings

class ModelLoader:
    """Load and prepare models for LoRA training."""

    def __init__(
        self,
        model_config: ModelConfig,
        lora_settings: LoRASettings,
    ):
        self.model_config = model_config
        self.lora_settings = lora_settings
        self.device = "cuda" if torch.cuda.is_available() else "cpu"

    def get_quantization_config(self) -> Optional[BitsAndBytesConfig]:
        """Create quantization configuration."""
        if self.model_config.use_4bit:
            return BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_quant_type="nf4",         # Normal Float 4-bit
                bnb_4bit_compute_dtype=torch.bfloat16,
                bnb_4bit_use_double_quant=True,    # Nested quantization
            )
        elif self.model_config.use_8bit:
            return BitsAndBytesConfig(
                load_in_8bit=True,
            )
        return None

    def load_tokenizer(self) -> AutoTokenizer:
        """Load and configure tokenizer."""
        tokenizer = AutoTokenizer.from_pretrained(
            self.model_config.model_name,
            trust_remote_code=self.model_config.trust_remote_code,
            padding_side="right",  # Required for LoRA
        )

        # Set padding token if not present
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token
            tokenizer.pad_token_id = tokenizer.eos_token_id

        return tokenizer

    def load_base_model(self) -> AutoModelForCausalLM:
        """Load base model with optional quantization."""
        quant_config = self.get_quantization_config()

        model_kwargs = {
            "pretrained_model_name_or_path": self.model_config.model_name,
            "trust_remote_code": self.model_config.trust_remote_code,
            "torch_dtype": torch.bfloat16,
            "device_map": "auto",
        }

        if quant_config:
            model_kwargs["quantization_config"] = quant_config

        if self.model_config.use_flash_attention:
            model_kwargs["attn_implementation"] = "flash_attention_2"

        model = AutoModelForCausalLM.from_pretrained(**model_kwargs)

        return model

    def prepare_for_training(
        self,
        model: AutoModelForCausalLM,
    ) -> AutoModelForCausalLM:
        """Prepare model for k-bit training."""
        if self.model_config.use_4bit or self.model_config.use_8bit:
            model = prepare_model_for_kbit_training(
                model,
                use_gradient_checkpointing=True,
            )

        return model

    def apply_lora(
        self,
        model: AutoModelForCausalLM,
    ) -> AutoModelForCausalLM:
        """Apply LoRA adapters to model."""
        lora_config = self.lora_settings.to_peft_config()
        model = get_peft_model(model, lora_config)

        return model

    def load(self) -> Tuple[AutoModelForCausalLM, AutoTokenizer]:
        """Load complete model with LoRA adapters."""
        print(f"Loading model: {self.model_config.model_name}")

        # Load tokenizer
        tokenizer = self.load_tokenizer()

        # Load and prepare model
        model = self.load_base_model()
        model = self.prepare_for_training(model)
        model = self.apply_lora(model)

        # Print trainable parameters
        self._print_trainable_params(model)

        return model, tokenizer

    def _print_trainable_params(self, model: AutoModelForCausalLM) -> None:
        """Print number of trainable parameters."""
        trainable_params = sum(
            p.numel() for p in model.parameters() if p.requires_grad
        )
        total_params = sum(p.numel() for p in model.parameters())

        print(f"Trainable parameters: {trainable_params:,}")
        print(f"Total parameters: {total_params:,}")
        print(f"Trainable %: {100 * trainable_params / total_params:.4f}%")

Dataset Preparation

Instruction Dataset Format

# data/dataset.py
from datasets import load_dataset, Dataset
from transformers import PreTrainedTokenizer
from typing import Dict, Any, Optional
from dataclasses import dataclass

@dataclass
class PromptTemplate:
    """Template for formatting instructions."""
    instruction_key: str = "instruction"
    input_key: str = "context"
    output_key: str = "response"

    system_prompt: str = (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request."
    )

    def format(self, example: Dict[str, Any]) -> str:
        """Format a single example."""
        parts = [f"### System:\n{self.system_prompt}\n\n"]

        # Add instruction
        instruction = example.get(self.instruction_key, "")
        parts.append(f"### Instruction:\n{instruction}\n\n")

        # Add optional input/context
        input_text = example.get(self.input_key, "")
        if input_text:
            parts.append(f"### Input:\n{input_text}\n\n")

        # Add response
        output = example.get(self.output_key, "")
        parts.append(f"### Response:\n{output}")

        return "".join(parts)

    def format_for_inference(self, instruction: str, input_text: str = "") -> str:
        """Format prompt for inference (no response)."""
        parts = [f"### System:\n{self.system_prompt}\n\n"]
        parts.append(f"### Instruction:\n{instruction}\n\n")

        if input_text:
            parts.append(f"### Input:\n{input_text}\n\n")

        parts.append("### Response:\n")

        return "".join(parts)


class DatasetPreparation:
    """Prepare datasets for LoRA fine-tuning."""

    def __init__(
        self,
        tokenizer: PreTrainedTokenizer,
        max_length: int = 2048,
        template: Optional[PromptTemplate] = None,
    ):
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.template = template or PromptTemplate()

    def load_dataset(
        self,
        dataset_name: str,
        split: str = "train",
    ) -> Dataset:
        """Load dataset from HuggingFace Hub."""
        dataset = load_dataset(dataset_name, split=split)
        return dataset

    def tokenize_function(self, examples: Dict[str, Any]) -> Dict[str, Any]:
        """Tokenize a batch of examples."""
        texts = []

        for i in range(len(examples[self.template.instruction_key])):
            example = {
                self.template.instruction_key: examples[self.template.instruction_key][i],
                self.template.input_key: examples.get(self.template.input_key, [""])[i] if self.template.input_key in examples else "",
                self.template.output_key: examples[self.template.output_key][i],
            }
            texts.append(self.template.format(example))

        # Tokenize
        tokenized = self.tokenizer(
            texts,
            truncation=True,
            max_length=self.max_length,
            padding="max_length",
            return_tensors=None,
        )

        # Labels are same as input_ids for causal LM
        tokenized["labels"] = tokenized["input_ids"].copy()

        return tokenized

    def prepare(
        self,
        dataset_name: str,
        validation_split: float = 0.1,
    ) -> Dict[str, Dataset]:
        """Prepare train and validation datasets."""
        # Load raw dataset
        dataset = self.load_dataset(dataset_name)

        # Split into train/validation
        split_dataset = dataset.train_test_split(test_size=validation_split)

        # Tokenize
        tokenized_train = split_dataset["train"].map(
            self.tokenize_function,
            batched=True,
            remove_columns=split_dataset["train"].column_names,
            desc="Tokenizing training set",
        )

        tokenized_val = split_dataset["test"].map(
            self.tokenize_function,
            batched=True,
            remove_columns=split_dataset["test"].column_names,
            desc="Tokenizing validation set",
        )

        return {
            "train": tokenized_train,
            "validation": tokenized_val,
        }


class ChatDatasetPreparation(DatasetPreparation):
    """Prepare chat/conversation datasets."""

    def __init__(
        self,
        tokenizer: PreTrainedTokenizer,
        max_length: int = 2048,
    ):
        super().__init__(tokenizer, max_length)

    def format_chat(self, messages: list) -> str:
        """Format chat messages using tokenizer's chat template."""
        return self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False,
        )

    def tokenize_chat(self, examples: Dict[str, Any]) -> Dict[str, Any]:
        """Tokenize chat conversations."""
        texts = []

        for messages in examples["messages"]:
            text = self.format_chat(messages)
            texts.append(text)

        tokenized = self.tokenizer(
            texts,
            truncation=True,
            max_length=self.max_length,
            padding="max_length",
            return_tensors=None,
        )

        tokenized["labels"] = tokenized["input_ids"].copy()

        return tokenized

Training Implementation

Custom Trainer with LoRA

# training/trainer.py
import torch
from transformers import (
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling,
)
from transformers.trainer_callback import TrainerCallback
from peft import PeftModel
from typing import Optional, Dict, Any
import wandb
from config.lora_config import TrainingConfig

class LoRATrainer:
    """Trainer for LoRA fine-tuning."""

    def __init__(
        self,
        model: PeftModel,
        tokenizer,
        train_dataset,
        eval_dataset,
        training_config: TrainingConfig,
    ):
        self.model = model
        self.tokenizer = tokenizer
        self.train_dataset = train_dataset
        self.eval_dataset = eval_dataset
        self.config = training_config

    def get_training_args(self) -> TrainingArguments:
        """Create training arguments."""
        return TrainingArguments(
            output_dir=self.config.output_dir,
            num_train_epochs=self.config.num_train_epochs,
            per_device_train_batch_size=self.config.per_device_train_batch_size,
            per_device_eval_batch_size=self.config.per_device_eval_batch_size,
            gradient_accumulation_steps=self.config.gradient_accumulation_steps,

            # Optimizer
            learning_rate=self.config.learning_rate,
            weight_decay=self.config.weight_decay,
            warmup_ratio=self.config.warmup_ratio,
            lr_scheduler_type=self.config.lr_scheduler_type,
            optim="paged_adamw_32bit",  # Memory-efficient optimizer

            # Memory optimization
            gradient_checkpointing=self.config.gradient_checkpointing,
            max_grad_norm=self.config.max_grad_norm,

            # Precision
            fp16=self.config.fp16,
            bf16=self.config.bf16,

            # Logging
            logging_steps=self.config.logging_steps,
            eval_strategy="steps",
            eval_steps=self.config.eval_steps,
            save_strategy="steps",
            save_steps=self.config.save_steps,
            save_total_limit=3,
            load_best_model_at_end=True,

            # W&B logging
            report_to="wandb",
            run_name=f"lora-{self.config.output_dir.split('/')[-1]}",

            # Other
            remove_unused_columns=False,
            dataloader_pin_memory=True,
            dataloader_num_workers=4,
        )

    def get_data_collator(self):
        """Create data collator for language modeling."""
        return DataCollatorForLanguageModeling(
            tokenizer=self.tokenizer,
            mlm=False,  # Causal LM, not masked LM
        )

    def train(self) -> Dict[str, Any]:
        """Run training."""
        training_args = self.get_training_args()
        data_collator = self.get_data_collator()

        # Create trainer
        trainer = Trainer(
            model=self.model,
            args=training_args,
            train_dataset=self.train_dataset,
            eval_dataset=self.eval_dataset,
            data_collator=data_collator,
            callbacks=[
                LoRAProgressCallback(),
                EarlyStoppingCallback(patience=3),
            ],
        )

        # Train
        train_result = trainer.train()

        # Save final model
        trainer.save_model(f"{self.config.output_dir}/final")

        return {
            "train_loss": train_result.training_loss,
            "train_samples": len(self.train_dataset),
            "train_steps": train_result.global_step,
        }


class LoRAProgressCallback(TrainerCallback):
    """Custom callback for LoRA training progress."""

    def on_log(self, args, state, control, logs=None, **kwargs):
        """Log training progress."""
        if logs:
            # Log gradient norms for LoRA layers
            if "grad_norm" in logs:
                print(f"Step {state.global_step}: grad_norm = {logs['grad_norm']:.4f}")


class EarlyStoppingCallback(TrainerCallback):
    """Early stopping based on validation loss."""

    def __init__(self, patience: int = 3, min_delta: float = 0.01):
        self.patience = patience
        self.min_delta = min_delta
        self.best_loss = float("inf")
        self.counter = 0

    def on_evaluate(self, args, state, control, metrics=None, **kwargs):
        """Check for improvement after evaluation."""
        if metrics:
            eval_loss = metrics.get("eval_loss", float("inf"))

            if eval_loss < self.best_loss - self.min_delta:
                self.best_loss = eval_loss
                self.counter = 0
            else:
                self.counter += 1

                if self.counter >= self.patience:
                    print(f"Early stopping: no improvement for {self.patience} evaluations")
                    control.should_training_stop = True

SFT Trainer Alternative

For instruction fine-tuning, the TRL library provides a specialized trainer:

# training/sft_trainer.py
from trl import SFTTrainer, SFTConfig
from peft import PeftModel
from typing import Optional

class SFTLoRATrainer:
    """SFT Trainer for instruction tuning with LoRA."""

    def __init__(
        self,
        model: PeftModel,
        tokenizer,
        train_dataset,
        eval_dataset,
        max_seq_length: int = 2048,
        output_dir: str = "./outputs",
    ):
        self.model = model
        self.tokenizer = tokenizer
        self.train_dataset = train_dataset
        self.eval_dataset = eval_dataset
        self.max_seq_length = max_seq_length
        self.output_dir = output_dir

    def train(self):
        """Run SFT training."""
        sft_config = SFTConfig(
            output_dir=self.output_dir,
            max_seq_length=self.max_seq_length,
            num_train_epochs=3,
            per_device_train_batch_size=4,
            gradient_accumulation_steps=4,
            learning_rate=2e-4,
            logging_steps=10,
            save_strategy="steps",
            save_steps=100,
            eval_strategy="steps",
            eval_steps=100,
            bf16=True,
            gradient_checkpointing=True,
            optim="paged_adamw_32bit",
            packing=True,  # Pack multiple samples into one sequence
        )

        trainer = SFTTrainer(
            model=self.model,
            args=sft_config,
            train_dataset=self.train_dataset,
            eval_dataset=self.eval_dataset,
            tokenizer=self.tokenizer,
        )

        trainer.train()

        return trainer

Evaluation

# evaluation/evaluate.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from typing import List, Dict, Any
import evaluate
from tqdm import tqdm

class LoRAEvaluator:
    """Evaluate LoRA fine-tuned models."""

    def __init__(
        self,
        model: PeftModel,
        tokenizer: AutoTokenizer,
        device: str = "cuda",
    ):
        self.model = model
        self.tokenizer = tokenizer
        self.device = device

        # Load metrics
        self.rouge = evaluate.load("rouge")
        self.bleu = evaluate.load("bleu")

    @torch.no_grad()
    def generate(
        self,
        prompt: str,
        max_new_tokens: int = 256,
        temperature: float = 0.7,
        top_p: float = 0.9,
    ) -> str:
        """Generate response for a prompt."""
        inputs = self.tokenizer(
            prompt,
            return_tensors="pt",
            truncation=True,
            max_length=2048,
        ).to(self.device)

        outputs = self.model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            pad_token_id=self.tokenizer.pad_token_id,
            eos_token_id=self.tokenizer.eos_token_id,
        )

        response = self.tokenizer.decode(
            outputs[0][inputs["input_ids"].shape[1]:],
            skip_special_tokens=True,
        )

        return response.strip()

    def evaluate_generation(
        self,
        test_samples: List[Dict[str, str]],
        prompt_template: callable,
    ) -> Dict[str, float]:
        """Evaluate generation quality."""
        predictions = []
        references = []

        for sample in tqdm(test_samples, desc="Evaluating"):
            prompt = prompt_template(sample["instruction"], sample.get("input", ""))
            prediction = self.generate(prompt)

            predictions.append(prediction)
            references.append(sample["output"])

        # Calculate ROUGE
        rouge_scores = self.rouge.compute(
            predictions=predictions,
            references=references,
        )

        # Calculate BLEU
        bleu_scores = self.bleu.compute(
            predictions=[p.split() for p in predictions],
            references=[[r.split()] for r in references],
        )

        return {
            "rouge1": rouge_scores["rouge1"],
            "rouge2": rouge_scores["rouge2"],
            "rougeL": rouge_scores["rougeL"],
            "bleu": bleu_scores["bleu"],
        }

    def evaluate_perplexity(
        self,
        eval_dataset,
        batch_size: int = 4,
    ) -> float:
        """Calculate perplexity on evaluation set."""
        self.model.eval()
        total_loss = 0
        total_tokens = 0

        dataloader = torch.utils.data.DataLoader(
            eval_dataset,
            batch_size=batch_size,
            shuffle=False,
        )

        for batch in tqdm(dataloader, desc="Calculating perplexity"):
            input_ids = batch["input_ids"].to(self.device)
            attention_mask = batch["attention_mask"].to(self.device)
            labels = batch["labels"].to(self.device)

            outputs = self.model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels,
            )

            # Count non-padding tokens
            num_tokens = (labels != -100).sum().item()
            total_loss += outputs.loss.item() * num_tokens
            total_tokens += num_tokens

        avg_loss = total_loss / total_tokens
        perplexity = torch.exp(torch.tensor(avg_loss)).item()

        return perplexity


def compare_models(
    base_model_name: str,
    lora_adapter_path: str,
    test_prompts: List[str],
) -> Dict[str, List[str]]:
    """Compare base model vs LoRA fine-tuned model."""
    # Load base model
    base_tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        torch_dtype=torch.bfloat16,
        device_map="auto",
    )

    # Load LoRA model
    lora_model = PeftModel.from_pretrained(base_model, lora_adapter_path)

    results = {"base": [], "lora": []}

    for prompt in test_prompts:
        # Base model generation
        inputs = base_tokenizer(prompt, return_tensors="pt").to("cuda")
        base_output = base_model.generate(**inputs, max_new_tokens=256)
        results["base"].append(
            base_tokenizer.decode(base_output[0], skip_special_tokens=True)
        )

        # LoRA model generation
        lora_output = lora_model.generate(**inputs, max_new_tokens=256)
        results["lora"].append(
            base_tokenizer.decode(lora_output[0], skip_special_tokens=True)
        )

    return results

Adapter Merging and Deployment

Merging Adapters into Base Model

# scripts/merge.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import argparse
from pathlib import Path

def merge_lora_weights(
    base_model_name: str,
    adapter_path: str,
    output_path: str,
    push_to_hub: bool = False,
    hub_model_id: str = None,
):
    """Merge LoRA adapters into base model."""
    print(f"Loading base model: {base_model_name}")

    # Load base model in full precision for merging
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        torch_dtype=torch.float16,
        device_map="auto",
    )

    tokenizer = AutoTokenizer.from_pretrained(base_model_name)

    print(f"Loading LoRA adapters from: {adapter_path}")
    model = PeftModel.from_pretrained(base_model, adapter_path)

    print("Merging weights...")
    merged_model = model.merge_and_unload()

    print(f"Saving merged model to: {output_path}")
    merged_model.save_pretrained(output_path)
    tokenizer.save_pretrained(output_path)

    if push_to_hub and hub_model_id:
        print(f"Pushing to Hub: {hub_model_id}")
        merged_model.push_to_hub(hub_model_id)
        tokenizer.push_to_hub(hub_model_id)

    print("Done!")
    return merged_model


def merge_multiple_adapters(
    base_model_name: str,
    adapter_paths: list,
    weights: list,
    output_path: str,
):
    """Merge multiple LoRA adapters with different weights."""
    from peft import PeftModel, add_weighted_adapter

    print(f"Loading base model: {base_model_name}")
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        torch_dtype=torch.float16,
        device_map="auto",
    )

    # Load first adapter
    model = PeftModel.from_pretrained(
        base_model,
        adapter_paths[0],
        adapter_name="adapter_0",
    )

    # Load additional adapters
    for i, path in enumerate(adapter_paths[1:], 1):
        model.load_adapter(path, adapter_name=f"adapter_{i}")

    # Create weighted combination
    adapter_names = [f"adapter_{i}" for i in range(len(adapter_paths))]
    model.add_weighted_adapter(
        adapters=adapter_names,
        weights=weights,
        adapter_name="merged",
        combination_type="linear",
    )

    # Set merged as active and merge
    model.set_adapter("merged")
    merged_model = model.merge_and_unload()

    merged_model.save_pretrained(output_path)

    return merged_model


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--base-model", required=True)
    parser.add_argument("--adapter-path", required=True)
    parser.add_argument("--output-path", required=True)
    parser.add_argument("--push-to-hub", action="store_true")
    parser.add_argument("--hub-model-id", default=None)

    args = parser.parse_args()

    merge_lora_weights(
        args.base_model,
        args.adapter_path,
        args.output_path,
        args.push_to_hub,
        args.hub_model_id,
    )

Inference with Adapters

# inference/generate.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
from typing import Optional, Generator
import asyncio

class LoRAInference:
    """Inference with LoRA adapters."""

    def __init__(
        self,
        base_model_name: str,
        adapter_path: Optional[str] = None,
        use_4bit: bool = True,
    ):
        self.base_model_name = base_model_name
        self.adapter_path = adapter_path
        self.use_4bit = use_4bit

        self.model = None
        self.tokenizer = None
        self.device = "cuda" if torch.cuda.is_available() else "cpu"

    def load(self):
        """Load model and adapter."""
        # Quantization config
        quant_config = None
        if self.use_4bit:
            quant_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_quant_type="nf4",
                bnb_4bit_compute_dtype=torch.bfloat16,
            )

        # Load base model
        self.model = AutoModelForCausalLM.from_pretrained(
            self.base_model_name,
            quantization_config=quant_config,
            device_map="auto",
            torch_dtype=torch.bfloat16,
        )

        # Load adapter if provided
        if self.adapter_path:
            self.model = PeftModel.from_pretrained(
                self.model,
                self.adapter_path,
            )

        self.tokenizer = AutoTokenizer.from_pretrained(self.base_model_name)
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token

        self.model.eval()

    @torch.no_grad()
    def generate(
        self,
        prompt: str,
        max_new_tokens: int = 512,
        temperature: float = 0.7,
        top_p: float = 0.9,
        top_k: int = 50,
        repetition_penalty: float = 1.1,
    ) -> str:
        """Generate response."""
        inputs = self.tokenizer(
            prompt,
            return_tensors="pt",
            truncation=True,
            max_length=2048,
        ).to(self.device)

        outputs = self.model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            repetition_penalty=repetition_penalty,
            do_sample=True,
            pad_token_id=self.tokenizer.pad_token_id,
            eos_token_id=self.tokenizer.eos_token_id,
        )

        response = self.tokenizer.decode(
            outputs[0][inputs["input_ids"].shape[1]:],
            skip_special_tokens=True,
        )

        return response.strip()

    @torch.no_grad()
    def generate_stream(
        self,
        prompt: str,
        max_new_tokens: int = 512,
        temperature: float = 0.7,
    ) -> Generator[str, None, None]:
        """Generate response with streaming."""
        from transformers import TextIteratorStreamer
        from threading import Thread

        inputs = self.tokenizer(
            prompt,
            return_tensors="pt",
            truncation=True,
        ).to(self.device)

        streamer = TextIteratorStreamer(
            self.tokenizer,
            skip_prompt=True,
            skip_special_tokens=True,
        )

        generation_kwargs = {
            **inputs,
            "streamer": streamer,
            "max_new_tokens": max_new_tokens,
            "temperature": temperature,
            "do_sample": True,
            "pad_token_id": self.tokenizer.pad_token_id,
        }

        thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
        thread.start()

        for text in streamer:
            yield text

        thread.join()

    def swap_adapter(self, new_adapter_path: str):
        """Hot-swap to a different adapter."""
        if hasattr(self.model, "load_adapter"):
            self.model.load_adapter(new_adapter_path, adapter_name="new")
            self.model.set_adapter("new")
        else:
            # Reload with new adapter
            base_model = self.model.get_base_model()
            self.model = PeftModel.from_pretrained(base_model, new_adapter_path)

Training Script

# scripts/train.py
import argparse
import wandb
from config.lora_config import ModelConfig, LoRASettings, TrainingConfig, DataConfig
from training.model_loader import ModelLoader
from data.dataset import DatasetPreparation, PromptTemplate
from training.trainer import LoRATrainer

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", default="meta-llama/Llama-3.1-8B")
    parser.add_argument("--dataset", default="databricks/databricks-dolly-15k")
    parser.add_argument("--output-dir", default="./outputs/lora-llama2")
    parser.add_argument("--epochs", type=int, default=3)
    parser.add_argument("--batch-size", type=int, default=4)
    parser.add_argument("--lora-r", type=int, default=16)
    parser.add_argument("--learning-rate", type=float, default=2e-4)
    parser.add_argument("--use-4bit", action="store_true", default=True)
    parser.add_argument("--wandb-project", default="lora-finetuning")

    args = parser.parse_args()

    # Initialize wandb
    wandb.init(project=args.wandb_project, config=vars(args))

    # Create configurations
    model_config = ModelConfig(
        model_name=args.model,
        use_4bit=args.use_4bit,
    )

    lora_settings = LoRASettings(
        r=args.lora_r,
        lora_alpha=args.lora_r * 2,
    )

    training_config = TrainingConfig(
        output_dir=args.output_dir,
        num_train_epochs=args.epochs,
        per_device_train_batch_size=args.batch_size,
        learning_rate=args.learning_rate,
    )

    data_config = DataConfig(
        dataset_name=args.dataset,
    )

    # Load model
    print("Loading model...")
    loader = ModelLoader(model_config, lora_settings)
    model, tokenizer = loader.load()

    # Prepare dataset
    print("Preparing dataset...")
    template = PromptTemplate()
    dataset_prep = DatasetPreparation(
        tokenizer=tokenizer,
        max_length=data_config.max_seq_length,
        template=template,
    )
    datasets = dataset_prep.prepare(
        data_config.dataset_name,
        validation_split=data_config.validation_split,
    )

    # Train
    print("Starting training...")
    trainer = LoRATrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=datasets["train"],
        eval_dataset=datasets["validation"],
        training_config=training_config,
    )

    results = trainer.train()

    # Log results
    wandb.log(results)
    print(f"Training complete! Results: {results}")

    # Save final adapter
    model.save_pretrained(f"{args.output_dir}/final_adapter")
    tokenizer.save_pretrained(f"{args.output_dir}/final_adapter")

if __name__ == "__main__":
    main()

Advanced Techniques

QLoRA with Double Quantization

# config/qlora_config.py
from transformers import BitsAndBytesConfig
import torch

def get_qlora_config():
    """QLoRA configuration with double quantization."""
    return BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",          # NormalFloat 4-bit
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,      # Quantize quantization constants
    )

DoRA: Weight-Decomposed Low-Rank Adaptation

# config/dora_config.py
from peft import LoraConfig

def get_dora_config():
    """DoRA configuration - weight-decomposed LoRA."""
    return LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
        use_dora=True,  # Enable DoRA
        task_type="CAUSAL_LM",
    )

Layer-wise Learning Rates

# training/layerwise_lr.py
from torch.optim import AdamW

def get_layerwise_optimizer(model, base_lr=2e-4, lr_decay=0.9):
    """Create optimizer with layer-wise learning rate decay."""
    params = []

    # Get all LoRA parameters grouped by layer
    for name, param in model.named_parameters():
        if not param.requires_grad:
            continue

        # Extract layer number from name
        layer_num = None
        for part in name.split("."):
            if part.isdigit():
                layer_num = int(part)
                break

        # Calculate layer-specific learning rate
        if layer_num is not None:
            lr = base_lr * (lr_decay ** (31 - layer_num))  # 32 layers
        else:
            lr = base_lr

        params.append({"params": param, "lr": lr})

    return AdamW(params, weight_decay=0.01)

Memory Optimization Tips

Memory Optimization Techniques

4-bit Quantization

~4GB VRAM

Fit 7B models on 8GB GPU

Gradient Checkpointing

Trade compute for memory

Reduce activation memory

Gradient Accumulation

Simulate larger batch sizes

Effective batch 16+

Paged Optimizer

CPU offload of optimizer states

Manage optimizer state gracefully

Memory Calculation Guide

Model Size	Full FT (fp16)	LoRA (fp16)	QLoRA (4-bit)
7B	~56GB	~16GB	~6GB
13B	~104GB	~30GB	~10GB
70B	~560GB	~160GB	~48GB

Common Issues and Solutions

Issue: Gradient Overflow

# Use gradient clipping
training_args = TrainingArguments(
    max_grad_norm=0.3,  # Clip gradients
    fp16=False,         # Use bf16 instead
    bf16=True,
)

Issue: Loss Not Decreasing

# Check learning rate and warmup
training_args = TrainingArguments(
    learning_rate=2e-4,    # Try 1e-4 to 5e-4
    warmup_ratio=0.03,     # Increase to 0.1
    lr_scheduler_type="cosine",
)

Issue: Out of Memory

# Reduce memory usage
training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    gradient_checkpointing=True,
    optim="paged_adamw_8bit",
)

Key Concepts Recap

Concept	What It Is	Why It Matters
LoRA (Low-Rank Adaptation)	Freezes base model, trains small adapter matrices	Reduces trainable params from 7B to ~4M (0.06%)
Rank (r)	Dimension of the low-rank matrices (typically 8-64)	Controls capacity vs efficiency trade-off
Alpha (lora_alpha)	Scaling factor for LoRA updates (typically 2×r)	Adjusts adaptation strength without retraining
Target Modules	Which layers get LoRA adapters (q/k/v/o_proj, MLP)	More modules = more capacity but more params
QLoRA	LoRA + 4-bit quantization of base model	Fits 7B models on 8GB GPUs
NF4 (NormalFloat 4-bit)	Optimal 4-bit data type for normally-distributed weights	Better accuracy than standard INT4
Double Quantization	Quantizes the quantization constants themselves	Extra ~0.5GB memory savings
Gradient Checkpointing	Recomputes activations during backward pass	Trades compute for memory
Paged Optimizer	Offloads optimizer states to CPU when needed	Handles memory spikes gracefully
Adapter Merging	Combines LoRA weights into base model permanently	No inference overhead, easy deployment

Next Steps

After completing this project, consider:

Custom Reranker - Train cross-encoders for RAG
Knowledge Distillation - Compress fine-tuned models
DPO Alignment - Align models with human preferences

LoRA Fine-tuning

On this page

LoRA Fine-tuning

On this page