LoRA, QLoRA, and adapter methods for efficient fine-tuning with the PEFT library

Fine-Tuning with PEFT

TL;DR

PEFT (Parameter-Efficient Fine-Tuning) lets you fine-tune large models by updating only 0.1-1% of parameters. Learn LoRA, QLoRA (4-bit), and Prefix Tuning using the peft library, with bitsandbytes for quantization and safetensors for safe weight storage.

What You'll Learn

LoRA: Low-Rank Adaptation for efficient fine-tuning
QLoRA: 4-bit quantization + LoRA for minimal VRAM
Prefix Tuning and Prompt Tuning methods
BitsAndBytes 4-bit and 8-bit quantization
Adapter merging and saving with safetensors
Publishing adapters to HuggingFace Hub

Why Parameter-Efficient Fine-Tuning?

Full fine-tuning of a 7B parameter model requires ~64 GB of GPU memory (model weights + optimizer states + gradients in fp16). That means a single A100 80GB is the minimum -- costing $2-4/hour in the cloud. PEFT methods like LoRA reduce this to ~6 GB by freezing the base model and training only small adapter matrices (0.1% of parameters). This means you can fine-tune Llama 3.1 8B on a free Colab T4 GPU. The adapters are tiny (5-50 MB vs 14 GB for full weights), making it practical to maintain dozens of task-specific adapters that share a single base model.

Property	Value
Difficulty	Intermediate
Time	~6 hours
Lines of Code	~400
Prerequisites	Pipelines & Hub, Tokenizers, basic PyTorch

Tech Stack

Component	Technology	Why
PEFT Methods	`peft`	Unified API for LoRA, Prefix Tuning, and more
Base Models	`transformers`	Load and configure any pretrained model
Quantization	`bitsandbytes`	4-bit NF4 quantization for extreme memory savings
Storage	`safetensors`	Secure, fast model serialization
Hub	`huggingface_hub`	Share adapters with the community
Python	3.10+	Type hint support

Architecture

PEFT Fine-Tuning Approaches

Full Fine-Tuning

Update ALL 7B parameters. VRAM: ~28 GB (fp16). Storage: 14 GB per checkpoint.

LoRA (Low-Rank Adaptation)

Freeze base model, train only LoRA adapters (0.1-1% of params). VRAM: ~10 GB (fp16). Storage: 5-50 MB per adapter.

QLoRA (Quantized LoRA)

Recommended

4-bit quantized base + LoRA adapters in fp16. VRAM: ~6 GB (4-bit). Storage: 5-50 MB per adapter.

LoRA Math: W' = W + BA where B is [d, r] and A is [r, d]. Instead of updating a d x d matrix, learn two small r x d matrices. With r=16 and d=4096: 131K params instead of 16M (125x reduction).

Project Structure

fine-tuning-peft/
├── src/
│   ├── __init__.py
│   ├── lora.py               # LoRA fine-tuning
│   ├── qlora.py              # QLoRA (4-bit) fine-tuning
│   ├── prefix_tuning.py      # Prefix and prompt tuning
│   ├── adapter_ops.py        # Merge, save, push adapters
│   └── data_prep.py          # Dataset preparation for fine-tuning
├── configs/
│   └── training_config.yaml
├── examples/
│   └── finetune_llama.py
├── requirements.txt
└── README.md

Implementation

Step 1: Dependencies

requirements.txt

peft>=0.11.0
transformers>=4.40.0
bitsandbytes>=0.43.0
safetensors>=0.4.0
datasets>=2.19.0
huggingface_hub>=0.23.0
torch>=2.0.0
trl>=0.9.0
accelerate>=0.30.0

Step 2: LoRA Fine-Tuning

src/lora.py

"""LoRA fine-tuning with the PEFT library."""

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
)
from peft import (
    LoraConfig,
    get_peft_model,
    TaskType,
    PeftModel,
)
from datasets import Dataset


def setup_lora(
    model_name: str = "meta-llama/Llama-3.1-8B",
    lora_r: int = 16,
    lora_alpha: int = 32,
    lora_dropout: float = 0.05,
    target_modules: list[str] | None = None,
):
    """
    Set up a model with LoRA adapters.

    Args:
        model_name: Base model from HuggingFace Hub
        lora_r: LoRA rank — lower = fewer params, less capacity
        lora_alpha: LoRA scaling factor — controls adapter strength
        lora_dropout: Dropout on LoRA layers for regularization
        target_modules: Which layers to add LoRA to
    """
    # Load base model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto",
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Default target modules for transformer models
    if target_modules is None:
        target_modules = [
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ]

    # Configure LoRA
    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=lora_r,
        lora_alpha=lora_alpha,
        lora_dropout=lora_dropout,
        target_modules=target_modules,
        bias="none",
    )

    # Apply LoRA to model
    model = get_peft_model(model, lora_config)

    # Print trainable parameters
    model.print_trainable_parameters()
    # Example output: trainable params: 6,553,600 || all params: 8,030,261,248
    #                 || trainable%: 0.0816

    return model, tokenizer


def train_lora(
    model,
    tokenizer,
    train_dataset: Dataset,
    eval_dataset: Dataset | None = None,
    output_dir: str = "models/lora-adapter",
    epochs: int = 3,
    batch_size: int = 4,
    learning_rate: float = 2e-4,
    max_length: int = 512,
):
    """Train the LoRA adapter."""

    # Tokenize dataset
    def tokenize(example):
        result = tokenizer(
            example["text"],
            truncation=True,
            max_length=max_length,
            padding="max_length",
        )
        result["labels"] = result["input_ids"].copy()
        return result

    train_tokenized = train_dataset.map(tokenize, remove_columns=["text"])
    eval_tokenized = eval_dataset.map(tokenize, remove_columns=["text"]) if eval_dataset else None

    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        gradient_accumulation_steps=4,
        learning_rate=learning_rate,
        lr_scheduler_type="cosine",
        warmup_ratio=0.1,
        fp16=True,
        logging_steps=10,
        eval_strategy="steps" if eval_tokenized else "no",
        eval_steps=100,
        save_strategy="steps",
        save_steps=200,
        save_total_limit=3,
        report_to="none",
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_tokenized,
        eval_dataset=eval_tokenized,
    )

    trainer.train()

    # Save the adapter (NOT the full model — just the LoRA weights)
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

    return trainer

LoRA Hyperparameter Guide:

LoRA Configuration — Rank (r)

r = 4 (0.02% params)

Minimal adaptation — least capacity, lowest VRAM

r = 16 (0.08% params)

Recommended

Good balance — recommended default for most tasks

r = 64 (0.3% params)

High capacity — near full fine-tuning quality

r = 256 (1.2% params)

Diminishing returns — more VRAM for marginal gains

Alpha: Effective scaling = alpha / r. Rule of thumb: alpha = 2 x r (so scaling = 2.0).

LoRA Configuration — Target Modules

Attention only

q_proj, v_proj — minimum viable adaptation

All attention

Recommended

q_proj, k_proj, v_proj, o_proj — recommended for most tasks

+ MLP layers

+ gate_proj, up_proj, down_proj — maximum adaptation

Parameter	Recommended	Effect of Increasing
`r`	16	More adapter capacity, more VRAM
`alpha`	32 (2×r)	Stronger adapter influence
`dropout`	0.05	More regularization
`target_modules`	All attention + MLP	More layers adapted, more VRAM

Step 3: QLoRA (4-bit)

src/qlora.py

"""QLoRA: LoRA with 4-bit quantization for minimal VRAM usage."""

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)
from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
    TaskType,
)


def setup_qlora(
    model_name: str = "meta-llama/Llama-3.1-8B",
    lora_r: int = 16,
    lora_alpha: int = 32,
):
    """
    Set up a model with QLoRA (4-bit quantized base + LoRA adapters).

    QLoRA reduces VRAM from ~28GB (full) to ~6GB (4-bit + LoRA).
    The base model is frozen in 4-bit, while LoRA adapters train in fp16.
    """
    # BitsAndBytes 4-bit quantization config
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",          # NormalFloat4 — better than int4
        bnb_4bit_compute_dtype=torch.float16, # Compute in fp16 for LoRA
        bnb_4bit_use_double_quant=True,      # Quantize the quantization constants
    )

    # Load model in 4-bit
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Prepare model for k-bit training
    # (handles gradient checkpointing and layer norm casting)
    model = prepare_model_for_kbit_training(model)

    # Configure LoRA
    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=lora_r,
        lora_alpha=lora_alpha,
        lora_dropout=0.05,
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ],
        bias="none",
    )

    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

    return model, tokenizer

Understanding the QLoRA Setup:

The BitsAndBytesConfig is the key to QLoRA's memory savings. load_in_4bit=True quantizes every weight to 4 bits on load. bnb_4bit_quant_type="nf4" selects NormalFloat4, a data type designed for normally-distributed neural network weights (better than uniform int4). bnb_4bit_compute_dtype=torch.float16 means the 4-bit weights are dequantized to fp16 on-the-fly during computation -- the LoRA adapters then compute gradients in fp16 while the frozen base stays in 4-bit. bnb_4bit_use_double_quant=True quantizes the quantization constants themselves, saving an additional ~0.4 bits per parameter. The prepare_model_for_kbit_training() call handles two critical details: it enables gradient checkpointing (recompute activations during backward to save memory) and casts LayerNorm layers to fp32 for training stability.

QLoRA Memory Savings:

QLoRA Memory Comparison (Llama 3.1 8B)

Full Fine-tuning (fp16)

Model weights: 16 GB + Optimizer states: 32 GB (Adam: 2x model + momentum) + Gradients: 16 GB = ~64 GB total. Needs A100 80GB.

LoRA (fp16 base)

Model weights: 16 GB (frozen) + LoRA adapters: 0.05 GB + Optimizer: 0.1 GB = ~18 GB total. Needs A100 40GB.

QLoRA (4-bit base)

Recommended

Model weights: 4 GB (4-bit quantized) + LoRA adapters: 0.05 GB (fp16) + Optimizer: 0.1 GB = ~6 GB total. Runs on RTX 3090 / Colab T4.

nf4 (NormalFloat4): Quantization format optimized for normally-distributed weights. Better than uniform int4. Double quantization: Quantizes the quantization constants themselves, saving ~0.4 bits per parameter.

Step 4: Prefix Tuning

src/prefix_tuning.py

"""Prefix Tuning and Prompt Tuning with PEFT."""

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PrefixTuningConfig, get_peft_model, TaskType


def setup_prefix_tuning(
    model_name: str = "gpt2",
    num_virtual_tokens: int = 20,
):
    """
    Set up Prefix Tuning.

    Prefix Tuning prepends learnable "virtual tokens" to the
    key and value states of every attention layer.

    Unlike LoRA (which modifies weight matrices), Prefix Tuning
    adds new information to the attention context.
    """
    model = AutoModelForCausalLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    config = PrefixTuningConfig(
        task_type=TaskType.CAUSAL_LM,
        num_virtual_tokens=num_virtual_tokens,
        # prefix_projection=True uses an MLP to generate the prefix
        # (more parameters but often better quality)
        prefix_projection=True,
        encoder_hidden_size=512,
    )

    model = get_peft_model(model, config)
    model.print_trainable_parameters()

    return model, tokenizer

PEFT Method Comparison:

Method	What It Modifies	Trainable Params	Best For
LoRA	Weight matrices (W + BA)	0.1-1%	General fine-tuning
QLoRA	Same as LoRA + 4-bit base	0.1-1%	Low-VRAM fine-tuning
Prefix Tuning	Attention key/value context	~0.1%	Task-specific prompts
Prompt Tuning	Input embeddings only	~0.01%	Simple task adaptation
IA3	Scale activations	~0.01%	Few-shot adaptation

Step 5: Adapter Operations

src/adapter_ops.py

"""Adapter management: merge, save, load, and push to Hub."""

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
from safetensors.torch import save_file, load_file


def load_adapter(
    base_model_name: str,
    adapter_path: str,
) -> tuple:
    """Load a base model and apply a saved adapter."""
    # Load base model
    model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        torch_dtype=torch.float16,
        device_map="auto",
    )
    tokenizer = AutoTokenizer.from_pretrained(base_model_name)

    # Load and apply adapter
    model = PeftModel.from_pretrained(model, adapter_path)

    return model, tokenizer


def merge_adapter(
    base_model_name: str,
    adapter_path: str,
    output_path: str,
):
    """
    Merge LoRA adapter into the base model.

    After merging, the model runs at full speed (no adapter overhead)
    but you lose the ability to swap adapters.
    """
    model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        torch_dtype=torch.float16,
        device_map="auto",
    )
    tokenizer = AutoTokenizer.from_pretrained(base_model_name)

    model = PeftModel.from_pretrained(model, adapter_path)

    # Merge adapter weights into base model
    model = model.merge_and_unload()

    # Save merged model
    model.save_pretrained(output_path, safe_serialization=True)
    tokenizer.save_pretrained(output_path)

    print(f"Merged model saved to {output_path}")


def push_adapter_to_hub(
    adapter_path: str,
    repo_id: str,
    private: bool = False,
):
    """Push a trained adapter to HuggingFace Hub."""
    from peft import PeftModel, PeftConfig
    from huggingface_hub import HfApi

    config = PeftConfig.from_pretrained(adapter_path)

    api = HfApi()
    api.upload_folder(
        folder_path=adapter_path,
        repo_id=repo_id,
        repo_type="model",
    )

    print(f"Adapter pushed to https://huggingface.co/{repo_id}")
    print(f"Base model: {config.base_model_name_or_path}")


def explain_safetensors():
    """
    Why safetensors over pickle (.bin)?

    1. Security: No arbitrary code execution (pickle can run malicious code)
    2. Speed: Memory-mapped loading (2-5x faster for large models)
    3. Size: Same as pickle (no compression difference)
    4. Lazy loading: Load specific tensors without loading the full file

    transformers uses safetensors by default since v4.35.
    """
    # Save tensors
    tensors = {
        "weight": torch.randn(768, 768),
        "bias": torch.randn(768),
    }
    save_file(tensors, "model.safetensors")

    # Load tensors (memory-mapped)
    loaded = load_file("model.safetensors")
    print(f"Loaded keys: {list(loaded.keys())}")

Step 6: Complete Fine-tuning Example

examples/finetune_llama.py

"""Complete example: Fine-tune Llama 3.1 with QLoRA."""

from datasets import load_dataset
from src.qlora import setup_qlora
from trl import SFTTrainer, SFTConfig


def main():
    # 1. Load model with QLoRA
    model, tokenizer = setup_qlora(
        model_name="meta-llama/Llama-3.1-8B",
        lora_r=16,
        lora_alpha=32,
    )

    # 2. Load and format dataset
    dataset = load_dataset("tatsu-lab/alpaca", split="train[:5000]")

    def format_prompt(example):
        if example["input"]:
            text = (
                f"### Instruction:\n{example['instruction']}\n\n"
                f"### Input:\n{example['input']}\n\n"
                f"### Response:\n{example['output']}"
            )
        else:
            text = (
                f"### Instruction:\n{example['instruction']}\n\n"
                f"### Response:\n{example['output']}"
            )
        return {"text": text}

    dataset = dataset.map(format_prompt)

    # 3. Train with SFTTrainer (from TRL — simplifies the training loop)
    training_config = SFTConfig(
        output_dir="models/llama-qlora",
        num_train_epochs=1,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.1,
        fp16=True,
        logging_steps=10,
        save_strategy="steps",
        save_steps=200,
        max_seq_length=512,
        dataset_text_field="text",
        report_to="none",
    )

    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        args=training_config,
        train_dataset=dataset,
    )

    trainer.train()

    # 4. Save adapter
    model.save_pretrained("models/llama-qlora/final")
    print("Training complete! Adapter saved.")


if __name__ == "__main__":
    main()

Running the Project

# Install dependencies
pip install -r requirements.txt

# Fine-tune with QLoRA
python examples/finetune_llama.py

# Merge adapter into base model
python -c "
from src.adapter_ops import merge_adapter
merge_adapter(
    base_model_name='meta-llama/Llama-3.1-8B',
    adapter_path='models/llama-qlora/final',
    output_path='models/llama-merged',
)
"

# Push adapter to Hub
python -c "
from src.adapter_ops import push_adapter_to_hub
push_adapter_to_hub(
    adapter_path='models/llama-qlora/final',
    repo_id='your-username/llama-qlora-adapter',
)
"

Key Concepts Recap

Concept	What It Is	Why It Matters
LoRA	Low-rank decomposition of weight updates	Train 0.1% of parameters instead of 100%
QLoRA	4-bit quantized base + LoRA	Fine-tune 8B models on consumer GPUs
PEFT	Library for parameter-efficient methods	Unified API for LoRA, Prefix, Prompt tuning
NF4	NormalFloat 4-bit quantization	Optimal for normally-distributed weights
safetensors	Secure tensor serialization format	Safe, fast, memory-mapped model loading
Adapter Merging	Fold LoRA into base weights	Full speed inference after training
SFTTrainer	Supervised fine-tuning trainer from TRL	Simplifies instruction fine-tuning

Next Steps

Model Evaluation & Benchmarks — Evaluate your fine-tuned model
Preference Alignment with TRL — Align models with human preferences

Fine-Tuning with PEFT

TL;DR

What You'll Learn

LoRA: Low-Rank Adaptation for efficient fine-tuning
QLoRA: 4-bit quantization + LoRA for minimal VRAM
Prefix Tuning and Prompt Tuning methods
BitsAndBytes 4-bit and 8-bit quantization
Adapter merging and saving with safetensors
Publishing adapters to HuggingFace Hub

Why Parameter-Efficient Fine-Tuning?

Property	Value
Difficulty	Intermediate
Time	~6 hours
Lines of Code	~400
Prerequisites	Pipelines & Hub, Tokenizers, basic PyTorch

Tech Stack

Component	Technology	Why
PEFT Methods	`peft`	Unified API for LoRA, Prefix Tuning, and more
Base Models	`transformers`	Load and configure any pretrained model
Quantization	`bitsandbytes`	4-bit NF4 quantization for extreme memory savings
Storage	`safetensors`	Secure, fast model serialization
Hub	`huggingface_hub`	Share adapters with the community
Python	3.10+	Type hint support

Architecture

PEFT Fine-Tuning Approaches

Full Fine-Tuning

Update ALL 7B parameters. VRAM: ~28 GB (fp16). Storage: 14 GB per checkpoint.

LoRA (Low-Rank Adaptation)

Freeze base model, train only LoRA adapters (0.1-1% of params). VRAM: ~10 GB (fp16). Storage: 5-50 MB per adapter.

QLoRA (Quantized LoRA)

Recommended

4-bit quantized base + LoRA adapters in fp16. VRAM: ~6 GB (4-bit). Storage: 5-50 MB per adapter.

LoRA Math: W' = W + BA where B is [d, r] and A is [r, d]. Instead of updating a d x d matrix, learn two small r x d matrices. With r=16 and d=4096: 131K params instead of 16M (125x reduction).

Project Structure

fine-tuning-peft/
├── src/
│   ├── __init__.py
│   ├── lora.py               # LoRA fine-tuning
│   ├── qlora.py              # QLoRA (4-bit) fine-tuning
│   ├── prefix_tuning.py      # Prefix and prompt tuning
│   ├── adapter_ops.py        # Merge, save, push adapters
│   └── data_prep.py          # Dataset preparation for fine-tuning
├── configs/
│   └── training_config.yaml
├── examples/
│   └── finetune_llama.py
├── requirements.txt
└── README.md

Implementation

Step 1: Dependencies

requirements.txt

peft>=0.11.0
transformers>=4.40.0
bitsandbytes>=0.43.0
safetensors>=0.4.0
datasets>=2.19.0
huggingface_hub>=0.23.0
torch>=2.0.0
trl>=0.9.0
accelerate>=0.30.0

Step 2: LoRA Fine-Tuning

src/lora.py

"""LoRA fine-tuning with the PEFT library."""

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
)
from peft import (
    LoraConfig,
    get_peft_model,
    TaskType,
    PeftModel,
)
from datasets import Dataset


def setup_lora(
    model_name: str = "meta-llama/Llama-3.1-8B",
    lora_r: int = 16,
    lora_alpha: int = 32,
    lora_dropout: float = 0.05,
    target_modules: list[str] | None = None,
):
    """
    Set up a model with LoRA adapters.

    Args:
        model_name: Base model from HuggingFace Hub
        lora_r: LoRA rank — lower = fewer params, less capacity
        lora_alpha: LoRA scaling factor — controls adapter strength
        lora_dropout: Dropout on LoRA layers for regularization
        target_modules: Which layers to add LoRA to
    """
    # Load base model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto",
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Default target modules for transformer models
    if target_modules is None:
        target_modules = [
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ]

    # Configure LoRA
    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=lora_r,
        lora_alpha=lora_alpha,
        lora_dropout=lora_dropout,
        target_modules=target_modules,
        bias="none",
    )

    # Apply LoRA to model
    model = get_peft_model(model, lora_config)

    # Print trainable parameters
    model.print_trainable_parameters()
    # Example output: trainable params: 6,553,600 || all params: 8,030,261,248
    #                 || trainable%: 0.0816

    return model, tokenizer


def train_lora(
    model,
    tokenizer,
    train_dataset: Dataset,
    eval_dataset: Dataset | None = None,
    output_dir: str = "models/lora-adapter",
    epochs: int = 3,
    batch_size: int = 4,
    learning_rate: float = 2e-4,
    max_length: int = 512,
):
    """Train the LoRA adapter."""

    # Tokenize dataset
    def tokenize(example):
        result = tokenizer(
            example["text"],
            truncation=True,
            max_length=max_length,
            padding="max_length",
        )
        result["labels"] = result["input_ids"].copy()
        return result

    train_tokenized = train_dataset.map(tokenize, remove_columns=["text"])
    eval_tokenized = eval_dataset.map(tokenize, remove_columns=["text"]) if eval_dataset else None

    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        gradient_accumulation_steps=4,
        learning_rate=learning_rate,
        lr_scheduler_type="cosine",
        warmup_ratio=0.1,
        fp16=True,
        logging_steps=10,
        eval_strategy="steps" if eval_tokenized else "no",
        eval_steps=100,
        save_strategy="steps",
        save_steps=200,
        save_total_limit=3,
        report_to="none",
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_tokenized,
        eval_dataset=eval_tokenized,
    )

    trainer.train()

    # Save the adapter (NOT the full model — just the LoRA weights)
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

    return trainer

LoRA Hyperparameter Guide:

LoRA Configuration — Rank (r)

r = 4 (0.02% params)

Minimal adaptation — least capacity, lowest VRAM

r = 16 (0.08% params)

Recommended

Good balance — recommended default for most tasks

r = 64 (0.3% params)

High capacity — near full fine-tuning quality

r = 256 (1.2% params)

Diminishing returns — more VRAM for marginal gains

Alpha: Effective scaling = alpha / r. Rule of thumb: alpha = 2 x r (so scaling = 2.0).

LoRA Configuration — Target Modules

Attention only

q_proj, v_proj — minimum viable adaptation

All attention

Recommended

q_proj, k_proj, v_proj, o_proj — recommended for most tasks

+ MLP layers

+ gate_proj, up_proj, down_proj — maximum adaptation

Parameter	Recommended	Effect of Increasing
`r`	16	More adapter capacity, more VRAM
`alpha`	32 (2×r)	Stronger adapter influence
`dropout`	0.05	More regularization
`target_modules`	All attention + MLP	More layers adapted, more VRAM

Step 3: QLoRA (4-bit)

src/qlora.py

"""QLoRA: LoRA with 4-bit quantization for minimal VRAM usage."""

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)
from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
    TaskType,
)


def setup_qlora(
    model_name: str = "meta-llama/Llama-3.1-8B",
    lora_r: int = 16,
    lora_alpha: int = 32,
):
    """
    Set up a model with QLoRA (4-bit quantized base + LoRA adapters).

    QLoRA reduces VRAM from ~28GB (full) to ~6GB (4-bit + LoRA).
    The base model is frozen in 4-bit, while LoRA adapters train in fp16.
    """
    # BitsAndBytes 4-bit quantization config
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",          # NormalFloat4 — better than int4
        bnb_4bit_compute_dtype=torch.float16, # Compute in fp16 for LoRA
        bnb_4bit_use_double_quant=True,      # Quantize the quantization constants
    )

    # Load model in 4-bit
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Prepare model for k-bit training
    # (handles gradient checkpointing and layer norm casting)
    model = prepare_model_for_kbit_training(model)

    # Configure LoRA
    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=lora_r,
        lora_alpha=lora_alpha,
        lora_dropout=0.05,
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ],
        bias="none",
    )

    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

    return model, tokenizer

Understanding the QLoRA Setup:

QLoRA Memory Savings:

QLoRA Memory Comparison (Llama 3.1 8B)

Full Fine-tuning (fp16)

Model weights: 16 GB + Optimizer states: 32 GB (Adam: 2x model + momentum) + Gradients: 16 GB = ~64 GB total. Needs A100 80GB.

LoRA (fp16 base)

Model weights: 16 GB (frozen) + LoRA adapters: 0.05 GB + Optimizer: 0.1 GB = ~18 GB total. Needs A100 40GB.

QLoRA (4-bit base)

Recommended

Model weights: 4 GB (4-bit quantized) + LoRA adapters: 0.05 GB (fp16) + Optimizer: 0.1 GB = ~6 GB total. Runs on RTX 3090 / Colab T4.

Step 4: Prefix Tuning

src/prefix_tuning.py

"""Prefix Tuning and Prompt Tuning with PEFT."""

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PrefixTuningConfig, get_peft_model, TaskType


def setup_prefix_tuning(
    model_name: str = "gpt2",
    num_virtual_tokens: int = 20,
):
    """
    Set up Prefix Tuning.

    Prefix Tuning prepends learnable "virtual tokens" to the
    key and value states of every attention layer.

    Unlike LoRA (which modifies weight matrices), Prefix Tuning
    adds new information to the attention context.
    """
    model = AutoModelForCausalLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    config = PrefixTuningConfig(
        task_type=TaskType.CAUSAL_LM,
        num_virtual_tokens=num_virtual_tokens,
        # prefix_projection=True uses an MLP to generate the prefix
        # (more parameters but often better quality)
        prefix_projection=True,
        encoder_hidden_size=512,
    )

    model = get_peft_model(model, config)
    model.print_trainable_parameters()

    return model, tokenizer

PEFT Method Comparison:

Method	What It Modifies	Trainable Params	Best For
LoRA	Weight matrices (W + BA)	0.1-1%	General fine-tuning
QLoRA	Same as LoRA + 4-bit base	0.1-1%	Low-VRAM fine-tuning
Prefix Tuning	Attention key/value context	~0.1%	Task-specific prompts
Prompt Tuning	Input embeddings only	~0.01%	Simple task adaptation
IA3	Scale activations	~0.01%	Few-shot adaptation

Step 5: Adapter Operations

src/adapter_ops.py

"""Adapter management: merge, save, load, and push to Hub."""

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
from safetensors.torch import save_file, load_file


def load_adapter(
    base_model_name: str,
    adapter_path: str,
) -> tuple:
    """Load a base model and apply a saved adapter."""
    # Load base model
    model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        torch_dtype=torch.float16,
        device_map="auto",
    )
    tokenizer = AutoTokenizer.from_pretrained(base_model_name)

    # Load and apply adapter
    model = PeftModel.from_pretrained(model, adapter_path)

    return model, tokenizer


def merge_adapter(
    base_model_name: str,
    adapter_path: str,
    output_path: str,
):
    """
    Merge LoRA adapter into the base model.

    After merging, the model runs at full speed (no adapter overhead)
    but you lose the ability to swap adapters.
    """
    model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        torch_dtype=torch.float16,
        device_map="auto",
    )
    tokenizer = AutoTokenizer.from_pretrained(base_model_name)

    model = PeftModel.from_pretrained(model, adapter_path)

    # Merge adapter weights into base model
    model = model.merge_and_unload()

    # Save merged model
    model.save_pretrained(output_path, safe_serialization=True)
    tokenizer.save_pretrained(output_path)

    print(f"Merged model saved to {output_path}")


def push_adapter_to_hub(
    adapter_path: str,
    repo_id: str,
    private: bool = False,
):
    """Push a trained adapter to HuggingFace Hub."""
    from peft import PeftModel, PeftConfig
    from huggingface_hub import HfApi

    config = PeftConfig.from_pretrained(adapter_path)

    api = HfApi()
    api.upload_folder(
        folder_path=adapter_path,
        repo_id=repo_id,
        repo_type="model",
    )

    print(f"Adapter pushed to https://huggingface.co/{repo_id}")
    print(f"Base model: {config.base_model_name_or_path}")


def explain_safetensors():
    """
    Why safetensors over pickle (.bin)?

    1. Security: No arbitrary code execution (pickle can run malicious code)
    2. Speed: Memory-mapped loading (2-5x faster for large models)
    3. Size: Same as pickle (no compression difference)
    4. Lazy loading: Load specific tensors without loading the full file

    transformers uses safetensors by default since v4.35.
    """
    # Save tensors
    tensors = {
        "weight": torch.randn(768, 768),
        "bias": torch.randn(768),
    }
    save_file(tensors, "model.safetensors")

    # Load tensors (memory-mapped)
    loaded = load_file("model.safetensors")
    print(f"Loaded keys: {list(loaded.keys())}")

Step 6: Complete Fine-tuning Example

examples/finetune_llama.py

"""Complete example: Fine-tune Llama 3.1 with QLoRA."""

from datasets import load_dataset
from src.qlora import setup_qlora
from trl import SFTTrainer, SFTConfig


def main():
    # 1. Load model with QLoRA
    model, tokenizer = setup_qlora(
        model_name="meta-llama/Llama-3.1-8B",
        lora_r=16,
        lora_alpha=32,
    )

    # 2. Load and format dataset
    dataset = load_dataset("tatsu-lab/alpaca", split="train[:5000]")

    def format_prompt(example):
        if example["input"]:
            text = (
                f"### Instruction:\n{example['instruction']}\n\n"
                f"### Input:\n{example['input']}\n\n"
                f"### Response:\n{example['output']}"
            )
        else:
            text = (
                f"### Instruction:\n{example['instruction']}\n\n"
                f"### Response:\n{example['output']}"
            )
        return {"text": text}

    dataset = dataset.map(format_prompt)

    # 3. Train with SFTTrainer (from TRL — simplifies the training loop)
    training_config = SFTConfig(
        output_dir="models/llama-qlora",
        num_train_epochs=1,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.1,
        fp16=True,
        logging_steps=10,
        save_strategy="steps",
        save_steps=200,
        max_seq_length=512,
        dataset_text_field="text",
        report_to="none",
    )

    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        args=training_config,
        train_dataset=dataset,
    )

    trainer.train()

    # 4. Save adapter
    model.save_pretrained("models/llama-qlora/final")
    print("Training complete! Adapter saved.")


if __name__ == "__main__":
    main()

Running the Project

# Install dependencies
pip install -r requirements.txt

# Fine-tune with QLoRA
python examples/finetune_llama.py

# Merge adapter into base model
python -c "
from src.adapter_ops import merge_adapter
merge_adapter(
    base_model_name='meta-llama/Llama-3.1-8B',
    adapter_path='models/llama-qlora/final',
    output_path='models/llama-merged',
)
"

# Push adapter to Hub
python -c "
from src.adapter_ops import push_adapter_to_hub
push_adapter_to_hub(
    adapter_path='models/llama-qlora/final',
    repo_id='your-username/llama-qlora-adapter',
)
"

Key Concepts Recap

Concept	What It Is	Why It Matters
LoRA	Low-rank decomposition of weight updates	Train 0.1% of parameters instead of 100%
QLoRA	4-bit quantized base + LoRA	Fine-tune 8B models on consumer GPUs
PEFT	Library for parameter-efficient methods	Unified API for LoRA, Prefix, Prompt tuning
NF4	NormalFloat 4-bit quantization	Optimal for normally-distributed weights
safetensors	Secure tensor serialization format	Safe, fast, memory-mapped model loading
Adapter Merging	Fold LoRA into base weights	Full speed inference after training
SFTTrainer	Supervised fine-tuning trainer from TRL	Simplifies instruction fine-tuning

Next Steps

Model Evaluation & Benchmarks — Evaluate your fine-tuned model
Preference Alignment with TRL — Align models with human preferences

Fine-Tuning with PEFT

Fine-Tuning with PEFT

What You'll Learn

Why Parameter-Efficient Fine-Tuning?

Tech Stack

Architecture

Project Structure

Implementation

Step 1: Dependencies

Step 2: LoRA Fine-Tuning

Step 3: QLoRA (4-bit)

Step 4: Prefix Tuning

Step 5: Adapter Operations

Step 6: Complete Fine-tuning Example

Running the Project

Key Concepts Recap

Next Steps

On this page

Fine-Tuning with PEFT

Fine-Tuning with PEFT

What You'll Learn

Why Parameter-Efficient Fine-Tuning?

Tech Stack

Architecture

Project Structure

Implementation

Step 1: Dependencies

Step 2: LoRA Fine-Tuning

Step 3: QLoRA (4-bit)

Step 4: Prefix Tuning

Step 5: Adapter Operations

Step 6: Complete Fine-tuning Example

Running the Project

Key Concepts Recap

Next Steps

On this page