SLM Fine-tuning
Adapt small language models for your specific domain
SLM Fine-tuning
TL;DR
Fine-tune SLMs on custom data using QLoRA (4-bit quantization + LoRA adapters) to train on consumer GPUs with 8GB+ VRAM. Use Unsloth for 2-5x faster training. Export to GGUF format for Ollama deployment. Key formula: effective_batch = batch_size × gradient_accumulation.
Fine-tune small language models on your custom datasets to dramatically improve performance on domain-specific tasks. Learn efficient techniques like QLoRA and Unsloth that make fine-tuning possible on consumer hardware.
Project Overview
| Aspect | Details |
|---|---|
| Difficulty | Intermediate |
| Time | 6-8 hours |
| Prerequisites | Local SLM Setup, SLM Benchmarking |
| What You'll Build | Fine-tuned SLM for custom task with evaluation pipeline |
What You'll Learn
- QLoRA fine-tuning with PEFT library
- Unsloth for 2-5x faster training
- Dataset preparation and formatting
- Hyperparameter selection for SLMs
- Evaluation and iteration workflow
- Exporting models for Ollama deployment
Architecture Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ SLM Fine-tuning Pipeline │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────┐ │
│ │ Data Preparation │ │ Training Pipeline │ │ Evaluation │ │
│ ├─────────────────────┤ ├─────────────────────┤ ├─────────────────┤ │
│ │ • Raw Data │ │ • Base Model │ │ • Metrics │ │
│ │ • Formatter │───►│ • LoRA Adapters │───►│ • Comparison │ │
│ │ • Train/Val Split │ │ • SFTTrainer │ │ • Iteration │ │
│ └─────────────────────┘ └─────────────────────┘ └────────┬────────┘ │
│ ▲ │ │
│ │ │ │
│ └────── (retry loop) ─────┘ │
│ │
│ ┌─────────────────────┐ │
│ │ Export & Deploy │ │
│ ├─────────────────────┤ │
│ │ Merge ──► GGUF ──► Ollama │
│ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Data Flow:
Raw Data ──► Format ──► Split ──► Train ──► Evaluate ──► Merge ──► GGUF ──► OllamaHardware Requirements
| Configuration | VRAM | Models | Training Time |
|---|---|---|---|
| Minimum | 8GB | 1-2B models | Slow |
| Recommended | 16GB | 3-4B models | Good |
| Optimal | 24GB+ | 7B+ models | Fast |
Project Setup
Install Dependencies
# Create project directory
mkdir slm-finetuning && cd slm-finetuning
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install PyTorch with CUDA (adjust for your CUDA version)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Install training libraries
pip install transformers datasets peft accelerate bitsandbytes
pip install trl wandb evaluate scikit-learn
# Install Unsloth for faster training (optional but recommended)
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"Part 1: Dataset Preparation
Dataset Formats
SLMs typically use instruction or chat formats for fine-tuning.
# data_prep.py
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from datasets import Dataset, DatasetDict
import json
@dataclass
class InstructionSample:
"""Single instruction-response pair."""
instruction: str
input: str = ""
output: str = ""
system: str = ""
class DatasetFormatter:
"""Format datasets for SLM fine-tuning."""
# Common chat templates
TEMPLATES = {
"alpaca": """Below is an instruction that describes a task{input_section}. Write a response that appropriately completes the request.
### Instruction:
{instruction}
{input_block}
### Response:
{output}""",
"chatml": """<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{instruction}{input_block}<|im_end|>
<|im_start|>assistant
{output}<|im_end|>""",
"llama3": """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{system}<|eot_id|><|start_header_id|>user<|end_header_id|}
{instruction}{input_block}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{output}<|eot_id|>""",
"phi3": """<|system|>
{system}<|end|>
<|user|>
{instruction}{input_block}<|end|>
<|assistant|>
{output}<|end|>""",
}
def __init__(self, template: str = "chatml", system_prompt: str = ""):
self.template_name = template
self.template = self.TEMPLATES.get(template, self.TEMPLATES["chatml"])
self.default_system = system_prompt or "You are a helpful assistant."
def format_sample(self, sample: InstructionSample) -> str:
"""Format a single sample."""
input_section = ", along with an input that provides further context" if sample.input else ""
input_block = f"\n\n### Input:\n{sample.input}" if sample.input else ""
system = sample.system or self.default_system
return self.template.format(
instruction=sample.instruction,
input=sample.input,
input_section=input_section,
input_block=input_block,
output=sample.output,
system=system
)
def format_dataset(
self,
samples: List[InstructionSample],
text_column: str = "text"
) -> Dataset:
"""Format list of samples into HuggingFace Dataset."""
formatted = [self.format_sample(s) for s in samples]
return Dataset.from_dict({text_column: formatted})
def load_jsonl(filepath: str) -> List[InstructionSample]:
"""Load samples from JSONL file."""
samples = []
with open(filepath, 'r') as f:
for line in f:
data = json.loads(line)
samples.append(InstructionSample(
instruction=data.get("instruction", ""),
input=data.get("input", ""),
output=data.get("output", ""),
system=data.get("system", "")
))
return samples
def create_train_val_split(
samples: List[InstructionSample],
val_ratio: float = 0.1,
seed: int = 42
) -> tuple:
"""Split samples into train and validation sets."""
import random
random.seed(seed)
random.shuffle(samples)
split_idx = int(len(samples) * (1 - val_ratio))
return samples[:split_idx], samples[split_idx:]
# Example usage
if __name__ == "__main__":
# Create sample dataset
samples = [
InstructionSample(
instruction="Summarize the following text.",
input="Machine learning is a subset of artificial intelligence...",
output="Machine learning enables computers to learn from data."
),
InstructionSample(
instruction="What is the capital of France?",
output="The capital of France is Paris."
),
]
formatter = DatasetFormatter(template="chatml")
dataset = formatter.format_dataset(samples)
print(dataset[0]["text"])Understanding Chat Templates and Dataset Formatting:
┌─────────────────────────────────────────────────────────────────────────────┐
│ WHY CHAT TEMPLATES MATTER │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Raw Data: Formatted (ChatML): │
│ ┌─────────────────────────────┐ ┌─────────────────────────────────┐ │
│ │ instruction: "Summarize" │ │ <|im_start|>system │ │
│ │ input: "Long text..." │ ───► │ You are helpful<|im_end|> │ │
│ │ output: "Summary..." │ │ <|im_start|>user │ │
│ └─────────────────────────────┘ │ Summarize: Long text<|im_end|> │ │
│ │ <|im_start|>assistant │ │
│ │ Summary...<|im_end|> │ │
│ └─────────────────────────────────┘ │
│ │
│ The model learns to: │
│ 1. Recognize the start/end of each role │
│ 2. Generate after <|im_start|>assistant │
│ 3. Stop at <|im_end|> │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Common Chat Template Formats:
| Template | Models Using It | Special Tokens |
|---|---|---|
| ChatML | Qwen, Yi, many fine-tunes | <|im_start|>, <|im_end|> |
| Llama 3 | Llama 3, Llama 3.1/3.2 | <|begin_of_text|>, <|eot_id|> |
| Phi-3 | Microsoft Phi-3 family | <|system|>, <|end|> |
| Alpaca | Many older fine-tunes | ### Instruction:, ### Response: |
Critical: Always match the template to your base model. Using the wrong template causes the model to generate garbage or never stop.
Create Domain-Specific Dataset
# create_dataset.py
import json
from pathlib import Path
from typing import List, Dict
from data_prep import InstructionSample, DatasetFormatter, create_train_val_split
def create_customer_support_dataset() -> List[InstructionSample]:
"""Create sample customer support dataset."""
samples = [
InstructionSample(
instruction="A customer asks about their order status. Respond helpfully.",
input="Hi, I placed an order 3 days ago but haven't received any shipping notification. Order #12345.",
output="I apologize for the delay in updating you about your order #12345. Let me check the status for you. I can see your order was processed and is currently being prepared for shipment. You should receive a shipping confirmation with tracking information within the next 24 hours. Is there anything else I can help you with?",
system="You are a helpful customer support agent. Be polite, professional, and solution-oriented."
),
InstructionSample(
instruction="Handle a refund request professionally.",
input="I want a refund. The product doesn't work as advertised.",
output="I'm sorry to hear the product didn't meet your expectations. I'd be happy to help process a refund for you. Could you please provide your order number and briefly describe the issue you experienced? This will help us improve our products and ensure a smooth refund process. If you prefer, we can also offer an exchange or store credit instead.",
system="You are a helpful customer support agent. Be polite, professional, and solution-oriented."
),
InstructionSample(
instruction="Answer a product question accurately.",
input="Does this laptop support USB-C charging?",
output="Yes, this laptop supports USB-C charging through its Thunderbolt 4 ports. You can use any USB-C PD (Power Delivery) charger with at least 65W output for optimal charging speed. The laptop also includes the standard barrel charger in the box. Would you like me to recommend any compatible USB-C chargers?",
system="You are a helpful customer support agent. Be polite, professional, and solution-oriented."
),
# Add more samples...
]
return samples
def create_code_assistant_dataset() -> List[InstructionSample]:
"""Create sample coding assistant dataset."""
samples = [
InstructionSample(
instruction="Write a Python function that checks if a number is prime.",
output='''def is_prime(n: int) -> bool:
"""Check if a number is prime."""
if n < 2:
return False
if n == 2:
return True
if n % 2 == 0:
return False
for i in range(3, int(n**0.5) + 1, 2):
if n % i == 0:
return False
return True''',
system="You are an expert Python programmer. Write clean, efficient, and well-documented code."
),
InstructionSample(
instruction="Explain this code and suggest improvements.",
input='''def sort(arr):
for i in range(len(arr)):
for j in range(len(arr)-1):
if arr[j] > arr[j+1]:
arr[j], arr[j+1] = arr[j+1], arr[j]
return arr''',
output='''This is a Bubble Sort implementation with O(n^2) time complexity.
Issues:
1. Not optimized - continues even if array is sorted
2. Inner loop does unnecessary comparisons
Improved version:
```python
def bubble_sort(arr: list) -> list:
"""Optimized bubble sort with early termination."""
n = len(arr)
for i in range(n):
swapped = False
for j in range(n - i - 1):
if arr[j] > arr[j + 1]:
arr[j], arr[j + 1] = arr[j + 1], arr[j]
swapped = True
if not swapped:
break
return arr
```
For better performance, consider using Python's built-in `sorted()` which uses Timsort with O(n log n) complexity.''',
system="You are an expert Python programmer. Write clean, efficient, and well-documented code."
),
]
return samples
def save_dataset(samples: List[InstructionSample], filepath: str):
"""Save dataset to JSONL file."""
with open(filepath, 'w') as f:
for sample in samples:
data = {
"instruction": sample.instruction,
"input": sample.input,
"output": sample.output,
"system": sample.system
}
f.write(json.dumps(data) + "\n")
print(f"Saved {len(samples)} samples to {filepath}")
if __name__ == "__main__":
# Create datasets
support_samples = create_customer_support_dataset()
code_samples = create_code_assistant_dataset()
# Combine or save separately
all_samples = support_samples + code_samples
# Split and save
train_samples, val_samples = create_train_val_split(all_samples)
Path("data").mkdir(exist_ok=True)
save_dataset(train_samples, "data/train.jsonl")
save_dataset(val_samples, "data/val.jsonl")Part 2: QLoRA Fine-tuning with PEFT
Basic QLoRA Training
# train_qlora.py
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
)
from peft import (
LoraConfig,
get_peft_model,
prepare_model_for_kbit_training,
TaskType,
)
from trl import SFTTrainer
from datasets import load_dataset
import wandb
def setup_qlora_model(
model_name: str,
lora_r: int = 16,
lora_alpha: int = 32,
lora_dropout: float = 0.05,
target_modules: list = None
):
"""Setup model with QLoRA configuration."""
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# Load base model
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
)
# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)
# Default target modules for common architectures
if target_modules is None:
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj", # Attention
"gate_proj", "up_proj", "down_proj", # MLP
]
# LoRA configuration
lora_config = LoraConfig(
r=lora_r,
lora_alpha=lora_alpha,
target_modules=target_modules,
lora_dropout=lora_dropout,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
# Apply LoRA
model = get_peft_model(model, lora_config)
# Print trainable parameters
model.print_trainable_parameters()
return model
def train_model(
model_name: str = "microsoft/phi-2",
train_file: str = "data/train.jsonl",
val_file: str = "data/val.jsonl",
output_dir: str = "outputs/phi2-finetuned",
num_epochs: int = 3,
batch_size: int = 4,
learning_rate: float = 2e-4,
max_seq_length: int = 2048,
gradient_accumulation_steps: int = 4,
use_wandb: bool = False,
):
"""Fine-tune model with QLoRA."""
# Initialize wandb if requested
if use_wandb:
wandb.init(project="slm-finetuning", name=f"qlora-{model_name.split('/')[-1]}")
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# Setup model
model = setup_qlora_model(model_name)
# Load datasets
train_dataset = load_dataset("json", data_files=train_file, split="train")
val_dataset = load_dataset("json", data_files=val_file, split="train")
# Format function
def format_instruction(sample):
system = sample.get("system", "You are a helpful assistant.")
instruction = sample["instruction"]
input_text = sample.get("input", "")
output = sample["output"]
if input_text:
text = f"""<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{instruction}
{input_text}<|im_end|>
<|im_start|>assistant
{output}<|im_end|>"""
else:
text = f"""<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
{output}<|im_end|>"""
return {"text": text}
# Apply formatting
train_dataset = train_dataset.map(format_instruction)
val_dataset = val_dataset.map(format_instruction)
# Training arguments
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_epochs,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
learning_rate=learning_rate,
weight_decay=0.01,
warmup_ratio=0.1,
lr_scheduler_type="cosine",
logging_steps=10,
save_strategy="epoch",
evaluation_strategy="epoch",
bf16=True,
optim="paged_adamw_8bit",
gradient_checkpointing=True,
max_grad_norm=0.3,
report_to="wandb" if use_wandb else "none",
)
# Create trainer
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
tokenizer=tokenizer,
dataset_text_field="text",
max_seq_length=max_seq_length,
packing=False,
)
# Train
print("Starting training...")
trainer.train()
# Save
trainer.save_model()
tokenizer.save_pretrained(output_dir)
print(f"Model saved to {output_dir}")
if use_wandb:
wandb.finish()
return trainer
if __name__ == "__main__":
trainer = train_model(
model_name="microsoft/phi-2",
num_epochs=3,
batch_size=2,
learning_rate=2e-4,
)Understanding QLoRA: How It Enables Fine-tuning on Consumer GPUs:
┌─────────────────────────────────────────────────────────────────────────────┐
│ QLoRA = 4-bit Quantization + LoRA Adapters │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Traditional Fine-tuning: QLoRA Fine-tuning: │
│ ┌─────────────────────────────┐ ┌─────────────────────────────────┐ │
│ │ All weights trainable │ │ Base weights frozen (4-bit) │ │
│ │ 7B model = 28GB VRAM │ │ + Small LoRA adapters (FP16) │ │
│ │ Need A100 GPU │ │ 7B model = ~6GB VRAM │ │
│ └─────────────────────────────┘ │ Works on RTX 3080! │ │
│ └─────────────────────────────────┘ │
│ │
│ How LoRA Works: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Original Weight Matrix W (frozen, 4-bit): │ │
│ │ ┌─────────────────────┐ │ │
│ │ │ │ d_model × d_model │ │
│ │ │ 4096 × 4096 │ = 16M parameters │ │
│ │ │ │ │ │
│ │ └─────────────────────┘ │ │
│ │ + │ │
│ │ LoRA Adapters (trainable, FP16): │ │
│ │ ┌───────┐ ┌───────┐ │ │
│ │ │ A │ × │ B │ (4096 × 16) × (16 × 4096) │ │
│ │ │4096×16│ │16×4096│ = 131K parameters (0.8% of original!) │ │
│ │ └───────┘ └───────┘ │ │
│ │ │ │
│ │ Output = W × x + (A × B) × x × (alpha/r) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘QLoRA Configuration Explained:
| Parameter | Value | Why This Matters |
|---|---|---|
load_in_4bit=True | 4-bit NF4 quantization | Reduces base model memory 4x |
bnb_4bit_compute_dtype=bfloat16 | Compute in BF16 | Better numerical stability than FP16 |
bnb_4bit_use_double_quant=True | Quantize the quantization constants | Extra 0.4 bits/param savings |
r=16 | LoRA rank | Higher = more capacity, more memory |
lora_alpha=32 | Scaling factor (usually 2×r) | Controls magnitude of LoRA updates |
target_modules | Attention + MLP layers | Where LoRA adapters are inserted |
Effective Batch Size Calculation:
effective_batch = batch_size × gradient_accumulation_steps
= 4 × 4 = 16
This means: Update weights every 16 samples, but only hold 4 in memory at once.Part 3: Fast Fine-tuning with Unsloth
Unsloth provides 2-5x faster training with 70% less memory.
# train_unsloth.py
from unsloth import FastLanguageModel
from unsloth import is_bfloat16_supported
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
import torch
def train_with_unsloth(
model_name: str = "unsloth/Phi-3-mini-4k-instruct",
train_file: str = "data/train.jsonl",
val_file: str = "data/val.jsonl",
output_dir: str = "outputs/phi3-unsloth",
num_epochs: int = 3,
batch_size: int = 2,
learning_rate: float = 2e-4,
max_seq_length: int = 2048,
):
"""Fine-tune with Unsloth for faster training."""
# Load model with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=max_seq_length,
dtype=None, # Auto-detect
load_in_4bit=True,
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=16,
lora_dropout=0, # Unsloth optimizes for 0 dropout
bias="none",
use_gradient_checkpointing="unsloth", # Unsloth optimization
random_state=42,
use_rslora=False,
loftq_config=None,
)
# Load datasets
train_dataset = load_dataset("json", data_files=train_file, split="train")
val_dataset = load_dataset("json", data_files=val_file, split="train")
# Phi-3 chat template
def format_phi3(sample):
system = sample.get("system", "You are a helpful assistant.")
instruction = sample["instruction"]
input_text = sample.get("input", "")
output = sample["output"]
user_content = f"{instruction}\n\n{input_text}" if input_text else instruction
text = f"""<|system|>
{system}<|end|>
<|user|>
{user_content}<|end|>
<|assistant|>
{output}<|end|>"""
return {"text": text}
train_dataset = train_dataset.map(format_phi3)
val_dataset = val_dataset.map(format_phi3)
# Training arguments
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_epochs,
per_device_train_batch_size=batch_size,
gradient_accumulation_steps=4,
learning_rate=learning_rate,
weight_decay=0.01,
warmup_steps=5,
lr_scheduler_type="linear",
logging_steps=1,
save_strategy="epoch",
evaluation_strategy="epoch",
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
optim="adamw_8bit",
seed=42,
)
# Create trainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
)
# Show GPU stats before training
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
# Train
trainer_stats = trainer.train()
# Show final stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
# Save LoRA adapters
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
return model, tokenizer
def merge_and_save(
model,
tokenizer,
output_dir: str = "outputs/phi3-merged",
save_gguf: bool = True,
quantization: str = "q4_k_m",
):
"""Merge LoRA adapters and save."""
# Save merged 16-bit model
model.save_pretrained_merged(
output_dir,
tokenizer,
save_method="merged_16bit",
)
print(f"Merged model saved to {output_dir}")
# Save GGUF for Ollama
if save_gguf:
model.save_pretrained_gguf(
f"{output_dir}-gguf",
tokenizer,
quantization_method=quantization,
)
print(f"GGUF model saved to {output_dir}-gguf")
if __name__ == "__main__":
model, tokenizer = train_with_unsloth(
model_name="unsloth/Phi-3-mini-4k-instruct",
num_epochs=3,
)
# Merge and export
merge_and_save(model, tokenizer)Why Unsloth is 2-5x Faster:
┌─────────────────────────────────────────────────────────────────────────────┐
│ UNSLOTH OPTIMIZATIONS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Standard PEFT/TRL: Unsloth: │
│ ┌─────────────────────────────┐ ┌─────────────────────────────────┐ │
│ │ Generic PyTorch operations │ │ Custom fused CUDA kernels │ │
│ │ Multiple kernel launches │ │ Single kernel for common ops │ │
│ │ Standard attention │ │ Flash Attention 2 built-in │ │
│ │ Regular gradient checkpt │ │ "Unsloth" gradient checkpt │ │
│ └─────────────────────────────┘ └─────────────────────────────────┘ │
│ │
│ Memory Comparison (Phi-3 Mini, same batch size): │
│ ┌────────────────────┬────────────────┬────────────────────────────────┐ │
│ │ Framework │ Peak VRAM │ Tokens/Second │ │
│ ├────────────────────┼────────────────┼────────────────────────────────┤ │
│ │ Standard PEFT │ 14.2 GB │ ~1,200 │ │
│ │ Unsloth │ 5.8 GB │ ~4,800 │ │
│ │ Improvement │ 60% less │ 4x faster │ │
│ └────────────────────┴────────────────┴────────────────────────────────┘ │
│ │
│ Key Unsloth Settings: │
│ • lora_dropout=0 → Enables kernel fusion (required!) │
│ • use_gradient_checkpointing="unsloth" → Custom implementation │
│ • Automatic dtype detection → Picks best for your GPU │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Unsloth Export Options:
| Method | Output | Use Case |
|---|---|---|
save_pretrained | LoRA adapters only | Continue training, share small files |
save_pretrained_merged | Full merged model | HuggingFace deployment |
save_pretrained_gguf | GGUF quantized | Ollama, llama.cpp deployment |
Part 4: Evaluation Pipeline
Evaluate Fine-tuned Model
# evaluate.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from datasets import load_dataset
from typing import List, Dict, Any
import json
from tqdm import tqdm
import numpy as np
class ModelEvaluator:
"""Evaluate fine-tuned models."""
def __init__(
self,
base_model_name: str,
adapter_path: str = None,
device: str = "auto",
):
self.device = device
# Load tokenizer
self.tokenizer = AutoTokenizer.from_pretrained(
adapter_path or base_model_name,
trust_remote_code=True
)
self.tokenizer.pad_token = self.tokenizer.eos_token
# Load model
self.model = AutoModelForCausalLM.from_pretrained(
base_model_name,
device_map=device,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
# Load adapters if provided
if adapter_path:
self.model = PeftModel.from_pretrained(
self.model,
adapter_path,
)
print(f"Loaded adapters from {adapter_path}")
self.model.eval()
@torch.no_grad()
def generate(
self,
prompt: str,
max_new_tokens: int = 256,
temperature: float = 0.1,
top_p: float = 0.9,
) -> str:
"""Generate response for a prompt."""
inputs = self.tokenizer(
prompt,
return_tensors="pt",
truncation=True,
max_length=2048,
).to(self.model.device)
outputs = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_p=top_p,
do_sample=temperature > 0,
pad_token_id=self.tokenizer.eos_token_id,
)
response = self.tokenizer.decode(
outputs[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True
)
return response.strip()
def evaluate_dataset(
self,
test_file: str,
template: str = "chatml",
max_samples: int = None,
) -> Dict[str, Any]:
"""Evaluate on a test dataset."""
dataset = load_dataset("json", data_files=test_file, split="train")
if max_samples:
dataset = dataset.select(range(min(max_samples, len(dataset))))
results = []
for sample in tqdm(dataset, desc="Evaluating"):
# Format prompt (without output)
if template == "chatml":
system = sample.get("system", "You are a helpful assistant.")
instruction = sample["instruction"]
input_text = sample.get("input", "")
if input_text:
prompt = f"""<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{instruction}
{input_text}<|im_end|>
<|im_start|>assistant
"""
else:
prompt = f"""<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
"""
# Generate
prediction = self.generate(prompt)
results.append({
"instruction": sample["instruction"],
"input": sample.get("input", ""),
"expected": sample["output"],
"predicted": prediction,
})
return self._calculate_metrics(results)
def _calculate_metrics(self, results: List[Dict]) -> Dict[str, Any]:
"""Calculate evaluation metrics."""
from evaluate import load
from sklearn.metrics import accuracy_score
# Exact match (for classification tasks)
exact_matches = sum(
1 for r in results
if r["expected"].strip().lower() == r["predicted"].strip().lower()
)
exact_match_acc = exact_matches / len(results)
# Contains match (looser criterion)
contains_matches = sum(
1 for r in results
if r["expected"].strip().lower() in r["predicted"].strip().lower()
)
contains_acc = contains_matches / len(results)
# BLEU score for generation tasks
try:
bleu = load("bleu")
predictions = [r["predicted"] for r in results]
references = [[r["expected"]] for r in results]
bleu_score = bleu.compute(predictions=predictions, references=references)["bleu"]
except Exception:
bleu_score = None
# ROUGE scores
try:
rouge = load("rouge")
predictions = [r["predicted"] for r in results]
references = [r["expected"] for r in results]
rouge_scores = rouge.compute(predictions=predictions, references=references)
except Exception:
rouge_scores = None
return {
"num_samples": len(results),
"exact_match_accuracy": exact_match_acc,
"contains_accuracy": contains_acc,
"bleu": bleu_score,
"rouge": rouge_scores,
"results": results,
}
def compare_models(
base_model_name: str,
adapter_path: str,
test_file: str,
max_samples: int = 50,
) -> Dict[str, Any]:
"""Compare base model vs fine-tuned model."""
print("Evaluating base model...")
base_evaluator = ModelEvaluator(base_model_name)
base_metrics = base_evaluator.evaluate_dataset(test_file, max_samples=max_samples)
print("\nEvaluating fine-tuned model...")
ft_evaluator = ModelEvaluator(base_model_name, adapter_path)
ft_metrics = ft_evaluator.evaluate_dataset(test_file, max_samples=max_samples)
# Compare
comparison = {
"base_model": {
"exact_match": base_metrics["exact_match_accuracy"],
"contains": base_metrics["contains_accuracy"],
"bleu": base_metrics["bleu"],
},
"finetuned_model": {
"exact_match": ft_metrics["exact_match_accuracy"],
"contains": ft_metrics["contains_accuracy"],
"bleu": ft_metrics["bleu"],
},
"improvement": {
"exact_match": ft_metrics["exact_match_accuracy"] - base_metrics["exact_match_accuracy"],
"contains": ft_metrics["contains_accuracy"] - base_metrics["contains_accuracy"],
}
}
print("\n" + "=" * 50)
print("COMPARISON RESULTS")
print("=" * 50)
print(f"{'Metric':<20} {'Base':<15} {'Fine-tuned':<15} {'Change':<15}")
print("-" * 65)
print(f"{'Exact Match':<20} {comparison['base_model']['exact_match']:.1%} {comparison['finetuned_model']['exact_match']:.1%} {comparison['improvement']['exact_match']:+.1%}")
print(f"{'Contains Match':<20} {comparison['base_model']['contains']:.1%} {comparison['finetuned_model']['contains']:.1%} {comparison['improvement']['contains']:+.1%}")
return comparison
if __name__ == "__main__":
# Evaluate fine-tuned model
evaluator = ModelEvaluator(
base_model_name="microsoft/phi-2",
adapter_path="outputs/phi2-finetuned",
)
metrics = evaluator.evaluate_dataset(
test_file="data/val.jsonl",
max_samples=20,
)
print(f"\nExact Match Accuracy: {metrics['exact_match_accuracy']:.1%}")
print(f"Contains Accuracy: {metrics['contains_accuracy']:.1%}")
if metrics['bleu']:
print(f"BLEU Score: {metrics['bleu']:.3f}")Understanding Fine-tuning Evaluation:
┌─────────────────────────────────────────────────────────────────────────────┐
│ EVALUATION METRICS FOR FINE-TUNED MODELS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ When to Use Each Metric: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Classification Tasks (Q&A, Yes/No): │ │
│ │ ┌───────────────────┐ │ │
│ │ │ Exact Match │ ──► "Paris" == "Paris" ✓ │ │
│ │ │ Accuracy │ "Paris" == "paris" ✗ (case-sensitive) │ │
│ │ └───────────────────┘ │ │
│ │ │ │
│ │ Generation Tasks (Summarization, Writing): │ │
│ │ ┌───────────────────┐ │ │
│ │ │ BLEU │ ──► n-gram overlap with reference │ │
│ │ │ ROUGE │ ──► recall-oriented overlap │ │
│ │ │ Contains Match │ ──► key info present (looser) │ │
│ │ └───────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Comparing Base vs Fine-tuned: │
│ │
│ Base Model: ████████████░░░░░░░░ 60% │
│ Fine-tuned: ████████████████████ 95% │
│ ▲ │
│ │ │
│ +35% improvement on domain tasks │
│ │
│ Warning Signs: │
│ • Fine-tuned worse than base → Wrong template or data format │
│ • Perfect training, poor validation → Overfitting │
│ • Good validation, poor real-world → Data distribution mismatch │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Evaluation Strategy:
| Approach | When to Use | Example |
|---|---|---|
| Exact Match | Classification, factual Q&A | "What is the capital?" → "Paris" |
| Contains Match | Key information extraction | Response includes the key fact |
| BLEU/ROUGE | Open-ended generation | Summaries, creative writing |
| Human Evaluation | Final quality check | A/B preference testing |
Part 5: Export to Ollama
Merge and Convert to GGUF
# export_ollama.py
import subprocess
import os
from pathlib import Path
import shutil
def merge_lora_weights(
base_model: str,
adapter_path: str,
output_path: str,
):
"""Merge LoRA adapters into base model."""
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
print(f"Loading base model: {base_model}")
model = AutoModelForCausalLM.from_pretrained(
base_model,
torch_dtype=torch.float16,
device_map="cpu",
trust_remote_code=True,
)
print(f"Loading adapters: {adapter_path}")
model = PeftModel.from_pretrained(model, adapter_path)
print("Merging weights...")
model = model.merge_and_unload()
print(f"Saving merged model to {output_path}")
model.save_pretrained(output_path, safe_serialization=True)
# Save tokenizer
tokenizer = AutoTokenizer.from_pretrained(adapter_path, trust_remote_code=True)
tokenizer.save_pretrained(output_path)
print("Merge complete!")
return output_path
def convert_to_gguf(
model_path: str,
output_path: str,
quantization: str = "q4_k_m",
llama_cpp_path: str = None,
):
"""Convert HuggingFace model to GGUF format."""
if llama_cpp_path is None:
# Try to find llama.cpp in common locations
possible_paths = [
Path.home() / "llama.cpp",
Path("./llama.cpp"),
Path("/opt/llama.cpp"),
]
for p in possible_paths:
if p.exists():
llama_cpp_path = str(p)
break
if not llama_cpp_path:
print("llama.cpp not found. Please provide the path or install it.")
return None
convert_script = Path(llama_cpp_path) / "convert_hf_to_gguf.py"
quantize_binary = Path(llama_cpp_path) / "llama-quantize"
# Create output directory
output_dir = Path(output_path)
output_dir.mkdir(parents=True, exist_ok=True)
# Step 1: Convert to GGUF F16
f16_path = output_dir / "model-f16.gguf"
print(f"Converting to GGUF F16...")
cmd = [
"python", str(convert_script),
model_path,
"--outfile", str(f16_path),
"--outtype", "f16",
]
subprocess.run(cmd, check=True)
# Step 2: Quantize
quantized_path = output_dir / f"model-{quantization}.gguf"
print(f"Quantizing to {quantization}...")
cmd = [
str(quantize_binary),
str(f16_path),
str(quantized_path),
quantization.upper(),
]
subprocess.run(cmd, check=True)
# Clean up F16 if quantization successful
if quantized_path.exists():
f16_path.unlink()
print(f"GGUF model saved to {quantized_path}")
return str(quantized_path)
def create_ollama_modelfile(
gguf_path: str,
model_name: str,
system_prompt: str = "You are a helpful assistant.",
template: str = "chatml",
output_path: str = None,
):
"""Create Ollama Modelfile for the custom model."""
templates = {
"chatml": '''TEMPLATE """<|im_start|>system
{{ .System }}<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>"""''',
"phi3": '''TEMPLATE """<|system|>
{{ .System }}<|end|>
<|user|>
{{ .Prompt }}<|end|>
<|assistant|>
{{ .Response }}<|end|>"""''',
"llama3": '''TEMPLATE """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|><|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{{ .Response }}<|eot_id|>"""''',
}
modelfile_content = f'''FROM {gguf_path}
{templates.get(template, templates["chatml"])}
SYSTEM """{system_prompt}"""
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<|im_end|>"
PARAMETER stop "<|end|>"
'''
output_path = output_path or "Modelfile"
with open(output_path, 'w') as f:
f.write(modelfile_content)
print(f"Modelfile saved to {output_path}")
return output_path
def register_with_ollama(modelfile_path: str, model_name: str):
"""Register the model with Ollama."""
print(f"Creating Ollama model: {model_name}")
cmd = ["ollama", "create", model_name, "-f", modelfile_path]
subprocess.run(cmd, check=True)
print(f"Model {model_name} is now available in Ollama!")
print(f"Test with: ollama run {model_name}")
def full_export_pipeline(
base_model: str,
adapter_path: str,
model_name: str,
system_prompt: str = "You are a helpful assistant.",
quantization: str = "q4_k_m",
llama_cpp_path: str = None,
):
"""Complete pipeline: merge -> convert -> register."""
work_dir = Path("exports") / model_name
work_dir.mkdir(parents=True, exist_ok=True)
# Step 1: Merge
merged_path = str(work_dir / "merged")
merge_lora_weights(base_model, adapter_path, merged_path)
# Step 2: Convert to GGUF
gguf_dir = str(work_dir / "gguf")
gguf_path = convert_to_gguf(
merged_path,
gguf_dir,
quantization,
llama_cpp_path,
)
if gguf_path:
# Step 3: Create Modelfile
modelfile_path = str(work_dir / "Modelfile")
create_ollama_modelfile(
gguf_path,
model_name,
system_prompt,
output_path=modelfile_path,
)
# Step 4: Register with Ollama
register_with_ollama(modelfile_path, model_name)
print("\n" + "=" * 50)
print("EXPORT COMPLETE!")
print("=" * 50)
print(f"Model name: {model_name}")
print(f"Quantization: {quantization}")
print(f"\nRun with: ollama run {model_name}")
if __name__ == "__main__":
full_export_pipeline(
base_model="microsoft/phi-2",
adapter_path="outputs/phi2-finetuned",
model_name="phi2-custom",
system_prompt="You are a helpful customer support agent.",
quantization="q4_k_m",
)Understanding the Export Pipeline:
┌─────────────────────────────────────────────────────────────────────────────┐
│ FROM TRAINING TO OLLAMA DEPLOYMENT │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Step 1: Merge LoRA → Base Model │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Base Model (4-bit) LoRA Adapters Merged Model (FP16) │ │
│ │ ┌─────────────────┐ ┌───────────┐ ┌─────────────────┐ │ │
│ │ │ W (frozen) │ + │ A × B │ = │ W + A×B×(α/r) │ │ │
│ │ │ 4-bit weights │ │ trainable │ │ full precision │ │ │
│ │ └─────────────────┘ └───────────┘ └─────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Step 2: Convert to GGUF (llama.cpp format) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ HuggingFace Format GGUF Format │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ │ │
│ │ │ model.safetensors│ ──► │ model-q4_k_m.gguf│ │ │
│ │ │ config.json │ convert │ (single file, │ │ │
│ │ │ tokenizer.json │ + quant │ 4-bit weights) │ │ │
│ │ └─────────────────┘ └─────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Step 3: Create Modelfile + Register with Ollama │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Modelfile: │ │
│ │ FROM ./model-q4_k_m.gguf │ │
│ │ TEMPLATE "..." ◄── Must match training template! │ │
│ │ SYSTEM "..." │ │
│ │ PARAMETER temperature 0.7 │ │
│ │ │ │
│ │ $ ollama create my-model -f Modelfile │ │
│ │ $ ollama run my-model │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘GGUF Quantization Options:
| Method | Bits | Size (2B model) | Quality | Recommended For |
|---|---|---|---|---|
| Q2_K | 2.5 | ~600MB | Poor | Extreme constraints only |
| Q4_K_M | 4.5 | ~1.2GB | Near-FP16 | Best balance |
| Q5_K_M | 5.5 | ~1.5GB | Excellent | Quality-critical apps |
| Q8_0 | 8.0 | ~2.2GB | Indistinguishable | When size doesn't matter |
Common Pitfall: Using a different chat template in Modelfile than what the model was trained with causes poor outputs. Always match the template!
Part 6: Complete Training Script
All-in-One Training Pipeline
# train.py
import argparse
from pathlib import Path
import json
def main():
parser = argparse.ArgumentParser(description="SLM Fine-tuning Pipeline")
# Model arguments
parser.add_argument("--base-model", type=str, default="microsoft/phi-2",
help="Base model to fine-tune")
parser.add_argument("--output-dir", type=str, default="outputs/finetuned",
help="Output directory for checkpoints")
# Data arguments
parser.add_argument("--train-file", type=str, required=True,
help="Training data file (JSONL)")
parser.add_argument("--val-file", type=str,
help="Validation data file (JSONL)")
# Training arguments
parser.add_argument("--epochs", type=int, default=3)
parser.add_argument("--batch-size", type=int, default=2)
parser.add_argument("--lr", type=float, default=2e-4)
parser.add_argument("--max-seq-length", type=int, default=2048)
parser.add_argument("--gradient-accumulation", type=int, default=4)
# LoRA arguments
parser.add_argument("--lora-r", type=int, default=16)
parser.add_argument("--lora-alpha", type=int, default=32)
parser.add_argument("--lora-dropout", type=float, default=0.05)
# Other arguments
parser.add_argument("--use-unsloth", action="store_true",
help="Use Unsloth for faster training")
parser.add_argument("--use-wandb", action="store_true",
help="Log to Weights & Biases")
parser.add_argument("--export-ollama", action="store_true",
help="Export to Ollama after training")
parser.add_argument("--model-name", type=str, default="custom-model",
help="Name for Ollama model")
args = parser.parse_args()
# Create output directory
Path(args.output_dir).mkdir(parents=True, exist_ok=True)
# Save config
config = vars(args)
with open(Path(args.output_dir) / "config.json", 'w') as f:
json.dump(config, f, indent=2)
# Train
if args.use_unsloth:
from train_unsloth import train_with_unsloth, merge_and_save
model, tokenizer = train_with_unsloth(
model_name=args.base_model,
train_file=args.train_file,
val_file=args.val_file,
output_dir=args.output_dir,
num_epochs=args.epochs,
batch_size=args.batch_size,
learning_rate=args.lr,
max_seq_length=args.max_seq_length,
)
if args.export_ollama:
merge_and_save(
model, tokenizer,
output_dir=f"{args.output_dir}-merged",
save_gguf=True,
)
else:
from train_qlora import train_model
trainer = train_model(
model_name=args.base_model,
train_file=args.train_file,
val_file=args.val_file,
output_dir=args.output_dir,
num_epochs=args.epochs,
batch_size=args.batch_size,
learning_rate=args.lr,
max_seq_length=args.max_seq_length,
gradient_accumulation_steps=args.gradient_accumulation,
use_wandb=args.use_wandb,
)
print("\nTraining complete!")
print(f"Model saved to: {args.output_dir}")
if __name__ == "__main__":
main()Usage Examples
# Basic training
python train.py \
--base-model microsoft/phi-2 \
--train-file data/train.jsonl \
--val-file data/val.jsonl \
--epochs 3
# Fast training with Unsloth
python train.py \
--base-model unsloth/Phi-3-mini-4k-instruct \
--train-file data/train.jsonl \
--use-unsloth \
--export-ollama \
--model-name my-assistant
# With Weights & Biases logging
python train.py \
--base-model microsoft/phi-2 \
--train-file data/train.jsonl \
--use-wandb \
--epochs 5 \
--lr 1e-4Hyperparameter Guide
| Parameter | Recommended Range | Notes |
|---|---|---|
| Learning Rate | 1e-5 to 5e-4 | Start with 2e-4 |
| Batch Size | 1-8 | Limited by VRAM |
| Gradient Accumulation | 4-16 | Effective batch = batch * accum |
| Epochs | 1-5 | More epochs risk overfitting |
| LoRA Rank (r) | 8-64 | Higher = more capacity |
| LoRA Alpha | 16-64 | Usually 2x rank |
| Max Seq Length | 512-4096 | Longer uses more memory |
Common Issues & Solutions
Out of Memory
# Reduce batch size
batch_size = 1
gradient_accumulation_steps = 8
# Use gradient checkpointing
training_args = TrainingArguments(
gradient_checkpointing=True,
# ...
)
# Reduce max sequence length
max_seq_length = 1024Poor Performance After Fine-tuning
- Check data quality - Remove duplicates and errors
- Increase training data - More examples usually help
- Adjust learning rate - Try 5x lower or higher
- Add more epochs - Ensure model converges
- Verify template matching - Use same template as base model
Exercises
-
Domain Adaptation: Fine-tune a model on legal or medical text
-
Multi-task Learning: Train on multiple task types simultaneously
-
Hyperparameter Search: Implement automated hyperparameter tuning
-
Evaluation Suite: Create comprehensive task-specific evaluation
Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| QLoRA | 4-bit quantization + LoRA adapters | Train 7B models on 8GB VRAM |
| LoRA Rank (r) | Size of low-rank matrices (8-64) | Higher = more capacity, more memory |
| LoRA Alpha | Scaling factor (usually 2×rank) | Controls update magnitude |
| Effective Batch | batch_size × gradient_accumulation | Stabilizes training without more memory |
| Unsloth | Optimized training library | 2-5x faster, 70% less memory |
| Chat Template | Format for instruction/response pairs | Must match base model's expected format |
| SFTTrainer | Supervised fine-tuning trainer | Handles chat format, packing, loss masking |
| Gradient Checkpointing | Trade compute for memory | Enables larger models on limited VRAM |
| GGUF Export | Quantized format for llama.cpp/Ollama | Deploy fine-tuned models locally |
| Validation Loss | Loss on held-out data | Detect overfitting, choose best checkpoint |
Next Steps
- SLM-Powered RAG - Combine fine-tuned models with retrieval
- Edge Deployment - Deploy on mobile and edge devices
- SLM Agents - Build agentic systems with SLMs