SLM Fine-tuning

TL;DR

Fine-tune SLMs on custom data using QLoRA (4-bit quantization + LoRA adapters) to train on consumer GPUs with 8GB+ VRAM. Use Unsloth for 2-5x faster training. Export to GGUF format for Ollama deployment. Key formula: effective_batch = batch_size × gradient_accumulation.

Project Overview

Aspect	Details
Difficulty	Intermediate
Time	6-8 hours
Prerequisites	Local SLM Setup, SLM Benchmarking
What You'll Build	Fine-tuned SLM for custom task with evaluation pipeline

What You'll Learn

QLoRA fine-tuning with PEFT library
Unsloth for 2-5x faster training
Dataset preparation and formatting
Hyperparameter selection for SLMs
Evaluation and iteration workflow
Exporting models for Ollama deployment

Why Fine-tune Small Models?

General-purpose SLMs handle broad tasks, but they often struggle with domain-specific terminology, output formats, or specialized reasoning. Fine-tuning bridges this gap by adapting a model to your exact task -- and small models make this practical on consumer hardware.

Approach	Quality on Your Task	Cost	Hardware
Prompting a general SLM	Moderate	Free	Any
Prompting a large API model	Good	$$ per request	None (cloud)
Fine-tuned SLM	Excellent for your domain	One-time GPU cost	8GB+ VRAM

With QLoRA, you can fine-tune a 3B model on a single RTX 3080 in hours, then deploy it locally with zero ongoing costs. The fine-tuned model often outperforms much larger general-purpose models on your specific task.

Architecture Overview

SLM Fine-tuning Pipeline

Data Preparation

Raw Data

Formatter

Train/Val Split

Training Pipeline

Base Model

LoRA Adapters

SFTTrainer

Evaluation

Metrics

Comparison

Iteration (retry loop)

Export & Deploy

Merge

GGUF

Ollama

Data Flow

Raw DataInput dataset

FormatApply chat template

SplitTrain/validation sets

TrainQLoRA fine-tuning

EvaluateMeasure metrics

MergeLoRA + Base model

GGUFConvert to GGUF format

OllamaDeploy locally

Hardware Requirements

Configuration	VRAM	Models	Training Time
Minimum	8GB	1-2B models	Slow
Recommended	16GB	3-4B models	Good
Optimal	24GB+	7B+ models	Fast

Project Setup

Install Dependencies

# Create project directory
mkdir slm-finetuning && cd slm-finetuning

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install PyTorch with CUDA (adjust for your CUDA version)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install training libraries
pip install transformers datasets peft accelerate bitsandbytes
pip install trl wandb evaluate scikit-learn

# Install Unsloth for faster training (optional but recommended)
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Part 1: Dataset Preparation

Dataset Formats

SLMs typically use instruction or chat formats for fine-tuning.

# data_prep.py
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from datasets import Dataset, DatasetDict
import json


@dataclass
class InstructionSample:
    """Single instruction-response pair."""
    instruction: str
    input: str = ""
    output: str = ""
    system: str = ""


class DatasetFormatter:
    """Format datasets for SLM fine-tuning."""

    # Common chat templates
    TEMPLATES = {
        "alpaca": """Below is an instruction that describes a task{input_section}. Write a response that appropriately completes the request.

### Instruction:
{instruction}
{input_block}
### Response:
{output}""",

        "chatml": """<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{instruction}{input_block}<|im_end|>
<|im_start|>assistant
{output}<|im_end|>""",

        "llama3": """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system}<|eot_id|><|start_header_id|>user<|end_header_id|}

{instruction}{input_block}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{output}<|eot_id|>""",

        "phi3": """<|system|>
{system}<|end|>
<|user|>
{instruction}{input_block}<|end|>
<|assistant|>
{output}<|end|>""",
    }

    def __init__(self, template: str = "chatml", system_prompt: str = ""):
        self.template_name = template
        self.template = self.TEMPLATES.get(template, self.TEMPLATES["chatml"])
        self.default_system = system_prompt or "You are a helpful assistant."

    def format_sample(self, sample: InstructionSample) -> str:
        """Format a single sample."""
        input_section = ", along with an input that provides further context" if sample.input else ""
        input_block = f"\n\n### Input:\n{sample.input}" if sample.input else ""
        system = sample.system or self.default_system

        return self.template.format(
            instruction=sample.instruction,
            input=sample.input,
            input_section=input_section,
            input_block=input_block,
            output=sample.output,
            system=system
        )

    def format_dataset(
        self,
        samples: List[InstructionSample],
        text_column: str = "text"
    ) -> Dataset:
        """Format list of samples into HuggingFace Dataset."""
        formatted = [self.format_sample(s) for s in samples]
        return Dataset.from_dict({text_column: formatted})


def load_jsonl(filepath: str) -> List[InstructionSample]:
    """Load samples from JSONL file."""
    samples = []
    with open(filepath, 'r') as f:
        for line in f:
            data = json.loads(line)
            samples.append(InstructionSample(
                instruction=data.get("instruction", ""),
                input=data.get("input", ""),
                output=data.get("output", ""),
                system=data.get("system", "")
            ))
    return samples


def create_train_val_split(
    samples: List[InstructionSample],
    val_ratio: float = 0.1,
    seed: int = 42
) -> tuple:
    """Split samples into train and validation sets."""
    import random
    random.seed(seed)
    random.shuffle(samples)

    split_idx = int(len(samples) * (1 - val_ratio))
    return samples[:split_idx], samples[split_idx:]


# Example usage
if __name__ == "__main__":
    # Create sample dataset
    samples = [
        InstructionSample(
            instruction="Summarize the following text.",
            input="Machine learning is a subset of artificial intelligence...",
            output="Machine learning enables computers to learn from data."
        ),
        InstructionSample(
            instruction="What is the capital of France?",
            output="The capital of France is Paris."
        ),
    ]

    formatter = DatasetFormatter(template="chatml")
    dataset = formatter.format_dataset(samples)
    print(dataset[0]["text"])

Understanding Chat Templates and Dataset Formatting:

Why Chat Templates Matter

Raw Data

instruction: "Summarize", input: "Long text...", output: "Summary..." -- Unstructured key-value pairs with no role markers.

Formatted (ChatML)

Recommended

Common Chat Template Formats:

Template	Models Using It	Special Tokens
ChatML	Qwen, Yi, many fine-tunes	`<\|im_start\|>`, `<\|im_end\|>`
Llama 3	Llama 3, Llama 3.1/3.2	`<\|begin_of_text\|>`, `<\|eot_id\|>`
Phi-3	Microsoft Phi-3 family	`<\|system\|>`, `<\|end\|>`
Alpaca	Many older fine-tunes	`### Instruction:`, `### Response:`

Critical: Always match the template to your base model. Using the wrong template causes the model to generate garbage or never stop.

Create Domain-Specific Dataset

# create_dataset.py
import json
from pathlib import Path
from typing import List, Dict
from data_prep import InstructionSample, DatasetFormatter, create_train_val_split


def create_customer_support_dataset() -> List[InstructionSample]:
    """Create sample customer support dataset."""
    samples = [
        InstructionSample(
            instruction="A customer asks about their order status. Respond helpfully.",
            input="Hi, I placed an order 3 days ago but haven't received any shipping notification. Order #12345.",
            output="I apologize for the delay in updating you about your order #12345. Let me check the status for you. I can see your order was processed and is currently being prepared for shipment. You should receive a shipping confirmation with tracking information within the next 24 hours. Is there anything else I can help you with?",
            system="You are a helpful customer support agent. Be polite, professional, and solution-oriented."
        ),
        InstructionSample(
            instruction="Handle a refund request professionally.",
            input="I want a refund. The product doesn't work as advertised.",
            output="I'm sorry to hear the product didn't meet your expectations. I'd be happy to help process a refund for you. Could you please provide your order number and briefly describe the issue you experienced? This will help us improve our products and ensure a smooth refund process. If you prefer, we can also offer an exchange or store credit instead.",
            system="You are a helpful customer support agent. Be polite, professional, and solution-oriented."
        ),
        InstructionSample(
            instruction="Answer a product question accurately.",
            input="Does this laptop support USB-C charging?",
            output="Yes, this laptop supports USB-C charging through its Thunderbolt 4 ports. You can use any USB-C PD (Power Delivery) charger with at least 65W output for optimal charging speed. The laptop also includes the standard barrel charger in the box. Would you like me to recommend any compatible USB-C chargers?",
            system="You are a helpful customer support agent. Be polite, professional, and solution-oriented."
        ),
        # Add more samples...
    ]

    return samples


def create_code_assistant_dataset() -> List[InstructionSample]:
    """Create sample coding assistant dataset."""
    samples = [
        InstructionSample(
            instruction="Write a Python function that checks if a number is prime.",
            output='''def is_prime(n: int) -> bool:
    """Check if a number is prime."""
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False
    for i in range(3, int(n**0.5) + 1, 2):
        if n % i == 0:
            return False
    return True''',
            system="You are an expert Python programmer. Write clean, efficient, and well-documented code."
        ),
        InstructionSample(
            instruction="Explain this code and suggest improvements.",
            input='''def sort(arr):
    for i in range(len(arr)):
        for j in range(len(arr)-1):
            if arr[j] > arr[j+1]:
                arr[j], arr[j+1] = arr[j+1], arr[j]
    return arr''',
            output='''This is a Bubble Sort implementation with O(n^2) time complexity.

Issues:
1. Not optimized - continues even if array is sorted
2. Inner loop does unnecessary comparisons

Improved version:
```python
def bubble_sort(arr: list) -> list:
    """Optimized bubble sort with early termination."""
    n = len(arr)
    for i in range(n):
        swapped = False
        for j in range(n - i - 1):
            if arr[j] > arr[j + 1]:
                arr[j], arr[j + 1] = arr[j + 1], arr[j]
                swapped = True
        if not swapped:
            break
    return arr
```

For better performance, consider using Python's built-in `sorted()` which uses Timsort with O(n log n) complexity.''',
            system="You are an expert Python programmer. Write clean, efficient, and well-documented code."
        ),
    ]

    return samples


def save_dataset(samples: List[InstructionSample], filepath: str):
    """Save dataset to JSONL file."""
    with open(filepath, 'w') as f:
        for sample in samples:
            data = {
                "instruction": sample.instruction,
                "input": sample.input,
                "output": sample.output,
                "system": sample.system
            }
            f.write(json.dumps(data) + "\n")
    print(f"Saved {len(samples)} samples to {filepath}")


if __name__ == "__main__":
    # Create datasets
    support_samples = create_customer_support_dataset()
    code_samples = create_code_assistant_dataset()

    # Combine or save separately
    all_samples = support_samples + code_samples

    # Split and save
    train_samples, val_samples = create_train_val_split(all_samples)

    Path("data").mkdir(exist_ok=True)
    save_dataset(train_samples, "data/train.jsonl")
    save_dataset(val_samples, "data/val.jsonl")

Part 2: QLoRA Fine-tuning with PEFT

Basic QLoRA Training

# train_qlora.py
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
    TaskType,
)
from trl import SFTTrainer
from datasets import load_dataset
import wandb


def setup_qlora_model(
    model_name: str,
    lora_r: int = 16,
    lora_alpha: int = 32,
    lora_dropout: float = 0.05,
    target_modules: list = None
):
    """Setup model with QLoRA configuration."""

    # 4-bit quantization config
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
    )

    # Load base model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
        torch_dtype=torch.bfloat16,
    )

    # Prepare for k-bit training
    model = prepare_model_for_kbit_training(model)

    # Default target modules for common architectures
    if target_modules is None:
        target_modules = [
            "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
            "gate_proj", "up_proj", "down_proj",      # MLP
        ]

    # LoRA configuration
    lora_config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        target_modules=target_modules,
        lora_dropout=lora_dropout,
        bias="none",
        task_type=TaskType.CAUSAL_LM,
    )

    # Apply LoRA
    model = get_peft_model(model, lora_config)

    # Print trainable parameters
    model.print_trainable_parameters()

    return model


def train_model(
    model_name: str = "microsoft/Phi-3-mini-4k-instruct",
    train_file: str = "data/train.jsonl",
    val_file: str = "data/val.jsonl",
    output_dir: str = "outputs/phi3-finetuned",
    num_epochs: int = 3,
    batch_size: int = 4,
    learning_rate: float = 2e-4,
    max_seq_length: int = 2048,
    gradient_accumulation_steps: int = 4,
    use_wandb: bool = False,
):
    """Fine-tune model with QLoRA."""

    # Initialize wandb if requested
    if use_wandb:
        wandb.init(project="slm-finetuning", name=f"qlora-phi3")

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"

    # Setup model
    model = setup_qlora_model(model_name)

    # Load datasets
    train_dataset = load_dataset("json", data_files=train_file, split="train")
    val_dataset = load_dataset("json", data_files=val_file, split="train")

    # Format function
    def format_instruction(sample):
        system = sample.get("system", "You are a helpful assistant.")
        instruction = sample["instruction"]
        input_text = sample.get("input", "")
        output = sample["output"]

        if input_text:
            text = f"""<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{instruction}

{input_text}<|im_end|>
<|im_start|>assistant
{output}<|im_end|>"""
        else:
            text = f"""<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
{output}<|im_end|>"""

        return {"text": text}

    # Apply formatting
    train_dataset = train_dataset.map(format_instruction)
    val_dataset = val_dataset.map(format_instruction)

    # Training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=num_epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        learning_rate=learning_rate,
        weight_decay=0.01,
        warmup_ratio=0.1,
        lr_scheduler_type="cosine",
        logging_steps=10,
        save_strategy="epoch",
        evaluation_strategy="epoch",
        bf16=True,
        optim="paged_adamw_8bit",
        gradient_checkpointing=True,
        max_grad_norm=0.3,
        report_to="wandb" if use_wandb else "none",
    )

    # Create trainer
    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        tokenizer=tokenizer,
        dataset_text_field="text",
        max_seq_length=max_seq_length,
        packing=False,
    )

    # Train
    print("Starting training...")
    trainer.train()

    # Save
    trainer.save_model()
    tokenizer.save_pretrained(output_dir)

    print(f"Model saved to {output_dir}")

    if use_wandb:
        wandb.finish()

    return trainer


if __name__ == "__main__":
    trainer = train_model(
        model_name="microsoft/Phi-3-mini-4k-instruct",
        num_epochs=3,
        batch_size=2,
        learning_rate=2e-4,
    )

Understanding QLoRA: How It Enables Fine-tuning on Consumer GPUs:

QLoRA = 4-bit Quantization + LoRA Adapters

Traditional Fine-tuning

All weights trainable. 7B model = 28GB VRAM. Requires A100 GPU.

QLoRA Fine-tuning

Recommended

Base weights frozen (4-bit) + Small LoRA adapters (FP16). 7B model = ~6GB VRAM. Works on RTX 3080!

How LoRA Works

Original Weight Matrix W (frozen, 4-bit)

4096 x 4096 = 16M parameters

LoRA Adapters (trainable, FP16)

Matrix A: 4096 x 16

Matrix B: 16 x 4096

= 131K parameters (0.8% of original!)

Output Formula

Output = W x input + (A x B) x input x (alpha/r)

QLoRA Configuration Explained:

Parameter	Value	Why This Matters
`load_in_4bit=True`	4-bit NF4 quantization	Reduces base model memory 4x
`bnb_4bit_compute_dtype=bfloat16`	Compute in BF16	Better numerical stability than FP16
`bnb_4bit_use_double_quant=True`	Quantize the quantization constants	Extra 0.4 bits/param savings
`r=16`	LoRA rank	Higher = more capacity, more memory
`lora_alpha=32`	Scaling factor (usually 2×r)	Controls magnitude of LoRA updates
`target_modules`	Attention + MLP layers	Where LoRA adapters are inserted

Effective Batch Size Calculation:

effective_batch = batch_size × gradient_accumulation_steps
                = 4 × 4 = 16

This means: Update weights every 16 samples, but only hold 4 in memory at once.

Part 3: Fast Fine-tuning with Unsloth

Unsloth provides 2-5x faster training with 70% less memory.

# train_unsloth.py
from unsloth import FastLanguageModel
from unsloth import is_bfloat16_supported
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
import torch


def train_with_unsloth(
    model_name: str = "unsloth/Phi-3-mini-4k-instruct",
    train_file: str = "data/train.jsonl",
    val_file: str = "data/val.jsonl",
    output_dir: str = "outputs/phi3-unsloth",
    num_epochs: int = 3,
    batch_size: int = 2,
    learning_rate: float = 2e-4,
    max_seq_length: int = 2048,
):
    """Fine-tune with Unsloth for faster training."""

    # Load model with Unsloth
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_name,
        max_seq_length=max_seq_length,
        dtype=None,  # Auto-detect
        load_in_4bit=True,
    )

    # Add LoRA adapters
    model = FastLanguageModel.get_peft_model(
        model,
        r=16,
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ],
        lora_alpha=16,
        lora_dropout=0,  # Unsloth optimizes for 0 dropout
        bias="none",
        use_gradient_checkpointing="unsloth",  # Unsloth optimization
        random_state=42,
        use_rslora=False,
        loftq_config=None,
    )

    # Load datasets
    train_dataset = load_dataset("json", data_files=train_file, split="train")
    val_dataset = load_dataset("json", data_files=val_file, split="train")

    # Phi-3 chat template
    def format_phi3(sample):
        system = sample.get("system", "You are a helpful assistant.")
        instruction = sample["instruction"]
        input_text = sample.get("input", "")
        output = sample["output"]

        user_content = f"{instruction}\n\n{input_text}" if input_text else instruction

        text = f"""<|system|>
{system}<|end|>
<|user|>
{user_content}<|end|>
<|assistant|>
{output}<|end|>"""

        return {"text": text}

    train_dataset = train_dataset.map(format_phi3)
    val_dataset = val_dataset.map(format_phi3)

    # Training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=num_epochs,
        per_device_train_batch_size=batch_size,
        gradient_accumulation_steps=4,
        learning_rate=learning_rate,
        weight_decay=0.01,
        warmup_steps=5,
        lr_scheduler_type="linear",
        logging_steps=1,
        save_strategy="epoch",
        evaluation_strategy="epoch",
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        optim="adamw_8bit",
        seed=42,
    )

    # Create trainer
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        dataset_text_field="text",
        max_seq_length=max_seq_length,
    )

    # Show GPU stats before training
    gpu_stats = torch.cuda.get_device_properties(0)
    start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
    max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
    print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
    print(f"{start_gpu_memory} GB of memory reserved.")

    # Train
    trainer_stats = trainer.train()

    # Show final stats
    used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
    used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
    print(f"Peak reserved memory = {used_memory} GB.")
    print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")

    # Save LoRA adapters
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

    return model, tokenizer


def merge_and_save(
    model,
    tokenizer,
    output_dir: str = "outputs/phi3-merged",
    save_gguf: bool = True,
    quantization: str = "q4_k_m",
):
    """Merge LoRA adapters and save."""

    # Save merged 16-bit model
    model.save_pretrained_merged(
        output_dir,
        tokenizer,
        save_method="merged_16bit",
    )
    print(f"Merged model saved to {output_dir}")

    # Save GGUF for Ollama
    if save_gguf:
        model.save_pretrained_gguf(
            f"{output_dir}-gguf",
            tokenizer,
            quantization_method=quantization,
        )
        print(f"GGUF model saved to {output_dir}-gguf")


if __name__ == "__main__":
    model, tokenizer = train_with_unsloth(
        model_name="unsloth/Phi-3-mini-4k-instruct",
        num_epochs=3,
    )

    # Merge and export
    merge_and_save(model, tokenizer)

Why Unsloth is 2-5x Faster:

Unsloth Optimizations

Standard PEFT/TRL

Generic PyTorch operations, multiple kernel launches, standard attention, regular gradient checkpointing. Peak VRAM: 14.2 GB, ~1,200 tokens/sec (Phi-3 Mini).

Unsloth

Recommended

Custom fused CUDA kernels, single kernel for common ops, Flash Attention 2 built-in, custom gradient checkpointing. Peak VRAM: 5.8 GB (60% less), ~4,800 tokens/sec (4x faster). Key settings: lora_dropout=0 (enables kernel fusion), use_gradient_checkpointing="unsloth", automatic dtype detection.

Unsloth Export Options:

Method	Output	Use Case
`save_pretrained`	LoRA adapters only	Continue training, share small files
`save_pretrained_merged`	Full merged model	HuggingFace deployment
`save_pretrained_gguf`	GGUF quantized	Ollama, llama.cpp deployment

Part 4: Evaluation Pipeline

Evaluate Fine-tuned Model

# evaluate.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from datasets import load_dataset
from typing import List, Dict, Any
import json
from tqdm import tqdm
import numpy as np


class ModelEvaluator:
    """Evaluate fine-tuned models."""

    def __init__(
        self,
        base_model_name: str,
        adapter_path: str = None,
        device: str = "auto",
    ):
        self.device = device

        # Load tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(
            adapter_path or base_model_name,
            trust_remote_code=True
        )
        self.tokenizer.pad_token = self.tokenizer.eos_token

        # Load model
        self.model = AutoModelForCausalLM.from_pretrained(
            base_model_name,
            device_map=device,
            torch_dtype=torch.bfloat16,
            trust_remote_code=True,
        )

        # Load adapters if provided
        if adapter_path:
            self.model = PeftModel.from_pretrained(
                self.model,
                adapter_path,
            )
            print(f"Loaded adapters from {adapter_path}")

        self.model.eval()

    @torch.no_grad()
    def generate(
        self,
        prompt: str,
        max_new_tokens: int = 256,
        temperature: float = 0.1,
        top_p: float = 0.9,
    ) -> str:
        """Generate response for a prompt."""
        inputs = self.tokenizer(
            prompt,
            return_tensors="pt",
            truncation=True,
            max_length=2048,
        ).to(self.model.device)

        outputs = self.model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            do_sample=temperature > 0,
            pad_token_id=self.tokenizer.eos_token_id,
        )

        response = self.tokenizer.decode(
            outputs[0][inputs["input_ids"].shape[1]:],
            skip_special_tokens=True
        )

        return response.strip()

    def evaluate_dataset(
        self,
        test_file: str,
        template: str = "chatml",
        max_samples: int = None,
    ) -> Dict[str, Any]:
        """Evaluate on a test dataset."""

        dataset = load_dataset("json", data_files=test_file, split="train")
        if max_samples:
            dataset = dataset.select(range(min(max_samples, len(dataset))))

        results = []
        for sample in tqdm(dataset, desc="Evaluating"):
            # Format prompt (without output)
            if template == "chatml":
                system = sample.get("system", "You are a helpful assistant.")
                instruction = sample["instruction"]
                input_text = sample.get("input", "")

                if input_text:
                    prompt = f"""<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{instruction}

{input_text}<|im_end|>
<|im_start|>assistant
"""
                else:
                    prompt = f"""<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
"""

            # Generate
            prediction = self.generate(prompt)

            results.append({
                "instruction": sample["instruction"],
                "input": sample.get("input", ""),
                "expected": sample["output"],
                "predicted": prediction,
            })

        return self._calculate_metrics(results)

    def _calculate_metrics(self, results: List[Dict]) -> Dict[str, Any]:
        """Calculate evaluation metrics."""
        from evaluate import load
        from sklearn.metrics import accuracy_score

        # Exact match (for classification tasks)
        exact_matches = sum(
            1 for r in results
            if r["expected"].strip().lower() == r["predicted"].strip().lower()
        )
        exact_match_acc = exact_matches / len(results)

        # Contains match (looser criterion)
        contains_matches = sum(
            1 for r in results
            if r["expected"].strip().lower() in r["predicted"].strip().lower()
        )
        contains_acc = contains_matches / len(results)

        # BLEU score for generation tasks
        try:
            bleu = load("bleu")
            predictions = [r["predicted"] for r in results]
            references = [[r["expected"]] for r in results]
            bleu_score = bleu.compute(predictions=predictions, references=references)["bleu"]
        except Exception:
            bleu_score = None

        # ROUGE scores
        try:
            rouge = load("rouge")
            predictions = [r["predicted"] for r in results]
            references = [r["expected"] for r in results]
            rouge_scores = rouge.compute(predictions=predictions, references=references)
        except Exception:
            rouge_scores = None

        return {
            "num_samples": len(results),
            "exact_match_accuracy": exact_match_acc,
            "contains_accuracy": contains_acc,
            "bleu": bleu_score,
            "rouge": rouge_scores,
            "results": results,
        }


def compare_models(
    base_model_name: str,
    adapter_path: str,
    test_file: str,
    max_samples: int = 50,
) -> Dict[str, Any]:
    """Compare base model vs fine-tuned model."""

    print("Evaluating base model...")
    base_evaluator = ModelEvaluator(base_model_name)
    base_metrics = base_evaluator.evaluate_dataset(test_file, max_samples=max_samples)

    print("\nEvaluating fine-tuned model...")
    ft_evaluator = ModelEvaluator(base_model_name, adapter_path)
    ft_metrics = ft_evaluator.evaluate_dataset(test_file, max_samples=max_samples)

    # Compare
    comparison = {
        "base_model": {
            "exact_match": base_metrics["exact_match_accuracy"],
            "contains": base_metrics["contains_accuracy"],
            "bleu": base_metrics["bleu"],
        },
        "finetuned_model": {
            "exact_match": ft_metrics["exact_match_accuracy"],
            "contains": ft_metrics["contains_accuracy"],
            "bleu": ft_metrics["bleu"],
        },
        "improvement": {
            "exact_match": ft_metrics["exact_match_accuracy"] - base_metrics["exact_match_accuracy"],
            "contains": ft_metrics["contains_accuracy"] - base_metrics["contains_accuracy"],
        }
    }

    print("\n" + "=" * 50)
    print("COMPARISON RESULTS")
    print("=" * 50)
    print(f"{'Metric':<20} {'Base':<15} {'Fine-tuned':<15} {'Change':<15}")
    print("-" * 65)
    print(f"{'Exact Match':<20} {comparison['base_model']['exact_match']:.1%}          {comparison['finetuned_model']['exact_match']:.1%}          {comparison['improvement']['exact_match']:+.1%}")
    print(f"{'Contains Match':<20} {comparison['base_model']['contains']:.1%}          {comparison['finetuned_model']['contains']:.1%}          {comparison['improvement']['contains']:+.1%}")

    return comparison


if __name__ == "__main__":
    # Evaluate fine-tuned model
    evaluator = ModelEvaluator(
        base_model_name="microsoft/Phi-3-mini-4k-instruct",
        adapter_path="outputs/phi3-finetuned",
    )

    metrics = evaluator.evaluate_dataset(
        test_file="data/val.jsonl",
        max_samples=20,
    )

    print(f"\nExact Match Accuracy: {metrics['exact_match_accuracy']:.1%}")
    print(f"Contains Accuracy: {metrics['contains_accuracy']:.1%}")
    if metrics['bleu']:
        print(f"BLEU Score: {metrics['bleu']:.3f}")

Understanding Fine-tuning Evaluation:

Evaluation Metrics for Fine-Tuned Models

Classification Tasks (Q&A, Yes/No)

Exact Match: "Paris" == "Paris" (pass), "Paris" == "paris" (fail, case-sensitive). Accuracy: Percentage of correct answers.

Generation Tasks (Summarization, Writing)

BLEU: n-gram overlap with reference. ROUGE: Recall-oriented overlap. Contains Match: Key info present (looser check).

Base Model Performance

~60% on domain tasks.

Fine-tuned Performance

Recommended

~95% on domain tasks (+35% improvement). Warning signs: Fine-tuned worse than base = wrong template/data format. Perfect training but poor validation = overfitting. Good validation but poor real-world = data distribution mismatch.

Evaluation Strategy:

Approach	When to Use	Example
Exact Match	Classification, factual Q&A	"What is the capital?" → "Paris"
Contains Match	Key information extraction	Response includes the key fact
BLEU/ROUGE	Open-ended generation	Summaries, creative writing
Human Evaluation	Final quality check	A/B preference testing

Part 5: Export to Ollama

Merge and Convert to GGUF

# export_ollama.py
import subprocess
import os
from pathlib import Path
import shutil


def merge_lora_weights(
    base_model: str,
    adapter_path: str,
    output_path: str,
):
    """Merge LoRA adapters into base model."""
    from transformers import AutoModelForCausalLM, AutoTokenizer
    from peft import PeftModel
    import torch

    print(f"Loading base model: {base_model}")
    model = AutoModelForCausalLM.from_pretrained(
        base_model,
        torch_dtype=torch.float16,
        device_map="cpu",
        trust_remote_code=True,
    )

    print(f"Loading adapters: {adapter_path}")
    model = PeftModel.from_pretrained(model, adapter_path)

    print("Merging weights...")
    model = model.merge_and_unload()

    print(f"Saving merged model to {output_path}")
    model.save_pretrained(output_path, safe_serialization=True)

    # Save tokenizer
    tokenizer = AutoTokenizer.from_pretrained(adapter_path, trust_remote_code=True)
    tokenizer.save_pretrained(output_path)

    print("Merge complete!")
    return output_path


def convert_to_gguf(
    model_path: str,
    output_path: str,
    quantization: str = "q4_k_m",
    llama_cpp_path: str = None,
):
    """Convert HuggingFace model to GGUF format."""

    if llama_cpp_path is None:
        # Try to find llama.cpp in common locations
        possible_paths = [
            Path.home() / "llama.cpp",
            Path("./llama.cpp"),
            Path("/opt/llama.cpp"),
        ]
        for p in possible_paths:
            if p.exists():
                llama_cpp_path = str(p)
                break

    if not llama_cpp_path:
        print("llama.cpp not found. Please provide the path or install it.")
        return None

    convert_script = Path(llama_cpp_path) / "convert_hf_to_gguf.py"
    quantize_binary = Path(llama_cpp_path) / "llama-quantize"

    # Create output directory
    output_dir = Path(output_path)
    output_dir.mkdir(parents=True, exist_ok=True)

    # Step 1: Convert to GGUF F16
    f16_path = output_dir / "model-f16.gguf"
    print(f"Converting to GGUF F16...")

    cmd = [
        "python", str(convert_script),
        model_path,
        "--outfile", str(f16_path),
        "--outtype", "f16",
    ]
    subprocess.run(cmd, check=True)

    # Step 2: Quantize
    quantized_path = output_dir / f"model-{quantization}.gguf"
    print(f"Quantizing to {quantization}...")

    cmd = [
        str(quantize_binary),
        str(f16_path),
        str(quantized_path),
        quantization.upper(),
    ]
    subprocess.run(cmd, check=True)

    # Clean up F16 if quantization successful
    if quantized_path.exists():
        f16_path.unlink()

    print(f"GGUF model saved to {quantized_path}")
    return str(quantized_path)


def create_ollama_modelfile(
    gguf_path: str,
    model_name: str,
    system_prompt: str = "You are a helpful assistant.",
    template: str = "chatml",
    output_path: str = None,
):
    """Create Ollama Modelfile for the custom model."""

    templates = {
        "chatml": '''TEMPLATE """<|im_start|>system
{{ .System }}<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>"""''',

        "phi3": '''TEMPLATE """<|system|>
{{ .System }}<|end|>
<|user|>
{{ .Prompt }}<|end|>
<|assistant|>
{{ .Response }}<|end|>"""''',

        "llama3": '''TEMPLATE """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""''',
    }

    modelfile_content = f'''FROM {gguf_path}

{templates.get(template, templates["chatml"])}

SYSTEM """{system_prompt}"""

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<|im_end|>"
PARAMETER stop "<|end|>"
'''

    output_path = output_path or "Modelfile"
    with open(output_path, 'w') as f:
        f.write(modelfile_content)

    print(f"Modelfile saved to {output_path}")
    return output_path


def register_with_ollama(modelfile_path: str, model_name: str):
    """Register the model with Ollama."""
    print(f"Creating Ollama model: {model_name}")

    cmd = ["ollama", "create", model_name, "-f", modelfile_path]
    subprocess.run(cmd, check=True)

    print(f"Model {model_name} is now available in Ollama!")
    print(f"Test with: ollama run {model_name}")


def full_export_pipeline(
    base_model: str,
    adapter_path: str,
    model_name: str,
    system_prompt: str = "You are a helpful assistant.",
    quantization: str = "q4_k_m",
    llama_cpp_path: str = None,
):
    """Complete pipeline: merge -> convert -> register."""

    work_dir = Path("exports") / model_name
    work_dir.mkdir(parents=True, exist_ok=True)

    # Step 1: Merge
    merged_path = str(work_dir / "merged")
    merge_lora_weights(base_model, adapter_path, merged_path)

    # Step 2: Convert to GGUF
    gguf_dir = str(work_dir / "gguf")
    gguf_path = convert_to_gguf(
        merged_path,
        gguf_dir,
        quantization,
        llama_cpp_path,
    )

    if gguf_path:
        # Step 3: Create Modelfile
        modelfile_path = str(work_dir / "Modelfile")
        create_ollama_modelfile(
            gguf_path,
            model_name,
            system_prompt,
            output_path=modelfile_path,
        )

        # Step 4: Register with Ollama
        register_with_ollama(modelfile_path, model_name)

        print("\n" + "=" * 50)
        print("EXPORT COMPLETE!")
        print("=" * 50)
        print(f"Model name: {model_name}")
        print(f"Quantization: {quantization}")
        print(f"\nRun with: ollama run {model_name}")


if __name__ == "__main__":
    full_export_pipeline(
        base_model="microsoft/Phi-3-mini-4k-instruct",
        adapter_path="outputs/phi3-finetuned",
        model_name="phi2-custom",
        system_prompt="You are a helpful customer support agent.",
        quantization="q4_k_m",
    )

Understanding the Export Pipeline:

From Training to Ollama Deployment

Step 1: Merge LoRA into Base ModelBase Model W (frozen, 4-bit) + LoRA Adapters (A x B, trainable) = Merged Model W + A*B*(alpha/r) in full precision (FP16)

Step 2: Convert to GGUFHuggingFace format (model.safetensors, config.json, tokenizer.json) is converted and quantized into a single GGUF file (model-q4_k_m.gguf, 4-bit weights)

Step 3: Create Modelfile + Register with OllamaDefine Modelfile with FROM ./model-q4_k_m.gguf, TEMPLATE (must match training template!), SYSTEM prompt, and PARAMETER settings. Then run: ollama create my-model -f Modelfile

GGUF Quantization Options:

Method	Bits	Size (2B model)	Quality	Recommended For
Q2_K	2.5	~600MB	Poor	Extreme constraints only
Q4_K_M	4.5	~1.2GB	Near-FP16	Best balance
Q5_K_M	5.5	~1.5GB	Excellent	Quality-critical apps
Q8_0	8.0	~2.2GB	Indistinguishable	When size doesn't matter

Common Pitfall: Using a different chat template in Modelfile than what the model was trained with causes poor outputs. Always match the template!

Part 6: Complete Training Script

All-in-One Training Pipeline

# train.py
import argparse
from pathlib import Path
import json


def main():
    parser = argparse.ArgumentParser(description="SLM Fine-tuning Pipeline")

    # Model arguments
    parser.add_argument("--base-model", type=str, default="microsoft/Phi-3-mini-4k-instruct",
                       help="Base model to fine-tune")
    parser.add_argument("--output-dir", type=str, default="outputs/finetuned",
                       help="Output directory for checkpoints")

    # Data arguments
    parser.add_argument("--train-file", type=str, required=True,
                       help="Training data file (JSONL)")
    parser.add_argument("--val-file", type=str,
                       help="Validation data file (JSONL)")

    # Training arguments
    parser.add_argument("--epochs", type=int, default=3)
    parser.add_argument("--batch-size", type=int, default=2)
    parser.add_argument("--lr", type=float, default=2e-4)
    parser.add_argument("--max-seq-length", type=int, default=2048)
    parser.add_argument("--gradient-accumulation", type=int, default=4)

    # LoRA arguments
    parser.add_argument("--lora-r", type=int, default=16)
    parser.add_argument("--lora-alpha", type=int, default=32)
    parser.add_argument("--lora-dropout", type=float, default=0.05)

    # Other arguments
    parser.add_argument("--use-unsloth", action="store_true",
                       help="Use Unsloth for faster training")
    parser.add_argument("--use-wandb", action="store_true",
                       help="Log to Weights & Biases")
    parser.add_argument("--export-ollama", action="store_true",
                       help="Export to Ollama after training")
    parser.add_argument("--model-name", type=str, default="custom-model",
                       help="Name for Ollama model")

    args = parser.parse_args()

    # Create output directory
    Path(args.output_dir).mkdir(parents=True, exist_ok=True)

    # Save config
    config = vars(args)
    with open(Path(args.output_dir) / "config.json", 'w') as f:
        json.dump(config, f, indent=2)

    # Train
    if args.use_unsloth:
        from train_unsloth import train_with_unsloth, merge_and_save

        model, tokenizer = train_with_unsloth(
            model_name=args.base_model,
            train_file=args.train_file,
            val_file=args.val_file,
            output_dir=args.output_dir,
            num_epochs=args.epochs,
            batch_size=args.batch_size,
            learning_rate=args.lr,
            max_seq_length=args.max_seq_length,
        )

        if args.export_ollama:
            merge_and_save(
                model, tokenizer,
                output_dir=f"{args.output_dir}-merged",
                save_gguf=True,
            )
    else:
        from train_qlora import train_model

        trainer = train_model(
            model_name=args.base_model,
            train_file=args.train_file,
            val_file=args.val_file,
            output_dir=args.output_dir,
            num_epochs=args.epochs,
            batch_size=args.batch_size,
            learning_rate=args.lr,
            max_seq_length=args.max_seq_length,
            gradient_accumulation_steps=args.gradient_accumulation,
            use_wandb=args.use_wandb,
        )

    print("\nTraining complete!")
    print(f"Model saved to: {args.output_dir}")


if __name__ == "__main__":
    main()

Usage Examples

# Basic training
python train.py \
  --base-model microsoft/Phi-3-mini-4k-instruct \
  --train-file data/train.jsonl \
  --val-file data/val.jsonl \
  --epochs 3

# Fast training with Unsloth
python train.py \
  --base-model unsloth/Phi-3-mini-4k-instruct \
  --train-file data/train.jsonl \
  --use-unsloth \
  --export-ollama \
  --model-name my-assistant

# With Weights & Biases logging
python train.py \
  --base-model microsoft/Phi-3-mini-4k-instruct \
  --train-file data/train.jsonl \
  --use-wandb \
  --epochs 5 \
  --lr 1e-4

Hyperparameter Guide

Parameter	Recommended Range	Notes
Learning Rate	1e-5 to 5e-4	Start with 2e-4
Batch Size	1-8	Limited by VRAM
Gradient Accumulation	4-16	Effective batch = batch * accum
Epochs	1-5	More epochs risk overfitting
LoRA Rank (r)	8-64	Higher = more capacity
LoRA Alpha	16-64	Usually 2x rank
Max Seq Length	512-4096	Longer uses more memory

Common Issues & Solutions

Out of Memory

# Reduce batch size
batch_size = 1
gradient_accumulation_steps = 8

# Use gradient checkpointing
training_args = TrainingArguments(
    gradient_checkpointing=True,
    # ...
)

# Reduce max sequence length
max_seq_length = 1024

Poor Performance After Fine-tuning

Check data quality - Remove duplicates and errors
Increase training data - More examples usually help
Adjust learning rate - Try 5x lower or higher
Add more epochs - Ensure model converges
Verify template matching - Use same template as base model

Exercises

Domain Adaptation: Fine-tune a model on legal or medical text
Multi-task Learning: Train on multiple task types simultaneously
Hyperparameter Search: Implement automated hyperparameter tuning
Evaluation Suite: Create comprehensive task-specific evaluation

Key Concepts Recap

Concept	What It Is	Why It Matters
QLoRA	4-bit quantization + LoRA adapters	Train 7B models on 8GB VRAM
LoRA Rank (r)	Size of low-rank matrices (8-64)	Higher = more capacity, more memory
LoRA Alpha	Scaling factor (usually 2×rank)	Controls update magnitude
Effective Batch	batch_size × gradient_accumulation	Stabilizes training without more memory
Unsloth	Optimized training library	2-5x faster, 70% less memory
Chat Template	Format for instruction/response pairs	Must match base model's expected format
SFTTrainer	Supervised fine-tuning trainer	Handles chat format, packing, loss masking
Gradient Checkpointing	Trade compute for memory	Enables larger models on limited VRAM
GGUF Export	Quantized format for llama.cpp/Ollama	Deploy fine-tuned models locally
Validation Loss	Loss on held-out data	Detect overfitting, choose best checkpoint

Next Steps

SLM-Powered RAG - Combine fine-tuned models with retrieval
Edge Deployment - Deploy on mobile and edge devices
SLM Agents - Build agentic systems with SLMs

SLM Fine-tuning

TL;DR

Project Overview

Aspect	Details
Difficulty	Intermediate
Time	6-8 hours
Prerequisites	Local SLM Setup, SLM Benchmarking
What You'll Build	Fine-tuned SLM for custom task with evaluation pipeline

What You'll Learn

QLoRA fine-tuning with PEFT library
Unsloth for 2-5x faster training
Dataset preparation and formatting
Hyperparameter selection for SLMs
Evaluation and iteration workflow
Exporting models for Ollama deployment

Why Fine-tune Small Models?

Approach	Quality on Your Task	Cost	Hardware
Prompting a general SLM	Moderate	Free	Any
Prompting a large API model	Good	$$ per request	None (cloud)
Fine-tuned SLM	Excellent for your domain	One-time GPU cost	8GB+ VRAM

Architecture Overview

SLM Fine-tuning Pipeline

Data Preparation

Raw Data

Formatter

Train/Val Split

Training Pipeline

Base Model

LoRA Adapters

SFTTrainer

Evaluation

Metrics

Comparison

Iteration (retry loop)

Export & Deploy

Merge

GGUF

Ollama

Data Flow

Raw DataInput dataset

FormatApply chat template

SplitTrain/validation sets

TrainQLoRA fine-tuning

EvaluateMeasure metrics

MergeLoRA + Base model

GGUFConvert to GGUF format

OllamaDeploy locally

Hardware Requirements

Configuration	VRAM	Models	Training Time
Minimum	8GB	1-2B models	Slow
Recommended	16GB	3-4B models	Good
Optimal	24GB+	7B+ models	Fast

Project Setup

Install Dependencies

# Create project directory
mkdir slm-finetuning && cd slm-finetuning

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install PyTorch with CUDA (adjust for your CUDA version)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install training libraries
pip install transformers datasets peft accelerate bitsandbytes
pip install trl wandb evaluate scikit-learn

# Install Unsloth for faster training (optional but recommended)
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Part 1: Dataset Preparation

Dataset Formats

SLMs typically use instruction or chat formats for fine-tuning.

# data_prep.py
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from datasets import Dataset, DatasetDict
import json


@dataclass
class InstructionSample:
    """Single instruction-response pair."""
    instruction: str
    input: str = ""
    output: str = ""
    system: str = ""


class DatasetFormatter:
    """Format datasets for SLM fine-tuning."""

    # Common chat templates
    TEMPLATES = {
        "alpaca": """Below is an instruction that describes a task{input_section}. Write a response that appropriately completes the request.

### Instruction:
{instruction}
{input_block}
### Response:
{output}""",

        "chatml": """<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{instruction}{input_block}<|im_end|>
<|im_start|>assistant
{output}<|im_end|>""",

        "llama3": """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system}<|eot_id|><|start_header_id|>user<|end_header_id|}

{instruction}{input_block}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{output}<|eot_id|>""",

        "phi3": """<|system|>
{system}<|end|>
<|user|>
{instruction}{input_block}<|end|>
<|assistant|>
{output}<|end|>""",
    }

    def __init__(self, template: str = "chatml", system_prompt: str = ""):
        self.template_name = template
        self.template = self.TEMPLATES.get(template, self.TEMPLATES["chatml"])
        self.default_system = system_prompt or "You are a helpful assistant."

    def format_sample(self, sample: InstructionSample) -> str:
        """Format a single sample."""
        input_section = ", along with an input that provides further context" if sample.input else ""
        input_block = f"\n\n### Input:\n{sample.input}" if sample.input else ""
        system = sample.system or self.default_system

        return self.template.format(
            instruction=sample.instruction,
            input=sample.input,
            input_section=input_section,
            input_block=input_block,
            output=sample.output,
            system=system
        )

    def format_dataset(
        self,
        samples: List[InstructionSample],
        text_column: str = "text"
    ) -> Dataset:
        """Format list of samples into HuggingFace Dataset."""
        formatted = [self.format_sample(s) for s in samples]
        return Dataset.from_dict({text_column: formatted})


def load_jsonl(filepath: str) -> List[InstructionSample]:
    """Load samples from JSONL file."""
    samples = []
    with open(filepath, 'r') as f:
        for line in f:
            data = json.loads(line)
            samples.append(InstructionSample(
                instruction=data.get("instruction", ""),
                input=data.get("input", ""),
                output=data.get("output", ""),
                system=data.get("system", "")
            ))
    return samples


def create_train_val_split(
    samples: List[InstructionSample],
    val_ratio: float = 0.1,
    seed: int = 42
) -> tuple:
    """Split samples into train and validation sets."""
    import random
    random.seed(seed)
    random.shuffle(samples)

    split_idx = int(len(samples) * (1 - val_ratio))
    return samples[:split_idx], samples[split_idx:]


# Example usage
if __name__ == "__main__":
    # Create sample dataset
    samples = [
        InstructionSample(
            instruction="Summarize the following text.",
            input="Machine learning is a subset of artificial intelligence...",
            output="Machine learning enables computers to learn from data."
        ),
        InstructionSample(
            instruction="What is the capital of France?",
            output="The capital of France is Paris."
        ),
    ]

    formatter = DatasetFormatter(template="chatml")
    dataset = formatter.format_dataset(samples)
    print(dataset[0]["text"])

Understanding Chat Templates and Dataset Formatting:

Why Chat Templates Matter

Raw Data

instruction: "Summarize", input: "Long text...", output: "Summary..." -- Unstructured key-value pairs with no role markers.

Formatted (ChatML)

Recommended

Common Chat Template Formats:

Template	Models Using It	Special Tokens
ChatML	Qwen, Yi, many fine-tunes	`<\|im_start\|>`, `<\|im_end\|>`
Llama 3	Llama 3, Llama 3.1/3.2	`<\|begin_of_text\|>`, `<\|eot_id\|>`
Phi-3	Microsoft Phi-3 family	`<\|system\|>`, `<\|end\|>`
Alpaca	Many older fine-tunes	`### Instruction:`, `### Response:`

Critical: Always match the template to your base model. Using the wrong template causes the model to generate garbage or never stop.

Create Domain-Specific Dataset

# create_dataset.py
import json
from pathlib import Path
from typing import List, Dict
from data_prep import InstructionSample, DatasetFormatter, create_train_val_split


def create_customer_support_dataset() -> List[InstructionSample]:
    """Create sample customer support dataset."""
    samples = [
        InstructionSample(
            instruction="A customer asks about their order status. Respond helpfully.",
            input="Hi, I placed an order 3 days ago but haven't received any shipping notification. Order #12345.",
            output="I apologize for the delay in updating you about your order #12345. Let me check the status for you. I can see your order was processed and is currently being prepared for shipment. You should receive a shipping confirmation with tracking information within the next 24 hours. Is there anything else I can help you with?",
            system="You are a helpful customer support agent. Be polite, professional, and solution-oriented."
        ),
        InstructionSample(
            instruction="Handle a refund request professionally.",
            input="I want a refund. The product doesn't work as advertised.",
            output="I'm sorry to hear the product didn't meet your expectations. I'd be happy to help process a refund for you. Could you please provide your order number and briefly describe the issue you experienced? This will help us improve our products and ensure a smooth refund process. If you prefer, we can also offer an exchange or store credit instead.",
            system="You are a helpful customer support agent. Be polite, professional, and solution-oriented."
        ),
        InstructionSample(
            instruction="Answer a product question accurately.",
            input="Does this laptop support USB-C charging?",
            output="Yes, this laptop supports USB-C charging through its Thunderbolt 4 ports. You can use any USB-C PD (Power Delivery) charger with at least 65W output for optimal charging speed. The laptop also includes the standard barrel charger in the box. Would you like me to recommend any compatible USB-C chargers?",
            system="You are a helpful customer support agent. Be polite, professional, and solution-oriented."
        ),
        # Add more samples...
    ]

    return samples


def create_code_assistant_dataset() -> List[InstructionSample]:
    """Create sample coding assistant dataset."""
    samples = [
        InstructionSample(
            instruction="Write a Python function that checks if a number is prime.",
            output='''def is_prime(n: int) -> bool:
    """Check if a number is prime."""
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False
    for i in range(3, int(n**0.5) + 1, 2):
        if n % i == 0:
            return False
    return True''',
            system="You are an expert Python programmer. Write clean, efficient, and well-documented code."
        ),
        InstructionSample(
            instruction="Explain this code and suggest improvements.",
            input='''def sort(arr):
    for i in range(len(arr)):
        for j in range(len(arr)-1):
            if arr[j] > arr[j+1]:
                arr[j], arr[j+1] = arr[j+1], arr[j]
    return arr''',
            output='''This is a Bubble Sort implementation with O(n^2) time complexity.

Issues:
1. Not optimized - continues even if array is sorted
2. Inner loop does unnecessary comparisons

Improved version:
```python
def bubble_sort(arr: list) -> list:
    """Optimized bubble sort with early termination."""
    n = len(arr)
    for i in range(n):
        swapped = False
        for j in range(n - i - 1):
            if arr[j] > arr[j + 1]:
                arr[j], arr[j + 1] = arr[j + 1], arr[j]
                swapped = True
        if not swapped:
            break
    return arr
```

For better performance, consider using Python's built-in `sorted()` which uses Timsort with O(n log n) complexity.''',
            system="You are an expert Python programmer. Write clean, efficient, and well-documented code."
        ),
    ]

    return samples


def save_dataset(samples: List[InstructionSample], filepath: str):
    """Save dataset to JSONL file."""
    with open(filepath, 'w') as f:
        for sample in samples:
            data = {
                "instruction": sample.instruction,
                "input": sample.input,
                "output": sample.output,
                "system": sample.system
            }
            f.write(json.dumps(data) + "\n")
    print(f"Saved {len(samples)} samples to {filepath}")


if __name__ == "__main__":
    # Create datasets
    support_samples = create_customer_support_dataset()
    code_samples = create_code_assistant_dataset()

    # Combine or save separately
    all_samples = support_samples + code_samples

    # Split and save
    train_samples, val_samples = create_train_val_split(all_samples)

    Path("data").mkdir(exist_ok=True)
    save_dataset(train_samples, "data/train.jsonl")
    save_dataset(val_samples, "data/val.jsonl")

Part 2: QLoRA Fine-tuning with PEFT

Basic QLoRA Training

# train_qlora.py
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
    TaskType,
)
from trl import SFTTrainer
from datasets import load_dataset
import wandb


def setup_qlora_model(
    model_name: str,
    lora_r: int = 16,
    lora_alpha: int = 32,
    lora_dropout: float = 0.05,
    target_modules: list = None
):
    """Setup model with QLoRA configuration."""

    # 4-bit quantization config
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
    )

    # Load base model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
        torch_dtype=torch.bfloat16,
    )

    # Prepare for k-bit training
    model = prepare_model_for_kbit_training(model)

    # Default target modules for common architectures
    if target_modules is None:
        target_modules = [
            "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
            "gate_proj", "up_proj", "down_proj",      # MLP
        ]

    # LoRA configuration
    lora_config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        target_modules=target_modules,
        lora_dropout=lora_dropout,
        bias="none",
        task_type=TaskType.CAUSAL_LM,
    )

    # Apply LoRA
    model = get_peft_model(model, lora_config)

    # Print trainable parameters
    model.print_trainable_parameters()

    return model


def train_model(
    model_name: str = "microsoft/Phi-3-mini-4k-instruct",
    train_file: str = "data/train.jsonl",
    val_file: str = "data/val.jsonl",
    output_dir: str = "outputs/phi3-finetuned",
    num_epochs: int = 3,
    batch_size: int = 4,
    learning_rate: float = 2e-4,
    max_seq_length: int = 2048,
    gradient_accumulation_steps: int = 4,
    use_wandb: bool = False,
):
    """Fine-tune model with QLoRA."""

    # Initialize wandb if requested
    if use_wandb:
        wandb.init(project="slm-finetuning", name=f"qlora-phi3")

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"

    # Setup model
    model = setup_qlora_model(model_name)

    # Load datasets
    train_dataset = load_dataset("json", data_files=train_file, split="train")
    val_dataset = load_dataset("json", data_files=val_file, split="train")

    # Format function
    def format_instruction(sample):
        system = sample.get("system", "You are a helpful assistant.")
        instruction = sample["instruction"]
        input_text = sample.get("input", "")
        output = sample["output"]

        if input_text:
            text = f"""<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{instruction}

{input_text}<|im_end|>
<|im_start|>assistant
{output}<|im_end|>"""
        else:
            text = f"""<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
{output}<|im_end|>"""

        return {"text": text}

    # Apply formatting
    train_dataset = train_dataset.map(format_instruction)
    val_dataset = val_dataset.map(format_instruction)

    # Training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=num_epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        learning_rate=learning_rate,
        weight_decay=0.01,
        warmup_ratio=0.1,
        lr_scheduler_type="cosine",
        logging_steps=10,
        save_strategy="epoch",
        evaluation_strategy="epoch",
        bf16=True,
        optim="paged_adamw_8bit",
        gradient_checkpointing=True,
        max_grad_norm=0.3,
        report_to="wandb" if use_wandb else "none",
    )

    # Create trainer
    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        tokenizer=tokenizer,
        dataset_text_field="text",
        max_seq_length=max_seq_length,
        packing=False,
    )

    # Train
    print("Starting training...")
    trainer.train()

    # Save
    trainer.save_model()
    tokenizer.save_pretrained(output_dir)

    print(f"Model saved to {output_dir}")

    if use_wandb:
        wandb.finish()

    return trainer


if __name__ == "__main__":
    trainer = train_model(
        model_name="microsoft/Phi-3-mini-4k-instruct",
        num_epochs=3,
        batch_size=2,
        learning_rate=2e-4,
    )

Understanding QLoRA: How It Enables Fine-tuning on Consumer GPUs:

QLoRA = 4-bit Quantization + LoRA Adapters

Traditional Fine-tuning

All weights trainable. 7B model = 28GB VRAM. Requires A100 GPU.

QLoRA Fine-tuning

Recommended

Base weights frozen (4-bit) + Small LoRA adapters (FP16). 7B model = ~6GB VRAM. Works on RTX 3080!

How LoRA Works

Original Weight Matrix W (frozen, 4-bit)

4096 x 4096 = 16M parameters

LoRA Adapters (trainable, FP16)

Matrix A: 4096 x 16

Matrix B: 16 x 4096

= 131K parameters (0.8% of original!)

Output Formula

Output = W x input + (A x B) x input x (alpha/r)

QLoRA Configuration Explained:

Parameter	Value	Why This Matters
`load_in_4bit=True`	4-bit NF4 quantization	Reduces base model memory 4x
`bnb_4bit_compute_dtype=bfloat16`	Compute in BF16	Better numerical stability than FP16
`bnb_4bit_use_double_quant=True`	Quantize the quantization constants	Extra 0.4 bits/param savings
`r=16`	LoRA rank	Higher = more capacity, more memory
`lora_alpha=32`	Scaling factor (usually 2×r)	Controls magnitude of LoRA updates
`target_modules`	Attention + MLP layers	Where LoRA adapters are inserted

Effective Batch Size Calculation:

effective_batch = batch_size × gradient_accumulation_steps
                = 4 × 4 = 16

This means: Update weights every 16 samples, but only hold 4 in memory at once.

Part 3: Fast Fine-tuning with Unsloth

Unsloth provides 2-5x faster training with 70% less memory.

# train_unsloth.py
from unsloth import FastLanguageModel
from unsloth import is_bfloat16_supported
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
import torch


def train_with_unsloth(
    model_name: str = "unsloth/Phi-3-mini-4k-instruct",
    train_file: str = "data/train.jsonl",
    val_file: str = "data/val.jsonl",
    output_dir: str = "outputs/phi3-unsloth",
    num_epochs: int = 3,
    batch_size: int = 2,
    learning_rate: float = 2e-4,
    max_seq_length: int = 2048,
):
    """Fine-tune with Unsloth for faster training."""

    # Load model with Unsloth
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_name,
        max_seq_length=max_seq_length,
        dtype=None,  # Auto-detect
        load_in_4bit=True,
    )

    # Add LoRA adapters
    model = FastLanguageModel.get_peft_model(
        model,
        r=16,
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ],
        lora_alpha=16,
        lora_dropout=0,  # Unsloth optimizes for 0 dropout
        bias="none",
        use_gradient_checkpointing="unsloth",  # Unsloth optimization
        random_state=42,
        use_rslora=False,
        loftq_config=None,
    )

    # Load datasets
    train_dataset = load_dataset("json", data_files=train_file, split="train")
    val_dataset = load_dataset("json", data_files=val_file, split="train")

    # Phi-3 chat template
    def format_phi3(sample):
        system = sample.get("system", "You are a helpful assistant.")
        instruction = sample["instruction"]
        input_text = sample.get("input", "")
        output = sample["output"]

        user_content = f"{instruction}\n\n{input_text}" if input_text else instruction

        text = f"""<|system|>
{system}<|end|>
<|user|>
{user_content}<|end|>
<|assistant|>
{output}<|end|>"""

        return {"text": text}

    train_dataset = train_dataset.map(format_phi3)
    val_dataset = val_dataset.map(format_phi3)

    # Training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=num_epochs,
        per_device_train_batch_size=batch_size,
        gradient_accumulation_steps=4,
        learning_rate=learning_rate,
        weight_decay=0.01,
        warmup_steps=5,
        lr_scheduler_type="linear",
        logging_steps=1,
        save_strategy="epoch",
        evaluation_strategy="epoch",
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        optim="adamw_8bit",
        seed=42,
    )

    # Create trainer
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        dataset_text_field="text",
        max_seq_length=max_seq_length,
    )

    # Show GPU stats before training
    gpu_stats = torch.cuda.get_device_properties(0)
    start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
    max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
    print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
    print(f"{start_gpu_memory} GB of memory reserved.")

    # Train
    trainer_stats = trainer.train()

    # Show final stats
    used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
    used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
    print(f"Peak reserved memory = {used_memory} GB.")
    print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")

    # Save LoRA adapters
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

    return model, tokenizer


def merge_and_save(
    model,
    tokenizer,
    output_dir: str = "outputs/phi3-merged",
    save_gguf: bool = True,
    quantization: str = "q4_k_m",
):
    """Merge LoRA adapters and save."""

    # Save merged 16-bit model
    model.save_pretrained_merged(
        output_dir,
        tokenizer,
        save_method="merged_16bit",
    )
    print(f"Merged model saved to {output_dir}")

    # Save GGUF for Ollama
    if save_gguf:
        model.save_pretrained_gguf(
            f"{output_dir}-gguf",
            tokenizer,
            quantization_method=quantization,
        )
        print(f"GGUF model saved to {output_dir}-gguf")


if __name__ == "__main__":
    model, tokenizer = train_with_unsloth(
        model_name="unsloth/Phi-3-mini-4k-instruct",
        num_epochs=3,
    )

    # Merge and export
    merge_and_save(model, tokenizer)

Why Unsloth is 2-5x Faster:

Unsloth Optimizations

Standard PEFT/TRL

Generic PyTorch operations, multiple kernel launches, standard attention, regular gradient checkpointing. Peak VRAM: 14.2 GB, ~1,200 tokens/sec (Phi-3 Mini).

Unsloth

Recommended

Unsloth Export Options:

Method	Output	Use Case
`save_pretrained`	LoRA adapters only	Continue training, share small files
`save_pretrained_merged`	Full merged model	HuggingFace deployment
`save_pretrained_gguf`	GGUF quantized	Ollama, llama.cpp deployment

Part 4: Evaluation Pipeline

Evaluate Fine-tuned Model

# evaluate.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from datasets import load_dataset
from typing import List, Dict, Any
import json
from tqdm import tqdm
import numpy as np


class ModelEvaluator:
    """Evaluate fine-tuned models."""

    def __init__(
        self,
        base_model_name: str,
        adapter_path: str = None,
        device: str = "auto",
    ):
        self.device = device

        # Load tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(
            adapter_path or base_model_name,
            trust_remote_code=True
        )
        self.tokenizer.pad_token = self.tokenizer.eos_token

        # Load model
        self.model = AutoModelForCausalLM.from_pretrained(
            base_model_name,
            device_map=device,
            torch_dtype=torch.bfloat16,
            trust_remote_code=True,
        )

        # Load adapters if provided
        if adapter_path:
            self.model = PeftModel.from_pretrained(
                self.model,
                adapter_path,
            )
            print(f"Loaded adapters from {adapter_path}")

        self.model.eval()

    @torch.no_grad()
    def generate(
        self,
        prompt: str,
        max_new_tokens: int = 256,
        temperature: float = 0.1,
        top_p: float = 0.9,
    ) -> str:
        """Generate response for a prompt."""
        inputs = self.tokenizer(
            prompt,
            return_tensors="pt",
            truncation=True,
            max_length=2048,
        ).to(self.model.device)

        outputs = self.model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            do_sample=temperature > 0,
            pad_token_id=self.tokenizer.eos_token_id,
        )

        response = self.tokenizer.decode(
            outputs[0][inputs["input_ids"].shape[1]:],
            skip_special_tokens=True
        )

        return response.strip()

    def evaluate_dataset(
        self,
        test_file: str,
        template: str = "chatml",
        max_samples: int = None,
    ) -> Dict[str, Any]:
        """Evaluate on a test dataset."""

        dataset = load_dataset("json", data_files=test_file, split="train")
        if max_samples:
            dataset = dataset.select(range(min(max_samples, len(dataset))))

        results = []
        for sample in tqdm(dataset, desc="Evaluating"):
            # Format prompt (without output)
            if template == "chatml":
                system = sample.get("system", "You are a helpful assistant.")
                instruction = sample["instruction"]
                input_text = sample.get("input", "")

                if input_text:
                    prompt = f"""<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{instruction}

{input_text}<|im_end|>
<|im_start|>assistant
"""
                else:
                    prompt = f"""<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
"""

            # Generate
            prediction = self.generate(prompt)

            results.append({
                "instruction": sample["instruction"],
                "input": sample.get("input", ""),
                "expected": sample["output"],
                "predicted": prediction,
            })

        return self._calculate_metrics(results)

    def _calculate_metrics(self, results: List[Dict]) -> Dict[str, Any]:
        """Calculate evaluation metrics."""
        from evaluate import load
        from sklearn.metrics import accuracy_score

        # Exact match (for classification tasks)
        exact_matches = sum(
            1 for r in results
            if r["expected"].strip().lower() == r["predicted"].strip().lower()
        )
        exact_match_acc = exact_matches / len(results)

        # Contains match (looser criterion)
        contains_matches = sum(
            1 for r in results
            if r["expected"].strip().lower() in r["predicted"].strip().lower()
        )
        contains_acc = contains_matches / len(results)

        # BLEU score for generation tasks
        try:
            bleu = load("bleu")
            predictions = [r["predicted"] for r in results]
            references = [[r["expected"]] for r in results]
            bleu_score = bleu.compute(predictions=predictions, references=references)["bleu"]
        except Exception:
            bleu_score = None

        # ROUGE scores
        try:
            rouge = load("rouge")
            predictions = [r["predicted"] for r in results]
            references = [r["expected"] for r in results]
            rouge_scores = rouge.compute(predictions=predictions, references=references)
        except Exception:
            rouge_scores = None

        return {
            "num_samples": len(results),
            "exact_match_accuracy": exact_match_acc,
            "contains_accuracy": contains_acc,
            "bleu": bleu_score,
            "rouge": rouge_scores,
            "results": results,
        }


def compare_models(
    base_model_name: str,
    adapter_path: str,
    test_file: str,
    max_samples: int = 50,
) -> Dict[str, Any]:
    """Compare base model vs fine-tuned model."""

    print("Evaluating base model...")
    base_evaluator = ModelEvaluator(base_model_name)
    base_metrics = base_evaluator.evaluate_dataset(test_file, max_samples=max_samples)

    print("\nEvaluating fine-tuned model...")
    ft_evaluator = ModelEvaluator(base_model_name, adapter_path)
    ft_metrics = ft_evaluator.evaluate_dataset(test_file, max_samples=max_samples)

    # Compare
    comparison = {
        "base_model": {
            "exact_match": base_metrics["exact_match_accuracy"],
            "contains": base_metrics["contains_accuracy"],
            "bleu": base_metrics["bleu"],
        },
        "finetuned_model": {
            "exact_match": ft_metrics["exact_match_accuracy"],
            "contains": ft_metrics["contains_accuracy"],
            "bleu": ft_metrics["bleu"],
        },
        "improvement": {
            "exact_match": ft_metrics["exact_match_accuracy"] - base_metrics["exact_match_accuracy"],
            "contains": ft_metrics["contains_accuracy"] - base_metrics["contains_accuracy"],
        }
    }

    print("\n" + "=" * 50)
    print("COMPARISON RESULTS")
    print("=" * 50)
    print(f"{'Metric':<20} {'Base':<15} {'Fine-tuned':<15} {'Change':<15}")
    print("-" * 65)
    print(f"{'Exact Match':<20} {comparison['base_model']['exact_match']:.1%}          {comparison['finetuned_model']['exact_match']:.1%}          {comparison['improvement']['exact_match']:+.1%}")
    print(f"{'Contains Match':<20} {comparison['base_model']['contains']:.1%}          {comparison['finetuned_model']['contains']:.1%}          {comparison['improvement']['contains']:+.1%}")

    return comparison


if __name__ == "__main__":
    # Evaluate fine-tuned model
    evaluator = ModelEvaluator(
        base_model_name="microsoft/Phi-3-mini-4k-instruct",
        adapter_path="outputs/phi3-finetuned",
    )

    metrics = evaluator.evaluate_dataset(
        test_file="data/val.jsonl",
        max_samples=20,
    )

    print(f"\nExact Match Accuracy: {metrics['exact_match_accuracy']:.1%}")
    print(f"Contains Accuracy: {metrics['contains_accuracy']:.1%}")
    if metrics['bleu']:
        print(f"BLEU Score: {metrics['bleu']:.3f}")

Understanding Fine-tuning Evaluation:

Evaluation Metrics for Fine-Tuned Models

Classification Tasks (Q&A, Yes/No)

Exact Match: "Paris" == "Paris" (pass), "Paris" == "paris" (fail, case-sensitive). Accuracy: Percentage of correct answers.

Generation Tasks (Summarization, Writing)

BLEU: n-gram overlap with reference. ROUGE: Recall-oriented overlap. Contains Match: Key info present (looser check).

Base Model Performance

~60% on domain tasks.

Fine-tuned Performance

Recommended

Evaluation Strategy:

Approach	When to Use	Example
Exact Match	Classification, factual Q&A	"What is the capital?" → "Paris"
Contains Match	Key information extraction	Response includes the key fact
BLEU/ROUGE	Open-ended generation	Summaries, creative writing
Human Evaluation	Final quality check	A/B preference testing

Part 5: Export to Ollama

Merge and Convert to GGUF

# export_ollama.py
import subprocess
import os
from pathlib import Path
import shutil


def merge_lora_weights(
    base_model: str,
    adapter_path: str,
    output_path: str,
):
    """Merge LoRA adapters into base model."""
    from transformers import AutoModelForCausalLM, AutoTokenizer
    from peft import PeftModel
    import torch

    print(f"Loading base model: {base_model}")
    model = AutoModelForCausalLM.from_pretrained(
        base_model,
        torch_dtype=torch.float16,
        device_map="cpu",
        trust_remote_code=True,
    )

    print(f"Loading adapters: {adapter_path}")
    model = PeftModel.from_pretrained(model, adapter_path)

    print("Merging weights...")
    model = model.merge_and_unload()

    print(f"Saving merged model to {output_path}")
    model.save_pretrained(output_path, safe_serialization=True)

    # Save tokenizer
    tokenizer = AutoTokenizer.from_pretrained(adapter_path, trust_remote_code=True)
    tokenizer.save_pretrained(output_path)

    print("Merge complete!")
    return output_path


def convert_to_gguf(
    model_path: str,
    output_path: str,
    quantization: str = "q4_k_m",
    llama_cpp_path: str = None,
):
    """Convert HuggingFace model to GGUF format."""

    if llama_cpp_path is None:
        # Try to find llama.cpp in common locations
        possible_paths = [
            Path.home() / "llama.cpp",
            Path("./llama.cpp"),
            Path("/opt/llama.cpp"),
        ]
        for p in possible_paths:
            if p.exists():
                llama_cpp_path = str(p)
                break

    if not llama_cpp_path:
        print("llama.cpp not found. Please provide the path or install it.")
        return None

    convert_script = Path(llama_cpp_path) / "convert_hf_to_gguf.py"
    quantize_binary = Path(llama_cpp_path) / "llama-quantize"

    # Create output directory
    output_dir = Path(output_path)
    output_dir.mkdir(parents=True, exist_ok=True)

    # Step 1: Convert to GGUF F16
    f16_path = output_dir / "model-f16.gguf"
    print(f"Converting to GGUF F16...")

    cmd = [
        "python", str(convert_script),
        model_path,
        "--outfile", str(f16_path),
        "--outtype", "f16",
    ]
    subprocess.run(cmd, check=True)

    # Step 2: Quantize
    quantized_path = output_dir / f"model-{quantization}.gguf"
    print(f"Quantizing to {quantization}...")

    cmd = [
        str(quantize_binary),
        str(f16_path),
        str(quantized_path),
        quantization.upper(),
    ]
    subprocess.run(cmd, check=True)

    # Clean up F16 if quantization successful
    if quantized_path.exists():
        f16_path.unlink()

    print(f"GGUF model saved to {quantized_path}")
    return str(quantized_path)


def create_ollama_modelfile(
    gguf_path: str,
    model_name: str,
    system_prompt: str = "You are a helpful assistant.",
    template: str = "chatml",
    output_path: str = None,
):
    """Create Ollama Modelfile for the custom model."""

    templates = {
        "chatml": '''TEMPLATE """<|im_start|>system
{{ .System }}<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>"""''',

        "phi3": '''TEMPLATE """<|system|>
{{ .System }}<|end|>
<|user|>
{{ .Prompt }}<|end|>
<|assistant|>
{{ .Response }}<|end|>"""''',

        "llama3": '''TEMPLATE """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""''',
    }

    modelfile_content = f'''FROM {gguf_path}

{templates.get(template, templates["chatml"])}

SYSTEM """{system_prompt}"""

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<|im_end|>"
PARAMETER stop "<|end|>"
'''

    output_path = output_path or "Modelfile"
    with open(output_path, 'w') as f:
        f.write(modelfile_content)

    print(f"Modelfile saved to {output_path}")
    return output_path


def register_with_ollama(modelfile_path: str, model_name: str):
    """Register the model with Ollama."""
    print(f"Creating Ollama model: {model_name}")

    cmd = ["ollama", "create", model_name, "-f", modelfile_path]
    subprocess.run(cmd, check=True)

    print(f"Model {model_name} is now available in Ollama!")
    print(f"Test with: ollama run {model_name}")


def full_export_pipeline(
    base_model: str,
    adapter_path: str,
    model_name: str,
    system_prompt: str = "You are a helpful assistant.",
    quantization: str = "q4_k_m",
    llama_cpp_path: str = None,
):
    """Complete pipeline: merge -> convert -> register."""

    work_dir = Path("exports") / model_name
    work_dir.mkdir(parents=True, exist_ok=True)

    # Step 1: Merge
    merged_path = str(work_dir / "merged")
    merge_lora_weights(base_model, adapter_path, merged_path)

    # Step 2: Convert to GGUF
    gguf_dir = str(work_dir / "gguf")
    gguf_path = convert_to_gguf(
        merged_path,
        gguf_dir,
        quantization,
        llama_cpp_path,
    )

    if gguf_path:
        # Step 3: Create Modelfile
        modelfile_path = str(work_dir / "Modelfile")
        create_ollama_modelfile(
            gguf_path,
            model_name,
            system_prompt,
            output_path=modelfile_path,
        )

        # Step 4: Register with Ollama
        register_with_ollama(modelfile_path, model_name)

        print("\n" + "=" * 50)
        print("EXPORT COMPLETE!")
        print("=" * 50)
        print(f"Model name: {model_name}")
        print(f"Quantization: {quantization}")
        print(f"\nRun with: ollama run {model_name}")


if __name__ == "__main__":
    full_export_pipeline(
        base_model="microsoft/Phi-3-mini-4k-instruct",
        adapter_path="outputs/phi3-finetuned",
        model_name="phi2-custom",
        system_prompt="You are a helpful customer support agent.",
        quantization="q4_k_m",
    )

Understanding the Export Pipeline:

From Training to Ollama Deployment

Step 1: Merge LoRA into Base ModelBase Model W (frozen, 4-bit) + LoRA Adapters (A x B, trainable) = Merged Model W + A*B*(alpha/r) in full precision (FP16)

Step 2: Convert to GGUFHuggingFace format (model.safetensors, config.json, tokenizer.json) is converted and quantized into a single GGUF file (model-q4_k_m.gguf, 4-bit weights)

GGUF Quantization Options:

Method	Bits	Size (2B model)	Quality	Recommended For
Q2_K	2.5	~600MB	Poor	Extreme constraints only
Q4_K_M	4.5	~1.2GB	Near-FP16	Best balance
Q5_K_M	5.5	~1.5GB	Excellent	Quality-critical apps
Q8_0	8.0	~2.2GB	Indistinguishable	When size doesn't matter

Common Pitfall: Using a different chat template in Modelfile than what the model was trained with causes poor outputs. Always match the template!

Part 6: Complete Training Script

All-in-One Training Pipeline

# train.py
import argparse
from pathlib import Path
import json


def main():
    parser = argparse.ArgumentParser(description="SLM Fine-tuning Pipeline")

    # Model arguments
    parser.add_argument("--base-model", type=str, default="microsoft/Phi-3-mini-4k-instruct",
                       help="Base model to fine-tune")
    parser.add_argument("--output-dir", type=str, default="outputs/finetuned",
                       help="Output directory for checkpoints")

    # Data arguments
    parser.add_argument("--train-file", type=str, required=True,
                       help="Training data file (JSONL)")
    parser.add_argument("--val-file", type=str,
                       help="Validation data file (JSONL)")

    # Training arguments
    parser.add_argument("--epochs", type=int, default=3)
    parser.add_argument("--batch-size", type=int, default=2)
    parser.add_argument("--lr", type=float, default=2e-4)
    parser.add_argument("--max-seq-length", type=int, default=2048)
    parser.add_argument("--gradient-accumulation", type=int, default=4)

    # LoRA arguments
    parser.add_argument("--lora-r", type=int, default=16)
    parser.add_argument("--lora-alpha", type=int, default=32)
    parser.add_argument("--lora-dropout", type=float, default=0.05)

    # Other arguments
    parser.add_argument("--use-unsloth", action="store_true",
                       help="Use Unsloth for faster training")
    parser.add_argument("--use-wandb", action="store_true",
                       help="Log to Weights & Biases")
    parser.add_argument("--export-ollama", action="store_true",
                       help="Export to Ollama after training")
    parser.add_argument("--model-name", type=str, default="custom-model",
                       help="Name for Ollama model")

    args = parser.parse_args()

    # Create output directory
    Path(args.output_dir).mkdir(parents=True, exist_ok=True)

    # Save config
    config = vars(args)
    with open(Path(args.output_dir) / "config.json", 'w') as f:
        json.dump(config, f, indent=2)

    # Train
    if args.use_unsloth:
        from train_unsloth import train_with_unsloth, merge_and_save

        model, tokenizer = train_with_unsloth(
            model_name=args.base_model,
            train_file=args.train_file,
            val_file=args.val_file,
            output_dir=args.output_dir,
            num_epochs=args.epochs,
            batch_size=args.batch_size,
            learning_rate=args.lr,
            max_seq_length=args.max_seq_length,
        )

        if args.export_ollama:
            merge_and_save(
                model, tokenizer,
                output_dir=f"{args.output_dir}-merged",
                save_gguf=True,
            )
    else:
        from train_qlora import train_model

        trainer = train_model(
            model_name=args.base_model,
            train_file=args.train_file,
            val_file=args.val_file,
            output_dir=args.output_dir,
            num_epochs=args.epochs,
            batch_size=args.batch_size,
            learning_rate=args.lr,
            max_seq_length=args.max_seq_length,
            gradient_accumulation_steps=args.gradient_accumulation,
            use_wandb=args.use_wandb,
        )

    print("\nTraining complete!")
    print(f"Model saved to: {args.output_dir}")


if __name__ == "__main__":
    main()

Usage Examples

# Basic training
python train.py \
  --base-model microsoft/Phi-3-mini-4k-instruct \
  --train-file data/train.jsonl \
  --val-file data/val.jsonl \
  --epochs 3

# Fast training with Unsloth
python train.py \
  --base-model unsloth/Phi-3-mini-4k-instruct \
  --train-file data/train.jsonl \
  --use-unsloth \
  --export-ollama \
  --model-name my-assistant

# With Weights & Biases logging
python train.py \
  --base-model microsoft/Phi-3-mini-4k-instruct \
  --train-file data/train.jsonl \
  --use-wandb \
  --epochs 5 \
  --lr 1e-4

Hyperparameter Guide

Parameter	Recommended Range	Notes
Learning Rate	1e-5 to 5e-4	Start with 2e-4
Batch Size	1-8	Limited by VRAM
Gradient Accumulation	4-16	Effective batch = batch * accum
Epochs	1-5	More epochs risk overfitting
LoRA Rank (r)	8-64	Higher = more capacity
LoRA Alpha	16-64	Usually 2x rank
Max Seq Length	512-4096	Longer uses more memory

Common Issues & Solutions

Out of Memory

# Reduce batch size
batch_size = 1
gradient_accumulation_steps = 8

# Use gradient checkpointing
training_args = TrainingArguments(
    gradient_checkpointing=True,
    # ...
)

# Reduce max sequence length
max_seq_length = 1024

Poor Performance After Fine-tuning

Check data quality - Remove duplicates and errors
Increase training data - More examples usually help
Adjust learning rate - Try 5x lower or higher
Add more epochs - Ensure model converges
Verify template matching - Use same template as base model

Exercises

Domain Adaptation: Fine-tune a model on legal or medical text
Multi-task Learning: Train on multiple task types simultaneously
Hyperparameter Search: Implement automated hyperparameter tuning
Evaluation Suite: Create comprehensive task-specific evaluation

Key Concepts Recap

Concept	What It Is	Why It Matters
QLoRA	4-bit quantization + LoRA adapters	Train 7B models on 8GB VRAM
LoRA Rank (r)	Size of low-rank matrices (8-64)	Higher = more capacity, more memory
LoRA Alpha	Scaling factor (usually 2×rank)	Controls update magnitude
Effective Batch	batch_size × gradient_accumulation	Stabilizes training without more memory
Unsloth	Optimized training library	2-5x faster, 70% less memory
Chat Template	Format for instruction/response pairs	Must match base model's expected format
SFTTrainer	Supervised fine-tuning trainer	Handles chat format, packing, loss masking
Gradient Checkpointing	Trade compute for memory	Enables larger models on limited VRAM
GGUF Export	Quantized format for llama.cpp/Ollama	Deploy fine-tuned models locally
Validation Loss	Loss on held-out data	Detect overfitting, choose best checkpoint

Next Steps

SLM-Powered RAG - Combine fine-tuned models with retrieval
Edge Deployment - Deploy on mobile and edge devices
SLM Agents - Build agentic systems with SLMs

SLM Fine-tuning

On this page

SLM Fine-tuning

On this page