Local SLM Setup

TL;DR

Run LLMs locally with zero cloud dependencies using Ollama (easiest) or llama.cpp (fastest). Models come in GGUF format with quantization levels (Q2-Q8) trading size for quality. A 3B model at Q4 needs ~2GB RAM and runs at 20-50 tokens/sec on modern hardware. Key formula: RAM_GB ≈ params_B × bytes_per_param × 1.2.

Overview

Aspect	Details
Difficulty	Beginner
Time	~2 hours
Code	~300 lines
Prerequisites	Python 3.10+, 8GB RAM

What You'll Build

A complete local SLM environment with:

Ollama for easy model management
llama.cpp for high-performance inference
Python integration with multiple libraries
API server for application integration

Local SLM Stack

Small Models

Phi-3

Gemma 2

Qwen 2.5

Llama 3.2

Model Formats

GGUF (quantized)

Safetensors

Inference Tools

Ollama (easiest)

llama.cpp (fastest)

Python APIs (flexible)

Your Application

Integration via API or library

Understanding Small Language Models

Why Run Locally?

Why Run SLMs Locally?

Local SLMs

Privacy

Data stays local, HIPAA/GDPR compliant, enterprise secure

Cost

No API fees, unlimited usage, predictable costs

Performance

Low latency, no network dependency, offline capable

Control

Custom fine-tuning, version control, no rate limits

Full Ownership of Your AI Stack

Model Comparison (2025)

Model	Parameters	RAM Required	Speed	Best For
SmolLM 2	135M-1.7B	1-4 GB	Very Fast	Embedded, IoT
Phi-4-mini	3.8B	4-8 GB	Fast	General tasks, coding
Gemma 3	1B-4B	2-8 GB	Fast	Multi-modal, instruction following
Qwen 3	0.6B-8B	2-16 GB	Fast	Multilingual, tool use
Llama 3.2	1B-3B	2-8 GB	Fast	On-device AI

Part 1: Ollama Setup

Installation

Ollama is the easiest way to run LLMs locally.

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download/windows

Running Models

# Start Ollama service (runs in background)
ollama serve

# Pull and run Phi-3
ollama pull phi3
ollama run phi3

# Pull other popular small models
ollama pull gemma2:2b
ollama pull qwen2.5:3b
ollama pull llama3.2:3b
ollama pull smollm:1.7b

Ollama Commands

# List installed models
ollama list

# Show model information
ollama show phi3

# Remove a model
ollama rm phi3

# Copy/rename a model
ollama cp phi3 my-phi3

# Create custom model with Modelfile
ollama create my-assistant -f Modelfile

Custom Modelfile

# Modelfile
FROM phi3

# Set system prompt
SYSTEM """
You are a helpful coding assistant. You provide concise,
accurate answers with code examples when appropriate.
"""

# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096

# Set stop tokens
PARAMETER stop "<|end|>"
PARAMETER stop "<|user|>"

# Create and run custom model
ollama create code-assistant -f Modelfile
ollama run code-assistant

Part 2: Python Integration with Ollama

Project Setup

# Create project
mkdir local-slm && cd local-slm

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install ollama langchain-ollama openai requests

Basic Ollama Python Usage

# ollama_basic.py
"""
Basic usage of Ollama Python library.
"""
import ollama


def simple_completion():
    """Generate a simple completion."""
    response = ollama.generate(
        model='phi3',
        prompt='Explain what a transformer is in 2 sentences.'
    )
    print(response['response'])


def chat_completion():
    """Have a conversation with the model."""
    response = ollama.chat(
        model='phi3',
        messages=[
            {
                'role': 'system',
                'content': 'You are a helpful assistant.'
            },
            {
                'role': 'user',
                'content': 'What is the capital of France?'
            }
        ]
    )
    print(response['message']['content'])


def streaming_response():
    """Stream responses for better UX."""
    stream = ollama.chat(
        model='phi3',
        messages=[{'role': 'user', 'content': 'Write a haiku about coding.'}],
        stream=True
    )

    for chunk in stream:
        print(chunk['message']['content'], end='', flush=True)
    print()  # New line at end


def list_models():
    """List available models."""
    models = ollama.list()
    print("Available models:")
    for model in models['models']:
        size_gb = model['size'] / (1024**3)
        print(f"  - {model['name']}: {size_gb:.2f} GB")


if __name__ == "__main__":
    print("=== Simple Completion ===")
    simple_completion()

    print("\n=== Chat Completion ===")
    chat_completion()

    print("\n=== Streaming Response ===")
    streaming_response()

    print("\n=== Available Models ===")
    list_models()

Ollama with OpenAI-Compatible API

# ollama_openai.py
"""
Use Ollama with OpenAI-compatible API.
Allows easy migration between local and cloud models.
"""
from openai import OpenAI


# Point to local Ollama server
client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # Required but not used
)


def chat_with_openai_api():
    """Use OpenAI API format with local model."""
    response = client.chat.completions.create(
        model='phi3',
        messages=[
            {'role': 'system', 'content': 'You are a helpful assistant.'},
            {'role': 'user', 'content': 'Explain recursion simply.'}
        ],
        temperature=0.7,
        max_tokens=500
    )

    print(response.choices[0].message.content)


def streaming_with_openai_api():
    """Stream responses using OpenAI API format."""
    stream = client.chat.completions.create(
        model='phi3',
        messages=[
            {'role': 'user', 'content': 'Write a Python function to reverse a string.'}
        ],
        stream=True
    )

    for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end='', flush=True)
    print()


def embeddings():
    """Generate embeddings using Ollama."""
    response = client.embeddings.create(
        model='nomic-embed-text',  # Pull first: ollama pull nomic-embed-text
        input='Hello, world!'
    )

    embedding = response.data[0].embedding
    print(f"Embedding dimension: {len(embedding)}")
    print(f"First 5 values: {embedding[:5]}")


if __name__ == "__main__":
    print("=== Chat with OpenAI API ===")
    chat_with_openai_api()

    print("\n=== Streaming ===")
    streaming_with_openai_api()

LangChain Integration

# ollama_langchain.py
"""
Use Ollama with LangChain for building applications.
"""
from langchain_ollama import OllamaLLM, ChatOllama, OllamaEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser


def basic_chain():
    """Create a simple LangChain chain."""
    # Initialize model
    llm = ChatOllama(model='phi3', temperature=0.7)

    # Create prompt template
    prompt = ChatPromptTemplate.from_messages([
        ('system', 'You are a {role}. Be concise and helpful.'),
        ('user', '{question}')
    ])

    # Create chain
    chain = prompt | llm | StrOutputParser()

    # Run chain
    response = chain.invoke({
        'role': 'Python expert',
        'question': 'How do I read a JSON file?'
    })

    print(response)


def structured_output():
    """Generate structured output."""
    from langchain_core.pydantic_v1 import BaseModel, Field
    from typing import List

    class Recipe(BaseModel):
        name: str = Field(description="Name of the recipe")
        ingredients: List[str] = Field(description="List of ingredients")
        steps: List[str] = Field(description="Cooking steps")
        prep_time: int = Field(description="Preparation time in minutes")

    llm = ChatOllama(model='phi3', temperature=0)

    # Note: Structured output support varies by model
    prompt = ChatPromptTemplate.from_messages([
        ('system', 'Extract recipe information as JSON.'),
        ('user', '''Extract the recipe from this text:

        Classic Pancakes: Mix 1 cup flour, 1 egg, 1 cup milk, and 2 tbsp sugar.
        Heat pan, pour batter, flip when bubbles form. Takes about 20 minutes.
        ''')
    ])

    chain = prompt | llm | StrOutputParser()
    result = chain.invoke({})
    print(result)


def embeddings_example():
    """Use Ollama embeddings."""
    embeddings = OllamaEmbeddings(model='nomic-embed-text')

    # Single text
    vector = embeddings.embed_query("Hello, world!")
    print(f"Embedding dimension: {len(vector)}")

    # Multiple texts
    texts = ["First document", "Second document", "Third document"]
    vectors = embeddings.embed_documents(texts)
    print(f"Number of embeddings: {len(vectors)}")


if __name__ == "__main__":
    print("=== Basic Chain ===")
    basic_chain()

    print("\n=== Structured Output ===")
    structured_output()

Part 3: llama.cpp Setup

Why llama.cpp?

Written in C/C++ for maximum performance
Supports CPU and GPU inference
Quantization support (2-8 bit)
Active community with latest model support

Installation

# Clone repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build for CPU
make

# Build with Metal (macOS)
make LLAMA_METAL=1

# Build with CUDA (NVIDIA)
make LLAMA_CUDA=1

# Build with OpenBLAS (faster CPU)
make LLAMA_OPENBLAS=1

Download GGUF Models

# Using huggingface-cli
pip install huggingface-hub

# Download Phi-3 GGUF
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf \
    Phi-3-mini-4k-instruct-q4.gguf \
    --local-dir ./models

# Download Qwen 2.5 GGUF
huggingface-cli download Qwen/Qwen2.5-3B-Instruct-GGUF \
    qwen2.5-3b-instruct-q4_k_m.gguf \
    --local-dir ./models

Running with llama.cpp

# Basic inference
./llama-cli -m models/Phi-3-mini-4k-instruct-q4.gguf \
    -p "Explain quantum computing in simple terms:" \
    -n 256

# Interactive mode
./llama-cli -m models/Phi-3-mini-4k-instruct-q4.gguf \
    --interactive \
    --color

# With specific parameters
./llama-cli -m models/Phi-3-mini-4k-instruct-q4.gguf \
    -p "Write a Python function:" \
    -n 512 \
    --temp 0.7 \
    --top-p 0.9 \
    --repeat-penalty 1.1 \
    -ngl 35  # GPU layers (if GPU available)

llama.cpp Server

# Start server
./llama-server -m models/Phi-3-mini-4k-instruct-q4.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 35

# Server is now accessible at http://localhost:8080
# OpenAI-compatible API at http://localhost:8080/v1

Part 4: Python with llama-cpp-python

Installation

# CPU only
pip install llama-cpp-python

# With Metal support (macOS)
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

# With CUDA support
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python

# With OpenBLAS
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python

Basic Usage

# llama_cpp_basic.py
"""
Using llama-cpp-python for local inference.
"""
from llama_cpp import Llama


def basic_inference():
    """Basic text generation."""
    # Load model
    llm = Llama(
        model_path="./models/Phi-3-mini-4k-instruct-q4.gguf",
        n_ctx=4096,      # Context window
        n_threads=8,      # CPU threads
        n_gpu_layers=35,  # GPU layers (0 for CPU only)
        verbose=False
    )

    # Generate text
    output = llm(
        "Explain what an API is in one paragraph:",
        max_tokens=200,
        temperature=0.7,
        top_p=0.9,
        echo=False  # Don't include prompt in output
    )

    print(output['choices'][0]['text'])


def chat_completion():
    """Chat-style completion."""
    llm = Llama(
        model_path="./models/Phi-3-mini-4k-instruct-q4.gguf",
        n_ctx=4096,
        n_gpu_layers=35,
        chat_format="chatml"  # Use appropriate chat format
    )

    messages = [
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "How do I handle errors in Python?"}
    ]

    response = llm.create_chat_completion(
        messages=messages,
        temperature=0.7,
        max_tokens=500
    )

    print(response['choices'][0]['message']['content'])


def streaming_generation():
    """Stream tokens as they're generated."""
    llm = Llama(
        model_path="./models/Phi-3-mini-4k-instruct-q4.gguf",
        n_ctx=4096,
        n_gpu_layers=35
    )

    stream = llm(
        "Write a short poem about Python programming:",
        max_tokens=200,
        stream=True
    )

    for token in stream:
        print(token['choices'][0]['text'], end='', flush=True)
    print()


def embeddings():
    """Generate embeddings with llama.cpp."""
    llm = Llama(
        model_path="./models/Phi-3-mini-4k-instruct-q4.gguf",
        embedding=True,  # Enable embedding mode
        n_ctx=512
    )

    text = "This is a sample text for embedding."
    embedding = llm.embed(text)

    print(f"Embedding dimension: {len(embedding)}")
    print(f"First 5 values: {embedding[:5]}")


if __name__ == "__main__":
    print("=== Basic Inference ===")
    basic_inference()

    print("\n=== Chat Completion ===")
    chat_completion()

    print("\n=== Streaming ===")
    streaming_generation()

High-Performance Server

# llama_cpp_server.py
"""
FastAPI server for llama-cpp-python.
"""
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Optional
from llama_cpp import Llama
import uvicorn


app = FastAPI(title="Local SLM API")

# Load model at startup
llm = None


class Message(BaseModel):
    role: str
    content: str


class ChatRequest(BaseModel):
    messages: List[Message]
    temperature: float = 0.7
    max_tokens: int = 500
    stream: bool = False


class CompletionRequest(BaseModel):
    prompt: str
    temperature: float = 0.7
    max_tokens: int = 500


@app.on_event("startup")
async def load_model():
    global llm
    llm = Llama(
        model_path="./models/Phi-3-mini-4k-instruct-q4.gguf",
        n_ctx=4096,
        n_gpu_layers=35,
        chat_format="chatml"
    )
    print("Model loaded successfully!")


@app.post("/v1/chat/completions")
async def chat_completions(request: ChatRequest):
    if llm is None:
        raise HTTPException(status_code=503, detail="Model not loaded")

    messages = [{"role": m.role, "content": m.content} for m in request.messages]

    response = llm.create_chat_completion(
        messages=messages,
        temperature=request.temperature,
        max_tokens=request.max_tokens
    )

    return response


@app.post("/v1/completions")
async def completions(request: CompletionRequest):
    if llm is None:
        raise HTTPException(status_code=503, detail="Model not loaded")

    response = llm(
        request.prompt,
        temperature=request.temperature,
        max_tokens=request.max_tokens
    )

    return response


@app.get("/health")
async def health():
    return {"status": "healthy", "model_loaded": llm is not None}


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Part 5: Model Format Conversion

Converting to GGUF

# convert_to_gguf.py
"""
Convert HuggingFace models to GGUF format.
Requires llama.cpp convert scripts.
"""
import subprocess
import os


def convert_hf_to_gguf(
    model_name: str,
    output_dir: str = "./models",
    quantization: str = "q4_k_m"
):
    """
    Convert a HuggingFace model to GGUF format.

    Steps:
    1. Download model from HuggingFace
    2. Convert to GGUF format
    3. Quantize to reduce size
    """
    # Create output directory
    os.makedirs(output_dir, exist_ok=True)

    # Download model
    print(f"Downloading {model_name}...")
    subprocess.run([
        "huggingface-cli", "download", model_name,
        "--local-dir", f"{output_dir}/{model_name.split('/')[-1]}"
    ], check=True)

    # Convert to GGUF (using llama.cpp convert script)
    model_dir = f"{output_dir}/{model_name.split('/')[-1]}"
    output_file = f"{output_dir}/{model_name.split('/')[-1]}.gguf"

    print("Converting to GGUF...")
    subprocess.run([
        "python", "llama.cpp/convert_hf_to_gguf.py",
        model_dir,
        "--outfile", output_file,
        "--outtype", "f16"  # First convert to f16
    ], check=True)

    # Quantize
    quantized_file = f"{output_dir}/{model_name.split('/')[-1]}-{quantization}.gguf"
    print(f"Quantizing to {quantization}...")
    subprocess.run([
        "./llama.cpp/llama-quantize",
        output_file,
        quantized_file,
        quantization
    ], check=True)

    print(f"Done! Model saved to {quantized_file}")
    return quantized_file


# Quantization options:
# q2_k  - Smallest, lowest quality
# q3_k_m - Small, low quality
# q4_0  - Medium, good balance
# q4_k_m - Medium, better quality (recommended)
# q5_k_m - Larger, high quality
# q6_k  - Large, very high quality
# q8_0  - Largest quantized, near-original
# f16   - Half precision (no quantization)

Part 6: Performance Optimization

Benchmarking Script

# benchmark.py
"""
Benchmark local SLM performance.
"""
import time
from typing import Dict, List
import statistics


def benchmark_model(
    llm,
    prompts: List[str],
    max_tokens: int = 100,
    num_runs: int = 3
) -> Dict:
    """Benchmark model performance."""
    results = {
        "tokens_per_second": [],
        "time_to_first_token": [],
        "total_time": []
    }

    for prompt in prompts:
        for _ in range(num_runs):
            start_time = time.time()
            first_token_time = None
            token_count = 0

            # Stream to measure time to first token
            stream = llm(prompt, max_tokens=max_tokens, stream=True)

            for token in stream:
                if first_token_time is None:
                    first_token_time = time.time() - start_time
                token_count += 1

            total_time = time.time() - start_time

            results["tokens_per_second"].append(token_count / total_time)
            results["time_to_first_token"].append(first_token_time)
            results["total_time"].append(total_time)

    return {
        "avg_tokens_per_second": statistics.mean(results["tokens_per_second"]),
        "avg_time_to_first_token": statistics.mean(results["time_to_first_token"]),
        "avg_total_time": statistics.mean(results["total_time"]),
        "std_tokens_per_second": statistics.stdev(results["tokens_per_second"]) if len(results["tokens_per_second"]) > 1 else 0
    }


def compare_models():
    """Compare different models and configurations."""
    from llama_cpp import Llama

    models = [
        ("Phi-3 Q4", "./models/phi-3-q4_k_m.gguf"),
        ("Qwen 2.5 Q4", "./models/qwen2.5-3b-q4_k_m.gguf"),
    ]

    prompts = [
        "Explain machine learning in simple terms:",
        "Write a Python function to sort a list:",
        "What is the capital of France?",
    ]

    print("Model Benchmark Results")
    print("=" * 60)

    for model_name, model_path in models:
        try:
            llm = Llama(
                model_path=model_path,
                n_ctx=2048,
                n_gpu_layers=35,
                verbose=False
            )

            results = benchmark_model(llm, prompts)

            print(f"\n{model_name}:")
            print(f"  Tokens/sec: {results['avg_tokens_per_second']:.2f} "
                  f"(± {results['std_tokens_per_second']:.2f})")
            print(f"  Time to first token: {results['avg_time_to_first_token']*1000:.0f}ms")
            print(f"  Avg generation time: {results['avg_total_time']:.2f}s")

        except Exception as e:
            print(f"\n{model_name}: Error - {e}")


if __name__ == "__main__":
    compare_models()

Optimization Tips

# optimization_tips.py
"""
Tips for optimizing local SLM performance.
"""
from llama_cpp import Llama


def optimized_setup():
    """Optimized model configuration."""
    llm = Llama(
        model_path="./models/phi-3-q4_k_m.gguf",

        # Context and batch size
        n_ctx=2048,        # Smaller context = faster
        n_batch=512,       # Batch size for prompt processing

        # Threading
        n_threads=8,       # Match your CPU cores
        n_threads_batch=8, # Threads for batch processing

        # GPU acceleration
        n_gpu_layers=35,   # -1 for all layers on GPU

        # Memory optimization
        use_mmap=True,     # Memory map the model
        use_mlock=False,   # Don't lock in RAM (saves memory)

        # Disable unused features
        embedding=False,   # Disable if not using embeddings
        verbose=False      # Disable logging
    )

    return llm


# Key optimization strategies:
#
# 1. Use appropriate quantization
#    - q4_k_m for best balance
#    - q8_0 for quality-critical applications
#    - q2_k for memory-constrained environments
#
# 2. Adjust context window
#    - Smaller n_ctx = faster inference
#    - Only use what you need
#
# 3. GPU acceleration
#    - Use n_gpu_layers for GPU offloading
#    - More layers on GPU = faster (if VRAM allows)
#
# 4. Batch processing
#    - Process multiple prompts together
#    - Use larger n_batch for throughput
#
# 5. Model selection
#    - Smaller models are faster
#    - Phi-3 and Qwen 2.5 are well-optimized

Testing Your Setup

# test_setup.py
"""
Test your local SLM setup.
"""
import subprocess
import sys


def test_ollama():
    """Test Ollama installation."""
    try:
        result = subprocess.run(
            ["ollama", "list"],
            capture_output=True,
            text=True
        )
        if result.returncode == 0:
            print("✅ Ollama is installed and running")
            print(f"   Models: {result.stdout.strip() or 'None installed'}")
        else:
            print("❌ Ollama service not running")
            print("   Run: ollama serve")
    except FileNotFoundError:
        print("❌ Ollama not installed")
        print("   Install: brew install ollama (macOS)")


def test_python_packages():
    """Test Python packages."""
    packages = ['ollama', 'llama_cpp', 'langchain_ollama', 'openai']

    for package in packages:
        try:
            __import__(package)
            print(f"✅ {package} installed")
        except ImportError:
            print(f"❌ {package} not installed")


def test_inference():
    """Test basic inference."""
    try:
        import ollama

        response = ollama.generate(
            model='phi3',
            prompt='Say hello in one word.'
        )
        print(f"✅ Inference working: {response['response'].strip()}")
    except Exception as e:
        print(f"❌ Inference failed: {e}")


if __name__ == "__main__":
    print("=== Local SLM Setup Test ===\n")

    print("1. Checking Ollama...")
    test_ollama()

    print("\n2. Checking Python packages...")
    test_python_packages()

    print("\n3. Testing inference...")
    test_inference()

    print("\n=== Test Complete ===")

Key Concepts Recap

Concept	What It Is	Why It Matters
Ollama	All-in-one tool for downloading and running models	Simplest setup - one command to run any model
llama.cpp	C++ inference engine with GGUF support	Maximum performance, GPU/CPU optimization
GGUF	File format for quantized models	Standard format that works with all tools
Quantization Level	Q2_K to Q8_0 - bits per weight	Lower = smaller/faster, higher = more accurate
n_ctx	Context window size (tokens model can see)	Larger = more context but more RAM
n_gpu_layers	Layers offloaded to GPU	More layers = faster (if VRAM allows)
Modelfile	Ollama configuration for custom models	Set system prompts, parameters, stop tokens
OpenAI Compatibility	Ollama serves OpenAI-compatible API	Drop-in replacement for cloud APIs
Temperature	Randomness in output (0.0-2.0)	Lower = deterministic, higher = creative
Streaming	Token-by-token output	Better UX - see responses as they generate

Next Steps

After setting up your local SLM environment:

SLM for Text Tasks - Build practical NLP applications
SLM Benchmarking - Evaluate and compare models
SLM Fine-tuning - Customize models for your domain

Resources

Local SLM Setup

TL;DR

Overview

Aspect	Details
Difficulty	Beginner
Time	~2 hours
Code	~300 lines
Prerequisites	Python 3.10+, 8GB RAM

What You'll Build

A complete local SLM environment with:

Ollama for easy model management
llama.cpp for high-performance inference
Python integration with multiple libraries
API server for application integration

Local SLM Stack

Small Models

Phi-3

Gemma 2

Qwen 2.5

Llama 3.2

Model Formats

GGUF (quantized)

Safetensors

Inference Tools

Ollama (easiest)

llama.cpp (fastest)

Python APIs (flexible)

Your Application

Integration via API or library

Understanding Small Language Models

Why Run Locally?

Why Run SLMs Locally?

Local SLMs

Privacy

Data stays local, HIPAA/GDPR compliant, enterprise secure

Cost

No API fees, unlimited usage, predictable costs

Performance

Low latency, no network dependency, offline capable

Control

Custom fine-tuning, version control, no rate limits

Full Ownership of Your AI Stack

Model Comparison (2025)

Model	Parameters	RAM Required	Speed	Best For
SmolLM 2	135M-1.7B	1-4 GB	Very Fast	Embedded, IoT
Phi-4-mini	3.8B	4-8 GB	Fast	General tasks, coding
Gemma 3	1B-4B	2-8 GB	Fast	Multi-modal, instruction following
Qwen 3	0.6B-8B	2-16 GB	Fast	Multilingual, tool use
Llama 3.2	1B-3B	2-8 GB	Fast	On-device AI

Part 1: Ollama Setup

Installation

Ollama is the easiest way to run LLMs locally.

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download/windows

Running Models

# Start Ollama service (runs in background)
ollama serve

# Pull and run Phi-3
ollama pull phi3
ollama run phi3

# Pull other popular small models
ollama pull gemma2:2b
ollama pull qwen2.5:3b
ollama pull llama3.2:3b
ollama pull smollm:1.7b

Ollama Commands

# List installed models
ollama list

# Show model information
ollama show phi3

# Remove a model
ollama rm phi3

# Copy/rename a model
ollama cp phi3 my-phi3

# Create custom model with Modelfile
ollama create my-assistant -f Modelfile

Custom Modelfile

# Modelfile
FROM phi3

# Set system prompt
SYSTEM """
You are a helpful coding assistant. You provide concise,
accurate answers with code examples when appropriate.
"""

# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096

# Set stop tokens
PARAMETER stop "<|end|>"
PARAMETER stop "<|user|>"

# Create and run custom model
ollama create code-assistant -f Modelfile
ollama run code-assistant

Part 2: Python Integration with Ollama

Project Setup

# Create project
mkdir local-slm && cd local-slm

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install ollama langchain-ollama openai requests

Basic Ollama Python Usage

# ollama_basic.py
"""
Basic usage of Ollama Python library.
"""
import ollama


def simple_completion():
    """Generate a simple completion."""
    response = ollama.generate(
        model='phi3',
        prompt='Explain what a transformer is in 2 sentences.'
    )
    print(response['response'])


def chat_completion():
    """Have a conversation with the model."""
    response = ollama.chat(
        model='phi3',
        messages=[
            {
                'role': 'system',
                'content': 'You are a helpful assistant.'
            },
            {
                'role': 'user',
                'content': 'What is the capital of France?'
            }
        ]
    )
    print(response['message']['content'])


def streaming_response():
    """Stream responses for better UX."""
    stream = ollama.chat(
        model='phi3',
        messages=[{'role': 'user', 'content': 'Write a haiku about coding.'}],
        stream=True
    )

    for chunk in stream:
        print(chunk['message']['content'], end='', flush=True)
    print()  # New line at end


def list_models():
    """List available models."""
    models = ollama.list()
    print("Available models:")
    for model in models['models']:
        size_gb = model['size'] / (1024**3)
        print(f"  - {model['name']}: {size_gb:.2f} GB")


if __name__ == "__main__":
    print("=== Simple Completion ===")
    simple_completion()

    print("\n=== Chat Completion ===")
    chat_completion()

    print("\n=== Streaming Response ===")
    streaming_response()

    print("\n=== Available Models ===")
    list_models()

Ollama with OpenAI-Compatible API

# ollama_openai.py
"""
Use Ollama with OpenAI-compatible API.
Allows easy migration between local and cloud models.
"""
from openai import OpenAI


# Point to local Ollama server
client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # Required but not used
)


def chat_with_openai_api():
    """Use OpenAI API format with local model."""
    response = client.chat.completions.create(
        model='phi3',
        messages=[
            {'role': 'system', 'content': 'You are a helpful assistant.'},
            {'role': 'user', 'content': 'Explain recursion simply.'}
        ],
        temperature=0.7,
        max_tokens=500
    )

    print(response.choices[0].message.content)


def streaming_with_openai_api():
    """Stream responses using OpenAI API format."""
    stream = client.chat.completions.create(
        model='phi3',
        messages=[
            {'role': 'user', 'content': 'Write a Python function to reverse a string.'}
        ],
        stream=True
    )

    for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end='', flush=True)
    print()


def embeddings():
    """Generate embeddings using Ollama."""
    response = client.embeddings.create(
        model='nomic-embed-text',  # Pull first: ollama pull nomic-embed-text
        input='Hello, world!'
    )

    embedding = response.data[0].embedding
    print(f"Embedding dimension: {len(embedding)}")
    print(f"First 5 values: {embedding[:5]}")


if __name__ == "__main__":
    print("=== Chat with OpenAI API ===")
    chat_with_openai_api()

    print("\n=== Streaming ===")
    streaming_with_openai_api()

LangChain Integration

# ollama_langchain.py
"""
Use Ollama with LangChain for building applications.
"""
from langchain_ollama import OllamaLLM, ChatOllama, OllamaEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser


def basic_chain():
    """Create a simple LangChain chain."""
    # Initialize model
    llm = ChatOllama(model='phi3', temperature=0.7)

    # Create prompt template
    prompt = ChatPromptTemplate.from_messages([
        ('system', 'You are a {role}. Be concise and helpful.'),
        ('user', '{question}')
    ])

    # Create chain
    chain = prompt | llm | StrOutputParser()

    # Run chain
    response = chain.invoke({
        'role': 'Python expert',
        'question': 'How do I read a JSON file?'
    })

    print(response)


def structured_output():
    """Generate structured output."""
    from langchain_core.pydantic_v1 import BaseModel, Field
    from typing import List

    class Recipe(BaseModel):
        name: str = Field(description="Name of the recipe")
        ingredients: List[str] = Field(description="List of ingredients")
        steps: List[str] = Field(description="Cooking steps")
        prep_time: int = Field(description="Preparation time in minutes")

    llm = ChatOllama(model='phi3', temperature=0)

    # Note: Structured output support varies by model
    prompt = ChatPromptTemplate.from_messages([
        ('system', 'Extract recipe information as JSON.'),
        ('user', '''Extract the recipe from this text:

        Classic Pancakes: Mix 1 cup flour, 1 egg, 1 cup milk, and 2 tbsp sugar.
        Heat pan, pour batter, flip when bubbles form. Takes about 20 minutes.
        ''')
    ])

    chain = prompt | llm | StrOutputParser()
    result = chain.invoke({})
    print(result)


def embeddings_example():
    """Use Ollama embeddings."""
    embeddings = OllamaEmbeddings(model='nomic-embed-text')

    # Single text
    vector = embeddings.embed_query("Hello, world!")
    print(f"Embedding dimension: {len(vector)}")

    # Multiple texts
    texts = ["First document", "Second document", "Third document"]
    vectors = embeddings.embed_documents(texts)
    print(f"Number of embeddings: {len(vectors)}")


if __name__ == "__main__":
    print("=== Basic Chain ===")
    basic_chain()

    print("\n=== Structured Output ===")
    structured_output()

Part 3: llama.cpp Setup

Why llama.cpp?

Written in C/C++ for maximum performance
Supports CPU and GPU inference
Quantization support (2-8 bit)
Active community with latest model support

Installation

# Clone repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build for CPU
make

# Build with Metal (macOS)
make LLAMA_METAL=1

# Build with CUDA (NVIDIA)
make LLAMA_CUDA=1

# Build with OpenBLAS (faster CPU)
make LLAMA_OPENBLAS=1

Download GGUF Models

# Using huggingface-cli
pip install huggingface-hub

# Download Phi-3 GGUF
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf \
    Phi-3-mini-4k-instruct-q4.gguf \
    --local-dir ./models

# Download Qwen 2.5 GGUF
huggingface-cli download Qwen/Qwen2.5-3B-Instruct-GGUF \
    qwen2.5-3b-instruct-q4_k_m.gguf \
    --local-dir ./models

Running with llama.cpp

# Basic inference
./llama-cli -m models/Phi-3-mini-4k-instruct-q4.gguf \
    -p "Explain quantum computing in simple terms:" \
    -n 256

# Interactive mode
./llama-cli -m models/Phi-3-mini-4k-instruct-q4.gguf \
    --interactive \
    --color

# With specific parameters
./llama-cli -m models/Phi-3-mini-4k-instruct-q4.gguf \
    -p "Write a Python function:" \
    -n 512 \
    --temp 0.7 \
    --top-p 0.9 \
    --repeat-penalty 1.1 \
    -ngl 35  # GPU layers (if GPU available)

llama.cpp Server

# Start server
./llama-server -m models/Phi-3-mini-4k-instruct-q4.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 35

# Server is now accessible at http://localhost:8080
# OpenAI-compatible API at http://localhost:8080/v1

Part 4: Python with llama-cpp-python

Installation

# CPU only
pip install llama-cpp-python

# With Metal support (macOS)
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

# With CUDA support
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python

# With OpenBLAS
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python

Basic Usage

# llama_cpp_basic.py
"""
Using llama-cpp-python for local inference.
"""
from llama_cpp import Llama


def basic_inference():
    """Basic text generation."""
    # Load model
    llm = Llama(
        model_path="./models/Phi-3-mini-4k-instruct-q4.gguf",
        n_ctx=4096,      # Context window
        n_threads=8,      # CPU threads
        n_gpu_layers=35,  # GPU layers (0 for CPU only)
        verbose=False
    )

    # Generate text
    output = llm(
        "Explain what an API is in one paragraph:",
        max_tokens=200,
        temperature=0.7,
        top_p=0.9,
        echo=False  # Don't include prompt in output
    )

    print(output['choices'][0]['text'])


def chat_completion():
    """Chat-style completion."""
    llm = Llama(
        model_path="./models/Phi-3-mini-4k-instruct-q4.gguf",
        n_ctx=4096,
        n_gpu_layers=35,
        chat_format="chatml"  # Use appropriate chat format
    )

    messages = [
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "How do I handle errors in Python?"}
    ]

    response = llm.create_chat_completion(
        messages=messages,
        temperature=0.7,
        max_tokens=500
    )

    print(response['choices'][0]['message']['content'])


def streaming_generation():
    """Stream tokens as they're generated."""
    llm = Llama(
        model_path="./models/Phi-3-mini-4k-instruct-q4.gguf",
        n_ctx=4096,
        n_gpu_layers=35
    )

    stream = llm(
        "Write a short poem about Python programming:",
        max_tokens=200,
        stream=True
    )

    for token in stream:
        print(token['choices'][0]['text'], end='', flush=True)
    print()


def embeddings():
    """Generate embeddings with llama.cpp."""
    llm = Llama(
        model_path="./models/Phi-3-mini-4k-instruct-q4.gguf",
        embedding=True,  # Enable embedding mode
        n_ctx=512
    )

    text = "This is a sample text for embedding."
    embedding = llm.embed(text)

    print(f"Embedding dimension: {len(embedding)}")
    print(f"First 5 values: {embedding[:5]}")


if __name__ == "__main__":
    print("=== Basic Inference ===")
    basic_inference()

    print("\n=== Chat Completion ===")
    chat_completion()

    print("\n=== Streaming ===")
    streaming_generation()

High-Performance Server

# llama_cpp_server.py
"""
FastAPI server for llama-cpp-python.
"""
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Optional
from llama_cpp import Llama
import uvicorn


app = FastAPI(title="Local SLM API")

# Load model at startup
llm = None


class Message(BaseModel):
    role: str
    content: str


class ChatRequest(BaseModel):
    messages: List[Message]
    temperature: float = 0.7
    max_tokens: int = 500
    stream: bool = False


class CompletionRequest(BaseModel):
    prompt: str
    temperature: float = 0.7
    max_tokens: int = 500


@app.on_event("startup")
async def load_model():
    global llm
    llm = Llama(
        model_path="./models/Phi-3-mini-4k-instruct-q4.gguf",
        n_ctx=4096,
        n_gpu_layers=35,
        chat_format="chatml"
    )
    print("Model loaded successfully!")


@app.post("/v1/chat/completions")
async def chat_completions(request: ChatRequest):
    if llm is None:
        raise HTTPException(status_code=503, detail="Model not loaded")

    messages = [{"role": m.role, "content": m.content} for m in request.messages]

    response = llm.create_chat_completion(
        messages=messages,
        temperature=request.temperature,
        max_tokens=request.max_tokens
    )

    return response


@app.post("/v1/completions")
async def completions(request: CompletionRequest):
    if llm is None:
        raise HTTPException(status_code=503, detail="Model not loaded")

    response = llm(
        request.prompt,
        temperature=request.temperature,
        max_tokens=request.max_tokens
    )

    return response


@app.get("/health")
async def health():
    return {"status": "healthy", "model_loaded": llm is not None}


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Part 5: Model Format Conversion

Converting to GGUF

# convert_to_gguf.py
"""
Convert HuggingFace models to GGUF format.
Requires llama.cpp convert scripts.
"""
import subprocess
import os


def convert_hf_to_gguf(
    model_name: str,
    output_dir: str = "./models",
    quantization: str = "q4_k_m"
):
    """
    Convert a HuggingFace model to GGUF format.

    Steps:
    1. Download model from HuggingFace
    2. Convert to GGUF format
    3. Quantize to reduce size
    """
    # Create output directory
    os.makedirs(output_dir, exist_ok=True)

    # Download model
    print(f"Downloading {model_name}...")
    subprocess.run([
        "huggingface-cli", "download", model_name,
        "--local-dir", f"{output_dir}/{model_name.split('/')[-1]}"
    ], check=True)

    # Convert to GGUF (using llama.cpp convert script)
    model_dir = f"{output_dir}/{model_name.split('/')[-1]}"
    output_file = f"{output_dir}/{model_name.split('/')[-1]}.gguf"

    print("Converting to GGUF...")
    subprocess.run([
        "python", "llama.cpp/convert_hf_to_gguf.py",
        model_dir,
        "--outfile", output_file,
        "--outtype", "f16"  # First convert to f16
    ], check=True)

    # Quantize
    quantized_file = f"{output_dir}/{model_name.split('/')[-1]}-{quantization}.gguf"
    print(f"Quantizing to {quantization}...")
    subprocess.run([
        "./llama.cpp/llama-quantize",
        output_file,
        quantized_file,
        quantization
    ], check=True)

    print(f"Done! Model saved to {quantized_file}")
    return quantized_file


# Quantization options:
# q2_k  - Smallest, lowest quality
# q3_k_m - Small, low quality
# q4_0  - Medium, good balance
# q4_k_m - Medium, better quality (recommended)
# q5_k_m - Larger, high quality
# q6_k  - Large, very high quality
# q8_0  - Largest quantized, near-original
# f16   - Half precision (no quantization)

Part 6: Performance Optimization

Benchmarking Script

# benchmark.py
"""
Benchmark local SLM performance.
"""
import time
from typing import Dict, List
import statistics


def benchmark_model(
    llm,
    prompts: List[str],
    max_tokens: int = 100,
    num_runs: int = 3
) -> Dict:
    """Benchmark model performance."""
    results = {
        "tokens_per_second": [],
        "time_to_first_token": [],
        "total_time": []
    }

    for prompt in prompts:
        for _ in range(num_runs):
            start_time = time.time()
            first_token_time = None
            token_count = 0

            # Stream to measure time to first token
            stream = llm(prompt, max_tokens=max_tokens, stream=True)

            for token in stream:
                if first_token_time is None:
                    first_token_time = time.time() - start_time
                token_count += 1

            total_time = time.time() - start_time

            results["tokens_per_second"].append(token_count / total_time)
            results["time_to_first_token"].append(first_token_time)
            results["total_time"].append(total_time)

    return {
        "avg_tokens_per_second": statistics.mean(results["tokens_per_second"]),
        "avg_time_to_first_token": statistics.mean(results["time_to_first_token"]),
        "avg_total_time": statistics.mean(results["total_time"]),
        "std_tokens_per_second": statistics.stdev(results["tokens_per_second"]) if len(results["tokens_per_second"]) > 1 else 0
    }


def compare_models():
    """Compare different models and configurations."""
    from llama_cpp import Llama

    models = [
        ("Phi-3 Q4", "./models/phi-3-q4_k_m.gguf"),
        ("Qwen 2.5 Q4", "./models/qwen2.5-3b-q4_k_m.gguf"),
    ]

    prompts = [
        "Explain machine learning in simple terms:",
        "Write a Python function to sort a list:",
        "What is the capital of France?",
    ]

    print("Model Benchmark Results")
    print("=" * 60)

    for model_name, model_path in models:
        try:
            llm = Llama(
                model_path=model_path,
                n_ctx=2048,
                n_gpu_layers=35,
                verbose=False
            )

            results = benchmark_model(llm, prompts)

            print(f"\n{model_name}:")
            print(f"  Tokens/sec: {results['avg_tokens_per_second']:.2f} "
                  f"(± {results['std_tokens_per_second']:.2f})")
            print(f"  Time to first token: {results['avg_time_to_first_token']*1000:.0f}ms")
            print(f"  Avg generation time: {results['avg_total_time']:.2f}s")

        except Exception as e:
            print(f"\n{model_name}: Error - {e}")


if __name__ == "__main__":
    compare_models()

Optimization Tips

# optimization_tips.py
"""
Tips for optimizing local SLM performance.
"""
from llama_cpp import Llama


def optimized_setup():
    """Optimized model configuration."""
    llm = Llama(
        model_path="./models/phi-3-q4_k_m.gguf",

        # Context and batch size
        n_ctx=2048,        # Smaller context = faster
        n_batch=512,       # Batch size for prompt processing

        # Threading
        n_threads=8,       # Match your CPU cores
        n_threads_batch=8, # Threads for batch processing

        # GPU acceleration
        n_gpu_layers=35,   # -1 for all layers on GPU

        # Memory optimization
        use_mmap=True,     # Memory map the model
        use_mlock=False,   # Don't lock in RAM (saves memory)

        # Disable unused features
        embedding=False,   # Disable if not using embeddings
        verbose=False      # Disable logging
    )

    return llm


# Key optimization strategies:
#
# 1. Use appropriate quantization
#    - q4_k_m for best balance
#    - q8_0 for quality-critical applications
#    - q2_k for memory-constrained environments
#
# 2. Adjust context window
#    - Smaller n_ctx = faster inference
#    - Only use what you need
#
# 3. GPU acceleration
#    - Use n_gpu_layers for GPU offloading
#    - More layers on GPU = faster (if VRAM allows)
#
# 4. Batch processing
#    - Process multiple prompts together
#    - Use larger n_batch for throughput
#
# 5. Model selection
#    - Smaller models are faster
#    - Phi-3 and Qwen 2.5 are well-optimized

Testing Your Setup

# test_setup.py
"""
Test your local SLM setup.
"""
import subprocess
import sys


def test_ollama():
    """Test Ollama installation."""
    try:
        result = subprocess.run(
            ["ollama", "list"],
            capture_output=True,
            text=True
        )
        if result.returncode == 0:
            print("✅ Ollama is installed and running")
            print(f"   Models: {result.stdout.strip() or 'None installed'}")
        else:
            print("❌ Ollama service not running")
            print("   Run: ollama serve")
    except FileNotFoundError:
        print("❌ Ollama not installed")
        print("   Install: brew install ollama (macOS)")


def test_python_packages():
    """Test Python packages."""
    packages = ['ollama', 'llama_cpp', 'langchain_ollama', 'openai']

    for package in packages:
        try:
            __import__(package)
            print(f"✅ {package} installed")
        except ImportError:
            print(f"❌ {package} not installed")


def test_inference():
    """Test basic inference."""
    try:
        import ollama

        response = ollama.generate(
            model='phi3',
            prompt='Say hello in one word.'
        )
        print(f"✅ Inference working: {response['response'].strip()}")
    except Exception as e:
        print(f"❌ Inference failed: {e}")


if __name__ == "__main__":
    print("=== Local SLM Setup Test ===\n")

    print("1. Checking Ollama...")
    test_ollama()

    print("\n2. Checking Python packages...")
    test_python_packages()

    print("\n3. Testing inference...")
    test_inference()

    print("\n=== Test Complete ===")

Key Concepts Recap

Concept	What It Is	Why It Matters
Ollama	All-in-one tool for downloading and running models	Simplest setup - one command to run any model
llama.cpp	C++ inference engine with GGUF support	Maximum performance, GPU/CPU optimization
GGUF	File format for quantized models	Standard format that works with all tools
Quantization Level	Q2_K to Q8_0 - bits per weight	Lower = smaller/faster, higher = more accurate
n_ctx	Context window size (tokens model can see)	Larger = more context but more RAM
n_gpu_layers	Layers offloaded to GPU	More layers = faster (if VRAM allows)
Modelfile	Ollama configuration for custom models	Set system prompts, parameters, stop tokens
OpenAI Compatibility	Ollama serves OpenAI-compatible API	Drop-in replacement for cloud APIs
Temperature	Randomness in output (0.0-2.0)	Lower = deterministic, higher = creative
Streaming	Token-by-token output	Better UX - see responses as they generate

Next Steps

After setting up your local SLM environment:

SLM for Text Tasks - Build practical NLP applications
SLM Benchmarking - Evaluate and compare models
SLM Fine-tuning - Customize models for your domain

Local SLM Setup

On this page

Local SLM Setup

On this page