Local SLM Setup
Run Phi-3, Gemma, and Qwen locally with Ollama and llama.cpp
Local SLM Setup
TL;DR
Run LLMs locally with zero cloud dependencies using Ollama (easiest) or llama.cpp (fastest). Models come in GGUF format with quantization levels (Q2-Q8) trading size for quality. A 3B model at Q4 needs ~2GB RAM and runs at 20-50 tokens/sec on modern hardware. Key formula: RAM_GB ≈ params_B × bytes_per_param × 1.2.
Overview
| Aspect | Details |
|---|---|
| Difficulty | Beginner |
| Time | ~2 hours |
| Code | ~300 lines |
| Prerequisites | Python 3.10+, 8GB RAM |
What You'll Build
A complete local SLM environment with:
- Ollama for easy model management
- llama.cpp for high-performance inference
- Python integration with multiple libraries
- API server for application integration
Local SLM Stack
Small Models
Model Formats
Inference Tools
Your Application
Understanding Small Language Models
Why Run Locally?
Why Run SLMs Locally?
Privacy
Data stays local, HIPAA/GDPR compliant, enterprise secure
Cost
No API fees, unlimited usage, predictable costs
Performance
Low latency, no network dependency, offline capable
Control
Custom fine-tuning, version control, no rate limits
Model Comparison (2025)
| Model | Parameters | RAM Required | Speed | Best For |
|---|---|---|---|---|
| SmolLM 2 | 135M-1.7B | 1-4 GB | Very Fast | Embedded, IoT |
| Phi-4-mini | 3.8B | 4-8 GB | Fast | General tasks, coding |
| Gemma 3 | 1B-4B | 2-8 GB | Fast | Multi-modal, instruction following |
| Qwen 3 | 0.6B-8B | 2-16 GB | Fast | Multilingual, tool use |
| Llama 3.2 | 1B-3B | 2-8 GB | Fast | On-device AI |
Part 1: Ollama Setup
Installation
Ollama is the easiest way to run LLMs locally.
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download/windowsRunning Models
# Start Ollama service (runs in background)
ollama serve
# Pull and run Phi-3
ollama pull phi3
ollama run phi3
# Pull other popular small models
ollama pull gemma2:2b
ollama pull qwen2.5:3b
ollama pull llama3.2:3b
ollama pull smollm:1.7bOllama Commands
# List installed models
ollama list
# Show model information
ollama show phi3
# Remove a model
ollama rm phi3
# Copy/rename a model
ollama cp phi3 my-phi3
# Create custom model with Modelfile
ollama create my-assistant -f ModelfileCustom Modelfile
# Modelfile
FROM phi3
# Set system prompt
SYSTEM """
You are a helpful coding assistant. You provide concise,
accurate answers with code examples when appropriate.
"""
# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
# Set stop tokens
PARAMETER stop "<|end|>"
PARAMETER stop "<|user|>"# Create and run custom model
ollama create code-assistant -f Modelfile
ollama run code-assistantPart 2: Python Integration with Ollama
Project Setup
# Create project
mkdir local-slm && cd local-slm
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install ollama langchain-ollama openai requestsBasic Ollama Python Usage
# ollama_basic.py
"""
Basic usage of Ollama Python library.
"""
import ollama
def simple_completion():
"""Generate a simple completion."""
response = ollama.generate(
model='phi3',
prompt='Explain what a transformer is in 2 sentences.'
)
print(response['response'])
def chat_completion():
"""Have a conversation with the model."""
response = ollama.chat(
model='phi3',
messages=[
{
'role': 'system',
'content': 'You are a helpful assistant.'
},
{
'role': 'user',
'content': 'What is the capital of France?'
}
]
)
print(response['message']['content'])
def streaming_response():
"""Stream responses for better UX."""
stream = ollama.chat(
model='phi3',
messages=[{'role': 'user', 'content': 'Write a haiku about coding.'}],
stream=True
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
print() # New line at end
def list_models():
"""List available models."""
models = ollama.list()
print("Available models:")
for model in models['models']:
size_gb = model['size'] / (1024**3)
print(f" - {model['name']}: {size_gb:.2f} GB")
if __name__ == "__main__":
print("=== Simple Completion ===")
simple_completion()
print("\n=== Chat Completion ===")
chat_completion()
print("\n=== Streaming Response ===")
streaming_response()
print("\n=== Available Models ===")
list_models()Ollama with OpenAI-Compatible API
# ollama_openai.py
"""
Use Ollama with OpenAI-compatible API.
Allows easy migration between local and cloud models.
"""
from openai import OpenAI
# Point to local Ollama server
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # Required but not used
)
def chat_with_openai_api():
"""Use OpenAI API format with local model."""
response = client.chat.completions.create(
model='phi3',
messages=[
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': 'Explain recursion simply.'}
],
temperature=0.7,
max_tokens=500
)
print(response.choices[0].message.content)
def streaming_with_openai_api():
"""Stream responses using OpenAI API format."""
stream = client.chat.completions.create(
model='phi3',
messages=[
{'role': 'user', 'content': 'Write a Python function to reverse a string.'}
],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end='', flush=True)
print()
def embeddings():
"""Generate embeddings using Ollama."""
response = client.embeddings.create(
model='nomic-embed-text', # Pull first: ollama pull nomic-embed-text
input='Hello, world!'
)
embedding = response.data[0].embedding
print(f"Embedding dimension: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")
if __name__ == "__main__":
print("=== Chat with OpenAI API ===")
chat_with_openai_api()
print("\n=== Streaming ===")
streaming_with_openai_api()LangChain Integration
# ollama_langchain.py
"""
Use Ollama with LangChain for building applications.
"""
from langchain_ollama import OllamaLLM, ChatOllama, OllamaEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
def basic_chain():
"""Create a simple LangChain chain."""
# Initialize model
llm = ChatOllama(model='phi3', temperature=0.7)
# Create prompt template
prompt = ChatPromptTemplate.from_messages([
('system', 'You are a {role}. Be concise and helpful.'),
('user', '{question}')
])
# Create chain
chain = prompt | llm | StrOutputParser()
# Run chain
response = chain.invoke({
'role': 'Python expert',
'question': 'How do I read a JSON file?'
})
print(response)
def structured_output():
"""Generate structured output."""
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import List
class Recipe(BaseModel):
name: str = Field(description="Name of the recipe")
ingredients: List[str] = Field(description="List of ingredients")
steps: List[str] = Field(description="Cooking steps")
prep_time: int = Field(description="Preparation time in minutes")
llm = ChatOllama(model='phi3', temperature=0)
# Note: Structured output support varies by model
prompt = ChatPromptTemplate.from_messages([
('system', 'Extract recipe information as JSON.'),
('user', '''Extract the recipe from this text:
Classic Pancakes: Mix 1 cup flour, 1 egg, 1 cup milk, and 2 tbsp sugar.
Heat pan, pour batter, flip when bubbles form. Takes about 20 minutes.
''')
])
chain = prompt | llm | StrOutputParser()
result = chain.invoke({})
print(result)
def embeddings_example():
"""Use Ollama embeddings."""
embeddings = OllamaEmbeddings(model='nomic-embed-text')
# Single text
vector = embeddings.embed_query("Hello, world!")
print(f"Embedding dimension: {len(vector)}")
# Multiple texts
texts = ["First document", "Second document", "Third document"]
vectors = embeddings.embed_documents(texts)
print(f"Number of embeddings: {len(vectors)}")
if __name__ == "__main__":
print("=== Basic Chain ===")
basic_chain()
print("\n=== Structured Output ===")
structured_output()Part 3: llama.cpp Setup
Why llama.cpp?
- Written in C/C++ for maximum performance
- Supports CPU and GPU inference
- Quantization support (2-8 bit)
- Active community with latest model support
Installation
# Clone repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build for CPU
make
# Build with Metal (macOS)
make LLAMA_METAL=1
# Build with CUDA (NVIDIA)
make LLAMA_CUDA=1
# Build with OpenBLAS (faster CPU)
make LLAMA_OPENBLAS=1Download GGUF Models
# Using huggingface-cli
pip install huggingface-hub
# Download Phi-3 GGUF
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf \
Phi-3-mini-4k-instruct-q4.gguf \
--local-dir ./models
# Download Qwen 2.5 GGUF
huggingface-cli download Qwen/Qwen2.5-3B-Instruct-GGUF \
qwen2.5-3b-instruct-q4_k_m.gguf \
--local-dir ./modelsRunning with llama.cpp
# Basic inference
./llama-cli -m models/Phi-3-mini-4k-instruct-q4.gguf \
-p "Explain quantum computing in simple terms:" \
-n 256
# Interactive mode
./llama-cli -m models/Phi-3-mini-4k-instruct-q4.gguf \
--interactive \
--color
# With specific parameters
./llama-cli -m models/Phi-3-mini-4k-instruct-q4.gguf \
-p "Write a Python function:" \
-n 512 \
--temp 0.7 \
--top-p 0.9 \
--repeat-penalty 1.1 \
-ngl 35 # GPU layers (if GPU available)llama.cpp Server
# Start server
./llama-server -m models/Phi-3-mini-4k-instruct-q4.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 35
# Server is now accessible at http://localhost:8080
# OpenAI-compatible API at http://localhost:8080/v1Part 4: Python with llama-cpp-python
Installation
# CPU only
pip install llama-cpp-python
# With Metal support (macOS)
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
# With CUDA support
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
# With OpenBLAS
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-pythonBasic Usage
# llama_cpp_basic.py
"""
Using llama-cpp-python for local inference.
"""
from llama_cpp import Llama
def basic_inference():
"""Basic text generation."""
# Load model
llm = Llama(
model_path="./models/Phi-3-mini-4k-instruct-q4.gguf",
n_ctx=4096, # Context window
n_threads=8, # CPU threads
n_gpu_layers=35, # GPU layers (0 for CPU only)
verbose=False
)
# Generate text
output = llm(
"Explain what an API is in one paragraph:",
max_tokens=200,
temperature=0.7,
top_p=0.9,
echo=False # Don't include prompt in output
)
print(output['choices'][0]['text'])
def chat_completion():
"""Chat-style completion."""
llm = Llama(
model_path="./models/Phi-3-mini-4k-instruct-q4.gguf",
n_ctx=4096,
n_gpu_layers=35,
chat_format="chatml" # Use appropriate chat format
)
messages = [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "How do I handle errors in Python?"}
]
response = llm.create_chat_completion(
messages=messages,
temperature=0.7,
max_tokens=500
)
print(response['choices'][0]['message']['content'])
def streaming_generation():
"""Stream tokens as they're generated."""
llm = Llama(
model_path="./models/Phi-3-mini-4k-instruct-q4.gguf",
n_ctx=4096,
n_gpu_layers=35
)
stream = llm(
"Write a short poem about Python programming:",
max_tokens=200,
stream=True
)
for token in stream:
print(token['choices'][0]['text'], end='', flush=True)
print()
def embeddings():
"""Generate embeddings with llama.cpp."""
llm = Llama(
model_path="./models/Phi-3-mini-4k-instruct-q4.gguf",
embedding=True, # Enable embedding mode
n_ctx=512
)
text = "This is a sample text for embedding."
embedding = llm.embed(text)
print(f"Embedding dimension: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")
if __name__ == "__main__":
print("=== Basic Inference ===")
basic_inference()
print("\n=== Chat Completion ===")
chat_completion()
print("\n=== Streaming ===")
streaming_generation()High-Performance Server
# llama_cpp_server.py
"""
FastAPI server for llama-cpp-python.
"""
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Optional
from llama_cpp import Llama
import uvicorn
app = FastAPI(title="Local SLM API")
# Load model at startup
llm = None
class Message(BaseModel):
role: str
content: str
class ChatRequest(BaseModel):
messages: List[Message]
temperature: float = 0.7
max_tokens: int = 500
stream: bool = False
class CompletionRequest(BaseModel):
prompt: str
temperature: float = 0.7
max_tokens: int = 500
@app.on_event("startup")
async def load_model():
global llm
llm = Llama(
model_path="./models/Phi-3-mini-4k-instruct-q4.gguf",
n_ctx=4096,
n_gpu_layers=35,
chat_format="chatml"
)
print("Model loaded successfully!")
@app.post("/v1/chat/completions")
async def chat_completions(request: ChatRequest):
if llm is None:
raise HTTPException(status_code=503, detail="Model not loaded")
messages = [{"role": m.role, "content": m.content} for m in request.messages]
response = llm.create_chat_completion(
messages=messages,
temperature=request.temperature,
max_tokens=request.max_tokens
)
return response
@app.post("/v1/completions")
async def completions(request: CompletionRequest):
if llm is None:
raise HTTPException(status_code=503, detail="Model not loaded")
response = llm(
request.prompt,
temperature=request.temperature,
max_tokens=request.max_tokens
)
return response
@app.get("/health")
async def health():
return {"status": "healthy", "model_loaded": llm is not None}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)Part 5: Model Format Conversion
Converting to GGUF
# convert_to_gguf.py
"""
Convert HuggingFace models to GGUF format.
Requires llama.cpp convert scripts.
"""
import subprocess
import os
def convert_hf_to_gguf(
model_name: str,
output_dir: str = "./models",
quantization: str = "q4_k_m"
):
"""
Convert a HuggingFace model to GGUF format.
Steps:
1. Download model from HuggingFace
2. Convert to GGUF format
3. Quantize to reduce size
"""
# Create output directory
os.makedirs(output_dir, exist_ok=True)
# Download model
print(f"Downloading {model_name}...")
subprocess.run([
"huggingface-cli", "download", model_name,
"--local-dir", f"{output_dir}/{model_name.split('/')[-1]}"
], check=True)
# Convert to GGUF (using llama.cpp convert script)
model_dir = f"{output_dir}/{model_name.split('/')[-1]}"
output_file = f"{output_dir}/{model_name.split('/')[-1]}.gguf"
print("Converting to GGUF...")
subprocess.run([
"python", "llama.cpp/convert_hf_to_gguf.py",
model_dir,
"--outfile", output_file,
"--outtype", "f16" # First convert to f16
], check=True)
# Quantize
quantized_file = f"{output_dir}/{model_name.split('/')[-1]}-{quantization}.gguf"
print(f"Quantizing to {quantization}...")
subprocess.run([
"./llama.cpp/llama-quantize",
output_file,
quantized_file,
quantization
], check=True)
print(f"Done! Model saved to {quantized_file}")
return quantized_file
# Quantization options:
# q2_k - Smallest, lowest quality
# q3_k_m - Small, low quality
# q4_0 - Medium, good balance
# q4_k_m - Medium, better quality (recommended)
# q5_k_m - Larger, high quality
# q6_k - Large, very high quality
# q8_0 - Largest quantized, near-original
# f16 - Half precision (no quantization)Part 6: Performance Optimization
Benchmarking Script
# benchmark.py
"""
Benchmark local SLM performance.
"""
import time
from typing import Dict, List
import statistics
def benchmark_model(
llm,
prompts: List[str],
max_tokens: int = 100,
num_runs: int = 3
) -> Dict:
"""Benchmark model performance."""
results = {
"tokens_per_second": [],
"time_to_first_token": [],
"total_time": []
}
for prompt in prompts:
for _ in range(num_runs):
start_time = time.time()
first_token_time = None
token_count = 0
# Stream to measure time to first token
stream = llm(prompt, max_tokens=max_tokens, stream=True)
for token in stream:
if first_token_time is None:
first_token_time = time.time() - start_time
token_count += 1
total_time = time.time() - start_time
results["tokens_per_second"].append(token_count / total_time)
results["time_to_first_token"].append(first_token_time)
results["total_time"].append(total_time)
return {
"avg_tokens_per_second": statistics.mean(results["tokens_per_second"]),
"avg_time_to_first_token": statistics.mean(results["time_to_first_token"]),
"avg_total_time": statistics.mean(results["total_time"]),
"std_tokens_per_second": statistics.stdev(results["tokens_per_second"]) if len(results["tokens_per_second"]) > 1 else 0
}
def compare_models():
"""Compare different models and configurations."""
from llama_cpp import Llama
models = [
("Phi-3 Q4", "./models/phi-3-q4_k_m.gguf"),
("Qwen 2.5 Q4", "./models/qwen2.5-3b-q4_k_m.gguf"),
]
prompts = [
"Explain machine learning in simple terms:",
"Write a Python function to sort a list:",
"What is the capital of France?",
]
print("Model Benchmark Results")
print("=" * 60)
for model_name, model_path in models:
try:
llm = Llama(
model_path=model_path,
n_ctx=2048,
n_gpu_layers=35,
verbose=False
)
results = benchmark_model(llm, prompts)
print(f"\n{model_name}:")
print(f" Tokens/sec: {results['avg_tokens_per_second']:.2f} "
f"(± {results['std_tokens_per_second']:.2f})")
print(f" Time to first token: {results['avg_time_to_first_token']*1000:.0f}ms")
print(f" Avg generation time: {results['avg_total_time']:.2f}s")
except Exception as e:
print(f"\n{model_name}: Error - {e}")
if __name__ == "__main__":
compare_models()Optimization Tips
# optimization_tips.py
"""
Tips for optimizing local SLM performance.
"""
from llama_cpp import Llama
def optimized_setup():
"""Optimized model configuration."""
llm = Llama(
model_path="./models/phi-3-q4_k_m.gguf",
# Context and batch size
n_ctx=2048, # Smaller context = faster
n_batch=512, # Batch size for prompt processing
# Threading
n_threads=8, # Match your CPU cores
n_threads_batch=8, # Threads for batch processing
# GPU acceleration
n_gpu_layers=35, # -1 for all layers on GPU
# Memory optimization
use_mmap=True, # Memory map the model
use_mlock=False, # Don't lock in RAM (saves memory)
# Disable unused features
embedding=False, # Disable if not using embeddings
verbose=False # Disable logging
)
return llm
# Key optimization strategies:
#
# 1. Use appropriate quantization
# - q4_k_m for best balance
# - q8_0 for quality-critical applications
# - q2_k for memory-constrained environments
#
# 2. Adjust context window
# - Smaller n_ctx = faster inference
# - Only use what you need
#
# 3. GPU acceleration
# - Use n_gpu_layers for GPU offloading
# - More layers on GPU = faster (if VRAM allows)
#
# 4. Batch processing
# - Process multiple prompts together
# - Use larger n_batch for throughput
#
# 5. Model selection
# - Smaller models are faster
# - Phi-3 and Qwen 2.5 are well-optimizedTesting Your Setup
# test_setup.py
"""
Test your local SLM setup.
"""
import subprocess
import sys
def test_ollama():
"""Test Ollama installation."""
try:
result = subprocess.run(
["ollama", "list"],
capture_output=True,
text=True
)
if result.returncode == 0:
print("✅ Ollama is installed and running")
print(f" Models: {result.stdout.strip() or 'None installed'}")
else:
print("❌ Ollama service not running")
print(" Run: ollama serve")
except FileNotFoundError:
print("❌ Ollama not installed")
print(" Install: brew install ollama (macOS)")
def test_python_packages():
"""Test Python packages."""
packages = ['ollama', 'llama_cpp', 'langchain_ollama', 'openai']
for package in packages:
try:
__import__(package)
print(f"✅ {package} installed")
except ImportError:
print(f"❌ {package} not installed")
def test_inference():
"""Test basic inference."""
try:
import ollama
response = ollama.generate(
model='phi3',
prompt='Say hello in one word.'
)
print(f"✅ Inference working: {response['response'].strip()}")
except Exception as e:
print(f"❌ Inference failed: {e}")
if __name__ == "__main__":
print("=== Local SLM Setup Test ===\n")
print("1. Checking Ollama...")
test_ollama()
print("\n2. Checking Python packages...")
test_python_packages()
print("\n3. Testing inference...")
test_inference()
print("\n=== Test Complete ===")Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| Ollama | All-in-one tool for downloading and running models | Simplest setup - one command to run any model |
| llama.cpp | C++ inference engine with GGUF support | Maximum performance, GPU/CPU optimization |
| GGUF | File format for quantized models | Standard format that works with all tools |
| Quantization Level | Q2_K to Q8_0 - bits per weight | Lower = smaller/faster, higher = more accurate |
| n_ctx | Context window size (tokens model can see) | Larger = more context but more RAM |
| n_gpu_layers | Layers offloaded to GPU | More layers = faster (if VRAM allows) |
| Modelfile | Ollama configuration for custom models | Set system prompts, parameters, stop tokens |
| OpenAI Compatibility | Ollama serves OpenAI-compatible API | Drop-in replacement for cloud APIs |
| Temperature | Randomness in output (0.0-2.0) | Lower = deterministic, higher = creative |
| Streaming | Token-by-token output | Better UX - see responses as they generate |
Next Steps
After setting up your local SLM environment:
- SLM for Text Tasks - Build practical NLP applications
- SLM Benchmarking - Evaluate and compare models
- SLM Fine-tuning - Customize models for your domain