Small Language ModelsBasic
Local SLM Setup
Run Phi-3, Gemma, and Qwen locally with Ollama and llama.cpp
Local SLM Setup
Set up your local environment to run small language models without any cloud dependencies.
TL;DR
Run LLMs locally with zero cloud dependencies using Ollama (easiest) or llama.cpp (fastest). Models come in GGUF format with quantization levels (Q2-Q8) trading size for quality. A 3B model at Q4 needs ~2GB RAM and runs at 20-50 tokens/sec on modern hardware. Key formula: RAM_GB ≈ params_B × bytes_per_param × 1.2.
Overview
| Aspect | Details |
|---|---|
| Difficulty | Beginner |
| Time | ~2 hours |
| Code | ~300 lines |
| Prerequisites | Python 3.10+, 8GB RAM |
What You'll Build
A complete local SLM environment with:
- Ollama for easy model management
- llama.cpp for high-performance inference
- Python integration with multiple libraries
- API server for application integration
┌─────────────────────────────────────────────────────────────────────────────┐
│ Local SLM Stack │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────┐ ┌─────────────────────────┐ │
│ │ Small Models │ │ Model Formats │ │
│ │ ┌─────────┐ ┌────────┐ │ │ ┌─────────┐ ┌────────┐ │ │
│ │ │ Phi-3 │ │Gemma 2 │ │ │ │ GGUF │ │Safetens│ │ │
│ │ └─────────┘ └────────┘ │ │ │ (quant) │ │ ors │ │ │
│ │ ┌─────────┐ ┌────────┐ │ │ └─────────┘ └────────┘ │ │
│ │ │Qwen 2.5 │ │Llama3.2│ │ │ │ │
│ │ └─────────┘ └────────┘ │ │ │ │
│ └───────────┬─────────────┘ └────────────┬────────────┘ │
│ │ │ │
│ └────────────────┬───────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Inference Tools │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Ollama │ │ llama.cpp │ │ Python APIs │ │ │
│ │ │ (easiest) │ │ (fastest) │ │ (flexible) │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └──────────────────────────────┬──────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Your Application│ │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Understanding Small Language Models
Why Run Locally?
┌─────────────────────────────────────────────────────────────────────────────┐
│ Why Run SLMs Locally? │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────┐ │
│ │ Local SLMs │ │
│ └─────┬──────┘ │
│ ┌─────────────────────┼─────────────────────┐ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Privacy │ │ Cost │ │ Perform. │ │ Control │ │
│ ├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤ │
│ │• Data │ │• No API │ │• Low │ │• Custom │ │
│ │ stays │ │ fees │ │ latency │ │ fine- │ │
│ │ local │ │• Unlimit.│ │• No │ │ tuning │ │
│ │• HIPAA/ │ │ usage │ │ network │ │• Version │ │
│ │ GDPR OK │ │• Predict.│ │ depend. │ │ control │ │
│ │• Enterpr.│ │ costs │ │• Offline │ │• No rate │ │
│ │ secure │ │ │ │ capable │ │ limits │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Model Comparison
| Model | Parameters | RAM Required | Speed | Best For |
|---|---|---|---|---|
| SmolLM | 135M-1.7B | 1-4 GB | Very Fast | Embedded, IoT |
| Phi-3 Mini | 3.8B | 4-8 GB | Fast | General tasks |
| Gemma 2 | 2B | 4-6 GB | Fast | Instruction following |
| Qwen 2.5 | 3B | 4-8 GB | Fast | Multilingual |
| Llama 3.2 | 3B | 4-8 GB | Fast | On-device AI |
Part 1: Ollama Setup
Installation
Ollama is the easiest way to run LLMs locally.
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download/windowsRunning Models
# Start Ollama service (runs in background)
ollama serve
# Pull and run Phi-3
ollama pull phi3
ollama run phi3
# Pull other popular small models
ollama pull gemma2:2b
ollama pull qwen2.5:3b
ollama pull llama3.2:3b
ollama pull smollm:1.7bOllama Commands
# List installed models
ollama list
# Show model information
ollama show phi3
# Remove a model
ollama rm phi3
# Copy/rename a model
ollama cp phi3 my-phi3
# Create custom model with Modelfile
ollama create my-assistant -f ModelfileCustom Modelfile
# Modelfile
FROM phi3
# Set system prompt
SYSTEM """
You are a helpful coding assistant. You provide concise,
accurate answers with code examples when appropriate.
"""
# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
# Set stop tokens
PARAMETER stop "<|end|>"
PARAMETER stop "<|user|>"# Create and run custom model
ollama create code-assistant -f Modelfile
ollama run code-assistantPart 2: Python Integration with Ollama
Project Setup
# Create project
mkdir local-slm && cd local-slm
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install ollama langchain-ollama openai requestsBasic Ollama Python Usage
# ollama_basic.py
"""
Basic usage of Ollama Python library.
"""
import ollama
def simple_completion():
"""Generate a simple completion."""
response = ollama.generate(
model='phi3',
prompt='Explain what a transformer is in 2 sentences.'
)
print(response['response'])
def chat_completion():
"""Have a conversation with the model."""
response = ollama.chat(
model='phi3',
messages=[
{
'role': 'system',
'content': 'You are a helpful assistant.'
},
{
'role': 'user',
'content': 'What is the capital of France?'
}
]
)
print(response['message']['content'])
def streaming_response():
"""Stream responses for better UX."""
stream = ollama.chat(
model='phi3',
messages=[{'role': 'user', 'content': 'Write a haiku about coding.'}],
stream=True
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
print() # New line at end
def list_models():
"""List available models."""
models = ollama.list()
print("Available models:")
for model in models['models']:
size_gb = model['size'] / (1024**3)
print(f" - {model['name']}: {size_gb:.2f} GB")
if __name__ == "__main__":
print("=== Simple Completion ===")
simple_completion()
print("\n=== Chat Completion ===")
chat_completion()
print("\n=== Streaming Response ===")
streaming_response()
print("\n=== Available Models ===")
list_models()Ollama with OpenAI-Compatible API
# ollama_openai.py
"""
Use Ollama with OpenAI-compatible API.
Allows easy migration between local and cloud models.
"""
from openai import OpenAI
# Point to local Ollama server
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # Required but not used
)
def chat_with_openai_api():
"""Use OpenAI API format with local model."""
response = client.chat.completions.create(
model='phi3',
messages=[
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': 'Explain recursion simply.'}
],
temperature=0.7,
max_tokens=500
)
print(response.choices[0].message.content)
def streaming_with_openai_api():
"""Stream responses using OpenAI API format."""
stream = client.chat.completions.create(
model='phi3',
messages=[
{'role': 'user', 'content': 'Write a Python function to reverse a string.'}
],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end='', flush=True)
print()
def embeddings():
"""Generate embeddings using Ollama."""
response = client.embeddings.create(
model='nomic-embed-text', # Pull first: ollama pull nomic-embed-text
input='Hello, world!'
)
embedding = response.data[0].embedding
print(f"Embedding dimension: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")
if __name__ == "__main__":
print("=== Chat with OpenAI API ===")
chat_with_openai_api()
print("\n=== Streaming ===")
streaming_with_openai_api()LangChain Integration
# ollama_langchain.py
"""
Use Ollama with LangChain for building applications.
"""
from langchain_ollama import OllamaLLM, ChatOllama, OllamaEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
def basic_chain():
"""Create a simple LangChain chain."""
# Initialize model
llm = ChatOllama(model='phi3', temperature=0.7)
# Create prompt template
prompt = ChatPromptTemplate.from_messages([
('system', 'You are a {role}. Be concise and helpful.'),
('user', '{question}')
])
# Create chain
chain = prompt | llm | StrOutputParser()
# Run chain
response = chain.invoke({
'role': 'Python expert',
'question': 'How do I read a JSON file?'
})
print(response)
def structured_output():
"""Generate structured output."""
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import List
class Recipe(BaseModel):
name: str = Field(description="Name of the recipe")
ingredients: List[str] = Field(description="List of ingredients")
steps: List[str] = Field(description="Cooking steps")
prep_time: int = Field(description="Preparation time in minutes")
llm = ChatOllama(model='phi3', temperature=0)
# Note: Structured output support varies by model
prompt = ChatPromptTemplate.from_messages([
('system', 'Extract recipe information as JSON.'),
('user', '''Extract the recipe from this text:
Classic Pancakes: Mix 1 cup flour, 1 egg, 1 cup milk, and 2 tbsp sugar.
Heat pan, pour batter, flip when bubbles form. Takes about 20 minutes.
''')
])
chain = prompt | llm | StrOutputParser()
result = chain.invoke({})
print(result)
def embeddings_example():
"""Use Ollama embeddings."""
embeddings = OllamaEmbeddings(model='nomic-embed-text')
# Single text
vector = embeddings.embed_query("Hello, world!")
print(f"Embedding dimension: {len(vector)}")
# Multiple texts
texts = ["First document", "Second document", "Third document"]
vectors = embeddings.embed_documents(texts)
print(f"Number of embeddings: {len(vectors)}")
if __name__ == "__main__":
print("=== Basic Chain ===")
basic_chain()
print("\n=== Structured Output ===")
structured_output()Part 3: llama.cpp Setup
Why llama.cpp?
- Written in C/C++ for maximum performance
- Supports CPU and GPU inference
- Quantization support (2-8 bit)
- Active community with latest model support
Installation
# Clone repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build for CPU
make
# Build with Metal (macOS)
make LLAMA_METAL=1
# Build with CUDA (NVIDIA)
make LLAMA_CUDA=1
# Build with OpenBLAS (faster CPU)
make LLAMA_OPENBLAS=1Download GGUF Models
# Using huggingface-cli
pip install huggingface-hub
# Download Phi-3 GGUF
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf \
Phi-3-mini-4k-instruct-q4.gguf \
--local-dir ./models
# Download Qwen 2.5 GGUF
huggingface-cli download Qwen/Qwen2.5-3B-Instruct-GGUF \
qwen2.5-3b-instruct-q4_k_m.gguf \
--local-dir ./modelsRunning with llama.cpp
# Basic inference
./llama-cli -m models/Phi-3-mini-4k-instruct-q4.gguf \
-p "Explain quantum computing in simple terms:" \
-n 256
# Interactive mode
./llama-cli -m models/Phi-3-mini-4k-instruct-q4.gguf \
--interactive \
--color
# With specific parameters
./llama-cli -m models/Phi-3-mini-4k-instruct-q4.gguf \
-p "Write a Python function:" \
-n 512 \
--temp 0.7 \
--top-p 0.9 \
--repeat-penalty 1.1 \
-ngl 35 # GPU layers (if GPU available)llama.cpp Server
# Start server
./llama-server -m models/Phi-3-mini-4k-instruct-q4.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 35
# Server is now accessible at http://localhost:8080
# OpenAI-compatible API at http://localhost:8080/v1Part 4: Python with llama-cpp-python
Installation
# CPU only
pip install llama-cpp-python
# With Metal support (macOS)
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
# With CUDA support
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
# With OpenBLAS
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-pythonBasic Usage
# llama_cpp_basic.py
"""
Using llama-cpp-python for local inference.
"""
from llama_cpp import Llama
def basic_inference():
"""Basic text generation."""
# Load model
llm = Llama(
model_path="./models/Phi-3-mini-4k-instruct-q4.gguf",
n_ctx=4096, # Context window
n_threads=8, # CPU threads
n_gpu_layers=35, # GPU layers (0 for CPU only)
verbose=False
)
# Generate text
output = llm(
"Explain what an API is in one paragraph:",
max_tokens=200,
temperature=0.7,
top_p=0.9,
echo=False # Don't include prompt in output
)
print(output['choices'][0]['text'])
def chat_completion():
"""Chat-style completion."""
llm = Llama(
model_path="./models/Phi-3-mini-4k-instruct-q4.gguf",
n_ctx=4096,
n_gpu_layers=35,
chat_format="chatml" # Use appropriate chat format
)
messages = [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "How do I handle errors in Python?"}
]
response = llm.create_chat_completion(
messages=messages,
temperature=0.7,
max_tokens=500
)
print(response['choices'][0]['message']['content'])
def streaming_generation():
"""Stream tokens as they're generated."""
llm = Llama(
model_path="./models/Phi-3-mini-4k-instruct-q4.gguf",
n_ctx=4096,
n_gpu_layers=35
)
stream = llm(
"Write a short poem about Python programming:",
max_tokens=200,
stream=True
)
for token in stream:
print(token['choices'][0]['text'], end='', flush=True)
print()
def embeddings():
"""Generate embeddings with llama.cpp."""
llm = Llama(
model_path="./models/Phi-3-mini-4k-instruct-q4.gguf",
embedding=True, # Enable embedding mode
n_ctx=512
)
text = "This is a sample text for embedding."
embedding = llm.embed(text)
print(f"Embedding dimension: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")
if __name__ == "__main__":
print("=== Basic Inference ===")
basic_inference()
print("\n=== Chat Completion ===")
chat_completion()
print("\n=== Streaming ===")
streaming_generation()High-Performance Server
# llama_cpp_server.py
"""
FastAPI server for llama-cpp-python.
"""
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Optional
from llama_cpp import Llama
import uvicorn
app = FastAPI(title="Local SLM API")
# Load model at startup
llm = None
class Message(BaseModel):
role: str
content: str
class ChatRequest(BaseModel):
messages: List[Message]
temperature: float = 0.7
max_tokens: int = 500
stream: bool = False
class CompletionRequest(BaseModel):
prompt: str
temperature: float = 0.7
max_tokens: int = 500
@app.on_event("startup")
async def load_model():
global llm
llm = Llama(
model_path="./models/Phi-3-mini-4k-instruct-q4.gguf",
n_ctx=4096,
n_gpu_layers=35,
chat_format="chatml"
)
print("Model loaded successfully!")
@app.post("/v1/chat/completions")
async def chat_completions(request: ChatRequest):
if llm is None:
raise HTTPException(status_code=503, detail="Model not loaded")
messages = [{"role": m.role, "content": m.content} for m in request.messages]
response = llm.create_chat_completion(
messages=messages,
temperature=request.temperature,
max_tokens=request.max_tokens
)
return response
@app.post("/v1/completions")
async def completions(request: CompletionRequest):
if llm is None:
raise HTTPException(status_code=503, detail="Model not loaded")
response = llm(
request.prompt,
temperature=request.temperature,
max_tokens=request.max_tokens
)
return response
@app.get("/health")
async def health():
return {"status": "healthy", "model_loaded": llm is not None}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)Part 5: Model Format Conversion
Converting to GGUF
# convert_to_gguf.py
"""
Convert HuggingFace models to GGUF format.
Requires llama.cpp convert scripts.
"""
import subprocess
import os
def convert_hf_to_gguf(
model_name: str,
output_dir: str = "./models",
quantization: str = "q4_k_m"
):
"""
Convert a HuggingFace model to GGUF format.
Steps:
1. Download model from HuggingFace
2. Convert to GGUF format
3. Quantize to reduce size
"""
# Create output directory
os.makedirs(output_dir, exist_ok=True)
# Download model
print(f"Downloading {model_name}...")
subprocess.run([
"huggingface-cli", "download", model_name,
"--local-dir", f"{output_dir}/{model_name.split('/')[-1]}"
], check=True)
# Convert to GGUF (using llama.cpp convert script)
model_dir = f"{output_dir}/{model_name.split('/')[-1]}"
output_file = f"{output_dir}/{model_name.split('/')[-1]}.gguf"
print("Converting to GGUF...")
subprocess.run([
"python", "llama.cpp/convert_hf_to_gguf.py",
model_dir,
"--outfile", output_file,
"--outtype", "f16" # First convert to f16
], check=True)
# Quantize
quantized_file = f"{output_dir}/{model_name.split('/')[-1]}-{quantization}.gguf"
print(f"Quantizing to {quantization}...")
subprocess.run([
"./llama.cpp/llama-quantize",
output_file,
quantized_file,
quantization
], check=True)
print(f"Done! Model saved to {quantized_file}")
return quantized_file
# Quantization options:
# q2_k - Smallest, lowest quality
# q3_k_m - Small, low quality
# q4_0 - Medium, good balance
# q4_k_m - Medium, better quality (recommended)
# q5_k_m - Larger, high quality
# q6_k - Large, very high quality
# q8_0 - Largest quantized, near-original
# f16 - Half precision (no quantization)Part 6: Performance Optimization
Benchmarking Script
# benchmark.py
"""
Benchmark local SLM performance.
"""
import time
from typing import Dict, List
import statistics
def benchmark_model(
llm,
prompts: List[str],
max_tokens: int = 100,
num_runs: int = 3
) -> Dict:
"""Benchmark model performance."""
results = {
"tokens_per_second": [],
"time_to_first_token": [],
"total_time": []
}
for prompt in prompts:
for _ in range(num_runs):
start_time = time.time()
first_token_time = None
token_count = 0
# Stream to measure time to first token
stream = llm(prompt, max_tokens=max_tokens, stream=True)
for token in stream:
if first_token_time is None:
first_token_time = time.time() - start_time
token_count += 1
total_time = time.time() - start_time
results["tokens_per_second"].append(token_count / total_time)
results["time_to_first_token"].append(first_token_time)
results["total_time"].append(total_time)
return {
"avg_tokens_per_second": statistics.mean(results["tokens_per_second"]),
"avg_time_to_first_token": statistics.mean(results["time_to_first_token"]),
"avg_total_time": statistics.mean(results["total_time"]),
"std_tokens_per_second": statistics.stdev(results["tokens_per_second"]) if len(results["tokens_per_second"]) > 1 else 0
}
def compare_models():
"""Compare different models and configurations."""
from llama_cpp import Llama
models = [
("Phi-3 Q4", "./models/phi-3-q4_k_m.gguf"),
("Qwen 2.5 Q4", "./models/qwen2.5-3b-q4_k_m.gguf"),
]
prompts = [
"Explain machine learning in simple terms:",
"Write a Python function to sort a list:",
"What is the capital of France?",
]
print("Model Benchmark Results")
print("=" * 60)
for model_name, model_path in models:
try:
llm = Llama(
model_path=model_path,
n_ctx=2048,
n_gpu_layers=35,
verbose=False
)
results = benchmark_model(llm, prompts)
print(f"\n{model_name}:")
print(f" Tokens/sec: {results['avg_tokens_per_second']:.2f} "
f"(± {results['std_tokens_per_second']:.2f})")
print(f" Time to first token: {results['avg_time_to_first_token']*1000:.0f}ms")
print(f" Avg generation time: {results['avg_total_time']:.2f}s")
except Exception as e:
print(f"\n{model_name}: Error - {e}")
if __name__ == "__main__":
compare_models()Optimization Tips
# optimization_tips.py
"""
Tips for optimizing local SLM performance.
"""
from llama_cpp import Llama
def optimized_setup():
"""Optimized model configuration."""
llm = Llama(
model_path="./models/phi-3-q4_k_m.gguf",
# Context and batch size
n_ctx=2048, # Smaller context = faster
n_batch=512, # Batch size for prompt processing
# Threading
n_threads=8, # Match your CPU cores
n_threads_batch=8, # Threads for batch processing
# GPU acceleration
n_gpu_layers=35, # -1 for all layers on GPU
# Memory optimization
use_mmap=True, # Memory map the model
use_mlock=False, # Don't lock in RAM (saves memory)
# Disable unused features
embedding=False, # Disable if not using embeddings
verbose=False # Disable logging
)
return llm
# Key optimization strategies:
#
# 1. Use appropriate quantization
# - q4_k_m for best balance
# - q8_0 for quality-critical applications
# - q2_k for memory-constrained environments
#
# 2. Adjust context window
# - Smaller n_ctx = faster inference
# - Only use what you need
#
# 3. GPU acceleration
# - Use n_gpu_layers for GPU offloading
# - More layers on GPU = faster (if VRAM allows)
#
# 4. Batch processing
# - Process multiple prompts together
# - Use larger n_batch for throughput
#
# 5. Model selection
# - Smaller models are faster
# - Phi-3 and Qwen 2.5 are well-optimizedTesting Your Setup
# test_setup.py
"""
Test your local SLM setup.
"""
import subprocess
import sys
def test_ollama():
"""Test Ollama installation."""
try:
result = subprocess.run(
["ollama", "list"],
capture_output=True,
text=True
)
if result.returncode == 0:
print("✅ Ollama is installed and running")
print(f" Models: {result.stdout.strip() or 'None installed'}")
else:
print("❌ Ollama service not running")
print(" Run: ollama serve")
except FileNotFoundError:
print("❌ Ollama not installed")
print(" Install: brew install ollama (macOS)")
def test_python_packages():
"""Test Python packages."""
packages = ['ollama', 'llama_cpp', 'langchain_ollama', 'openai']
for package in packages:
try:
__import__(package)
print(f"✅ {package} installed")
except ImportError:
print(f"❌ {package} not installed")
def test_inference():
"""Test basic inference."""
try:
import ollama
response = ollama.generate(
model='phi3',
prompt='Say hello in one word.'
)
print(f"✅ Inference working: {response['response'].strip()}")
except Exception as e:
print(f"❌ Inference failed: {e}")
if __name__ == "__main__":
print("=== Local SLM Setup Test ===\n")
print("1. Checking Ollama...")
test_ollama()
print("\n2. Checking Python packages...")
test_python_packages()
print("\n3. Testing inference...")
test_inference()
print("\n=== Test Complete ===")Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| Ollama | All-in-one tool for downloading and running models | Simplest setup - one command to run any model |
| llama.cpp | C++ inference engine with GGUF support | Maximum performance, GPU/CPU optimization |
| GGUF | File format for quantized models | Standard format that works with all tools |
| Quantization Level | Q2_K to Q8_0 - bits per weight | Lower = smaller/faster, higher = more accurate |
| n_ctx | Context window size (tokens model can see) | Larger = more context but more RAM |
| n_gpu_layers | Layers offloaded to GPU | More layers = faster (if VRAM allows) |
| Modelfile | Ollama configuration for custom models | Set system prompts, parameters, stop tokens |
| OpenAI Compatibility | Ollama serves OpenAI-compatible API | Drop-in replacement for cloud APIs |
| Temperature | Randomness in output (0.0-2.0) | Lower = deterministic, higher = creative |
| Streaming | Token-by-token output | Better UX - see responses as they generate |
Next Steps
After setting up your local SLM environment:
- SLM for Text Tasks - Build practical NLP applications
- SLM Benchmarking - Evaluate and compare models
- SLM Fine-tuning - Customize models for your domain