Edge Deployment
Deploy small language models to mobile and edge devices
Edge Deployment
TL;DR
Deploy SLMs via ONNX (cross-platform), llama.cpp (efficient CPU inference with GGUF), WebLLM (browser via WebGPU), or CoreML (Apple Neural Engine). Key settings: use_mmap=True for memory efficiency, n_threads=physical_cores-1, and Q4_K_M quantization for best size/quality balance.
Deploy SLMs on resource-constrained edge devices including mobile phones, browsers, and IoT devices.
Project Overview
| Aspect | Details |
|---|---|
| Difficulty | Intermediate |
| Time | 6-8 hours |
| Prerequisites | Python, SLM basics, model formats |
| Learning Outcomes | ONNX export, mobile deployment, browser inference, optimization |
What You'll Learn
- Export models to ONNX and optimize for inference
- Deploy to iOS/macOS with CoreML
- Run models in browsers with WebLLM
- Use llama.cpp for efficient edge inference
- Optimize for memory-constrained devices
- Benchmark across different deployment targets
Architecture Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ Edge Deployment Architecture │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ SOURCE MODELS EXPORT FORMATS DEPLOYMENT TARGETS │
│ ┌─────────────────┐ │
│ │ HuggingFace │────┬───► ONNX Runtime ───┬───► Server (CPU/GPU) │
│ │ Model │ │ └───► Android │
│ └─────────────────┘ │ │
│ │ ├───► CoreML ─────────────► macOS/iOS │
│ │ │ │
│ │ └───► MLC-LLM/WebLLM ─────► Web Browser │
│ │ ▲ │
│ ▼ │ │
│ ┌─────────────────┐ │ │
│ │ GGUF Model │───────────────┤ │
│ │ │ │ │
│ └─────────────────┘ │ │
│ │ │ │
│ └───► llama.cpp ─────────┼───► Server (CPU) │
│ │ │
│ └───► Raspberry Pi │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ RECOMMENDED PATH: HuggingFace ──► GGUF ──► llama.cpp (most compatible) │
└─────────────────────────────────────────────────────────────────────────────┘Project Setup
Dependencies
# Create project directory
mkdir slm-edge-deployment && cd slm-edge-deployment
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Core dependencies
pip install torch transformers accelerate
# ONNX Runtime
pip install onnx onnxruntime optimum[exporters]
# llama.cpp Python bindings
pip install llama-cpp-python
# For benchmarking
pip install psutil numpy pandas plotly
# Optional: CoreML tools (macOS only)
# pip install coremltools
# Optional: TensorFlow Lite
# pip install tensorflow tflite-model-makerModel Selection for Edge
# models/edge_models.py
"""
Model recommendations for different edge targets.
"""
EDGE_MODEL_RECOMMENDATIONS = {
"raspberry_pi_4": {
"ram_gb": 4,
"recommended_models": [
{"name": "qwen2.5:0.5b-instruct-q4_K_M", "vram_mb": 400},
{"name": "smollm:135m-instruct-q8_0", "vram_mb": 150},
{"name": "tinyllama:1.1b-chat-v1.0-q2_K", "vram_mb": 500},
],
"notes": "Use smallest quantizations, expect 5-15 tok/s"
},
"raspberry_pi_5": {
"ram_gb": 8,
"recommended_models": [
{"name": "qwen2.5:1.5b-instruct-q4_K_M", "vram_mb": 1200},
{"name": "phi3:mini-4k-instruct-q4_K_M", "vram_mb": 2400},
{"name": "gemma2:2b-instruct-q4_K_M", "vram_mb": 1800},
],
"notes": "Better performance, 10-25 tok/s with Q4"
},
"iphone_15_pro": {
"ram_gb": 8,
"neural_engine": True,
"recommended_models": [
{"name": "phi3-mini-4k-CoreML", "vram_mb": 2000},
{"name": "gemma2-2b-CoreML", "vram_mb": 1500},
],
"notes": "Use CoreML for Neural Engine acceleration"
},
"android_flagship": {
"ram_gb": 12,
"recommended_models": [
{"name": "qwen2.5:3b-instruct-q4_K_M", "vram_mb": 2200},
{"name": "phi3:mini-4k-instruct-q4_K_M", "vram_mb": 2400},
],
"notes": "Use GGUF with Termux or dedicated apps"
},
"web_browser": {
"ram_gb": 4, # Available to browser
"webgpu": True,
"recommended_models": [
{"name": "Qwen2.5-0.5B-Instruct-q4f16_1-MLC", "vram_mb": 400},
{"name": "SmolLM-135M-Instruct-q4f16_1-MLC", "vram_mb": 150},
{"name": "Phi-3-mini-4k-instruct-q4f16_1-MLC", "vram_mb": 2000},
],
"notes": "Requires WebGPU-capable browser"
}
}
def get_model_recommendations(target: str, max_memory_mb: int = None) -> list:
"""Get model recommendations for a deployment target."""
if target not in EDGE_MODEL_RECOMMENDATIONS:
raise ValueError(f"Unknown target: {target}")
config = EDGE_MODEL_RECOMMENDATIONS[target]
models = config["recommended_models"]
if max_memory_mb:
models = [m for m in models if m["vram_mb"] <= max_memory_mb]
return models
def print_recommendations():
"""Print all recommendations in a readable format."""
for target, config in EDGE_MODEL_RECOMMENDATIONS.items():
print(f"\n{'='*50}")
print(f"Target: {target}")
print(f"RAM: {config['ram_gb']}GB")
print(f"Notes: {config['notes']}")
print("-" * 30)
for model in config["recommended_models"]:
print(f" - {model['name']} ({model['vram_mb']}MB)")
if __name__ == "__main__":
print_recommendations()Understanding Edge Model Selection:
┌─────────────────────────────────────────────────────────────────────────────┐
│ WHY MODEL SELECTION MATTERS FOR EDGE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Device RAM ──► Available for Model ──► Model Size Limit ──► Quality │
│ │
│ Example: iPhone 15 Pro (8GB RAM) │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ Total RAM: 8GB │ │
│ │ ├── iOS System: ~2GB │ │
│ │ ├── App Overhead: ~500MB │ │
│ │ ├── Neural Engine Buffer: ~500MB │ │
│ │ └── Available for Model: ~5GB ──► Can run Phi-3 Mini (2.4GB Q4) │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │
│ Quantization Impact on Quality vs Size: │
│ ┌──────────┬───────────┬─────────────┬──────────────────────────────────┐ │
│ │ Format │ Size (2B) │ Quality │ Use Case │ │
│ ├──────────┼───────────┼─────────────┼──────────────────────────────────┤ │
│ │ Q2_K │ ~0.8GB │ Degraded │ Extreme memory constraints │ │
│ │ Q4_K_M │ ~1.2GB │ Near-FP16 │ Best balance (recommended) │ │
│ │ Q8_0 │ ~2.0GB │ Excellent │ When quality is critical │ │
│ └──────────┴───────────┴─────────────┴──────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Key Design Decisions:
| Decision | Reasoning |
|---|---|
| Organize by device | Different devices have different constraints (RAM, accelerators) |
| Include VRAM estimates | Lets you calculate if a model fits before downloading |
| Recommend Q4_K_M | Best size/quality trade-off - K-quant methods preserve quality better |
| Device-specific notes | Each platform has unique optimization paths (CoreML, WebGPU, etc.) |
Part 1: ONNX Export and Optimization
ONNX (Open Neural Network Exchange) provides cross-platform model deployment.
# export/onnx_export.py
"""
Export HuggingFace models to ONNX format with optimizations.
"""
import os
import time
from pathlib import Path
from typing import Optional
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.onnxruntime import ORTModelForCausalLM
import onnxruntime as ort
class ONNXExporter:
"""Export and optimize models for ONNX Runtime."""
def __init__(self, model_name: str, output_dir: str = "./onnx_models"):
self.model_name = model_name
self.output_dir = Path(output_dir)
self.output_dir.mkdir(parents=True, exist_ok=True)
def export_model(
self,
optimize: bool = True,
quantize: bool = False,
device: str = "cpu"
) -> Path:
"""
Export model to ONNX format.
Args:
optimize: Apply ONNX graph optimizations
quantize: Apply INT8 quantization
device: Target device (cpu/cuda)
Returns:
Path to exported model
"""
export_path = self.output_dir / self.model_name.replace("/", "_")
print(f"Exporting {self.model_name} to ONNX...")
# Use optimum for export
model = ORTModelForCausalLM.from_pretrained(
self.model_name,
export=True,
provider="CPUExecutionProvider" if device == "cpu" else "CUDAExecutionProvider"
)
# Save the exported model
model.save_pretrained(export_path)
# Also save tokenizer
tokenizer = AutoTokenizer.from_pretrained(self.model_name)
tokenizer.save_pretrained(export_path)
print(f"Model exported to: {export_path}")
# Apply optimizations
if optimize:
self._optimize_model(export_path)
# Apply quantization
if quantize:
self._quantize_model(export_path)
return export_path
def _optimize_model(self, model_path: Path):
"""Apply ONNX graph optimizations."""
from optimum.onnxruntime import ORTOptimizer
from optimum.onnxruntime.configuration import OptimizationConfig
optimizer = ORTOptimizer.from_pretrained(model_path)
optimization_config = OptimizationConfig(
optimization_level=99, # Maximum optimization
enable_transformers_specific_optimizations=True,
fp16=False, # Keep FP32 for CPU compatibility
)
optimized_path = model_path / "optimized"
optimizer.optimize(
save_dir=optimized_path,
optimization_config=optimization_config
)
print(f"Optimized model saved to: {optimized_path}")
def _quantize_model(self, model_path: Path):
"""Apply dynamic INT8 quantization."""
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig
quantizer = ORTQuantizer.from_pretrained(model_path)
quantization_config = AutoQuantizationConfig.avx512_vnni(
is_static=False,
per_channel=True
)
quantized_path = model_path / "quantized"
quantizer.quantize(
save_dir=quantized_path,
quantization_config=quantization_config
)
print(f"Quantized model saved to: {quantized_path}")
class ONNXInference:
"""Run inference with ONNX models."""
def __init__(self, model_path: str, use_gpu: bool = False):
self.model_path = Path(model_path)
# Set up execution providers
providers = ["CPUExecutionProvider"]
if use_gpu:
providers.insert(0, "CUDAExecutionProvider")
# Load model
self.model = ORTModelForCausalLM.from_pretrained(
model_path,
provider=providers[0]
)
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
# Set pad token if needed
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token
def generate(
self,
prompt: str,
max_new_tokens: int = 100,
temperature: float = 0.7,
top_p: float = 0.9
) -> tuple[str, dict]:
"""
Generate text with ONNX model.
Returns:
Tuple of (generated_text, metrics)
"""
# Tokenize
inputs = self.tokenizer(prompt, return_tensors="pt")
input_length = inputs["input_ids"].shape[1]
# Generate
start_time = time.time()
outputs = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_p=top_p,
do_sample=temperature > 0,
pad_token_id=self.tokenizer.pad_token_id
)
generation_time = time.time() - start_time
# Decode
generated_tokens = outputs[0][input_length:]
generated_text = self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
# Calculate metrics
tokens_generated = len(generated_tokens)
metrics = {
"tokens_generated": tokens_generated,
"generation_time_s": generation_time,
"tokens_per_second": tokens_generated / generation_time if generation_time > 0 else 0
}
return generated_text, metrics
def benchmark(self, prompts: list[str], num_runs: int = 3) -> dict:
"""Run benchmark across multiple prompts."""
all_metrics = []
for prompt in prompts:
prompt_metrics = []
for _ in range(num_runs):
_, metrics = self.generate(prompt, max_new_tokens=50)
prompt_metrics.append(metrics)
# Average metrics for this prompt
avg_metrics = {
"tokens_per_second": np.mean([m["tokens_per_second"] for m in prompt_metrics]),
"generation_time_s": np.mean([m["generation_time_s"] for m in prompt_metrics])
}
all_metrics.append(avg_metrics)
return {
"avg_tokens_per_second": np.mean([m["tokens_per_second"] for m in all_metrics]),
"avg_generation_time_s": np.mean([m["generation_time_s"] for m in all_metrics]),
"num_prompts": len(prompts),
"num_runs": num_runs
}
# Example usage
if __name__ == "__main__":
# Export a small model
exporter = ONNXExporter("HuggingFaceTB/SmolLM-135M-Instruct")
model_path = exporter.export_model(optimize=True, quantize=False)
# Run inference
inference = ONNXInference(str(model_path))
prompt = "Explain edge computing in one sentence:"
response, metrics = inference.generate(prompt)
print(f"\nPrompt: {prompt}")
print(f"Response: {response}")
print(f"Speed: {metrics['tokens_per_second']:.1f} tokens/sec")★ Insight ─────────────────────────────────────
ONNX Export Strategy: Optimum's ORTModelForCausalLM handles the complex ONNX export process including attention mask handling, KV-cache management, and decoder architecture. The optimization levels (0-99) apply increasingly aggressive graph transformations. Level 99 enables all optimizations including operator fusion.
─────────────────────────────────────────────────
Understanding ONNX Export and Inference:
┌─────────────────────────────────────────────────────────────────────────────┐
│ ONNX EXPORT PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ HuggingFace Model ──► ONNX Graph ──► Optimizations ──► Quantization │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ PyTorch │ │ Operators │ │ Fused Ops │ │ INT8 │ │
│ │ Weights │ │ as Nodes │ │ Optimized │ │ Weights │ │
│ │ + Config │ │ + Edges │ │ Graph │ │ Smaller │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Why Each Step Matters: │
│ • Export: Converts PyTorch dynamic graph to static ONNX format │
│ • Optimize: Fuses operations (LayerNorm + Add → single kernel) │
│ • Quantize: Reduces memory 4x with minimal quality loss │
│ │
└─────────────────────────────────────────────────────────────────────────────┘ONNX Runtime Execution Providers:
| Provider | Best For | Performance | Setup Complexity |
|---|---|---|---|
CPUExecutionProvider | Any CPU | Baseline | None |
CUDAExecutionProvider | NVIDIA GPU | 10-50x faster | CUDA + cuDNN |
CoreMLExecutionProvider | Apple Neural Engine | 5-20x faster | macOS only |
QNNExecutionProvider | Qualcomm NPU | 10-30x faster | Android + Snapdragon |
Part 2: llama.cpp for Edge Devices
llama.cpp provides highly optimized CPU inference, perfect for edge devices.
# inference/llamacpp_edge.py
"""
Edge deployment using llama.cpp for efficient CPU inference.
"""
import os
import time
import psutil
from pathlib import Path
from typing import Optional, Generator
from dataclasses import dataclass
from llama_cpp import Llama
@dataclass
class EdgeConfig:
"""Configuration for edge deployment."""
n_ctx: int = 2048 # Context window
n_threads: int = 4 # CPU threads
n_batch: int = 512 # Batch size for prompt processing
n_gpu_layers: int = 0 # GPU layers (0 for pure CPU)
use_mlock: bool = False # Lock model in memory (requires permissions)
use_mmap: bool = True # Memory-mapped loading
verbose: bool = False
class EdgeLLM:
"""
Optimized LLM inference for edge devices.
Designed for:
- Raspberry Pi 4/5
- Low-power laptops
- Android devices (via Termux)
- IoT gateways
"""
def __init__(
self,
model_path: str,
config: EdgeConfig = None
):
self.model_path = model_path
self.config = config or EdgeConfig()
# Detect available resources
self._detect_resources()
# Adjust config based on resources
self._optimize_for_device()
# Load model
self.model = self._load_model()
def _detect_resources(self):
"""Detect available system resources."""
self.total_ram_gb = psutil.virtual_memory().total / (1024**3)
self.available_ram_gb = psutil.virtual_memory().available / (1024**3)
self.cpu_count = psutil.cpu_count(logical=True)
self.physical_cores = psutil.cpu_count(logical=False)
print(f"System Resources:")
print(f" Total RAM: {self.total_ram_gb:.1f} GB")
print(f" Available RAM: {self.available_ram_gb:.1f} GB")
print(f" CPU Cores: {self.physical_cores} physical, {self.cpu_count} logical")
def _optimize_for_device(self):
"""Adjust configuration based on detected resources."""
# Use physical cores for better performance
if self.config.n_threads > self.physical_cores:
self.config.n_threads = max(1, self.physical_cores - 1)
# Reduce context for low memory devices
if self.available_ram_gb < 2:
self.config.n_ctx = min(self.config.n_ctx, 1024)
self.config.n_batch = min(self.config.n_batch, 256)
# Enable mlock only if we have enough RAM
if self.available_ram_gb > 4:
self.config.use_mlock = True
def _load_model(self) -> Llama:
"""Load model with optimized settings."""
print(f"\nLoading model: {self.model_path}")
print(f"Config: {self.config}")
start_time = time.time()
model = Llama(
model_path=self.model_path,
n_ctx=self.config.n_ctx,
n_threads=self.config.n_threads,
n_batch=self.config.n_batch,
n_gpu_layers=self.config.n_gpu_layers,
use_mlock=self.config.use_mlock,
use_mmap=self.config.use_mmap,
verbose=self.config.verbose
)
load_time = time.time() - start_time
print(f"Model loaded in {load_time:.2f}s")
return model
def generate(
self,
prompt: str,
max_tokens: int = 100,
temperature: float = 0.7,
top_p: float = 0.9,
stop: list[str] = None
) -> tuple[str, dict]:
"""
Generate text response.
Returns:
Tuple of (response_text, metrics)
"""
# Monitor memory before generation
mem_before = psutil.Process().memory_info().rss / (1024**2)
start_time = time.time()
output = self.model(
prompt,
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
stop=stop or [],
echo=False
)
generation_time = time.time() - start_time
# Extract response
response = output["choices"][0]["text"]
# Calculate metrics
prompt_tokens = output["usage"]["prompt_tokens"]
completion_tokens = output["usage"]["completion_tokens"]
mem_after = psutil.Process().memory_info().rss / (1024**2)
metrics = {
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"total_tokens": prompt_tokens + completion_tokens,
"generation_time_s": generation_time,
"tokens_per_second": completion_tokens / generation_time if generation_time > 0 else 0,
"time_to_first_token_s": generation_time / max(completion_tokens, 1),
"memory_used_mb": mem_after - mem_before,
"peak_memory_mb": mem_after
}
return response, metrics
def generate_stream(
self,
prompt: str,
max_tokens: int = 100,
temperature: float = 0.7,
top_p: float = 0.9
) -> Generator[str, None, None]:
"""Stream tokens as they're generated."""
for output in self.model(
prompt,
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
stream=True
):
token = output["choices"][0]["text"]
yield token
def chat(
self,
messages: list[dict],
max_tokens: int = 100,
temperature: float = 0.7
) -> tuple[str, dict]:
"""Chat completion with message history."""
# Format messages into prompt
prompt = self._format_chat(messages)
return self.generate(
prompt,
max_tokens=max_tokens,
temperature=temperature,
stop=["<|endoftext|>", "<|im_end|>", "</s>"]
)
def _format_chat(self, messages: list[dict]) -> str:
"""Format chat messages for the model."""
# ChatML format (works with most models)
formatted = ""
for msg in messages:
role = msg["role"]
content = msg["content"]
formatted += f"<|im_start|>{role}\n{content}<|im_end|>\n"
formatted += "<|im_start|>assistant\n"
return formatted
class EdgeBenchmark:
"""Benchmark edge model performance."""
def __init__(self, model: EdgeLLM):
self.model = model
def run_benchmark(
self,
prompts: list[str] = None,
max_tokens: int = 50,
num_runs: int = 3
) -> dict:
"""Run comprehensive benchmark."""
if prompts is None:
prompts = [
"What is machine learning?",
"Write a Python function to sort a list.",
"Explain quantum computing simply.",
"What are the benefits of edge computing?"
]
results = {
"prompt_results": [],
"system_info": self._get_system_info()
}
print(f"\nRunning benchmark with {len(prompts)} prompts, {num_runs} runs each...")
for i, prompt in enumerate(prompts):
prompt_metrics = []
for run in range(num_runs):
_, metrics = self.model.generate(
prompt,
max_tokens=max_tokens,
temperature=0.1 # Low temp for consistency
)
prompt_metrics.append(metrics)
# Calculate averages
avg_metrics = {
"prompt": prompt[:50] + "...",
"avg_tokens_per_second": sum(m["tokens_per_second"] for m in prompt_metrics) / num_runs,
"avg_generation_time_s": sum(m["generation_time_s"] for m in prompt_metrics) / num_runs,
"avg_memory_mb": sum(m["peak_memory_mb"] for m in prompt_metrics) / num_runs
}
results["prompt_results"].append(avg_metrics)
print(f" [{i+1}/{len(prompts)}] {avg_metrics['avg_tokens_per_second']:.1f} tok/s")
# Calculate overall statistics
all_tps = [r["avg_tokens_per_second"] for r in results["prompt_results"]]
results["summary"] = {
"avg_tokens_per_second": sum(all_tps) / len(all_tps),
"min_tokens_per_second": min(all_tps),
"max_tokens_per_second": max(all_tps),
"total_prompts": len(prompts),
"runs_per_prompt": num_runs
}
return results
def _get_system_info(self) -> dict:
"""Get system information."""
import platform
return {
"platform": platform.system(),
"processor": platform.processor(),
"python_version": platform.python_version(),
"total_ram_gb": psutil.virtual_memory().total / (1024**3),
"cpu_count": psutil.cpu_count()
}
# Example usage
if __name__ == "__main__":
# Download a small GGUF model first:
# ollama pull qwen2.5:0.5b-instruct-q4_K_M
# Then find the model path in ~/.ollama/models/
# Or download directly:
# wget https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q4_k_m.gguf
model_path = "qwen2.5-0.5b-instruct-q4_k_m.gguf"
# Configure for edge device
config = EdgeConfig(
n_ctx=2048,
n_threads=4, # Adjust based on your device
n_batch=256, # Smaller batch for memory efficiency
n_gpu_layers=0, # Pure CPU
use_mlock=False, # Disable for low-memory devices
)
# Initialize model
llm = EdgeLLM(model_path, config)
# Simple generation
prompt = "Explain edge AI in one sentence:"
response, metrics = llm.generate(prompt)
print(f"\nPrompt: {prompt}")
print(f"Response: {response}")
print(f"\nMetrics:")
print(f" Tokens/sec: {metrics['tokens_per_second']:.1f}")
print(f" Memory: {metrics['peak_memory_mb']:.1f} MB")
# Run benchmark
benchmark = EdgeBenchmark(llm)
results = benchmark.run_benchmark()
print(f"\n{'='*50}")
print("Benchmark Summary:")
print(f" Average: {results['summary']['avg_tokens_per_second']:.1f} tok/s")
print(f" Range: {results['summary']['min_tokens_per_second']:.1f} - {results['summary']['max_tokens_per_second']:.1f} tok/s")Understanding llama.cpp Configuration:
┌─────────────────────────────────────────────────────────────────────────────┐
│ llama.cpp MEMORY AND THREADING │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Memory Options (EdgeConfig): │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ use_mmap=True (Memory-Mapped) use_mmap=False (Full Load) │ │
│ │ ┌─────────────────────────────┐ ┌─────────────────────────┐ │ │
│ │ │ Disk ◄──────► RAM (partial) │ │ Disk ──► RAM (full) │ │ │
│ │ │ • Fast startup │ │ • Slower startup │ │ │
│ │ │ • Uses disk I/O during run │ │ • Faster inference │ │ │
│ │ │ • Low memory devices ✓ │ │ • High memory devices ✓ │ │ │
│ │ └─────────────────────────────┘ └─────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Threading (n_threads): │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Device │ Physical Cores │ Recommended n_threads │ │
│ │ ────────────────┼────────────────┼──────────────────────────────── │ │
│ │ Raspberry Pi 4 │ 4 │ 3 (leave 1 for OS) │ │
│ │ M1 Mac │ 8 (4P + 4E) │ 4 (use P-cores only) │ │
│ │ Intel i7 │ 8 │ 7 (leave 1 for OS) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ IMPORTANT: Using logical cores (hyperthreads) hurts performance! │
│ Always use: n_threads = physical_cores - 1 │
│ │
└─────────────────────────────────────────────────────────────────────────────┘EdgeConfig Parameters Explained:
| Parameter | Default | Purpose | Tuning Guidance |
|---|---|---|---|
n_ctx | 2048 | Context window size | Reduce for low memory (1024 minimum) |
n_threads | 4 | CPU threads for compute | Set to physical_cores - 1 |
n_batch | 512 | Tokens processed per batch | Reduce to 256 for low memory |
n_gpu_layers | 0 | Layers offloaded to GPU | Keep 0 for pure CPU edge devices |
use_mlock | False | Lock model in RAM | Enable only with 4GB+ available |
use_mmap | True | Memory-mapped file access | Keep True for edge devices |
★ Insight ─────────────────────────────────────
Memory-Mapped Loading: The use_mmap=True option is crucial for edge devices. Instead of loading the entire model into RAM, it maps the file directly from disk, allowing the OS to page in model weights as needed. This dramatically reduces startup memory requirements but may slightly increase inference latency on first access.
─────────────────────────────────────────────────
Part 3: WebLLM for Browser Deployment
Run SLMs directly in web browsers using WebGPU.
// web/webllm-demo.ts
/**
* WebLLM Browser Deployment
*
* This demonstrates running SLMs in the browser using WebGPU.
* The model runs entirely client-side - no server needed!
*/
import * as webllm from "@mlc-ai/web-llm";
interface GenerationConfig {
maxTokens: number;
temperature: number;
topP: number;
}
interface ChatMessage {
role: "system" | "user" | "assistant";
content: string;
}
class BrowserLLM {
private engine: webllm.MLCEngine | null = null;
private modelId: string;
private initProgress: ((progress: string) => void) | null = null;
// Recommended models for browser deployment
static RECOMMENDED_MODELS = {
tiny: "SmolLM-135M-Instruct-q4f16_1-MLC", // ~100MB, fastest
small: "Qwen2.5-0.5B-Instruct-q4f16_1-MLC", // ~350MB, good balance
medium: "Phi-3-mini-4k-instruct-q4f16_1-MLC", // ~2GB, best quality
};
constructor(
modelId: string = BrowserLLM.RECOMMENDED_MODELS.small,
onProgress?: (progress: string) => void
) {
this.modelId = modelId;
this.initProgress = onProgress || null;
}
async initialize(): Promise<void> {
console.log(`Initializing WebLLM with model: ${this.modelId}`);
// Check WebGPU support
if (!navigator.gpu) {
throw new Error(
"WebGPU not supported. Please use Chrome 113+, Edge 113+, or Firefox Nightly."
);
}
// Initialize engine with progress callback
this.engine = await webllm.CreateMLCEngine(this.modelId, {
initProgressCallback: (progress) => {
const message = `Loading: ${progress.text}`;
console.log(message);
if (this.initProgress) {
this.initProgress(message);
}
},
});
console.log("Model loaded successfully!");
}
async generate(
prompt: string,
config: Partial<GenerationConfig> = {}
): Promise<string> {
if (!this.engine) {
throw new Error("Engine not initialized. Call initialize() first.");
}
const fullConfig: GenerationConfig = {
maxTokens: config.maxTokens || 100,
temperature: config.temperature || 0.7,
topP: config.topP || 0.9,
};
const response = await this.engine.chat.completions.create({
messages: [{ role: "user", content: prompt }],
max_tokens: fullConfig.maxTokens,
temperature: fullConfig.temperature,
top_p: fullConfig.topP,
});
return response.choices[0].message.content || "";
}
async chat(
messages: ChatMessage[],
config: Partial<GenerationConfig> = {}
): Promise<string> {
if (!this.engine) {
throw new Error("Engine not initialized. Call initialize() first.");
}
const response = await this.engine.chat.completions.create({
messages: messages,
max_tokens: config.maxTokens || 100,
temperature: config.temperature || 0.7,
top_p: config.topP || 0.9,
});
return response.choices[0].message.content || "";
}
async *generateStream(
prompt: string,
config: Partial<GenerationConfig> = {}
): AsyncGenerator<string> {
if (!this.engine) {
throw new Error("Engine not initialized. Call initialize() first.");
}
const stream = await this.engine.chat.completions.create({
messages: [{ role: "user", content: prompt }],
max_tokens: config.maxTokens || 100,
temperature: config.temperature || 0.7,
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || "";
if (content) {
yield content;
}
}
}
async getStats(): Promise<object> {
if (!this.engine) {
return { status: "not initialized" };
}
return await this.engine.runtimeStatsText();
}
async unload(): Promise<void> {
if (this.engine) {
await this.engine.unload();
this.engine = null;
}
}
}
// HTML/React component example
const WebLLMDemo = `
<!DOCTYPE html>
<html>
<head>
<title>WebLLM Edge Demo</title>
<script type="module">
import * as webllm from "https://esm.run/@mlc-ai/web-llm";
let engine = null;
async function initModel() {
const status = document.getElementById("status");
const modelSelect = document.getElementById("model");
const modelId = modelSelect.value;
status.textContent = "Checking WebGPU support...";
if (!navigator.gpu) {
status.textContent = "WebGPU not supported! Use Chrome 113+ or Edge 113+";
return;
}
status.textContent = "Loading model (this may take a few minutes)...";
try {
engine = await webllm.CreateMLCEngine(modelId, {
initProgressCallback: (progress) => {
status.textContent = progress.text;
}
});
status.textContent = "Model ready!";
document.getElementById("generate-btn").disabled = false;
} catch (error) {
status.textContent = "Error: " + error.message;
}
}
async function generate() {
if (!engine) return;
const prompt = document.getElementById("prompt").value;
const output = document.getElementById("output");
const stats = document.getElementById("stats");
output.textContent = "";
const startTime = performance.now();
let tokenCount = 0;
const stream = await engine.chat.completions.create({
messages: [{ role: "user", content: prompt }],
max_tokens: 100,
temperature: 0.7,
stream: true
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || "";
output.textContent += content;
tokenCount++;
}
const elapsed = (performance.now() - startTime) / 1000;
stats.textContent = \`Generated \${tokenCount} tokens in \${elapsed.toFixed(2)}s (\${(tokenCount/elapsed).toFixed(1)} tok/s)\`;
}
window.initModel = initModel;
window.generate = generate;
</script>
<style>
body { font-family: system-ui; max-width: 800px; margin: 40px auto; padding: 20px; }
select, textarea, button { width: 100%; padding: 10px; margin: 10px 0; }
#output { background: #f5f5f5; padding: 15px; min-height: 100px; white-space: pre-wrap; }
#status { color: #666; }
#stats { color: #0066cc; font-size: 14px; }
</style>
</head>
<body>
<h1>WebLLM Edge Demo</h1>
<p>Run language models directly in your browser using WebGPU!</p>
<label>Model:</label>
<select id="model">
<option value="SmolLM-135M-Instruct-q4f16_1-MLC">SmolLM 135M (Tiny, ~100MB)</option>
<option value="Qwen2.5-0.5B-Instruct-q4f16_1-MLC" selected>Qwen2.5 0.5B (Small, ~350MB)</option>
<option value="Phi-3-mini-4k-instruct-q4f16_1-MLC">Phi-3 Mini (Medium, ~2GB)</option>
</select>
<button onclick="initModel()">Load Model</button>
<p id="status">Click "Load Model" to start</p>
<label>Prompt:</label>
<textarea id="prompt" rows="3">Explain edge computing in simple terms:</textarea>
<button id="generate-btn" onclick="generate()" disabled>Generate</button>
<h3>Output:</h3>
<div id="output"></div>
<p id="stats"></p>
</body>
</html>
`;
export { BrowserLLM, WebLLMDemo };Understanding WebLLM Browser Deployment:
┌─────────────────────────────────────────────────────────────────────────────┐
│ HOW WEBLLM RUNS MODELS IN THE BROWSER │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Traditional Server-Based: WebLLM Client-Side: │
│ ┌─────────────────────────────┐ ┌─────────────────────────────────┐ │
│ │ Browser ──► Server ──► GPU │ │ Browser (WebGPU) │ │
│ │ ↑ │ │ │ ┌─────────────────────────────┐ │ │
│ │ └───────────┘ │ │ │ Model runs entirely here │ │ │
│ │ (Network latency, privacy?) │ │ │ • Zero server calls │ │ │
│ └─────────────────────────────┘ │ │ • Data never leaves device │ │ │
│ │ └─────────────────────────────┘ │ │
│ └─────────────────────────────────┘ │
│ │
│ WebGPU Pipeline: │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ Download │───►│ Compile │───►│ Load to │───►│ Run │ │
│ │ WASM + │ │ Shaders │ │ GPU VRAM │ │ Inference │ │
│ │ Weights │ │ (cached) │ │ │ │ │ │
│ └───────────┘ └───────────┘ └───────────┘ └───────────┘ │
│ │ │ │
│ ▼ ▼ │
│ First Load: Subsequent: │
│ ~2-5 min ~10-30 sec │
│ (one-time) (from cache) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘WebLLM Model Size vs Browser Constraints:
| Model | Download Size | VRAM Needed | Browser Tab Limit | Suitable? |
|---|---|---|---|---|
| SmolLM-135M | ~100MB | ~200MB | 4GB+ | Excellent |
| Qwen2.5-0.5B | ~350MB | ~500MB | 4GB+ | Good |
| Phi-3-mini | ~2GB | ~3GB | 8GB+ | Marginal |
| Llama-3.2-3B | ~2.5GB | ~4GB | 8GB+ | Risky |
Browser Compatibility:
- Chrome 113+, Edge 113+ → Full WebGPU support
- Firefox Nightly → Experimental WebGPU
- Safari 18+ → WebGPU (limited)
- Mobile browsers → Generally not supported yet
# web/serve_webllm.py
"""
Simple server to host the WebLLM demo.
"""
from http.server import HTTPServer, SimpleHTTPRequestHandler
import os
class CORSHandler(SimpleHTTPRequestHandler):
"""Handler with CORS support for WebLLM."""
def end_headers(self):
# Required headers for SharedArrayBuffer (needed by WebLLM)
self.send_header('Cross-Origin-Opener-Policy', 'same-origin')
self.send_header('Cross-Origin-Embedder-Policy', 'require-corp')
self.send_header('Access-Control-Allow-Origin', '*')
super().end_headers()
def serve(port: int = 8080):
"""Start the server."""
server = HTTPServer(('localhost', port), CORSHandler)
print(f"Serving WebLLM demo at http://localhost:{port}")
print("Press Ctrl+C to stop")
server.serve_forever()
if __name__ == "__main__":
serve()★ Insight ─────────────────────────────────────
WebGPU Requirements: WebLLM requires specific CORS headers (Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp) for SharedArrayBuffer support. Without these, the browser can't allocate the shared memory needed for GPU operations. Always test locally with a proper server, not by opening HTML files directly.
─────────────────────────────────────────────────
Part 4: CoreML for Apple Devices
Export models to CoreML for optimized inference on Apple Neural Engine.
# export/coreml_export.py
"""
Export models to CoreML for iOS/macOS deployment.
Note: Requires macOS with coremltools installed.
"""
import os
from pathlib import Path
from typing import Optional
import torch
import numpy as np
def check_coreml_available() -> bool:
"""Check if CoreML tools are available."""
try:
import coremltools
return True
except ImportError:
return False
class CoreMLExporter:
"""
Export HuggingFace models to CoreML format.
CoreML models can leverage Apple's Neural Engine for
efficient on-device inference on iOS/macOS.
"""
def __init__(self, model_name: str, output_dir: str = "./coreml_models"):
if not check_coreml_available():
raise ImportError(
"coremltools not installed. Install with: pip install coremltools"
)
self.model_name = model_name
self.output_dir = Path(output_dir)
self.output_dir.mkdir(parents=True, exist_ok=True)
def export_model(
self,
sequence_length: int = 512,
compute_units: str = "ALL" # ALL, CPU_ONLY, CPU_AND_GPU, CPU_AND_NE
) -> Path:
"""
Export model to CoreML format.
Args:
sequence_length: Fixed sequence length for the model
compute_units: Target compute units
Returns:
Path to exported .mlpackage
"""
import coremltools as ct
from transformers import AutoTokenizer, AutoModelForCausalLM
print(f"Loading model: {self.model_name}")
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(self.model_name)
model = AutoModelForCausalLM.from_pretrained(
self.model_name,
torch_dtype=torch.float32,
trust_remote_code=True
)
model.eval()
# Create sample input
sample_text = "Hello, how are you?"
inputs = tokenizer(
sample_text,
return_tensors="pt",
max_length=sequence_length,
padding="max_length",
truncation=True
)
# Trace the model
print("Tracing model...")
class ModelWrapper(torch.nn.Module):
def __init__(self, model):
super().__init__()
self.model = model
def forward(self, input_ids, attention_mask):
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
return_dict=True
)
return outputs.logits
wrapped_model = ModelWrapper(model)
traced_model = torch.jit.trace(
wrapped_model,
(inputs["input_ids"], inputs["attention_mask"])
)
# Convert to CoreML
print("Converting to CoreML...")
mlmodel = ct.convert(
traced_model,
inputs=[
ct.TensorType(
name="input_ids",
shape=(1, sequence_length),
dtype=np.int32
),
ct.TensorType(
name="attention_mask",
shape=(1, sequence_length),
dtype=np.int32
)
],
outputs=[
ct.TensorType(name="logits")
],
compute_units=getattr(ct.ComputeUnit, compute_units),
minimum_deployment_target=ct.target.iOS16
)
# Save model
output_path = self.output_dir / f"{self.model_name.replace('/', '_')}.mlpackage"
mlmodel.save(str(output_path))
print(f"Model saved to: {output_path}")
# Save tokenizer for inference
tokenizer.save_pretrained(self.output_dir / "tokenizer")
return output_path
class CoreMLInference:
"""Run inference with CoreML models on macOS."""
def __init__(self, model_path: str, tokenizer_path: str):
if not check_coreml_available():
raise ImportError("coremltools not installed")
import coremltools as ct
from transformers import AutoTokenizer
self.model = ct.models.MLModel(model_path)
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
# Get sequence length from model spec
self.sequence_length = 512 # Default, should match export
def generate(
self,
prompt: str,
max_new_tokens: int = 50,
temperature: float = 0.7
) -> str:
"""Generate text using CoreML model."""
import coremltools as ct
# Tokenize input
inputs = self.tokenizer(
prompt,
return_tensors="np",
max_length=self.sequence_length,
padding="max_length",
truncation=True
)
generated_tokens = []
input_ids = inputs["input_ids"].astype(np.int32)
attention_mask = inputs["attention_mask"].astype(np.int32)
# Get position of last real token
current_pos = int(attention_mask.sum()) - 1
for _ in range(max_new_tokens):
# Run inference
output = self.model.predict({
"input_ids": input_ids,
"attention_mask": attention_mask
})
logits = output["logits"][0, current_pos, :]
# Sample next token
if temperature > 0:
probs = self._softmax(logits / temperature)
next_token = np.random.choice(len(probs), p=probs)
else:
next_token = np.argmax(logits)
generated_tokens.append(next_token)
# Check for end of sequence
if next_token == self.tokenizer.eos_token_id:
break
# Update inputs for next iteration
current_pos += 1
if current_pos < self.sequence_length:
input_ids[0, current_pos] = next_token
attention_mask[0, current_pos] = 1
else:
# Shift window
input_ids[0, :-1] = input_ids[0, 1:]
input_ids[0, -1] = next_token
current_pos = self.sequence_length - 1
return self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
@staticmethod
def _softmax(x):
exp_x = np.exp(x - np.max(x))
return exp_x / exp_x.sum()
# Swift code for iOS integration
SWIFT_INTEGRATION = '''
// EdgeLLM.swift
// iOS/macOS integration for CoreML language models
import CoreML
import NaturalLanguage
class EdgeLLM {
private var model: MLModel?
private var tokenizer: NLTokenizer
init(modelPath: String) throws {
let modelURL = URL(fileURLWithPath: modelPath)
self.model = try MLModel(contentsOf: modelURL)
self.tokenizer = NLTokenizer(unit: .word)
}
func generate(prompt: String, maxTokens: Int = 50) async throws -> String {
guard let model = model else {
throw EdgeLLMError.modelNotLoaded
}
// Tokenize input
tokenizer.string = prompt
var tokens: [Int] = []
tokenizer.enumerateTokens(in: prompt.startIndex..<prompt.endIndex) { range, _ in
// Convert token to ID (simplified)
tokens.append(prompt[range].hashValue % 50000)
return true
}
// Pad to sequence length
let sequenceLength = 512
while tokens.count < sequenceLength {
tokens.append(0)
}
// Create input
let inputArray = try MLMultiArray(shape: [1, NSNumber(value: sequenceLength)], dataType: .int32)
for (i, token) in tokens.enumerated() {
inputArray[i] = NSNumber(value: token)
}
// Run inference
let input = try MLDictionaryFeatureProvider(dictionary: [
"input_ids": MLFeatureValue(multiArray: inputArray),
"attention_mask": MLFeatureValue(multiArray: inputArray)
])
let output = try model.prediction(from: input)
// Decode output (simplified)
return "Generated text from CoreML model"
}
}
enum EdgeLLMError: Error {
case modelNotLoaded
case tokenizationFailed
case generationFailed
}
'''
if __name__ == "__main__":
if check_coreml_available():
# Export a small model
exporter = CoreMLExporter("HuggingFaceTB/SmolLM-135M-Instruct")
model_path = exporter.export_model(
sequence_length=512,
compute_units="ALL"
)
print(f"Exported to: {model_path}")
else:
print("CoreML tools not available (requires macOS)")
print("\nSwift integration code:")
print(SWIFT_INTEGRATION)Part 5: Cross-Platform Benchmark Suite
Compare performance across deployment targets.
# benchmark/cross_platform.py
"""
Cross-platform benchmark suite for edge deployments.
"""
import json
import time
import platform
from dataclasses import dataclass, asdict
from typing import Optional
from pathlib import Path
import psutil
@dataclass
class BenchmarkResult:
"""Results from a single benchmark run."""
platform: str
runtime: str
model_name: str
model_size_mb: float
quantization: str
# Performance metrics
tokens_per_second: float
time_to_first_token_ms: float
total_generation_time_s: float
# Resource usage
peak_memory_mb: float
avg_cpu_percent: float
# Quality (optional)
output_quality_score: Optional[float] = None
def to_dict(self) -> dict:
return asdict(self)
class CrossPlatformBenchmark:
"""
Run benchmarks across different deployment configurations.
"""
STANDARD_PROMPTS = [
"Explain machine learning in one sentence.",
"Write a Python function to check if a number is prime.",
"What are the benefits of edge computing?",
"Summarize the key features of transformers.",
]
def __init__(self, output_dir: str = "./benchmark_results"):
self.output_dir = Path(output_dir)
self.output_dir.mkdir(parents=True, exist_ok=True)
self.results: list[BenchmarkResult] = []
def benchmark_llamacpp(
self,
model_path: str,
model_name: str,
quantization: str,
n_threads: int = 4
) -> BenchmarkResult:
"""Benchmark llama.cpp model."""
from llama_cpp import Llama
print(f"\nBenchmarking llama.cpp: {model_name}")
# Get model size
model_size_mb = Path(model_path).stat().st_size / (1024 * 1024)
# Load model
model = Llama(
model_path=model_path,
n_ctx=2048,
n_threads=n_threads,
verbose=False
)
# Run benchmark
all_tps = []
all_ttft = []
all_memory = []
for prompt in self.STANDARD_PROMPTS:
mem_before = psutil.Process().memory_info().rss / (1024 * 1024)
start_time = time.time()
output = model(prompt, max_tokens=50, temperature=0.1)
total_time = time.time() - start_time
mem_after = psutil.Process().memory_info().rss / (1024 * 1024)
tokens = output["usage"]["completion_tokens"]
tps = tokens / total_time if total_time > 0 else 0
ttft = (total_time / tokens * 1000) if tokens > 0 else 0
all_tps.append(tps)
all_ttft.append(ttft)
all_memory.append(mem_after - mem_before)
result = BenchmarkResult(
platform=platform.system(),
runtime="llama.cpp",
model_name=model_name,
model_size_mb=model_size_mb,
quantization=quantization,
tokens_per_second=sum(all_tps) / len(all_tps),
time_to_first_token_ms=sum(all_ttft) / len(all_ttft),
total_generation_time_s=sum(all_ttft) * len(self.STANDARD_PROMPTS) / 1000,
peak_memory_mb=max(all_memory),
avg_cpu_percent=psutil.cpu_percent(interval=0.1)
)
self.results.append(result)
return result
def benchmark_onnx(
self,
model_path: str,
model_name: str
) -> BenchmarkResult:
"""Benchmark ONNX Runtime model."""
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer
print(f"\nBenchmarking ONNX: {model_name}")
# Load model
model = ORTModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Get model size
model_files = list(Path(model_path).glob("*.onnx"))
model_size_mb = sum(f.stat().st_size for f in model_files) / (1024 * 1024)
# Run benchmark
all_tps = []
all_ttft = []
all_memory = []
for prompt in self.STANDARD_PROMPTS:
inputs = tokenizer(prompt, return_tensors="pt")
input_length = inputs["input_ids"].shape[1]
mem_before = psutil.Process().memory_info().rss / (1024 * 1024)
start_time = time.time()
outputs = model.generate(
**inputs,
max_new_tokens=50,
temperature=0.1,
do_sample=True,
pad_token_id=tokenizer.pad_token_id
)
total_time = time.time() - start_time
mem_after = psutil.Process().memory_info().rss / (1024 * 1024)
tokens = len(outputs[0]) - input_length
tps = tokens / total_time if total_time > 0 else 0
ttft = (total_time / tokens * 1000) if tokens > 0 else 0
all_tps.append(tps)
all_ttft.append(ttft)
all_memory.append(mem_after - mem_before)
result = BenchmarkResult(
platform=platform.system(),
runtime="ONNX Runtime",
model_name=model_name,
model_size_mb=model_size_mb,
quantization="FP32",
tokens_per_second=sum(all_tps) / len(all_tps),
time_to_first_token_ms=sum(all_ttft) / len(all_ttft),
total_generation_time_s=sum(all_ttft) * len(self.STANDARD_PROMPTS) / 1000,
peak_memory_mb=max(all_memory),
avg_cpu_percent=psutil.cpu_percent(interval=0.1)
)
self.results.append(result)
return result
def generate_report(self) -> dict:
"""Generate comparison report."""
if not self.results:
return {"error": "No benchmark results"}
report = {
"summary": {
"total_benchmarks": len(self.results),
"platforms_tested": list(set(r.platform for r in self.results)),
"runtimes_tested": list(set(r.runtime for r in self.results)),
},
"results": [r.to_dict() for r in self.results],
"rankings": {
"by_speed": sorted(
[r.to_dict() for r in self.results],
key=lambda x: x["tokens_per_second"],
reverse=True
),
"by_memory": sorted(
[r.to_dict() for r in self.results],
key=lambda x: x["peak_memory_mb"]
),
"by_latency": sorted(
[r.to_dict() for r in self.results],
key=lambda x: x["time_to_first_token_ms"]
)
}
}
# Save report
report_path = self.output_dir / f"benchmark_report_{int(time.time())}.json"
with open(report_path, "w") as f:
json.dump(report, f, indent=2)
print(f"\nReport saved to: {report_path}")
return report
def print_comparison_table(self):
"""Print results as a comparison table."""
if not self.results:
print("No results to display")
return
print("\n" + "=" * 80)
print("CROSS-PLATFORM BENCHMARK RESULTS")
print("=" * 80)
# Header
print(f"{'Runtime':<15} {'Model':<25} {'Quant':<8} {'Tok/s':<8} {'TTFT(ms)':<10} {'Mem(MB)':<10}")
print("-" * 80)
# Results
for r in sorted(self.results, key=lambda x: x.tokens_per_second, reverse=True):
print(
f"{r.runtime:<15} "
f"{r.model_name[:23]:<25} "
f"{r.quantization:<8} "
f"{r.tokens_per_second:<8.1f} "
f"{r.time_to_first_token_ms:<10.1f} "
f"{r.peak_memory_mb:<10.1f}"
)
print("=" * 80)
# Visualization
def visualize_benchmarks(results: list[BenchmarkResult]):
"""Create visualization of benchmark results."""
import plotly.graph_objects as go
from plotly.subplots import make_subplots
fig = make_subplots(
rows=2, cols=2,
subplot_titles=(
"Tokens per Second",
"Time to First Token (ms)",
"Peak Memory (MB)",
"Speed vs Memory Tradeoff"
)
)
labels = [f"{r.runtime}\n{r.model_name[:15]}" for r in results]
# Tokens per second
fig.add_trace(
go.Bar(x=labels, y=[r.tokens_per_second for r in results], name="Tok/s"),
row=1, col=1
)
# Time to first token
fig.add_trace(
go.Bar(x=labels, y=[r.time_to_first_token_ms for r in results], name="TTFT"),
row=1, col=2
)
# Memory usage
fig.add_trace(
go.Bar(x=labels, y=[r.peak_memory_mb for r in results], name="Memory"),
row=2, col=1
)
# Speed vs Memory scatter
fig.add_trace(
go.Scatter(
x=[r.peak_memory_mb for r in results],
y=[r.tokens_per_second for r in results],
mode="markers+text",
text=[r.model_name[:10] for r in results],
textposition="top center",
name="Speed vs Memory"
),
row=2, col=2
)
fig.update_layout(height=800, title="Edge Deployment Benchmark Comparison")
fig.write_html("benchmark_comparison.html")
print("Visualization saved to benchmark_comparison.html")
if __name__ == "__main__":
benchmark = CrossPlatformBenchmark()
# Run benchmarks (paths need to be configured)
# benchmark.benchmark_llamacpp(
# "qwen2.5-0.5b-instruct-q4_k_m.gguf",
# "Qwen2.5-0.5B",
# "Q4_K_M"
# )
# benchmark.benchmark_onnx(
# "./onnx_models/SmolLM-135M-Instruct",
# "SmolLM-135M"
# )
# Generate report
# report = benchmark.generate_report()
# benchmark.print_comparison_table()
print("Configure model paths and run benchmarks")Part 6: Edge Deployment FastAPI Server
Build a unified API that supports multiple backends.
# server/edge_server.py
"""
FastAPI server for edge deployment with multiple backend support.
"""
import os
import time
from typing import Optional, Literal
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
import psutil
# Request/Response models
class GenerateRequest(BaseModel):
prompt: str
max_tokens: int = Field(default=100, le=500)
temperature: float = Field(default=0.7, ge=0, le=2)
top_p: float = Field(default=0.9, ge=0, le=1)
stream: bool = False
class GenerateResponse(BaseModel):
text: str
tokens_generated: int
generation_time_s: float
tokens_per_second: float
backend: str
class ChatMessage(BaseModel):
role: Literal["system", "user", "assistant"]
content: str
class ChatRequest(BaseModel):
messages: list[ChatMessage]
max_tokens: int = Field(default=100, le=500)
temperature: float = Field(default=0.7, ge=0, le=2)
stream: bool = False
class SystemInfo(BaseModel):
platform: str
cpu_count: int
total_memory_gb: float
available_memory_gb: float
backend: str
model_name: str
# Backend abstraction
class EdgeBackend:
"""Abstract base for edge backends."""
def __init__(self, model_path: str):
self.model_path = model_path
self.model_name = os.path.basename(model_path)
def generate(self, prompt: str, max_tokens: int, temperature: float, top_p: float) -> tuple[str, dict]:
raise NotImplementedError
def generate_stream(self, prompt: str, max_tokens: int, temperature: float, top_p: float):
raise NotImplementedError
@property
def backend_name(self) -> str:
raise NotImplementedError
class LlamaCppBackend(EdgeBackend):
"""llama.cpp backend for edge deployment."""
def __init__(self, model_path: str, n_threads: int = 4, n_ctx: int = 2048):
super().__init__(model_path)
from llama_cpp import Llama
self.model = Llama(
model_path=model_path,
n_ctx=n_ctx,
n_threads=n_threads,
verbose=False
)
@property
def backend_name(self) -> str:
return "llama.cpp"
def generate(self, prompt: str, max_tokens: int, temperature: float, top_p: float) -> tuple[str, dict]:
start_time = time.time()
output = self.model(
prompt,
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
echo=False
)
generation_time = time.time() - start_time
text = output["choices"][0]["text"]
tokens = output["usage"]["completion_tokens"]
metrics = {
"tokens_generated": tokens,
"generation_time_s": generation_time,
"tokens_per_second": tokens / generation_time if generation_time > 0 else 0
}
return text, metrics
def generate_stream(self, prompt: str, max_tokens: int, temperature: float, top_p: float):
for output in self.model(
prompt,
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
stream=True
):
yield output["choices"][0]["text"]
class ONNXBackend(EdgeBackend):
"""ONNX Runtime backend."""
def __init__(self, model_path: str):
super().__init__(model_path)
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer
self.model = ORTModelForCausalLM.from_pretrained(model_path)
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token
@property
def backend_name(self) -> str:
return "ONNX Runtime"
def generate(self, prompt: str, max_tokens: int, temperature: float, top_p: float) -> tuple[str, dict]:
inputs = self.tokenizer(prompt, return_tensors="pt")
input_length = inputs["input_ids"].shape[1]
start_time = time.time()
outputs = self.model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=temperature if temperature > 0 else 1.0,
top_p=top_p,
do_sample=temperature > 0,
pad_token_id=self.tokenizer.pad_token_id
)
generation_time = time.time() - start_time
generated_tokens = outputs[0][input_length:]
text = self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
tokens = len(generated_tokens)
metrics = {
"tokens_generated": tokens,
"generation_time_s": generation_time,
"tokens_per_second": tokens / generation_time if generation_time > 0 else 0
}
return text, metrics
def generate_stream(self, prompt: str, max_tokens: int, temperature: float, top_p: float):
# ONNX doesn't support native streaming, simulate with full generation
text, _ = self.generate(prompt, max_tokens, temperature, top_p)
for char in text:
yield char
# Global backend instance
backend: Optional[EdgeBackend] = None
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Initialize backend on startup."""
global backend
# Configuration from environment
backend_type = os.getenv("EDGE_BACKEND", "llamacpp")
model_path = os.getenv("EDGE_MODEL_PATH", "qwen2.5-0.5b-instruct-q4_k_m.gguf")
n_threads = int(os.getenv("EDGE_THREADS", "4"))
print(f"Initializing {backend_type} backend with {model_path}")
if backend_type == "llamacpp":
backend = LlamaCppBackend(model_path, n_threads=n_threads)
elif backend_type == "onnx":
backend = ONNXBackend(model_path)
else:
raise ValueError(f"Unknown backend: {backend_type}")
print(f"Backend ready: {backend.backend_name}")
yield
# Cleanup
backend = None
app = FastAPI(
title="Edge LLM Server",
description="Lightweight LLM server for edge deployment",
version="1.0.0",
lifespan=lifespan
)
@app.get("/health")
async def health_check():
"""Health check endpoint."""
return {
"status": "healthy",
"backend": backend.backend_name if backend else "not initialized"
}
@app.get("/info", response_model=SystemInfo)
async def system_info():
"""Get system information."""
import platform
mem = psutil.virtual_memory()
return SystemInfo(
platform=platform.system(),
cpu_count=psutil.cpu_count(),
total_memory_gb=mem.total / (1024**3),
available_memory_gb=mem.available / (1024**3),
backend=backend.backend_name if backend else "not initialized",
model_name=backend.model_name if backend else "none"
)
@app.post("/generate", response_model=GenerateResponse)
async def generate(request: GenerateRequest):
"""Generate text from prompt."""
if not backend:
raise HTTPException(status_code=503, detail="Backend not initialized")
if request.stream:
async def stream_generator():
for token in backend.generate_stream(
request.prompt,
request.max_tokens,
request.temperature,
request.top_p
):
yield f"data: {token}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(
stream_generator(),
media_type="text/event-stream"
)
text, metrics = backend.generate(
request.prompt,
request.max_tokens,
request.temperature,
request.top_p
)
return GenerateResponse(
text=text,
tokens_generated=metrics["tokens_generated"],
generation_time_s=metrics["generation_time_s"],
tokens_per_second=metrics["tokens_per_second"],
backend=backend.backend_name
)
@app.post("/chat")
async def chat(request: ChatRequest):
"""Chat completion endpoint."""
if not backend:
raise HTTPException(status_code=503, detail="Backend not initialized")
# Format messages as prompt
prompt = ""
for msg in request.messages:
prompt += f"<|im_start|>{msg.role}\n{msg.content}<|im_end|>\n"
prompt += "<|im_start|>assistant\n"
text, metrics = backend.generate(
prompt,
request.max_tokens,
request.temperature,
0.9 # Fixed top_p for chat
)
# Clean up response
text = text.split("<|im_end|>")[0].strip()
return {
"message": {"role": "assistant", "content": text},
"usage": metrics,
"backend": backend.backend_name
}
if __name__ == "__main__":
import uvicorn
uvicorn.run(
app,
host="0.0.0.0",
port=8000,
workers=1 # Single worker for edge devices
)Docker Configuration
# Dockerfile.edge
FROM python:3.11-slim
# Install system dependencies
RUN apt-get update && apt-get install -y \
build-essential \
cmake \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application
COPY . .
# Environment variables
ENV EDGE_BACKEND=llamacpp
ENV EDGE_MODEL_PATH=/models/model.gguf
ENV EDGE_THREADS=4
# Expose port
EXPOSE 8000
# Run server
CMD ["python", "server/edge_server.py"]# docker-compose.yml
version: '3.8'
services:
edge-llm:
build:
context: .
dockerfile: Dockerfile.edge
ports:
- "8000:8000"
volumes:
- ./models:/models:ro
environment:
- EDGE_BACKEND=llamacpp
- EDGE_MODEL_PATH=/models/qwen2.5-0.5b-instruct-q4_k_m.gguf
- EDGE_THREADS=4
deploy:
resources:
limits:
memory: 2G
cpus: '2'
restart: unless-stoppedExercises
Exercise 1: Model Size Optimization
Export the same model at different quantization levels (Q2_K, Q4_K_M, Q8_0) and measure:
- Model file size
- Memory usage during inference
- Token generation speed
- Output quality (subjective evaluation)
Exercise 2: Browser Deployment
Deploy a model using WebLLM and measure:
- Initial load time
- First inference latency
- Sustained generation speed
- Memory usage in browser
Exercise 3: Raspberry Pi Deployment
Deploy an SLM on a Raspberry Pi and:
- Measure real-world performance
- Optimize thread count for the hardware
- Compare different quantization levels
- Build a simple voice assistant
Exercise 4: Mobile Integration
Create a mobile app concept that:
- Uses CoreML on iOS or ONNX on Android
- Handles model updates
- Gracefully degrades on low memory
- Provides offline functionality
Summary
You've learned to deploy SLMs across diverse edge platforms:
- ONNX Export: Cross-platform deployment with optimization
- llama.cpp: Efficient CPU inference for resource-constrained devices
- WebLLM: Browser-based deployment using WebGPU
- CoreML: Apple ecosystem optimization with Neural Engine
- Unified Server: Multi-backend API for flexible deployment
Key insights:
- Quantization is essential for edge deployment (Q4_K_M offers best balance)
- Memory-mapped loading reduces startup requirements
- Thread count should match physical cores, not logical
- WebGPU enables near-native browser performance
- Always benchmark on actual target hardware
Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| ONNX Runtime | Cross-platform inference engine | Deploy same model on CPU, GPU, mobile |
| llama.cpp | Optimized C++ inference for GGUF | Best performance on CPU-only devices |
| GGUF Format | Quantized model format (Q2-Q8) | Smaller files, faster inference |
| Q4_K_M | 4-bit quantization variant | Best balance of size and quality |
| WebLLM/MLC-LLM | Run models in browser via WebGPU | Zero server, complete privacy |
| CoreML | Apple's ML framework | Uses Neural Engine on iOS/macOS |
| use_mmap | Memory-map model from disk | Reduces RAM at cost of disk I/O |
| n_threads | CPU threads for inference | Set to physical_cores - 1 |
| n_ctx | Context window size | Larger = more memory, longer inputs |
| Time to First Token | Latency before generation starts | Critical for perceived responsiveness |
Next Steps
- SLM Agents - Build agentic systems with edge models
- Production SLM System - Scale edge deployments