Edge Deployment

TL;DR

Deploy SLMs via ONNX (cross-platform), llama.cpp (efficient CPU inference with GGUF), WebLLM (browser via WebGPU), or CoreML (Apple Neural Engine). Key settings: use_mmap=True for memory efficiency, n_threads=physical_cores-1, and Q4_K_M quantization for best size/quality balance.

Deploy SLMs on resource-constrained edge devices including mobile phones, browsers, and IoT devices.

Project Overview

Aspect	Details
Difficulty	Intermediate
Time	6-8 hours
Prerequisites	Python, SLM basics, model formats
Learning Outcomes	ONNX export, mobile deployment, browser inference, optimization

What You'll Learn

Export models to ONNX and optimize for inference
Deploy to iOS/macOS with CoreML
Run models in browsers with WebLLM
Use llama.cpp for efficient edge inference
Optimize for memory-constrained devices
Benchmark across different deployment targets

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                     Edge Deployment Architecture                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   SOURCE MODELS              EXPORT FORMATS           DEPLOYMENT TARGETS    │
│  ┌─────────────────┐                                                        │
│  │ HuggingFace     │────┬───► ONNX Runtime ───┬───► Server (CPU/GPU)       │
│  │ Model           │    │                     └───► Android                 │
│  └─────────────────┘    │                                                   │
│           │             ├───► CoreML ─────────────► macOS/iOS               │
│           │             │                                                   │
│           │             └───► MLC-LLM/WebLLM ─────► Web Browser             │
│           │                        ▲                                        │
│           ▼                        │                                        │
│  ┌─────────────────┐               │                                        │
│  │ GGUF Model      │───────────────┤                                        │
│  │                 │               │                                        │
│  └─────────────────┘               │                                        │
│           │                        │                                        │
│           └───► llama.cpp ─────────┼───► Server (CPU)                       │
│                                    │                                        │
│                                    └───► Raspberry Pi                       │
│                                                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│  RECOMMENDED PATH: HuggingFace ──► GGUF ──► llama.cpp (most compatible)     │
└─────────────────────────────────────────────────────────────────────────────┘

Project Setup

Dependencies

# Create project directory
mkdir slm-edge-deployment && cd slm-edge-deployment

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Core dependencies
pip install torch transformers accelerate

# ONNX Runtime
pip install onnx onnxruntime optimum[exporters]

# llama.cpp Python bindings
pip install llama-cpp-python

# For benchmarking
pip install psutil numpy pandas plotly

# Optional: CoreML tools (macOS only)
# pip install coremltools

# Optional: TensorFlow Lite
# pip install tensorflow tflite-model-maker

Model Selection for Edge

# models/edge_models.py
"""
Model recommendations for different edge targets.
"""

EDGE_MODEL_RECOMMENDATIONS = {
    "raspberry_pi_4": {
        "ram_gb": 4,
        "recommended_models": [
            {"name": "qwen2.5:0.5b-instruct-q4_K_M", "vram_mb": 400},
            {"name": "smollm:135m-instruct-q8_0", "vram_mb": 150},
            {"name": "tinyllama:1.1b-chat-v1.0-q2_K", "vram_mb": 500},
        ],
        "notes": "Use smallest quantizations, expect 5-15 tok/s"
    },
    "raspberry_pi_5": {
        "ram_gb": 8,
        "recommended_models": [
            {"name": "qwen2.5:1.5b-instruct-q4_K_M", "vram_mb": 1200},
            {"name": "phi3:mini-4k-instruct-q4_K_M", "vram_mb": 2400},
            {"name": "gemma2:2b-instruct-q4_K_M", "vram_mb": 1800},
        ],
        "notes": "Better performance, 10-25 tok/s with Q4"
    },
    "iphone_15_pro": {
        "ram_gb": 8,
        "neural_engine": True,
        "recommended_models": [
            {"name": "phi3-mini-4k-CoreML", "vram_mb": 2000},
            {"name": "gemma2-2b-CoreML", "vram_mb": 1500},
        ],
        "notes": "Use CoreML for Neural Engine acceleration"
    },
    "android_flagship": {
        "ram_gb": 12,
        "recommended_models": [
            {"name": "qwen2.5:3b-instruct-q4_K_M", "vram_mb": 2200},
            {"name": "phi3:mini-4k-instruct-q4_K_M", "vram_mb": 2400},
        ],
        "notes": "Use GGUF with Termux or dedicated apps"
    },
    "web_browser": {
        "ram_gb": 4,  # Available to browser
        "webgpu": True,
        "recommended_models": [
            {"name": "Qwen2.5-0.5B-Instruct-q4f16_1-MLC", "vram_mb": 400},
            {"name": "SmolLM-135M-Instruct-q4f16_1-MLC", "vram_mb": 150},
            {"name": "Phi-3-mini-4k-instruct-q4f16_1-MLC", "vram_mb": 2000},
        ],
        "notes": "Requires WebGPU-capable browser"
    }
}


def get_model_recommendations(target: str, max_memory_mb: int = None) -> list:
    """Get model recommendations for a deployment target."""
    if target not in EDGE_MODEL_RECOMMENDATIONS:
        raise ValueError(f"Unknown target: {target}")

    config = EDGE_MODEL_RECOMMENDATIONS[target]
    models = config["recommended_models"]

    if max_memory_mb:
        models = [m for m in models if m["vram_mb"] <= max_memory_mb]

    return models


def print_recommendations():
    """Print all recommendations in a readable format."""
    for target, config in EDGE_MODEL_RECOMMENDATIONS.items():
        print(f"\n{'='*50}")
        print(f"Target: {target}")
        print(f"RAM: {config['ram_gb']}GB")
        print(f"Notes: {config['notes']}")
        print("-" * 30)
        for model in config["recommended_models"]:
            print(f"  - {model['name']} ({model['vram_mb']}MB)")


if __name__ == "__main__":
    print_recommendations()

Understanding Edge Model Selection:

┌─────────────────────────────────────────────────────────────────────────────┐
│ WHY MODEL SELECTION MATTERS FOR EDGE                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Device RAM ──► Available for Model ──► Model Size Limit ──► Quality       │
│                                                                             │
│  Example: iPhone 15 Pro (8GB RAM)                                           │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │ Total RAM: 8GB                                                         │ │
│  │    ├── iOS System: ~2GB                                                │ │
│  │    ├── App Overhead: ~500MB                                            │ │
│  │    ├── Neural Engine Buffer: ~500MB                                    │ │
│  │    └── Available for Model: ~5GB ──► Can run Phi-3 Mini (2.4GB Q4)    │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                             │
│  Quantization Impact on Quality vs Size:                                    │
│  ┌──────────┬───────────┬─────────────┬──────────────────────────────────┐ │
│  │ Format   │ Size (2B) │ Quality     │ Use Case                         │ │
│  ├──────────┼───────────┼─────────────┼──────────────────────────────────┤ │
│  │ Q2_K     │ ~0.8GB    │ Degraded    │ Extreme memory constraints       │ │
│  │ Q4_K_M   │ ~1.2GB    │ Near-FP16   │ Best balance (recommended)       │ │
│  │ Q8_0     │ ~2.0GB    │ Excellent   │ When quality is critical         │ │
│  └──────────┴───────────┴─────────────┴──────────────────────────────────┘ │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Key Design Decisions:

Decision	Reasoning
Organize by device	Different devices have different constraints (RAM, accelerators)
Include VRAM estimates	Lets you calculate if a model fits before downloading
Recommend Q4_K_M	Best size/quality trade-off - K-quant methods preserve quality better
Device-specific notes	Each platform has unique optimization paths (CoreML, WebGPU, etc.)

Part 1: ONNX Export and Optimization

ONNX (Open Neural Network Exchange) provides cross-platform model deployment.

# export/onnx_export.py
"""
Export HuggingFace models to ONNX format with optimizations.
"""

import os
import time
from pathlib import Path
from typing import Optional

import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.onnxruntime import ORTModelForCausalLM
import onnxruntime as ort


class ONNXExporter:
    """Export and optimize models for ONNX Runtime."""

    def __init__(self, model_name: str, output_dir: str = "./onnx_models"):
        self.model_name = model_name
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)

    def export_model(
        self,
        optimize: bool = True,
        quantize: bool = False,
        device: str = "cpu"
    ) -> Path:
        """
        Export model to ONNX format.

        Args:
            optimize: Apply ONNX graph optimizations
            quantize: Apply INT8 quantization
            device: Target device (cpu/cuda)

        Returns:
            Path to exported model
        """
        export_path = self.output_dir / self.model_name.replace("/", "_")

        print(f"Exporting {self.model_name} to ONNX...")

        # Use optimum for export
        model = ORTModelForCausalLM.from_pretrained(
            self.model_name,
            export=True,
            provider="CPUExecutionProvider" if device == "cpu" else "CUDAExecutionProvider"
        )

        # Save the exported model
        model.save_pretrained(export_path)

        # Also save tokenizer
        tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        tokenizer.save_pretrained(export_path)

        print(f"Model exported to: {export_path}")

        # Apply optimizations
        if optimize:
            self._optimize_model(export_path)

        # Apply quantization
        if quantize:
            self._quantize_model(export_path)

        return export_path

    def _optimize_model(self, model_path: Path):
        """Apply ONNX graph optimizations."""
        from optimum.onnxruntime import ORTOptimizer
        from optimum.onnxruntime.configuration import OptimizationConfig

        optimizer = ORTOptimizer.from_pretrained(model_path)
        optimization_config = OptimizationConfig(
            optimization_level=99,  # Maximum optimization
            enable_transformers_specific_optimizations=True,
            fp16=False,  # Keep FP32 for CPU compatibility
        )

        optimized_path = model_path / "optimized"
        optimizer.optimize(
            save_dir=optimized_path,
            optimization_config=optimization_config
        )
        print(f"Optimized model saved to: {optimized_path}")

    def _quantize_model(self, model_path: Path):
        """Apply dynamic INT8 quantization."""
        from optimum.onnxruntime import ORTQuantizer
        from optimum.onnxruntime.configuration import AutoQuantizationConfig

        quantizer = ORTQuantizer.from_pretrained(model_path)
        quantization_config = AutoQuantizationConfig.avx512_vnni(
            is_static=False,
            per_channel=True
        )

        quantized_path = model_path / "quantized"
        quantizer.quantize(
            save_dir=quantized_path,
            quantization_config=quantization_config
        )
        print(f"Quantized model saved to: {quantized_path}")


class ONNXInference:
    """Run inference with ONNX models."""

    def __init__(self, model_path: str, use_gpu: bool = False):
        self.model_path = Path(model_path)

        # Set up execution providers
        providers = ["CPUExecutionProvider"]
        if use_gpu:
            providers.insert(0, "CUDAExecutionProvider")

        # Load model
        self.model = ORTModelForCausalLM.from_pretrained(
            model_path,
            provider=providers[0]
        )
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)

        # Set pad token if needed
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token

    def generate(
        self,
        prompt: str,
        max_new_tokens: int = 100,
        temperature: float = 0.7,
        top_p: float = 0.9
    ) -> tuple[str, dict]:
        """
        Generate text with ONNX model.

        Returns:
            Tuple of (generated_text, metrics)
        """
        # Tokenize
        inputs = self.tokenizer(prompt, return_tensors="pt")
        input_length = inputs["input_ids"].shape[1]

        # Generate
        start_time = time.time()

        outputs = self.model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            do_sample=temperature > 0,
            pad_token_id=self.tokenizer.pad_token_id
        )

        generation_time = time.time() - start_time

        # Decode
        generated_tokens = outputs[0][input_length:]
        generated_text = self.tokenizer.decode(generated_tokens, skip_special_tokens=True)

        # Calculate metrics
        tokens_generated = len(generated_tokens)
        metrics = {
            "tokens_generated": tokens_generated,
            "generation_time_s": generation_time,
            "tokens_per_second": tokens_generated / generation_time if generation_time > 0 else 0
        }

        return generated_text, metrics

    def benchmark(self, prompts: list[str], num_runs: int = 3) -> dict:
        """Run benchmark across multiple prompts."""
        all_metrics = []

        for prompt in prompts:
            prompt_metrics = []
            for _ in range(num_runs):
                _, metrics = self.generate(prompt, max_new_tokens=50)
                prompt_metrics.append(metrics)

            # Average metrics for this prompt
            avg_metrics = {
                "tokens_per_second": np.mean([m["tokens_per_second"] for m in prompt_metrics]),
                "generation_time_s": np.mean([m["generation_time_s"] for m in prompt_metrics])
            }
            all_metrics.append(avg_metrics)

        return {
            "avg_tokens_per_second": np.mean([m["tokens_per_second"] for m in all_metrics]),
            "avg_generation_time_s": np.mean([m["generation_time_s"] for m in all_metrics]),
            "num_prompts": len(prompts),
            "num_runs": num_runs
        }


# Example usage
if __name__ == "__main__":
    # Export a small model
    exporter = ONNXExporter("HuggingFaceTB/SmolLM-135M-Instruct")
    model_path = exporter.export_model(optimize=True, quantize=False)

    # Run inference
    inference = ONNXInference(str(model_path))

    prompt = "Explain edge computing in one sentence:"
    response, metrics = inference.generate(prompt)

    print(f"\nPrompt: {prompt}")
    print(f"Response: {response}")
    print(f"Speed: {metrics['tokens_per_second']:.1f} tokens/sec")

★ Insight ───────────────────────────────────── ONNX Export Strategy: Optimum's ORTModelForCausalLM handles the complex ONNX export process including attention mask handling, KV-cache management, and decoder architecture. The optimization levels (0-99) apply increasingly aggressive graph transformations. Level 99 enables all optimizations including operator fusion. ─────────────────────────────────────────────────

Understanding ONNX Export and Inference:

┌─────────────────────────────────────────────────────────────────────────────┐
│ ONNX EXPORT PIPELINE                                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  HuggingFace Model ──► ONNX Graph ──► Optimizations ──► Quantization       │
│         │                   │               │                │              │
│         ▼                   ▼               ▼                ▼              │
│  ┌─────────────┐    ┌─────────────┐  ┌─────────────┐  ┌─────────────┐      │
│  │ PyTorch     │    │ Operators   │  │ Fused Ops   │  │ INT8        │      │
│  │ Weights     │    │ as Nodes    │  │ Optimized   │  │ Weights     │      │
│  │ + Config    │    │ + Edges     │  │ Graph       │  │ Smaller     │      │
│  └─────────────┘    └─────────────┘  └─────────────┘  └─────────────┘      │
│                                                                             │
│  Why Each Step Matters:                                                     │
│  • Export: Converts PyTorch dynamic graph to static ONNX format            │
│  • Optimize: Fuses operations (LayerNorm + Add → single kernel)            │
│  • Quantize: Reduces memory 4x with minimal quality loss                   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

ONNX Runtime Execution Providers:

Provider	Best For	Performance	Setup Complexity
`CPUExecutionProvider`	Any CPU	Baseline	None
`CUDAExecutionProvider`	NVIDIA GPU	10-50x faster	CUDA + cuDNN
`CoreMLExecutionProvider`	Apple Neural Engine	5-20x faster	macOS only
`QNNExecutionProvider`	Qualcomm NPU	10-30x faster	Android + Snapdragon

Part 2: llama.cpp for Edge Devices

llama.cpp provides highly optimized CPU inference, perfect for edge devices.

# inference/llamacpp_edge.py
"""
Edge deployment using llama.cpp for efficient CPU inference.
"""

import os
import time
import psutil
from pathlib import Path
from typing import Optional, Generator
from dataclasses import dataclass

from llama_cpp import Llama


@dataclass
class EdgeConfig:
    """Configuration for edge deployment."""
    n_ctx: int = 2048        # Context window
    n_threads: int = 4        # CPU threads
    n_batch: int = 512        # Batch size for prompt processing
    n_gpu_layers: int = 0     # GPU layers (0 for pure CPU)
    use_mlock: bool = False   # Lock model in memory (requires permissions)
    use_mmap: bool = True     # Memory-mapped loading
    verbose: bool = False


class EdgeLLM:
    """
    Optimized LLM inference for edge devices.

    Designed for:
    - Raspberry Pi 4/5
    - Low-power laptops
    - Android devices (via Termux)
    - IoT gateways
    """

    def __init__(
        self,
        model_path: str,
        config: EdgeConfig = None
    ):
        self.model_path = model_path
        self.config = config or EdgeConfig()

        # Detect available resources
        self._detect_resources()

        # Adjust config based on resources
        self._optimize_for_device()

        # Load model
        self.model = self._load_model()

    def _detect_resources(self):
        """Detect available system resources."""
        self.total_ram_gb = psutil.virtual_memory().total / (1024**3)
        self.available_ram_gb = psutil.virtual_memory().available / (1024**3)
        self.cpu_count = psutil.cpu_count(logical=True)
        self.physical_cores = psutil.cpu_count(logical=False)

        print(f"System Resources:")
        print(f"  Total RAM: {self.total_ram_gb:.1f} GB")
        print(f"  Available RAM: {self.available_ram_gb:.1f} GB")
        print(f"  CPU Cores: {self.physical_cores} physical, {self.cpu_count} logical")

    def _optimize_for_device(self):
        """Adjust configuration based on detected resources."""
        # Use physical cores for better performance
        if self.config.n_threads > self.physical_cores:
            self.config.n_threads = max(1, self.physical_cores - 1)

        # Reduce context for low memory devices
        if self.available_ram_gb < 2:
            self.config.n_ctx = min(self.config.n_ctx, 1024)
            self.config.n_batch = min(self.config.n_batch, 256)

        # Enable mlock only if we have enough RAM
        if self.available_ram_gb > 4:
            self.config.use_mlock = True

    def _load_model(self) -> Llama:
        """Load model with optimized settings."""
        print(f"\nLoading model: {self.model_path}")
        print(f"Config: {self.config}")

        start_time = time.time()

        model = Llama(
            model_path=self.model_path,
            n_ctx=self.config.n_ctx,
            n_threads=self.config.n_threads,
            n_batch=self.config.n_batch,
            n_gpu_layers=self.config.n_gpu_layers,
            use_mlock=self.config.use_mlock,
            use_mmap=self.config.use_mmap,
            verbose=self.config.verbose
        )

        load_time = time.time() - start_time
        print(f"Model loaded in {load_time:.2f}s")

        return model

    def generate(
        self,
        prompt: str,
        max_tokens: int = 100,
        temperature: float = 0.7,
        top_p: float = 0.9,
        stop: list[str] = None
    ) -> tuple[str, dict]:
        """
        Generate text response.

        Returns:
            Tuple of (response_text, metrics)
        """
        # Monitor memory before generation
        mem_before = psutil.Process().memory_info().rss / (1024**2)

        start_time = time.time()

        output = self.model(
            prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            stop=stop or [],
            echo=False
        )

        generation_time = time.time() - start_time

        # Extract response
        response = output["choices"][0]["text"]

        # Calculate metrics
        prompt_tokens = output["usage"]["prompt_tokens"]
        completion_tokens = output["usage"]["completion_tokens"]

        mem_after = psutil.Process().memory_info().rss / (1024**2)

        metrics = {
            "prompt_tokens": prompt_tokens,
            "completion_tokens": completion_tokens,
            "total_tokens": prompt_tokens + completion_tokens,
            "generation_time_s": generation_time,
            "tokens_per_second": completion_tokens / generation_time if generation_time > 0 else 0,
            "time_to_first_token_s": generation_time / max(completion_tokens, 1),
            "memory_used_mb": mem_after - mem_before,
            "peak_memory_mb": mem_after
        }

        return response, metrics

    def generate_stream(
        self,
        prompt: str,
        max_tokens: int = 100,
        temperature: float = 0.7,
        top_p: float = 0.9
    ) -> Generator[str, None, None]:
        """Stream tokens as they're generated."""
        for output in self.model(
            prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            stream=True
        ):
            token = output["choices"][0]["text"]
            yield token

    def chat(
        self,
        messages: list[dict],
        max_tokens: int = 100,
        temperature: float = 0.7
    ) -> tuple[str, dict]:
        """Chat completion with message history."""
        # Format messages into prompt
        prompt = self._format_chat(messages)
        return self.generate(
            prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            stop=["<|endoftext|>", "<|im_end|>", "</s>"]
        )

    def _format_chat(self, messages: list[dict]) -> str:
        """Format chat messages for the model."""
        # ChatML format (works with most models)
        formatted = ""
        for msg in messages:
            role = msg["role"]
            content = msg["content"]
            formatted += f"<|im_start|>{role}\n{content}<|im_end|>\n"
        formatted += "<|im_start|>assistant\n"
        return formatted


class EdgeBenchmark:
    """Benchmark edge model performance."""

    def __init__(self, model: EdgeLLM):
        self.model = model

    def run_benchmark(
        self,
        prompts: list[str] = None,
        max_tokens: int = 50,
        num_runs: int = 3
    ) -> dict:
        """Run comprehensive benchmark."""
        if prompts is None:
            prompts = [
                "What is machine learning?",
                "Write a Python function to sort a list.",
                "Explain quantum computing simply.",
                "What are the benefits of edge computing?"
            ]

        results = {
            "prompt_results": [],
            "system_info": self._get_system_info()
        }

        print(f"\nRunning benchmark with {len(prompts)} prompts, {num_runs} runs each...")

        for i, prompt in enumerate(prompts):
            prompt_metrics = []

            for run in range(num_runs):
                _, metrics = self.model.generate(
                    prompt,
                    max_tokens=max_tokens,
                    temperature=0.1  # Low temp for consistency
                )
                prompt_metrics.append(metrics)

            # Calculate averages
            avg_metrics = {
                "prompt": prompt[:50] + "...",
                "avg_tokens_per_second": sum(m["tokens_per_second"] for m in prompt_metrics) / num_runs,
                "avg_generation_time_s": sum(m["generation_time_s"] for m in prompt_metrics) / num_runs,
                "avg_memory_mb": sum(m["peak_memory_mb"] for m in prompt_metrics) / num_runs
            }
            results["prompt_results"].append(avg_metrics)

            print(f"  [{i+1}/{len(prompts)}] {avg_metrics['avg_tokens_per_second']:.1f} tok/s")

        # Calculate overall statistics
        all_tps = [r["avg_tokens_per_second"] for r in results["prompt_results"]]
        results["summary"] = {
            "avg_tokens_per_second": sum(all_tps) / len(all_tps),
            "min_tokens_per_second": min(all_tps),
            "max_tokens_per_second": max(all_tps),
            "total_prompts": len(prompts),
            "runs_per_prompt": num_runs
        }

        return results

    def _get_system_info(self) -> dict:
        """Get system information."""
        import platform

        return {
            "platform": platform.system(),
            "processor": platform.processor(),
            "python_version": platform.python_version(),
            "total_ram_gb": psutil.virtual_memory().total / (1024**3),
            "cpu_count": psutil.cpu_count()
        }


# Example usage
if __name__ == "__main__":
    # Download a small GGUF model first:
    # ollama pull qwen2.5:0.5b-instruct-q4_K_M
    # Then find the model path in ~/.ollama/models/

    # Or download directly:
    # wget https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q4_k_m.gguf

    model_path = "qwen2.5-0.5b-instruct-q4_k_m.gguf"

    # Configure for edge device
    config = EdgeConfig(
        n_ctx=2048,
        n_threads=4,      # Adjust based on your device
        n_batch=256,      # Smaller batch for memory efficiency
        n_gpu_layers=0,   # Pure CPU
        use_mlock=False,  # Disable for low-memory devices
    )

    # Initialize model
    llm = EdgeLLM(model_path, config)

    # Simple generation
    prompt = "Explain edge AI in one sentence:"
    response, metrics = llm.generate(prompt)

    print(f"\nPrompt: {prompt}")
    print(f"Response: {response}")
    print(f"\nMetrics:")
    print(f"  Tokens/sec: {metrics['tokens_per_second']:.1f}")
    print(f"  Memory: {metrics['peak_memory_mb']:.1f} MB")

    # Run benchmark
    benchmark = EdgeBenchmark(llm)
    results = benchmark.run_benchmark()

    print(f"\n{'='*50}")
    print("Benchmark Summary:")
    print(f"  Average: {results['summary']['avg_tokens_per_second']:.1f} tok/s")
    print(f"  Range: {results['summary']['min_tokens_per_second']:.1f} - {results['summary']['max_tokens_per_second']:.1f} tok/s")

Understanding llama.cpp Configuration:

┌─────────────────────────────────────────────────────────────────────────────┐
│ llama.cpp MEMORY AND THREADING                                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Memory Options (EdgeConfig):                                               │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ use_mmap=True (Memory-Mapped)          use_mmap=False (Full Load)   │   │
│  │ ┌─────────────────────────────┐        ┌─────────────────────────┐  │   │
│  │ │ Disk ◄──────► RAM (partial) │        │ Disk ──► RAM (full)     │  │   │
│  │ │ • Fast startup              │        │ • Slower startup        │  │   │
│  │ │ • Uses disk I/O during run  │        │ • Faster inference      │  │   │
│  │ │ • Low memory devices ✓      │        │ • High memory devices ✓ │  │   │
│  │ └─────────────────────────────┘        └─────────────────────────┘  │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  Threading (n_threads):                                                     │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Device          │ Physical Cores │ Recommended n_threads            │   │
│  │ ────────────────┼────────────────┼────────────────────────────────  │   │
│  │ Raspberry Pi 4  │ 4              │ 3 (leave 1 for OS)               │   │
│  │ M1 Mac          │ 8 (4P + 4E)    │ 4 (use P-cores only)             │   │
│  │ Intel i7        │ 8              │ 7 (leave 1 for OS)               │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  IMPORTANT: Using logical cores (hyperthreads) hurts performance!          │
│  Always use: n_threads = physical_cores - 1                                │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

EdgeConfig Parameters Explained:

Parameter	Default	Purpose	Tuning Guidance
`n_ctx`	2048	Context window size	Reduce for low memory (1024 minimum)
`n_threads`	4	CPU threads for compute	Set to physical_cores - 1
`n_batch`	512	Tokens processed per batch	Reduce to 256 for low memory
`n_gpu_layers`	0	Layers offloaded to GPU	Keep 0 for pure CPU edge devices
`use_mlock`	False	Lock model in RAM	Enable only with 4GB+ available
`use_mmap`	True	Memory-mapped file access	Keep True for edge devices

★ Insight ───────────────────────────────────── Memory-Mapped Loading: The use_mmap=True option is crucial for edge devices. Instead of loading the entire model into RAM, it maps the file directly from disk, allowing the OS to page in model weights as needed. This dramatically reduces startup memory requirements but may slightly increase inference latency on first access. ─────────────────────────────────────────────────

Part 3: WebLLM for Browser Deployment

Run SLMs directly in web browsers using WebGPU.

// web/webllm-demo.ts
/**
 * WebLLM Browser Deployment
 *
 * This demonstrates running SLMs in the browser using WebGPU.
 * The model runs entirely client-side - no server needed!
 */

import * as webllm from "@mlc-ai/web-llm";

interface GenerationConfig {
  maxTokens: number;
  temperature: number;
  topP: number;
}

interface ChatMessage {
  role: "system" | "user" | "assistant";
  content: string;
}

class BrowserLLM {
  private engine: webllm.MLCEngine | null = null;
  private modelId: string;
  private initProgress: ((progress: string) => void) | null = null;

  // Recommended models for browser deployment
  static RECOMMENDED_MODELS = {
    tiny: "SmolLM-135M-Instruct-q4f16_1-MLC",      // ~100MB, fastest
    small: "Qwen2.5-0.5B-Instruct-q4f16_1-MLC",    // ~350MB, good balance
    medium: "Phi-3-mini-4k-instruct-q4f16_1-MLC", // ~2GB, best quality
  };

  constructor(
    modelId: string = BrowserLLM.RECOMMENDED_MODELS.small,
    onProgress?: (progress: string) => void
  ) {
    this.modelId = modelId;
    this.initProgress = onProgress || null;
  }

  async initialize(): Promise<void> {
    console.log(`Initializing WebLLM with model: ${this.modelId}`);

    // Check WebGPU support
    if (!navigator.gpu) {
      throw new Error(
        "WebGPU not supported. Please use Chrome 113+, Edge 113+, or Firefox Nightly."
      );
    }

    // Initialize engine with progress callback
    this.engine = await webllm.CreateMLCEngine(this.modelId, {
      initProgressCallback: (progress) => {
        const message = `Loading: ${progress.text}`;
        console.log(message);
        if (this.initProgress) {
          this.initProgress(message);
        }
      },
    });

    console.log("Model loaded successfully!");
  }

  async generate(
    prompt: string,
    config: Partial<GenerationConfig> = {}
  ): Promise<string> {
    if (!this.engine) {
      throw new Error("Engine not initialized. Call initialize() first.");
    }

    const fullConfig: GenerationConfig = {
      maxTokens: config.maxTokens || 100,
      temperature: config.temperature || 0.7,
      topP: config.topP || 0.9,
    };

    const response = await this.engine.chat.completions.create({
      messages: [{ role: "user", content: prompt }],
      max_tokens: fullConfig.maxTokens,
      temperature: fullConfig.temperature,
      top_p: fullConfig.topP,
    });

    return response.choices[0].message.content || "";
  }

  async chat(
    messages: ChatMessage[],
    config: Partial<GenerationConfig> = {}
  ): Promise<string> {
    if (!this.engine) {
      throw new Error("Engine not initialized. Call initialize() first.");
    }

    const response = await this.engine.chat.completions.create({
      messages: messages,
      max_tokens: config.maxTokens || 100,
      temperature: config.temperature || 0.7,
      top_p: config.topP || 0.9,
    });

    return response.choices[0].message.content || "";
  }

  async *generateStream(
    prompt: string,
    config: Partial<GenerationConfig> = {}
  ): AsyncGenerator<string> {
    if (!this.engine) {
      throw new Error("Engine not initialized. Call initialize() first.");
    }

    const stream = await this.engine.chat.completions.create({
      messages: [{ role: "user", content: prompt }],
      max_tokens: config.maxTokens || 100,
      temperature: config.temperature || 0.7,
      stream: true,
    });

    for await (const chunk of stream) {
      const content = chunk.choices[0]?.delta?.content || "";
      if (content) {
        yield content;
      }
    }
  }

  async getStats(): Promise<object> {
    if (!this.engine) {
      return { status: "not initialized" };
    }
    return await this.engine.runtimeStatsText();
  }

  async unload(): Promise<void> {
    if (this.engine) {
      await this.engine.unload();
      this.engine = null;
    }
  }
}

// HTML/React component example
const WebLLMDemo = `
<!DOCTYPE html>
<html>
<head>
  <title>WebLLM Edge Demo</title>
  <script type="module">
    import * as webllm from "https://esm.run/@mlc-ai/web-llm";

    let engine = null;

    async function initModel() {
      const status = document.getElementById("status");
      const modelSelect = document.getElementById("model");
      const modelId = modelSelect.value;

      status.textContent = "Checking WebGPU support...";

      if (!navigator.gpu) {
        status.textContent = "WebGPU not supported! Use Chrome 113+ or Edge 113+";
        return;
      }

      status.textContent = "Loading model (this may take a few minutes)...";

      try {
        engine = await webllm.CreateMLCEngine(modelId, {
          initProgressCallback: (progress) => {
            status.textContent = progress.text;
          }
        });

        status.textContent = "Model ready!";
        document.getElementById("generate-btn").disabled = false;
      } catch (error) {
        status.textContent = "Error: " + error.message;
      }
    }

    async function generate() {
      if (!engine) return;

      const prompt = document.getElementById("prompt").value;
      const output = document.getElementById("output");
      const stats = document.getElementById("stats");

      output.textContent = "";

      const startTime = performance.now();
      let tokenCount = 0;

      const stream = await engine.chat.completions.create({
        messages: [{ role: "user", content: prompt }],
        max_tokens: 100,
        temperature: 0.7,
        stream: true
      });

      for await (const chunk of stream) {
        const content = chunk.choices[0]?.delta?.content || "";
        output.textContent += content;
        tokenCount++;
      }

      const elapsed = (performance.now() - startTime) / 1000;
      stats.textContent = \`Generated \${tokenCount} tokens in \${elapsed.toFixed(2)}s (\${(tokenCount/elapsed).toFixed(1)} tok/s)\`;
    }

    window.initModel = initModel;
    window.generate = generate;
  </script>
  <style>
    body { font-family: system-ui; max-width: 800px; margin: 40px auto; padding: 20px; }
    select, textarea, button { width: 100%; padding: 10px; margin: 10px 0; }
    #output { background: #f5f5f5; padding: 15px; min-height: 100px; white-space: pre-wrap; }
    #status { color: #666; }
    #stats { color: #0066cc; font-size: 14px; }
  </style>
</head>
<body>
  <h1>WebLLM Edge Demo</h1>
  <p>Run language models directly in your browser using WebGPU!</p>

  <label>Model:</label>
  <select id="model">
    <option value="SmolLM-135M-Instruct-q4f16_1-MLC">SmolLM 135M (Tiny, ~100MB)</option>
    <option value="Qwen2.5-0.5B-Instruct-q4f16_1-MLC" selected>Qwen2.5 0.5B (Small, ~350MB)</option>
    <option value="Phi-3-mini-4k-instruct-q4f16_1-MLC">Phi-3 Mini (Medium, ~2GB)</option>
  </select>

  <button onclick="initModel()">Load Model</button>
  <p id="status">Click "Load Model" to start</p>

  <label>Prompt:</label>
  <textarea id="prompt" rows="3">Explain edge computing in simple terms:</textarea>

  <button id="generate-btn" onclick="generate()" disabled>Generate</button>

  <h3>Output:</h3>
  <div id="output"></div>
  <p id="stats"></p>
</body>
</html>
`;

export { BrowserLLM, WebLLMDemo };

Understanding WebLLM Browser Deployment:

┌─────────────────────────────────────────────────────────────────────────────┐
│ HOW WEBLLM RUNS MODELS IN THE BROWSER                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Traditional Server-Based:           WebLLM Client-Side:                    │
│  ┌─────────────────────────────┐     ┌─────────────────────────────────┐   │
│  │ Browser ──► Server ──► GPU  │     │ Browser (WebGPU)                │   │
│  │    ↑           │            │     │ ┌─────────────────────────────┐ │   │
│  │    └───────────┘            │     │ │ Model runs entirely here    │ │   │
│  │ (Network latency, privacy?) │     │ │ • Zero server calls         │ │   │
│  └─────────────────────────────┘     │ │ • Data never leaves device  │ │   │
│                                       │ └─────────────────────────────┘ │   │
│                                       └─────────────────────────────────┘   │
│                                                                             │
│  WebGPU Pipeline:                                                           │
│  ┌───────────┐    ┌───────────┐    ┌───────────┐    ┌───────────┐         │
│  │ Download  │───►│ Compile   │───►│ Load to   │───►│ Run       │         │
│  │ WASM +    │    │ Shaders   │    │ GPU VRAM  │    │ Inference │         │
│  │ Weights   │    │ (cached)  │    │           │    │           │         │
│  └───────────┘    └───────────┘    └───────────┘    └───────────┘         │
│       │                │                                                   │
│       ▼                ▼                                                   │
│  First Load:       Subsequent:                                             │
│  ~2-5 min          ~10-30 sec                                              │
│  (one-time)        (from cache)                                            │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

WebLLM Model Size vs Browser Constraints:

Model	Download Size	VRAM Needed	Browser Tab Limit	Suitable?
SmolLM-135M	~100MB	~200MB	4GB+	Excellent
Qwen2.5-0.5B	~350MB	~500MB	4GB+	Good
Phi-3-mini	~2GB	~3GB	8GB+	Marginal
Llama-3.2-3B	~2.5GB	~4GB	8GB+	Risky

Browser Compatibility:

Chrome 113+, Edge 113+ → Full WebGPU support
Firefox Nightly → Experimental WebGPU
Safari 18+ → WebGPU (limited)
Mobile browsers → Generally not supported yet

# web/serve_webllm.py
"""
Simple server to host the WebLLM demo.
"""

from http.server import HTTPServer, SimpleHTTPRequestHandler
import os


class CORSHandler(SimpleHTTPRequestHandler):
    """Handler with CORS support for WebLLM."""

    def end_headers(self):
        # Required headers for SharedArrayBuffer (needed by WebLLM)
        self.send_header('Cross-Origin-Opener-Policy', 'same-origin')
        self.send_header('Cross-Origin-Embedder-Policy', 'require-corp')
        self.send_header('Access-Control-Allow-Origin', '*')
        super().end_headers()


def serve(port: int = 8080):
    """Start the server."""
    server = HTTPServer(('localhost', port), CORSHandler)
    print(f"Serving WebLLM demo at http://localhost:{port}")
    print("Press Ctrl+C to stop")
    server.serve_forever()


if __name__ == "__main__":
    serve()

★ Insight ───────────────────────────────────── WebGPU Requirements: WebLLM requires specific CORS headers (Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp) for SharedArrayBuffer support. Without these, the browser can't allocate the shared memory needed for GPU operations. Always test locally with a proper server, not by opening HTML files directly. ─────────────────────────────────────────────────

Part 4: CoreML for Apple Devices

Export models to CoreML for optimized inference on Apple Neural Engine.

# export/coreml_export.py
"""
Export models to CoreML for iOS/macOS deployment.

Note: Requires macOS with coremltools installed.
"""

import os
from pathlib import Path
from typing import Optional

import torch
import numpy as np


def check_coreml_available() -> bool:
    """Check if CoreML tools are available."""
    try:
        import coremltools
        return True
    except ImportError:
        return False


class CoreMLExporter:
    """
    Export HuggingFace models to CoreML format.

    CoreML models can leverage Apple's Neural Engine for
    efficient on-device inference on iOS/macOS.
    """

    def __init__(self, model_name: str, output_dir: str = "./coreml_models"):
        if not check_coreml_available():
            raise ImportError(
                "coremltools not installed. Install with: pip install coremltools"
            )

        self.model_name = model_name
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)

    def export_model(
        self,
        sequence_length: int = 512,
        compute_units: str = "ALL"  # ALL, CPU_ONLY, CPU_AND_GPU, CPU_AND_NE
    ) -> Path:
        """
        Export model to CoreML format.

        Args:
            sequence_length: Fixed sequence length for the model
            compute_units: Target compute units

        Returns:
            Path to exported .mlpackage
        """
        import coremltools as ct
        from transformers import AutoTokenizer, AutoModelForCausalLM

        print(f"Loading model: {self.model_name}")

        # Load model and tokenizer
        tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        model = AutoModelForCausalLM.from_pretrained(
            self.model_name,
            torch_dtype=torch.float32,
            trust_remote_code=True
        )
        model.eval()

        # Create sample input
        sample_text = "Hello, how are you?"
        inputs = tokenizer(
            sample_text,
            return_tensors="pt",
            max_length=sequence_length,
            padding="max_length",
            truncation=True
        )

        # Trace the model
        print("Tracing model...")

        class ModelWrapper(torch.nn.Module):
            def __init__(self, model):
                super().__init__()
                self.model = model

            def forward(self, input_ids, attention_mask):
                outputs = self.model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    return_dict=True
                )
                return outputs.logits

        wrapped_model = ModelWrapper(model)

        traced_model = torch.jit.trace(
            wrapped_model,
            (inputs["input_ids"], inputs["attention_mask"])
        )

        # Convert to CoreML
        print("Converting to CoreML...")

        mlmodel = ct.convert(
            traced_model,
            inputs=[
                ct.TensorType(
                    name="input_ids",
                    shape=(1, sequence_length),
                    dtype=np.int32
                ),
                ct.TensorType(
                    name="attention_mask",
                    shape=(1, sequence_length),
                    dtype=np.int32
                )
            ],
            outputs=[
                ct.TensorType(name="logits")
            ],
            compute_units=getattr(ct.ComputeUnit, compute_units),
            minimum_deployment_target=ct.target.iOS16
        )

        # Save model
        output_path = self.output_dir / f"{self.model_name.replace('/', '_')}.mlpackage"
        mlmodel.save(str(output_path))

        print(f"Model saved to: {output_path}")

        # Save tokenizer for inference
        tokenizer.save_pretrained(self.output_dir / "tokenizer")

        return output_path


class CoreMLInference:
    """Run inference with CoreML models on macOS."""

    def __init__(self, model_path: str, tokenizer_path: str):
        if not check_coreml_available():
            raise ImportError("coremltools not installed")

        import coremltools as ct
        from transformers import AutoTokenizer

        self.model = ct.models.MLModel(model_path)
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)

        # Get sequence length from model spec
        self.sequence_length = 512  # Default, should match export

    def generate(
        self,
        prompt: str,
        max_new_tokens: int = 50,
        temperature: float = 0.7
    ) -> str:
        """Generate text using CoreML model."""
        import coremltools as ct

        # Tokenize input
        inputs = self.tokenizer(
            prompt,
            return_tensors="np",
            max_length=self.sequence_length,
            padding="max_length",
            truncation=True
        )

        generated_tokens = []
        input_ids = inputs["input_ids"].astype(np.int32)
        attention_mask = inputs["attention_mask"].astype(np.int32)

        # Get position of last real token
        current_pos = int(attention_mask.sum()) - 1

        for _ in range(max_new_tokens):
            # Run inference
            output = self.model.predict({
                "input_ids": input_ids,
                "attention_mask": attention_mask
            })

            logits = output["logits"][0, current_pos, :]

            # Sample next token
            if temperature > 0:
                probs = self._softmax(logits / temperature)
                next_token = np.random.choice(len(probs), p=probs)
            else:
                next_token = np.argmax(logits)

            generated_tokens.append(next_token)

            # Check for end of sequence
            if next_token == self.tokenizer.eos_token_id:
                break

            # Update inputs for next iteration
            current_pos += 1
            if current_pos < self.sequence_length:
                input_ids[0, current_pos] = next_token
                attention_mask[0, current_pos] = 1
            else:
                # Shift window
                input_ids[0, :-1] = input_ids[0, 1:]
                input_ids[0, -1] = next_token
                current_pos = self.sequence_length - 1

        return self.tokenizer.decode(generated_tokens, skip_special_tokens=True)

    @staticmethod
    def _softmax(x):
        exp_x = np.exp(x - np.max(x))
        return exp_x / exp_x.sum()


# Swift code for iOS integration
SWIFT_INTEGRATION = '''
// EdgeLLM.swift
// iOS/macOS integration for CoreML language models

import CoreML
import NaturalLanguage

class EdgeLLM {
    private var model: MLModel?
    private var tokenizer: NLTokenizer

    init(modelPath: String) throws {
        let modelURL = URL(fileURLWithPath: modelPath)
        self.model = try MLModel(contentsOf: modelURL)
        self.tokenizer = NLTokenizer(unit: .word)
    }

    func generate(prompt: String, maxTokens: Int = 50) async throws -> String {
        guard let model = model else {
            throw EdgeLLMError.modelNotLoaded
        }

        // Tokenize input
        tokenizer.string = prompt
        var tokens: [Int] = []
        tokenizer.enumerateTokens(in: prompt.startIndex..<prompt.endIndex) { range, _ in
            // Convert token to ID (simplified)
            tokens.append(prompt[range].hashValue % 50000)
            return true
        }

        // Pad to sequence length
        let sequenceLength = 512
        while tokens.count < sequenceLength {
            tokens.append(0)
        }

        // Create input
        let inputArray = try MLMultiArray(shape: [1, NSNumber(value: sequenceLength)], dataType: .int32)
        for (i, token) in tokens.enumerated() {
            inputArray[i] = NSNumber(value: token)
        }

        // Run inference
        let input = try MLDictionaryFeatureProvider(dictionary: [
            "input_ids": MLFeatureValue(multiArray: inputArray),
            "attention_mask": MLFeatureValue(multiArray: inputArray)
        ])

        let output = try model.prediction(from: input)

        // Decode output (simplified)
        return "Generated text from CoreML model"
    }
}

enum EdgeLLMError: Error {
    case modelNotLoaded
    case tokenizationFailed
    case generationFailed
}
'''

if __name__ == "__main__":
    if check_coreml_available():
        # Export a small model
        exporter = CoreMLExporter("HuggingFaceTB/SmolLM-135M-Instruct")
        model_path = exporter.export_model(
            sequence_length=512,
            compute_units="ALL"
        )
        print(f"Exported to: {model_path}")
    else:
        print("CoreML tools not available (requires macOS)")
        print("\nSwift integration code:")
        print(SWIFT_INTEGRATION)

Part 5: Cross-Platform Benchmark Suite

Compare performance across deployment targets.

# benchmark/cross_platform.py
"""
Cross-platform benchmark suite for edge deployments.
"""

import json
import time
import platform
from dataclasses import dataclass, asdict
from typing import Optional
from pathlib import Path
import psutil


@dataclass
class BenchmarkResult:
    """Results from a single benchmark run."""
    platform: str
    runtime: str
    model_name: str
    model_size_mb: float
    quantization: str

    # Performance metrics
    tokens_per_second: float
    time_to_first_token_ms: float
    total_generation_time_s: float

    # Resource usage
    peak_memory_mb: float
    avg_cpu_percent: float

    # Quality (optional)
    output_quality_score: Optional[float] = None

    def to_dict(self) -> dict:
        return asdict(self)


class CrossPlatformBenchmark:
    """
    Run benchmarks across different deployment configurations.
    """

    STANDARD_PROMPTS = [
        "Explain machine learning in one sentence.",
        "Write a Python function to check if a number is prime.",
        "What are the benefits of edge computing?",
        "Summarize the key features of transformers.",
    ]

    def __init__(self, output_dir: str = "./benchmark_results"):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
        self.results: list[BenchmarkResult] = []

    def benchmark_llamacpp(
        self,
        model_path: str,
        model_name: str,
        quantization: str,
        n_threads: int = 4
    ) -> BenchmarkResult:
        """Benchmark llama.cpp model."""
        from llama_cpp import Llama

        print(f"\nBenchmarking llama.cpp: {model_name}")

        # Get model size
        model_size_mb = Path(model_path).stat().st_size / (1024 * 1024)

        # Load model
        model = Llama(
            model_path=model_path,
            n_ctx=2048,
            n_threads=n_threads,
            verbose=False
        )

        # Run benchmark
        all_tps = []
        all_ttft = []
        all_memory = []

        for prompt in self.STANDARD_PROMPTS:
            mem_before = psutil.Process().memory_info().rss / (1024 * 1024)

            start_time = time.time()
            output = model(prompt, max_tokens=50, temperature=0.1)
            total_time = time.time() - start_time

            mem_after = psutil.Process().memory_info().rss / (1024 * 1024)

            tokens = output["usage"]["completion_tokens"]
            tps = tokens / total_time if total_time > 0 else 0
            ttft = (total_time / tokens * 1000) if tokens > 0 else 0

            all_tps.append(tps)
            all_ttft.append(ttft)
            all_memory.append(mem_after - mem_before)

        result = BenchmarkResult(
            platform=platform.system(),
            runtime="llama.cpp",
            model_name=model_name,
            model_size_mb=model_size_mb,
            quantization=quantization,
            tokens_per_second=sum(all_tps) / len(all_tps),
            time_to_first_token_ms=sum(all_ttft) / len(all_ttft),
            total_generation_time_s=sum(all_ttft) * len(self.STANDARD_PROMPTS) / 1000,
            peak_memory_mb=max(all_memory),
            avg_cpu_percent=psutil.cpu_percent(interval=0.1)
        )

        self.results.append(result)
        return result

    def benchmark_onnx(
        self,
        model_path: str,
        model_name: str
    ) -> BenchmarkResult:
        """Benchmark ONNX Runtime model."""
        from optimum.onnxruntime import ORTModelForCausalLM
        from transformers import AutoTokenizer

        print(f"\nBenchmarking ONNX: {model_name}")

        # Load model
        model = ORTModelForCausalLM.from_pretrained(model_path)
        tokenizer = AutoTokenizer.from_pretrained(model_path)

        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token

        # Get model size
        model_files = list(Path(model_path).glob("*.onnx"))
        model_size_mb = sum(f.stat().st_size for f in model_files) / (1024 * 1024)

        # Run benchmark
        all_tps = []
        all_ttft = []
        all_memory = []

        for prompt in self.STANDARD_PROMPTS:
            inputs = tokenizer(prompt, return_tensors="pt")
            input_length = inputs["input_ids"].shape[1]

            mem_before = psutil.Process().memory_info().rss / (1024 * 1024)

            start_time = time.time()
            outputs = model.generate(
                **inputs,
                max_new_tokens=50,
                temperature=0.1,
                do_sample=True,
                pad_token_id=tokenizer.pad_token_id
            )
            total_time = time.time() - start_time

            mem_after = psutil.Process().memory_info().rss / (1024 * 1024)

            tokens = len(outputs[0]) - input_length
            tps = tokens / total_time if total_time > 0 else 0
            ttft = (total_time / tokens * 1000) if tokens > 0 else 0

            all_tps.append(tps)
            all_ttft.append(ttft)
            all_memory.append(mem_after - mem_before)

        result = BenchmarkResult(
            platform=platform.system(),
            runtime="ONNX Runtime",
            model_name=model_name,
            model_size_mb=model_size_mb,
            quantization="FP32",
            tokens_per_second=sum(all_tps) / len(all_tps),
            time_to_first_token_ms=sum(all_ttft) / len(all_ttft),
            total_generation_time_s=sum(all_ttft) * len(self.STANDARD_PROMPTS) / 1000,
            peak_memory_mb=max(all_memory),
            avg_cpu_percent=psutil.cpu_percent(interval=0.1)
        )

        self.results.append(result)
        return result

    def generate_report(self) -> dict:
        """Generate comparison report."""
        if not self.results:
            return {"error": "No benchmark results"}

        report = {
            "summary": {
                "total_benchmarks": len(self.results),
                "platforms_tested": list(set(r.platform for r in self.results)),
                "runtimes_tested": list(set(r.runtime for r in self.results)),
            },
            "results": [r.to_dict() for r in self.results],
            "rankings": {
                "by_speed": sorted(
                    [r.to_dict() for r in self.results],
                    key=lambda x: x["tokens_per_second"],
                    reverse=True
                ),
                "by_memory": sorted(
                    [r.to_dict() for r in self.results],
                    key=lambda x: x["peak_memory_mb"]
                ),
                "by_latency": sorted(
                    [r.to_dict() for r in self.results],
                    key=lambda x: x["time_to_first_token_ms"]
                )
            }
        }

        # Save report
        report_path = self.output_dir / f"benchmark_report_{int(time.time())}.json"
        with open(report_path, "w") as f:
            json.dump(report, f, indent=2)

        print(f"\nReport saved to: {report_path}")

        return report

    def print_comparison_table(self):
        """Print results as a comparison table."""
        if not self.results:
            print("No results to display")
            return

        print("\n" + "=" * 80)
        print("CROSS-PLATFORM BENCHMARK RESULTS")
        print("=" * 80)

        # Header
        print(f"{'Runtime':<15} {'Model':<25} {'Quant':<8} {'Tok/s':<8} {'TTFT(ms)':<10} {'Mem(MB)':<10}")
        print("-" * 80)

        # Results
        for r in sorted(self.results, key=lambda x: x.tokens_per_second, reverse=True):
            print(
                f"{r.runtime:<15} "
                f"{r.model_name[:23]:<25} "
                f"{r.quantization:<8} "
                f"{r.tokens_per_second:<8.1f} "
                f"{r.time_to_first_token_ms:<10.1f} "
                f"{r.peak_memory_mb:<10.1f}"
            )

        print("=" * 80)


# Visualization
def visualize_benchmarks(results: list[BenchmarkResult]):
    """Create visualization of benchmark results."""
    import plotly.graph_objects as go
    from plotly.subplots import make_subplots

    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=(
            "Tokens per Second",
            "Time to First Token (ms)",
            "Peak Memory (MB)",
            "Speed vs Memory Tradeoff"
        )
    )

    labels = [f"{r.runtime}\n{r.model_name[:15]}" for r in results]

    # Tokens per second
    fig.add_trace(
        go.Bar(x=labels, y=[r.tokens_per_second for r in results], name="Tok/s"),
        row=1, col=1
    )

    # Time to first token
    fig.add_trace(
        go.Bar(x=labels, y=[r.time_to_first_token_ms for r in results], name="TTFT"),
        row=1, col=2
    )

    # Memory usage
    fig.add_trace(
        go.Bar(x=labels, y=[r.peak_memory_mb for r in results], name="Memory"),
        row=2, col=1
    )

    # Speed vs Memory scatter
    fig.add_trace(
        go.Scatter(
            x=[r.peak_memory_mb for r in results],
            y=[r.tokens_per_second for r in results],
            mode="markers+text",
            text=[r.model_name[:10] for r in results],
            textposition="top center",
            name="Speed vs Memory"
        ),
        row=2, col=2
    )

    fig.update_layout(height=800, title="Edge Deployment Benchmark Comparison")
    fig.write_html("benchmark_comparison.html")
    print("Visualization saved to benchmark_comparison.html")


if __name__ == "__main__":
    benchmark = CrossPlatformBenchmark()

    # Run benchmarks (paths need to be configured)
    # benchmark.benchmark_llamacpp(
    #     "qwen2.5-0.5b-instruct-q4_k_m.gguf",
    #     "Qwen2.5-0.5B",
    #     "Q4_K_M"
    # )

    # benchmark.benchmark_onnx(
    #     "./onnx_models/SmolLM-135M-Instruct",
    #     "SmolLM-135M"
    # )

    # Generate report
    # report = benchmark.generate_report()
    # benchmark.print_comparison_table()

    print("Configure model paths and run benchmarks")

Part 6: Edge Deployment FastAPI Server

Build a unified API that supports multiple backends.

# server/edge_server.py
"""
FastAPI server for edge deployment with multiple backend support.
"""

import os
import time
from typing import Optional, Literal
from contextlib import asynccontextmanager

from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
import psutil


# Request/Response models
class GenerateRequest(BaseModel):
    prompt: str
    max_tokens: int = Field(default=100, le=500)
    temperature: float = Field(default=0.7, ge=0, le=2)
    top_p: float = Field(default=0.9, ge=0, le=1)
    stream: bool = False


class GenerateResponse(BaseModel):
    text: str
    tokens_generated: int
    generation_time_s: float
    tokens_per_second: float
    backend: str


class ChatMessage(BaseModel):
    role: Literal["system", "user", "assistant"]
    content: str


class ChatRequest(BaseModel):
    messages: list[ChatMessage]
    max_tokens: int = Field(default=100, le=500)
    temperature: float = Field(default=0.7, ge=0, le=2)
    stream: bool = False


class SystemInfo(BaseModel):
    platform: str
    cpu_count: int
    total_memory_gb: float
    available_memory_gb: float
    backend: str
    model_name: str


# Backend abstraction
class EdgeBackend:
    """Abstract base for edge backends."""

    def __init__(self, model_path: str):
        self.model_path = model_path
        self.model_name = os.path.basename(model_path)

    def generate(self, prompt: str, max_tokens: int, temperature: float, top_p: float) -> tuple[str, dict]:
        raise NotImplementedError

    def generate_stream(self, prompt: str, max_tokens: int, temperature: float, top_p: float):
        raise NotImplementedError

    @property
    def backend_name(self) -> str:
        raise NotImplementedError


class LlamaCppBackend(EdgeBackend):
    """llama.cpp backend for edge deployment."""

    def __init__(self, model_path: str, n_threads: int = 4, n_ctx: int = 2048):
        super().__init__(model_path)
        from llama_cpp import Llama

        self.model = Llama(
            model_path=model_path,
            n_ctx=n_ctx,
            n_threads=n_threads,
            verbose=False
        )

    @property
    def backend_name(self) -> str:
        return "llama.cpp"

    def generate(self, prompt: str, max_tokens: int, temperature: float, top_p: float) -> tuple[str, dict]:
        start_time = time.time()

        output = self.model(
            prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            echo=False
        )

        generation_time = time.time() - start_time

        text = output["choices"][0]["text"]
        tokens = output["usage"]["completion_tokens"]

        metrics = {
            "tokens_generated": tokens,
            "generation_time_s": generation_time,
            "tokens_per_second": tokens / generation_time if generation_time > 0 else 0
        }

        return text, metrics

    def generate_stream(self, prompt: str, max_tokens: int, temperature: float, top_p: float):
        for output in self.model(
            prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            stream=True
        ):
            yield output["choices"][0]["text"]


class ONNXBackend(EdgeBackend):
    """ONNX Runtime backend."""

    def __init__(self, model_path: str):
        super().__init__(model_path)
        from optimum.onnxruntime import ORTModelForCausalLM
        from transformers import AutoTokenizer

        self.model = ORTModelForCausalLM.from_pretrained(model_path)
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)

        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token

    @property
    def backend_name(self) -> str:
        return "ONNX Runtime"

    def generate(self, prompt: str, max_tokens: int, temperature: float, top_p: float) -> tuple[str, dict]:
        inputs = self.tokenizer(prompt, return_tensors="pt")
        input_length = inputs["input_ids"].shape[1]

        start_time = time.time()

        outputs = self.model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=temperature if temperature > 0 else 1.0,
            top_p=top_p,
            do_sample=temperature > 0,
            pad_token_id=self.tokenizer.pad_token_id
        )

        generation_time = time.time() - start_time

        generated_tokens = outputs[0][input_length:]
        text = self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
        tokens = len(generated_tokens)

        metrics = {
            "tokens_generated": tokens,
            "generation_time_s": generation_time,
            "tokens_per_second": tokens / generation_time if generation_time > 0 else 0
        }

        return text, metrics

    def generate_stream(self, prompt: str, max_tokens: int, temperature: float, top_p: float):
        # ONNX doesn't support native streaming, simulate with full generation
        text, _ = self.generate(prompt, max_tokens, temperature, top_p)
        for char in text:
            yield char


# Global backend instance
backend: Optional[EdgeBackend] = None


@asynccontextmanager
async def lifespan(app: FastAPI):
    """Initialize backend on startup."""
    global backend

    # Configuration from environment
    backend_type = os.getenv("EDGE_BACKEND", "llamacpp")
    model_path = os.getenv("EDGE_MODEL_PATH", "qwen2.5-0.5b-instruct-q4_k_m.gguf")
    n_threads = int(os.getenv("EDGE_THREADS", "4"))

    print(f"Initializing {backend_type} backend with {model_path}")

    if backend_type == "llamacpp":
        backend = LlamaCppBackend(model_path, n_threads=n_threads)
    elif backend_type == "onnx":
        backend = ONNXBackend(model_path)
    else:
        raise ValueError(f"Unknown backend: {backend_type}")

    print(f"Backend ready: {backend.backend_name}")

    yield

    # Cleanup
    backend = None


app = FastAPI(
    title="Edge LLM Server",
    description="Lightweight LLM server for edge deployment",
    version="1.0.0",
    lifespan=lifespan
)


@app.get("/health")
async def health_check():
    """Health check endpoint."""
    return {
        "status": "healthy",
        "backend": backend.backend_name if backend else "not initialized"
    }


@app.get("/info", response_model=SystemInfo)
async def system_info():
    """Get system information."""
    import platform

    mem = psutil.virtual_memory()

    return SystemInfo(
        platform=platform.system(),
        cpu_count=psutil.cpu_count(),
        total_memory_gb=mem.total / (1024**3),
        available_memory_gb=mem.available / (1024**3),
        backend=backend.backend_name if backend else "not initialized",
        model_name=backend.model_name if backend else "none"
    )


@app.post("/generate", response_model=GenerateResponse)
async def generate(request: GenerateRequest):
    """Generate text from prompt."""
    if not backend:
        raise HTTPException(status_code=503, detail="Backend not initialized")

    if request.stream:
        async def stream_generator():
            for token in backend.generate_stream(
                request.prompt,
                request.max_tokens,
                request.temperature,
                request.top_p
            ):
                yield f"data: {token}\n\n"
            yield "data: [DONE]\n\n"

        return StreamingResponse(
            stream_generator(),
            media_type="text/event-stream"
        )

    text, metrics = backend.generate(
        request.prompt,
        request.max_tokens,
        request.temperature,
        request.top_p
    )

    return GenerateResponse(
        text=text,
        tokens_generated=metrics["tokens_generated"],
        generation_time_s=metrics["generation_time_s"],
        tokens_per_second=metrics["tokens_per_second"],
        backend=backend.backend_name
    )


@app.post("/chat")
async def chat(request: ChatRequest):
    """Chat completion endpoint."""
    if not backend:
        raise HTTPException(status_code=503, detail="Backend not initialized")

    # Format messages as prompt
    prompt = ""
    for msg in request.messages:
        prompt += f"<|im_start|>{msg.role}\n{msg.content}<|im_end|>\n"
    prompt += "<|im_start|>assistant\n"

    text, metrics = backend.generate(
        prompt,
        request.max_tokens,
        request.temperature,
        0.9  # Fixed top_p for chat
    )

    # Clean up response
    text = text.split("<|im_end|>")[0].strip()

    return {
        "message": {"role": "assistant", "content": text},
        "usage": metrics,
        "backend": backend.backend_name
    }


if __name__ == "__main__":
    import uvicorn

    uvicorn.run(
        app,
        host="0.0.0.0",
        port=8000,
        workers=1  # Single worker for edge devices
    )

Docker Configuration

# Dockerfile.edge
FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    cmake \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY . .

# Environment variables
ENV EDGE_BACKEND=llamacpp
ENV EDGE_MODEL_PATH=/models/model.gguf
ENV EDGE_THREADS=4

# Expose port
EXPOSE 8000

# Run server
CMD ["python", "server/edge_server.py"]

# docker-compose.yml
version: '3.8'

services:
  edge-llm:
    build:
      context: .
      dockerfile: Dockerfile.edge
    ports:
      - "8000:8000"
    volumes:
      - ./models:/models:ro
    environment:
      - EDGE_BACKEND=llamacpp
      - EDGE_MODEL_PATH=/models/qwen2.5-0.5b-instruct-q4_k_m.gguf
      - EDGE_THREADS=4
    deploy:
      resources:
        limits:
          memory: 2G
          cpus: '2'
    restart: unless-stopped

Exercises

Exercise 1: Model Size Optimization

Export the same model at different quantization levels (Q2_K, Q4_K_M, Q8_0) and measure:

Model file size
Memory usage during inference
Token generation speed
Output quality (subjective evaluation)

Exercise 2: Browser Deployment

Deploy a model using WebLLM and measure:

Initial load time
First inference latency
Sustained generation speed
Memory usage in browser

Exercise 3: Raspberry Pi Deployment

Deploy an SLM on a Raspberry Pi and:

Measure real-world performance
Optimize thread count for the hardware
Compare different quantization levels
Build a simple voice assistant

Exercise 4: Mobile Integration

Create a mobile app concept that:

Uses CoreML on iOS or ONNX on Android
Handles model updates
Gracefully degrades on low memory
Provides offline functionality

Summary

You've learned to deploy SLMs across diverse edge platforms:

ONNX Export: Cross-platform deployment with optimization
llama.cpp: Efficient CPU inference for resource-constrained devices
WebLLM: Browser-based deployment using WebGPU
CoreML: Apple ecosystem optimization with Neural Engine
Unified Server: Multi-backend API for flexible deployment

Key insights:

Quantization is essential for edge deployment (Q4_K_M offers best balance)
Memory-mapped loading reduces startup requirements
Thread count should match physical cores, not logical
WebGPU enables near-native browser performance
Always benchmark on actual target hardware

Key Concepts Recap

Concept	What It Is	Why It Matters
ONNX Runtime	Cross-platform inference engine	Deploy same model on CPU, GPU, mobile
llama.cpp	Optimized C++ inference for GGUF	Best performance on CPU-only devices
GGUF Format	Quantized model format (Q2-Q8)	Smaller files, faster inference
Q4_K_M	4-bit quantization variant	Best balance of size and quality
WebLLM/MLC-LLM	Run models in browser via WebGPU	Zero server, complete privacy
CoreML	Apple's ML framework	Uses Neural Engine on iOS/macOS
use_mmap	Memory-map model from disk	Reduces RAM at cost of disk I/O
n_threads	CPU threads for inference	Set to physical_cores - 1
n_ctx	Context window size	Larger = more memory, longer inputs
Time to First Token	Latency before generation starts	Critical for perceived responsiveness

Next Steps

SLM Agents - Build agentic systems with edge models
Production SLM System - Scale edge deployments

Edge Deployment

On this page