MLOpsIntermediate
A/B Testing Framework
Build a robust A/B testing system for ML model comparison in production
A/B Testing Framework
Create a statistically sound A/B testing framework to compare model versions in production.
TL;DR
Route users to model variants using consistent hashing (same user always sees same variant). Log conversions per variant, then use two-proportion z-test to check significance (p < 0.05). Need ~1000 samples per variant for 80% statistical power. Multi-armed bandits (UCB1, Thompson) can optimize traffic allocation.
What You'll Learn
- Traffic splitting strategies
- Statistical significance testing
- Multi-armed bandit algorithms
- Experiment tracking
- Results analysis
Tech Stack
| Component | Technology |
|---|---|
| API | FastAPI |
| Statistics | SciPy |
| Storage | Redis |
| Visualization | Plotly |
Architecture
┌──────────────────────────────────────────────────────────────────────────────┐
│ A/B TESTING ARCHITECTURE │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ │ Request │ │
│ │ (user_id) │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ TRAFFIC ROUTER │ │
│ │ (consistent hashing on user_id) │ │
│ └───────────────────────────┬─────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────┴───────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌───────────────────────┐ ┌───────────────────────┐ │
│ │ Model A (Control) │ │ Model B (Treatment) │ │
│ │ weight: 50% │ │ weight: 50% │ │
│ └───────────┬───────────┘ └───────────┬───────────┘ │
│ │ │ │
│ └───────────────┬───────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Experiment Logger │ │
│ │ (user, variant, │ │
│ │ converted?) │ │
│ └───────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Redis │ │
│ │ counts:exp1:control {samples: 1000, conversions: 150} │ │
│ │ counts:exp1:treatment {samples: 1000, conversions: 180} │ │
│ └───────────────────────────┬───────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Statistical Analyzer (z-test, p-value, confidence intervals) │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘Project Structure
ab-testing/
├── src/
│ ├── __init__.py
│ ├── router.py # Traffic routing
│ ├── experiments.py # Experiment management
│ ├── statistics.py # Statistical analysis
│ ├── bandit.py # Multi-armed bandit
│ └── api.py # FastAPI application
├── tests/
├── docker-compose.yml
└── requirements.txtImplementation
Step 1: Dependencies
fastapi>=0.100.0
uvicorn>=0.23.0
redis>=5.0.0
scipy>=1.11.0
numpy>=1.24.0
pydantic>=2.0.0Step 2: Experiment Management
"""Experiment configuration and management."""
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from typing import Optional
import json
import redis
class ExperimentStatus(str, Enum):
DRAFT = "draft"
RUNNING = "running"
PAUSED = "paused"
COMPLETED = "completed"
STOPPED = "stopped"
@dataclass
class Variant:
"""Experiment variant (model version)."""
name: str
model_id: str
weight: float = 0.5
description: str = ""
@dataclass
class Experiment:
"""A/B test experiment."""
id: str
name: str
description: str
variants: list[Variant]
status: ExperimentStatus = ExperimentStatus.DRAFT
created_at: datetime = field(default_factory=datetime.now)
started_at: Optional[datetime] = None
ended_at: Optional[datetime] = None
# Configuration
traffic_percentage: float = 100.0 # % of traffic in experiment
min_sample_size: int = 1000
confidence_level: float = 0.95
def to_dict(self) -> dict:
"""Convert to dictionary."""
return {
"id": self.id,
"name": self.name,
"description": self.description,
"variants": [
{"name": v.name, "model_id": v.model_id, "weight": v.weight}
for v in self.variants
],
"status": self.status.value,
"created_at": self.created_at.isoformat(),
"started_at": self.started_at.isoformat() if self.started_at else None,
"traffic_percentage": self.traffic_percentage,
"min_sample_size": self.min_sample_size,
"confidence_level": self.confidence_level
}
@classmethod
def from_dict(cls, data: dict) -> "Experiment":
"""Create from dictionary."""
return cls(
id=data["id"],
name=data["name"],
description=data["description"],
variants=[
Variant(
name=v["name"],
model_id=v["model_id"],
weight=v["weight"]
)
for v in data["variants"]
],
status=ExperimentStatus(data["status"]),
created_at=datetime.fromisoformat(data["created_at"]),
started_at=datetime.fromisoformat(data["started_at"]) if data.get("started_at") else None,
traffic_percentage=data.get("traffic_percentage", 100.0),
min_sample_size=data.get("min_sample_size", 1000),
confidence_level=data.get("confidence_level", 0.95)
)
class ExperimentStore:
"""Persistent experiment storage."""
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
self.prefix = "experiment:"
def save(self, experiment: Experiment) -> None:
"""Save experiment."""
key = f"{self.prefix}{experiment.id}"
self.redis.set(key, json.dumps(experiment.to_dict()))
def get(self, experiment_id: str) -> Optional[Experiment]:
"""Get experiment by ID."""
key = f"{self.prefix}{experiment_id}"
data = self.redis.get(key)
if data:
return Experiment.from_dict(json.loads(data))
return None
def get_active(self) -> list[Experiment]:
"""Get all running experiments."""
experiments = []
for key in self.redis.scan_iter(f"{self.prefix}*"):
data = self.redis.get(key)
if data:
exp = Experiment.from_dict(json.loads(data))
if exp.status == ExperimentStatus.RUNNING:
experiments.append(exp)
return experiments
def delete(self, experiment_id: str) -> bool:
"""Delete experiment."""
key = f"{self.prefix}{experiment_id}"
return self.redis.delete(key) > 0Step 3: Traffic Router
"""Traffic routing for A/B tests."""
import hashlib
import random
from typing import Optional
from .experiments import Experiment, Variant, ExperimentStatus
class TrafficRouter:
"""
Route traffic to experiment variants.
Uses consistent hashing for deterministic assignment
based on user ID.
"""
def __init__(self, experiment_store):
self.store = experiment_store
def route(
self,
user_id: str,
experiment_id: Optional[str] = None
) -> tuple[Optional[str], Optional[Variant]]:
"""
Route a request to a variant.
Args:
user_id: Unique user identifier
experiment_id: Specific experiment (or uses active)
Returns:
Tuple of (experiment_id, variant) or (None, None)
"""
if experiment_id:
experiment = self.store.get(experiment_id)
if not experiment or experiment.status != ExperimentStatus.RUNNING:
return None, None
else:
# Get first active experiment
active = self.store.get_active()
if not active:
return None, None
experiment = active[0]
# Check if user is in experiment traffic
if not self._is_in_experiment(user_id, experiment):
return None, None
# Assign variant
variant = self._assign_variant(user_id, experiment)
return experiment.id, variant
def _is_in_experiment(self, user_id: str, experiment: Experiment) -> bool:
"""Check if user should be in experiment."""
hash_value = self._hash_user(user_id, "traffic")
threshold = experiment.traffic_percentage / 100.0
return hash_value < threshold
def _assign_variant(self, user_id: str, experiment: Experiment) -> Variant:
"""Assign user to a variant deterministically."""
hash_value = self._hash_user(user_id, experiment.id)
cumulative = 0.0
for variant in experiment.variants:
cumulative += variant.weight
if hash_value < cumulative:
return variant
# Fallback to last variant
return experiment.variants[-1]
def _hash_user(self, user_id: str, salt: str) -> float:
"""Hash user ID to value between 0 and 1."""
combined = f"{user_id}:{salt}"
hash_bytes = hashlib.sha256(combined.encode()).digest()
hash_int = int.from_bytes(hash_bytes[:8], byteorder="big")
return hash_int / (2**64)Understanding Consistent Hashing:
┌─────────────────────────────────────────────────────────────────────────────┐
│ WHY CONSISTENT HASHING? │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ RANDOM ASSIGNMENT (Bad): │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ User "abc123" visits: │ │
│ │ Visit 1 → random() = 0.3 → Model A │ │
│ │ Visit 2 → random() = 0.7 → Model B ← Different! │ │
│ │ Visit 3 → random() = 0.2 → Model A ← Inconsistent experience │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ CONSISTENT HASHING (Good): │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ User "abc123" visits: │ │
│ │ hash("abc123:exp1") = 0.42 → Model A │ │
│ │ hash("abc123:exp1") = 0.42 → Model A ← Same! │ │
│ │ hash("abc123:exp1") = 0.42 → Model A ← Always consistent │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ The salt (experiment ID) ensures: │
│ • Same user can be in different variants across experiments │
│ • Prevents bias if experiments overlap │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Traffic Splitting Strategy:
| Strategy | Deterministic? | Use Case |
|---|---|---|
| Consistent Hash | Yes | A/B tests (user sees same variant) |
| Weighted Random | No | Load balancing, canary deploys |
| Sticky Sessions | Yes (with storage) | When hash not possible |
| Percentage Ramp | Yes | Gradual rollouts (1% → 10% → 50%) |
class WeightedRouter:
"""Simple weighted random routing (non-deterministic)."""
def route(self, experiment: Experiment) -> Variant:
"""Select variant based on weights."""
r = random.random()
cumulative = 0.0
for variant in experiment.variants:
cumulative += variant.weight
if r < cumulative:
return variant
return experiment.variants[-1]Step 4: Statistical Analysis
"""Statistical analysis for A/B tests."""
from dataclasses import dataclass
from typing import Optional
import numpy as np
from scipy import stats
@dataclass
class VariantMetrics:
"""Metrics for a single variant."""
name: str
samples: int
conversions: int
conversion_rate: float
mean_value: float
std_value: float
@dataclass
class ExperimentResults:
"""Results of statistical analysis."""
control: VariantMetrics
treatment: VariantMetrics
# Statistical results
relative_lift: float
p_value: float
confidence_interval: tuple[float, float]
is_significant: bool
required_samples: int
power: float
class StatisticalAnalyzer:
"""
Statistical analysis for A/B test results.
Supports both conversion rate and continuous metric analysis.
"""
def __init__(self, confidence_level: float = 0.95):
self.confidence_level = confidence_level
self.alpha = 1 - confidence_level
def analyze_conversion(
self,
control_conversions: int,
control_samples: int,
treatment_conversions: int,
treatment_samples: int
) -> ExperimentResults:
"""
Analyze conversion rate experiment.
Uses two-proportion z-test for significance.
"""
# Calculate rates
control_rate = control_conversions / control_samples
treatment_rate = treatment_conversions / treatment_samples
# Pooled proportion
pooled = (control_conversions + treatment_conversions) / (
control_samples + treatment_samples
)
# Standard error
se = np.sqrt(
pooled * (1 - pooled) * (1/control_samples + 1/treatment_samples)
)
# Z-statistic
z_stat = (treatment_rate - control_rate) / se if se > 0 else 0
# P-value (two-tailed)
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
# Confidence interval for difference
se_diff = np.sqrt(
(control_rate * (1 - control_rate) / control_samples) +
(treatment_rate * (1 - treatment_rate) / treatment_samples)
)
z_crit = stats.norm.ppf(1 - self.alpha / 2)
ci_low = (treatment_rate - control_rate) - z_crit * se_diff
ci_high = (treatment_rate - control_rate) + z_crit * se_diff
# Relative lift
lift = (treatment_rate - control_rate) / control_rate if control_rate > 0 else 0
# Required sample size for 80% power
required = self._required_sample_size(control_rate, treatment_rate)
# Statistical power
power = self._calculate_power(
control_rate, treatment_rate,
control_samples, treatment_samples
)
return ExperimentResults(
control=VariantMetrics(
name="control",
samples=control_samples,
conversions=control_conversions,
conversion_rate=control_rate,
mean_value=control_rate,
std_value=np.sqrt(control_rate * (1 - control_rate))
),
treatment=VariantMetrics(
name="treatment",
samples=treatment_samples,
conversions=treatment_conversions,
conversion_rate=treatment_rate,
mean_value=treatment_rate,
std_value=np.sqrt(treatment_rate * (1 - treatment_rate))
),
relative_lift=lift,
p_value=p_value,
confidence_interval=(ci_low, ci_high),
is_significant=p_value < self.alpha,
required_samples=required,
power=power
)
def analyze_continuous(
self,
control_values: np.ndarray,
treatment_values: np.ndarray
) -> ExperimentResults:
"""
Analyze continuous metric experiment.
Uses Welch's t-test for significance.
"""
# Calculate statistics
control_mean = np.mean(control_values)
control_std = np.std(control_values, ddof=1)
treatment_mean = np.mean(treatment_values)
treatment_std = np.std(treatment_values, ddof=1)
# Welch's t-test
t_stat, p_value = stats.ttest_ind(
treatment_values, control_values, equal_var=False
)
# Confidence interval
se = np.sqrt(
(control_std**2 / len(control_values)) +
(treatment_std**2 / len(treatment_values))
)
df = self._welch_df(control_values, treatment_values)
t_crit = stats.t.ppf(1 - self.alpha / 2, df)
diff = treatment_mean - control_mean
ci_low = diff - t_crit * se
ci_high = diff + t_crit * se
# Relative lift
lift = diff / control_mean if control_mean != 0 else 0
return ExperimentResults(
control=VariantMetrics(
name="control",
samples=len(control_values),
conversions=0,
conversion_rate=0,
mean_value=control_mean,
std_value=control_std
),
treatment=VariantMetrics(
name="treatment",
samples=len(treatment_values),
conversions=0,
conversion_rate=0,
mean_value=treatment_mean,
std_value=treatment_std
),
relative_lift=lift,
p_value=p_value,
confidence_interval=(ci_low, ci_high),
is_significant=p_value < self.alpha,
required_samples=0,
power=0
)
def _required_sample_size(
self,
baseline_rate: float,
expected_rate: float,
power: float = 0.8
) -> int:
"""Calculate required sample size per variant."""
if baseline_rate == 0 or expected_rate == baseline_rate:
return 0
z_alpha = stats.norm.ppf(1 - self.alpha / 2)
z_beta = stats.norm.ppf(power)
p1, p2 = baseline_rate, expected_rate
p_avg = (p1 + p2) / 2
n = (
(z_alpha * np.sqrt(2 * p_avg * (1 - p_avg)) +
z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2)))**2
) / (p2 - p1)**2
return int(np.ceil(n))
def _calculate_power(
self,
p1: float,
p2: float,
n1: int,
n2: int
) -> float:
"""Calculate statistical power."""
if p1 == p2 or n1 == 0 or n2 == 0:
return 0
z_alpha = stats.norm.ppf(1 - self.alpha / 2)
se = np.sqrt(p1 * (1 - p1) / n1 + p2 * (1 - p2) / n2)
effect = abs(p2 - p1)
z_power = (effect / se) - z_alpha
return float(stats.norm.cdf(z_power))
def _welch_df(self, a: np.ndarray, b: np.ndarray) -> float:
"""Calculate Welch-Satterthwaite degrees of freedom."""
var_a = np.var(a, ddof=1)
var_b = np.var(b, ddof=1)
n_a, n_b = len(a), len(b)
num = (var_a/n_a + var_b/n_b)**2
denom = (var_a/n_a)**2/(n_a-1) + (var_b/n_b)**2/(n_b-1)
return num / denom if denom > 0 else 1Step 5: Multi-Armed Bandit
"""Multi-armed bandit algorithms for adaptive experiments."""
from abc import ABC, abstractmethod
import numpy as np
from typing import Optional
class BanditAlgorithm(ABC):
"""Base class for bandit algorithms."""
@abstractmethod
def select_arm(self) -> int:
"""Select which arm to pull."""
pass
@abstractmethod
def update(self, arm: int, reward: float) -> None:
"""Update arm statistics with reward."""
pass
class EpsilonGreedy(BanditAlgorithm):
"""
Epsilon-greedy bandit algorithm.
Explores randomly with probability epsilon,
exploits best arm otherwise.
"""
def __init__(self, n_arms: int, epsilon: float = 0.1):
self.n_arms = n_arms
self.epsilon = epsilon
self.counts = np.zeros(n_arms)
self.values = np.zeros(n_arms)
def select_arm(self) -> int:
"""Select arm using epsilon-greedy."""
if np.random.random() < self.epsilon:
return np.random.randint(self.n_arms)
return int(np.argmax(self.values))
def update(self, arm: int, reward: float) -> None:
"""Update arm value with incremental mean."""
self.counts[arm] += 1
n = self.counts[arm]
self.values[arm] += (reward - self.values[arm]) / n
class UCB1(BanditAlgorithm):
"""
Upper Confidence Bound (UCB1) algorithm.
Balances exploration and exploitation using
confidence bounds.
"""
def __init__(self, n_arms: int):
self.n_arms = n_arms
self.counts = np.zeros(n_arms)
self.values = np.zeros(n_arms)
self.total_counts = 0
def select_arm(self) -> int:
"""Select arm with highest UCB."""
# Try each arm at least once
for arm in range(self.n_arms):
if self.counts[arm] == 0:
return arm
# Calculate UCB values
ucb_values = self.values + np.sqrt(
2 * np.log(self.total_counts) / self.counts
)
return int(np.argmax(ucb_values))
def update(self, arm: int, reward: float) -> None:
"""Update arm statistics."""
self.total_counts += 1
self.counts[arm] += 1
n = self.counts[arm]
self.values[arm] += (reward - self.values[arm]) / n
class ThompsonSampling(BanditAlgorithm):
"""
Thompson Sampling for Bernoulli bandits.
Uses Beta distribution for posterior sampling.
"""
def __init__(self, n_arms: int):
self.n_arms = n_arms
# Beta distribution parameters (successes, failures)
self.alpha = np.ones(n_arms) # Successes + 1
self.beta = np.ones(n_arms) # Failures + 1
def select_arm(self) -> int:
"""Select arm by sampling from posteriors."""
samples = np.array([
np.random.beta(self.alpha[i], self.beta[i])
for i in range(self.n_arms)
])
return int(np.argmax(samples))
def update(self, arm: int, reward: float) -> None:
"""Update Beta distribution parameters."""
if reward > 0:
self.alpha[arm] += 1
else:
self.beta[arm] += 1
def get_probabilities(self) -> np.ndarray:
"""Get current probability estimates."""
return self.alpha / (self.alpha + self.beta)Step 6: FastAPI Application
"""FastAPI application for A/B testing."""
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional
import redis
import uuid
from .experiments import Experiment, Variant, ExperimentStatus, ExperimentStore
from .router import TrafficRouter
from .statistics import StatisticalAnalyzer
# Global instances
store: ExperimentStore = None
router: TrafficRouter = None
analyzer: StatisticalAnalyzer = None
redis_client: redis.Redis = None
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Application lifespan."""
global store, router, analyzer, redis_client
redis_client = redis.Redis(host="localhost", port=6379, db=0)
store = ExperimentStore(redis_client)
router = TrafficRouter(store)
analyzer = StatisticalAnalyzer(confidence_level=0.95)
yield
redis_client.close()
app = FastAPI(
title="A/B Testing API",
description="Statistical A/B testing for ML models",
lifespan=lifespan
)
# Request/Response Models
class CreateExperimentRequest(BaseModel):
name: str
description: str
control_model_id: str
treatment_model_id: str
traffic_percentage: float = 100.0
control_weight: float = 0.5
class RouteRequest(BaseModel):
user_id: str
experiment_id: Optional[str] = None
class RouteResponse(BaseModel):
experiment_id: Optional[str]
variant_name: Optional[str]
model_id: Optional[str]
class LogEventRequest(BaseModel):
experiment_id: str
user_id: str
variant_name: str
converted: bool
value: Optional[float] = None
class ExperimentResultsResponse(BaseModel):
experiment_id: str
control_samples: int
treatment_samples: int
control_conversion_rate: float
treatment_conversion_rate: float
relative_lift: float
p_value: float
is_significant: bool
confidence_interval: tuple[float, float]
@app.post("/experiments", response_model=dict)
async def create_experiment(request: CreateExperimentRequest):
"""Create a new A/B test experiment."""
experiment = Experiment(
id=str(uuid.uuid4())[:8],
name=request.name,
description=request.description,
variants=[
Variant(
name="control",
model_id=request.control_model_id,
weight=request.control_weight
),
Variant(
name="treatment",
model_id=request.treatment_model_id,
weight=1 - request.control_weight
)
],
traffic_percentage=request.traffic_percentage
)
store.save(experiment)
return experiment.to_dict()
@app.post("/experiments/{experiment_id}/start")
async def start_experiment(experiment_id: str):
"""Start an experiment."""
experiment = store.get(experiment_id)
if not experiment:
raise HTTPException(status_code=404, detail="Experiment not found")
experiment.status = ExperimentStatus.RUNNING
store.save(experiment)
return {"status": "started"}
@app.post("/experiments/{experiment_id}/stop")
async def stop_experiment(experiment_id: str):
"""Stop an experiment."""
experiment = store.get(experiment_id)
if not experiment:
raise HTTPException(status_code=404, detail="Experiment not found")
experiment.status = ExperimentStatus.STOPPED
store.save(experiment)
return {"status": "stopped"}
@app.post("/route", response_model=RouteResponse)
async def route_request(request: RouteRequest):
"""Route a user to an experiment variant."""
exp_id, variant = router.route(request.user_id, request.experiment_id)
if not variant:
return RouteResponse(
experiment_id=None,
variant_name=None,
model_id=None
)
return RouteResponse(
experiment_id=exp_id,
variant_name=variant.name,
model_id=variant.model_id
)
@app.post("/log")
async def log_event(request: LogEventRequest):
"""Log a conversion event."""
key = f"events:{request.experiment_id}:{request.variant_name}"
# Store event
event = {
"user_id": request.user_id,
"converted": request.converted,
"value": request.value
}
redis_client.rpush(key, str(event))
# Update counts
count_key = f"counts:{request.experiment_id}:{request.variant_name}"
redis_client.hincrby(count_key, "samples", 1)
if request.converted:
redis_client.hincrby(count_key, "conversions", 1)
return {"status": "logged"}
@app.get("/experiments/{experiment_id}/results", response_model=ExperimentResultsResponse)
async def get_results(experiment_id: str):
"""Get experiment results with statistical analysis."""
experiment = store.get(experiment_id)
if not experiment:
raise HTTPException(status_code=404, detail="Experiment not found")
# Get counts for each variant
control_key = f"counts:{experiment_id}:control"
treatment_key = f"counts:{experiment_id}:treatment"
control_data = redis_client.hgetall(control_key)
treatment_data = redis_client.hgetall(treatment_key)
control_samples = int(control_data.get(b"samples", 0))
control_conversions = int(control_data.get(b"conversions", 0))
treatment_samples = int(treatment_data.get(b"samples", 0))
treatment_conversions = int(treatment_data.get(b"conversions", 0))
if control_samples == 0 or treatment_samples == 0:
raise HTTPException(
status_code=400,
detail="Insufficient data for analysis"
)
# Analyze results
results = analyzer.analyze_conversion(
control_conversions, control_samples,
treatment_conversions, treatment_samples
)
return ExperimentResultsResponse(
experiment_id=experiment_id,
control_samples=control_samples,
treatment_samples=treatment_samples,
control_conversion_rate=results.control.conversion_rate,
treatment_conversion_rate=results.treatment.conversion_rate,
relative_lift=results.relative_lift,
p_value=results.p_value,
is_significant=results.is_significant,
confidence_interval=results.confidence_interval
)
@app.get("/experiments/{experiment_id}")
async def get_experiment(experiment_id: str):
"""Get experiment details."""
experiment = store.get(experiment_id)
if not experiment:
raise HTTPException(status_code=404, detail="Experiment not found")
return experiment.to_dict()Step 7: Docker Compose
version: '3.8'
services:
api:
build: .
ports:
- "8000:8000"
environment:
- REDIS_HOST=redis
depends_on:
- redis
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis-data:/data
volumes:
redis-data:Usage Example
# Create experiment
curl -X POST http://localhost:8000/experiments \
-H "Content-Type: application/json" \
-d '{
"name": "Model v2 Test",
"description": "Testing new model version",
"control_model_id": "model-v1",
"treatment_model_id": "model-v2",
"traffic_percentage": 50
}'
# Start experiment
curl -X POST http://localhost:8000/experiments/abc123/start
# Route user to variant
curl -X POST http://localhost:8000/route \
-H "Content-Type: application/json" \
-d '{"user_id": "user-456"}'
# Log conversion
curl -X POST http://localhost:8000/log \
-H "Content-Type: application/json" \
-d '{
"experiment_id": "abc123",
"user_id": "user-456",
"variant_name": "treatment",
"converted": true
}'
# Get results
curl http://localhost:8000/experiments/abc123/resultsStatistical Considerations
| Factor | Recommendation |
|---|---|
| Sample size | Min 1000 per variant |
| Duration | 1-2 weeks minimum |
| Significance | p < 0.05 (95% confidence) |
| Power | 80% minimum |
Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| Consistent Hashing | Same user always gets same variant | Stable experience, valid comparison |
| Two-Proportion Z-Test | Compare conversion rates statistically | Know if difference is real or noise |
| P-Value | Probability difference is due to chance | p < 0.05 means statistically significant |
| Statistical Power | Probability of detecting real effect | 80% power = won't miss true improvements |
| Confidence Interval | Range of plausible effect sizes | Know the magnitude, not just direction |
| Epsilon-Greedy | Explore ε%, exploit (1-ε)% best arm | Simple bandit, tunable exploration |
| UCB1 | Balance mean reward + uncertainty bonus | Optimistic exploration, good regret |
| Thompson Sampling | Sample from posterior, pick max | Bayesian approach, often best empirically |
Next Steps
- Complete Pipeline - Integrate with CI/CD
- Model Registry - Version your models