A/B Testing Framework


Difficulty	Intermediate
Time	~4 hours
Code	~400 lines
Prerequisites	Basic statistics, FastAPI

TL;DR

Route users to model variants using consistent hashing (same user always sees same variant). Log conversions per variant, then use two-proportion z-test to check significance (p < 0.05). Need ~1000 samples per variant for 80% statistical power. Multi-armed bandits (UCB1, Thompson) can optimize traffic allocation.

What You'll Learn

Traffic splitting strategies
Statistical significance testing
Multi-armed bandit algorithms
Experiment tracking
Results analysis

Why Measure Before Shipping Model Changes?

A new model with better offline accuracy can still hurt production metrics. Higher accuracy on a test set does not guarantee better user engagement, lower latency, or fewer errors in the real world.

Without A/B Testing	With A/B Testing
"New model has 2% higher accuracy -- ship it!"	"New model has 2% higher accuracy offline, but conversion rate dropped 5% in production"
No rollback data if things go wrong	Statistical evidence before full rollout
Team debates opinions	Team debates data
HiPPO decides (Highest Paid Person's Opinion)	p-value decides

Tech Stack

Component	Technology	Why
API	FastAPI	Async request routing and experiment logging
Statistics	SciPy	Z-tests, Welch's t-test, power analysis
Storage	Redis	Fast counters for conversion tracking
Algorithms	NumPy	Multi-armed bandit implementations

Architecture

A/B Testing Architecture

Request (user_id)Incoming request with user identifier

Traffic RouterConsistent hashing on user_id for deterministic assignment

Model VariantsModel A (Control, 50%) or Model B (Treatment, 50%)

Experiment LoggerRecords user, variant, and conversion status

Redis Storagecounts:exp1:control {samples: 1000, conversions: 150}

Statistical AnalyzerZ-test, p-value, confidence intervals

Project Structure

ab-testing/
├── src/
│   ├── __init__.py
│   ├── router.py         # Traffic routing
│   ├── experiments.py    # Experiment management
│   ├── statistics.py     # Statistical analysis
│   ├── bandit.py         # Multi-armed bandit
│   └── api.py            # FastAPI application
├── tests/
├── docker-compose.yml
└── requirements.txt

Implementation

Step 1: Dependencies

requirements.txt

fastapi>=0.100.0
uvicorn>=0.23.0
redis>=5.0.0
scipy>=1.11.0
numpy>=1.24.0
pydantic>=2.0.0

Step 2: Experiment Management

src/experiments.py

"""Experiment configuration and management."""

from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from typing import Optional
import json
import redis


class ExperimentStatus(str, Enum):
    DRAFT = "draft"
    RUNNING = "running"
    PAUSED = "paused"
    COMPLETED = "completed"
    STOPPED = "stopped"


@dataclass
class Variant:
    """Experiment variant (model version)."""

    name: str
    model_id: str
    weight: float = 0.5
    description: str = ""


@dataclass
class Experiment:
    """A/B test experiment."""

    id: str
    name: str
    description: str
    variants: list[Variant]
    status: ExperimentStatus = ExperimentStatus.DRAFT
    created_at: datetime = field(default_factory=datetime.now)
    started_at: Optional[datetime] = None
    ended_at: Optional[datetime] = None

    # Configuration
    traffic_percentage: float = 100.0  # % of traffic in experiment
    min_sample_size: int = 1000
    confidence_level: float = 0.95

    def to_dict(self) -> dict:
        """Convert to dictionary."""
        return {
            "id": self.id,
            "name": self.name,
            "description": self.description,
            "variants": [
                {"name": v.name, "model_id": v.model_id, "weight": v.weight}
                for v in self.variants
            ],
            "status": self.status.value,
            "created_at": self.created_at.isoformat(),
            "started_at": self.started_at.isoformat() if self.started_at else None,
            "traffic_percentage": self.traffic_percentage,
            "min_sample_size": self.min_sample_size,
            "confidence_level": self.confidence_level
        }

    @classmethod
    def from_dict(cls, data: dict) -> "Experiment":
        """Create from dictionary."""
        return cls(
            id=data["id"],
            name=data["name"],
            description=data["description"],
            variants=[
                Variant(
                    name=v["name"],
                    model_id=v["model_id"],
                    weight=v["weight"]
                )
                for v in data["variants"]
            ],
            status=ExperimentStatus(data["status"]),
            created_at=datetime.fromisoformat(data["created_at"]),
            started_at=datetime.fromisoformat(data["started_at"]) if data.get("started_at") else None,
            traffic_percentage=data.get("traffic_percentage", 100.0),
            min_sample_size=data.get("min_sample_size", 1000),
            confidence_level=data.get("confidence_level", 0.95)
        )


class ExperimentStore:
    """Persistent experiment storage."""

    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client
        self.prefix = "experiment:"

    def save(self, experiment: Experiment) -> None:
        """Save experiment."""
        key = f"{self.prefix}{experiment.id}"
        self.redis.set(key, json.dumps(experiment.to_dict()))

    def get(self, experiment_id: str) -> Optional[Experiment]:
        """Get experiment by ID."""
        key = f"{self.prefix}{experiment_id}"
        data = self.redis.get(key)
        if data:
            return Experiment.from_dict(json.loads(data))
        return None

    def get_active(self) -> list[Experiment]:
        """Get all running experiments."""
        experiments = []
        for key in self.redis.scan_iter(f"{self.prefix}*"):
            data = self.redis.get(key)
            if data:
                exp = Experiment.from_dict(json.loads(data))
                if exp.status == ExperimentStatus.RUNNING:
                    experiments.append(exp)
        return experiments

    def delete(self, experiment_id: str) -> bool:
        """Delete experiment."""
        key = f"{self.prefix}{experiment_id}"
        return self.redis.delete(key) > 0

Step 3: Traffic Router

src/router.py

"""Traffic routing for A/B tests."""

import hashlib
import random
from typing import Optional
from .experiments import Experiment, Variant, ExperimentStatus


class TrafficRouter:
    """
    Route traffic to experiment variants.

    Uses consistent hashing for deterministic assignment
    based on user ID.
    """

    def __init__(self, experiment_store):
        self.store = experiment_store

    def route(
        self,
        user_id: str,
        experiment_id: Optional[str] = None
    ) -> tuple[Optional[str], Optional[Variant]]:
        """
        Route a request to a variant.

        Args:
            user_id: Unique user identifier
            experiment_id: Specific experiment (or uses active)

        Returns:
            Tuple of (experiment_id, variant) or (None, None)
        """
        if experiment_id:
            experiment = self.store.get(experiment_id)
            if not experiment or experiment.status != ExperimentStatus.RUNNING:
                return None, None
        else:
            # Get first active experiment
            active = self.store.get_active()
            if not active:
                return None, None
            experiment = active[0]

        # Check if user is in experiment traffic
        if not self._is_in_experiment(user_id, experiment):
            return None, None

        # Assign variant
        variant = self._assign_variant(user_id, experiment)
        return experiment.id, variant

    def _is_in_experiment(self, user_id: str, experiment: Experiment) -> bool:
        """Check if user should be in experiment."""
        hash_value = self._hash_user(user_id, "traffic")
        threshold = experiment.traffic_percentage / 100.0
        return hash_value < threshold

    def _assign_variant(self, user_id: str, experiment: Experiment) -> Variant:
        """Assign user to a variant deterministically."""
        hash_value = self._hash_user(user_id, experiment.id)

        cumulative = 0.0
        for variant in experiment.variants:
            cumulative += variant.weight
            if hash_value < cumulative:
                return variant

        # Fallback to last variant
        return experiment.variants[-1]

    def _hash_user(self, user_id: str, salt: str) -> float:
        """Hash user ID to value between 0 and 1."""
        combined = f"{user_id}:{salt}"
        hash_bytes = hashlib.sha256(combined.encode()).digest()
        hash_int = int.from_bytes(hash_bytes[:8], byteorder="big")
        return hash_int / (2**64)

Understanding Consistent Hashing:

Why Consistent Hashing?

Random Assignment

User "abc123" visits: Visit 1 → random()=0.3 → Model A. Visit 2 → random()=0.7 → Model B (different!). Visit 3 → random()=0.2 → Model A. Inconsistent experience invalidates results.

Consistent Hashing

Recommended

User "abc123" visits: hash("abc123:exp1")=0.42 → Model A, always. Same hash, same variant, every time. The salt (experiment ID) ensures the same user can be in different variants across experiments and prevents bias if experiments overlap.

Traffic Splitting Strategy:

Strategy	Deterministic?	Use Case
Consistent Hash	Yes	A/B tests (user sees same variant)
Weighted Random	No	Load balancing, canary deploys
Sticky Sessions	Yes (with storage)	When hash not possible
Percentage Ramp	Yes	Gradual rollouts (1% → 10% → 50%)

class WeightedRouter:
    """Simple weighted random routing (non-deterministic)."""

    def route(self, experiment: Experiment) -> Variant:
        """Select variant based on weights."""
        r = random.random()
        cumulative = 0.0
        for variant in experiment.variants:
            cumulative += variant.weight
            if r < cumulative:
                return variant
        return experiment.variants[-1]

Step 4: Statistical Analysis

src/statistics.py

"""Statistical analysis for A/B tests."""

from dataclasses import dataclass
from typing import Optional
import numpy as np
from scipy import stats


@dataclass
class VariantMetrics:
    """Metrics for a single variant."""

    name: str
    samples: int
    conversions: int
    conversion_rate: float
    mean_value: float
    std_value: float


@dataclass
class ExperimentResults:
    """Results of statistical analysis."""

    control: VariantMetrics
    treatment: VariantMetrics

    # Statistical results
    relative_lift: float
    p_value: float
    confidence_interval: tuple[float, float]
    is_significant: bool
    required_samples: int
    power: float


class StatisticalAnalyzer:
    """
    Statistical analysis for A/B test results.

    Supports both conversion rate and continuous metric analysis.
    """

    def __init__(self, confidence_level: float = 0.95):
        self.confidence_level = confidence_level
        self.alpha = 1 - confidence_level

    def analyze_conversion(
        self,
        control_conversions: int,
        control_samples: int,
        treatment_conversions: int,
        treatment_samples: int
    ) -> ExperimentResults:
        """
        Analyze conversion rate experiment.

        Uses two-proportion z-test for significance.
        """
        # Calculate rates
        control_rate = control_conversions / control_samples
        treatment_rate = treatment_conversions / treatment_samples

        # Pooled proportion
        pooled = (control_conversions + treatment_conversions) / (
            control_samples + treatment_samples
        )

        # Standard error
        se = np.sqrt(
            pooled * (1 - pooled) * (1/control_samples + 1/treatment_samples)
        )

        # Z-statistic
        z_stat = (treatment_rate - control_rate) / se if se > 0 else 0

        # P-value (two-tailed)
        p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

        # Confidence interval for difference
        se_diff = np.sqrt(
            (control_rate * (1 - control_rate) / control_samples) +
            (treatment_rate * (1 - treatment_rate) / treatment_samples)
        )
        z_crit = stats.norm.ppf(1 - self.alpha / 2)
        ci_low = (treatment_rate - control_rate) - z_crit * se_diff
        ci_high = (treatment_rate - control_rate) + z_crit * se_diff

        # Relative lift
        lift = (treatment_rate - control_rate) / control_rate if control_rate > 0 else 0

        # Required sample size for 80% power
        required = self._required_sample_size(control_rate, treatment_rate)

        # Statistical power
        power = self._calculate_power(
            control_rate, treatment_rate,
            control_samples, treatment_samples
        )

        return ExperimentResults(
            control=VariantMetrics(
                name="control",
                samples=control_samples,
                conversions=control_conversions,
                conversion_rate=control_rate,
                mean_value=control_rate,
                std_value=np.sqrt(control_rate * (1 - control_rate))
            ),
            treatment=VariantMetrics(
                name="treatment",
                samples=treatment_samples,
                conversions=treatment_conversions,
                conversion_rate=treatment_rate,
                mean_value=treatment_rate,
                std_value=np.sqrt(treatment_rate * (1 - treatment_rate))
            ),
            relative_lift=lift,
            p_value=p_value,
            confidence_interval=(ci_low, ci_high),
            is_significant=p_value < self.alpha,
            required_samples=required,
            power=power
        )

    def analyze_continuous(
        self,
        control_values: np.ndarray,
        treatment_values: np.ndarray
    ) -> ExperimentResults:
        """
        Analyze continuous metric experiment.

        Uses Welch's t-test for significance.
        """
        # Calculate statistics
        control_mean = np.mean(control_values)
        control_std = np.std(control_values, ddof=1)
        treatment_mean = np.mean(treatment_values)
        treatment_std = np.std(treatment_values, ddof=1)

        # Welch's t-test
        t_stat, p_value = stats.ttest_ind(
            treatment_values, control_values, equal_var=False
        )

        # Confidence interval
        se = np.sqrt(
            (control_std**2 / len(control_values)) +
            (treatment_std**2 / len(treatment_values))
        )
        df = self._welch_df(control_values, treatment_values)
        t_crit = stats.t.ppf(1 - self.alpha / 2, df)
        diff = treatment_mean - control_mean
        ci_low = diff - t_crit * se
        ci_high = diff + t_crit * se

        # Relative lift
        lift = diff / control_mean if control_mean != 0 else 0

        return ExperimentResults(
            control=VariantMetrics(
                name="control",
                samples=len(control_values),
                conversions=0,
                conversion_rate=0,
                mean_value=control_mean,
                std_value=control_std
            ),
            treatment=VariantMetrics(
                name="treatment",
                samples=len(treatment_values),
                conversions=0,
                conversion_rate=0,
                mean_value=treatment_mean,
                std_value=treatment_std
            ),
            relative_lift=lift,
            p_value=p_value,
            confidence_interval=(ci_low, ci_high),
            is_significant=p_value < self.alpha,
            required_samples=0,
            power=0
        )

    def _required_sample_size(
        self,
        baseline_rate: float,
        expected_rate: float,
        power: float = 0.8
    ) -> int:
        """Calculate required sample size per variant."""
        if baseline_rate == 0 or expected_rate == baseline_rate:
            return 0

        z_alpha = stats.norm.ppf(1 - self.alpha / 2)
        z_beta = stats.norm.ppf(power)

        p1, p2 = baseline_rate, expected_rate
        p_avg = (p1 + p2) / 2

        n = (
            (z_alpha * np.sqrt(2 * p_avg * (1 - p_avg)) +
             z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2)))**2
        ) / (p2 - p1)**2

        return int(np.ceil(n))

    def _calculate_power(
        self,
        p1: float,
        p2: float,
        n1: int,
        n2: int
    ) -> float:
        """Calculate statistical power."""
        if p1 == p2 or n1 == 0 or n2 == 0:
            return 0

        z_alpha = stats.norm.ppf(1 - self.alpha / 2)
        se = np.sqrt(p1 * (1 - p1) / n1 + p2 * (1 - p2) / n2)
        effect = abs(p2 - p1)

        z_power = (effect / se) - z_alpha
        return float(stats.norm.cdf(z_power))

    def _welch_df(self, a: np.ndarray, b: np.ndarray) -> float:
        """Calculate Welch-Satterthwaite degrees of freedom."""
        var_a = np.var(a, ddof=1)
        var_b = np.var(b, ddof=1)
        n_a, n_b = len(a), len(b)

        num = (var_a/n_a + var_b/n_b)**2
        denom = (var_a/n_a)**2/(n_a-1) + (var_b/n_b)**2/(n_b-1)

        return num / denom if denom > 0 else 1

Step 5: Multi-Armed Bandit

src/bandit.py

"""Multi-armed bandit algorithms for adaptive experiments."""

from abc import ABC, abstractmethod
import numpy as np
from typing import Optional


class BanditAlgorithm(ABC):
    """Base class for bandit algorithms."""

    @abstractmethod
    def select_arm(self) -> int:
        """Select which arm to pull."""
        pass

    @abstractmethod
    def update(self, arm: int, reward: float) -> None:
        """Update arm statistics with reward."""
        pass


class EpsilonGreedy(BanditAlgorithm):
    """
    Epsilon-greedy bandit algorithm.

    Explores randomly with probability epsilon,
    exploits best arm otherwise.
    """

    def __init__(self, n_arms: int, epsilon: float = 0.1):
        self.n_arms = n_arms
        self.epsilon = epsilon
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)

    def select_arm(self) -> int:
        """Select arm using epsilon-greedy."""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_arms)
        return int(np.argmax(self.values))

    def update(self, arm: int, reward: float) -> None:
        """Update arm value with incremental mean."""
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n


class UCB1(BanditAlgorithm):
    """
    Upper Confidence Bound (UCB1) algorithm.

    Balances exploration and exploitation using
    confidence bounds.
    """

    def __init__(self, n_arms: int):
        self.n_arms = n_arms
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)
        self.total_counts = 0

    def select_arm(self) -> int:
        """Select arm with highest UCB."""
        # Try each arm at least once
        for arm in range(self.n_arms):
            if self.counts[arm] == 0:
                return arm

        # Calculate UCB values
        ucb_values = self.values + np.sqrt(
            2 * np.log(self.total_counts) / self.counts
        )
        return int(np.argmax(ucb_values))

    def update(self, arm: int, reward: float) -> None:
        """Update arm statistics."""
        self.total_counts += 1
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n


class ThompsonSampling(BanditAlgorithm):
    """
    Thompson Sampling for Bernoulli bandits.

    Uses Beta distribution for posterior sampling.
    """

    def __init__(self, n_arms: int):
        self.n_arms = n_arms
        # Beta distribution parameters (successes, failures)
        self.alpha = np.ones(n_arms)  # Successes + 1
        self.beta = np.ones(n_arms)   # Failures + 1

    def select_arm(self) -> int:
        """Select arm by sampling from posteriors."""
        samples = np.array([
            np.random.beta(self.alpha[i], self.beta[i])
            for i in range(self.n_arms)
        ])
        return int(np.argmax(samples))

    def update(self, arm: int, reward: float) -> None:
        """Update Beta distribution parameters."""
        if reward > 0:
            self.alpha[arm] += 1
        else:
            self.beta[arm] += 1

    def get_probabilities(self) -> np.ndarray:
        """Get current probability estimates."""
        return self.alpha / (self.alpha + self.beta)

Step 6: FastAPI Application

src/api.py

"""FastAPI application for A/B testing."""

from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional
import redis
import uuid

from .experiments import Experiment, Variant, ExperimentStatus, ExperimentStore
from .router import TrafficRouter
from .statistics import StatisticalAnalyzer


# Global instances
store: ExperimentStore = None
router: TrafficRouter = None
analyzer: StatisticalAnalyzer = None
redis_client: redis.Redis = None


@asynccontextmanager
async def lifespan(app: FastAPI):
    """Application lifespan."""
    global store, router, analyzer, redis_client

    redis_client = redis.Redis(host="localhost", port=6379, db=0)
    store = ExperimentStore(redis_client)
    router = TrafficRouter(store)
    analyzer = StatisticalAnalyzer(confidence_level=0.95)

    yield

    redis_client.close()


app = FastAPI(
    title="A/B Testing API",
    description="Statistical A/B testing for ML models",
    lifespan=lifespan
)


# Request/Response Models
class CreateExperimentRequest(BaseModel):
    name: str
    description: str
    control_model_id: str
    treatment_model_id: str
    traffic_percentage: float = 100.0
    control_weight: float = 0.5


class RouteRequest(BaseModel):
    user_id: str
    experiment_id: Optional[str] = None


class RouteResponse(BaseModel):
    experiment_id: Optional[str]
    variant_name: Optional[str]
    model_id: Optional[str]


class LogEventRequest(BaseModel):
    experiment_id: str
    user_id: str
    variant_name: str
    converted: bool
    value: Optional[float] = None


class ExperimentResultsResponse(BaseModel):
    experiment_id: str
    control_samples: int
    treatment_samples: int
    control_conversion_rate: float
    treatment_conversion_rate: float
    relative_lift: float
    p_value: float
    is_significant: bool
    confidence_interval: tuple[float, float]


@app.post("/experiments", response_model=dict)
async def create_experiment(request: CreateExperimentRequest):
    """Create a new A/B test experiment."""
    experiment = Experiment(
        id=str(uuid.uuid4())[:8],
        name=request.name,
        description=request.description,
        variants=[
            Variant(
                name="control",
                model_id=request.control_model_id,
                weight=request.control_weight
            ),
            Variant(
                name="treatment",
                model_id=request.treatment_model_id,
                weight=1 - request.control_weight
            )
        ],
        traffic_percentage=request.traffic_percentage
    )
    store.save(experiment)
    return experiment.to_dict()


@app.post("/experiments/{experiment_id}/start")
async def start_experiment(experiment_id: str):
    """Start an experiment."""
    experiment = store.get(experiment_id)
    if not experiment:
        raise HTTPException(status_code=404, detail="Experiment not found")

    experiment.status = ExperimentStatus.RUNNING
    store.save(experiment)
    return {"status": "started"}


@app.post("/experiments/{experiment_id}/stop")
async def stop_experiment(experiment_id: str):
    """Stop an experiment."""
    experiment = store.get(experiment_id)
    if not experiment:
        raise HTTPException(status_code=404, detail="Experiment not found")

    experiment.status = ExperimentStatus.STOPPED
    store.save(experiment)
    return {"status": "stopped"}


@app.post("/route", response_model=RouteResponse)
async def route_request(request: RouteRequest):
    """Route a user to an experiment variant."""
    exp_id, variant = router.route(request.user_id, request.experiment_id)

    if not variant:
        return RouteResponse(
            experiment_id=None,
            variant_name=None,
            model_id=None
        )

    return RouteResponse(
        experiment_id=exp_id,
        variant_name=variant.name,
        model_id=variant.model_id
    )


@app.post("/log")
async def log_event(request: LogEventRequest):
    """Log a conversion event."""
    key = f"events:{request.experiment_id}:{request.variant_name}"

    # Store event
    event = {
        "user_id": request.user_id,
        "converted": request.converted,
        "value": request.value
    }
    redis_client.rpush(key, str(event))

    # Update counts
    count_key = f"counts:{request.experiment_id}:{request.variant_name}"
    redis_client.hincrby(count_key, "samples", 1)
    if request.converted:
        redis_client.hincrby(count_key, "conversions", 1)

    return {"status": "logged"}


@app.get("/experiments/{experiment_id}/results", response_model=ExperimentResultsResponse)
async def get_results(experiment_id: str):
    """Get experiment results with statistical analysis."""
    experiment = store.get(experiment_id)
    if not experiment:
        raise HTTPException(status_code=404, detail="Experiment not found")

    # Get counts for each variant
    control_key = f"counts:{experiment_id}:control"
    treatment_key = f"counts:{experiment_id}:treatment"

    control_data = redis_client.hgetall(control_key)
    treatment_data = redis_client.hgetall(treatment_key)

    control_samples = int(control_data.get(b"samples", 0))
    control_conversions = int(control_data.get(b"conversions", 0))
    treatment_samples = int(treatment_data.get(b"samples", 0))
    treatment_conversions = int(treatment_data.get(b"conversions", 0))

    if control_samples == 0 or treatment_samples == 0:
        raise HTTPException(
            status_code=400,
            detail="Insufficient data for analysis"
        )

    # Analyze results
    results = analyzer.analyze_conversion(
        control_conversions, control_samples,
        treatment_conversions, treatment_samples
    )

    return ExperimentResultsResponse(
        experiment_id=experiment_id,
        control_samples=control_samples,
        treatment_samples=treatment_samples,
        control_conversion_rate=results.control.conversion_rate,
        treatment_conversion_rate=results.treatment.conversion_rate,
        relative_lift=results.relative_lift,
        p_value=results.p_value,
        is_significant=results.is_significant,
        confidence_interval=results.confidence_interval
    )


@app.get("/experiments/{experiment_id}")
async def get_experiment(experiment_id: str):
    """Get experiment details."""
    experiment = store.get(experiment_id)
    if not experiment:
        raise HTTPException(status_code=404, detail="Experiment not found")
    return experiment.to_dict()

Step 7: Docker Compose

docker-compose.yml

version: '3.8'

services:
  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - REDIS_HOST=redis
    depends_on:
      - redis

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis-data:/data

volumes:
  redis-data:

Usage Example

# Create experiment
curl -X POST http://localhost:8000/experiments \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Model v2 Test",
    "description": "Testing new model version",
    "control_model_id": "model-v1",
    "treatment_model_id": "model-v2",
    "traffic_percentage": 50
  }'

# Start experiment
curl -X POST http://localhost:8000/experiments/abc123/start

# Route user to variant
curl -X POST http://localhost:8000/route \
  -H "Content-Type: application/json" \
  -d '{"user_id": "user-456"}'

# Log conversion
curl -X POST http://localhost:8000/log \
  -H "Content-Type: application/json" \
  -d '{
    "experiment_id": "abc123",
    "user_id": "user-456",
    "variant_name": "treatment",
    "converted": true
  }'

# Get results
curl http://localhost:8000/experiments/abc123/results

Statistical Considerations

Factor	Recommendation
Sample size	Min 1000 per variant
Duration	1-2 weeks minimum
Significance	p < 0.05 (95% confidence)
Power	80% minimum

Key Concepts Recap

Concept	What It Is	Why It Matters
Consistent Hashing	Same user always gets same variant	Stable experience, valid comparison
Two-Proportion Z-Test	Compare conversion rates statistically	Know if difference is real or noise
P-Value	Probability difference is due to chance	p < 0.05 means statistically significant
Statistical Power	Probability of detecting real effect	80% power = won't miss true improvements
Confidence Interval	Range of plausible effect sizes	Know the magnitude, not just direction
Epsilon-Greedy	Explore ε%, exploit (1-ε)% best arm	Simple bandit, tunable exploration
UCB1	Balance mean reward + uncertainty bonus	Optimistic exploration, good regret
Thompson Sampling	Sample from posterior, pick max	Bayesian approach, often best empirically

Next Steps

Complete Pipeline - Integrate with CI/CD
Model Registry - Version your models

A/B Testing Framework


Difficulty	Intermediate
Time	~4 hours
Code	~400 lines
Prerequisites	Basic statistics, FastAPI

TL;DR

What You'll Learn

Traffic splitting strategies
Statistical significance testing
Multi-armed bandit algorithms
Experiment tracking
Results analysis

Why Measure Before Shipping Model Changes?

A new model with better offline accuracy can still hurt production metrics. Higher accuracy on a test set does not guarantee better user engagement, lower latency, or fewer errors in the real world.

Without A/B Testing	With A/B Testing
"New model has 2% higher accuracy -- ship it!"	"New model has 2% higher accuracy offline, but conversion rate dropped 5% in production"
No rollback data if things go wrong	Statistical evidence before full rollout
Team debates opinions	Team debates data
HiPPO decides (Highest Paid Person's Opinion)	p-value decides

Tech Stack

Component	Technology	Why
API	FastAPI	Async request routing and experiment logging
Statistics	SciPy	Z-tests, Welch's t-test, power analysis
Storage	Redis	Fast counters for conversion tracking
Algorithms	NumPy	Multi-armed bandit implementations

Architecture

A/B Testing Architecture

Request (user_id)Incoming request with user identifier

Traffic RouterConsistent hashing on user_id for deterministic assignment

Model VariantsModel A (Control, 50%) or Model B (Treatment, 50%)

Experiment LoggerRecords user, variant, and conversion status

Redis Storagecounts:exp1:control {samples: 1000, conversions: 150}

Statistical AnalyzerZ-test, p-value, confidence intervals

Project Structure

ab-testing/
├── src/
│   ├── __init__.py
│   ├── router.py         # Traffic routing
│   ├── experiments.py    # Experiment management
│   ├── statistics.py     # Statistical analysis
│   ├── bandit.py         # Multi-armed bandit
│   └── api.py            # FastAPI application
├── tests/
├── docker-compose.yml
└── requirements.txt

Implementation

Step 1: Dependencies

requirements.txt

fastapi>=0.100.0
uvicorn>=0.23.0
redis>=5.0.0
scipy>=1.11.0
numpy>=1.24.0
pydantic>=2.0.0

Step 2: Experiment Management

src/experiments.py

"""Experiment configuration and management."""

from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from typing import Optional
import json
import redis


class ExperimentStatus(str, Enum):
    DRAFT = "draft"
    RUNNING = "running"
    PAUSED = "paused"
    COMPLETED = "completed"
    STOPPED = "stopped"


@dataclass
class Variant:
    """Experiment variant (model version)."""

    name: str
    model_id: str
    weight: float = 0.5
    description: str = ""


@dataclass
class Experiment:
    """A/B test experiment."""

    id: str
    name: str
    description: str
    variants: list[Variant]
    status: ExperimentStatus = ExperimentStatus.DRAFT
    created_at: datetime = field(default_factory=datetime.now)
    started_at: Optional[datetime] = None
    ended_at: Optional[datetime] = None

    # Configuration
    traffic_percentage: float = 100.0  # % of traffic in experiment
    min_sample_size: int = 1000
    confidence_level: float = 0.95

    def to_dict(self) -> dict:
        """Convert to dictionary."""
        return {
            "id": self.id,
            "name": self.name,
            "description": self.description,
            "variants": [
                {"name": v.name, "model_id": v.model_id, "weight": v.weight}
                for v in self.variants
            ],
            "status": self.status.value,
            "created_at": self.created_at.isoformat(),
            "started_at": self.started_at.isoformat() if self.started_at else None,
            "traffic_percentage": self.traffic_percentage,
            "min_sample_size": self.min_sample_size,
            "confidence_level": self.confidence_level
        }

    @classmethod
    def from_dict(cls, data: dict) -> "Experiment":
        """Create from dictionary."""
        return cls(
            id=data["id"],
            name=data["name"],
            description=data["description"],
            variants=[
                Variant(
                    name=v["name"],
                    model_id=v["model_id"],
                    weight=v["weight"]
                )
                for v in data["variants"]
            ],
            status=ExperimentStatus(data["status"]),
            created_at=datetime.fromisoformat(data["created_at"]),
            started_at=datetime.fromisoformat(data["started_at"]) if data.get("started_at") else None,
            traffic_percentage=data.get("traffic_percentage", 100.0),
            min_sample_size=data.get("min_sample_size", 1000),
            confidence_level=data.get("confidence_level", 0.95)
        )


class ExperimentStore:
    """Persistent experiment storage."""

    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client
        self.prefix = "experiment:"

    def save(self, experiment: Experiment) -> None:
        """Save experiment."""
        key = f"{self.prefix}{experiment.id}"
        self.redis.set(key, json.dumps(experiment.to_dict()))

    def get(self, experiment_id: str) -> Optional[Experiment]:
        """Get experiment by ID."""
        key = f"{self.prefix}{experiment_id}"
        data = self.redis.get(key)
        if data:
            return Experiment.from_dict(json.loads(data))
        return None

    def get_active(self) -> list[Experiment]:
        """Get all running experiments."""
        experiments = []
        for key in self.redis.scan_iter(f"{self.prefix}*"):
            data = self.redis.get(key)
            if data:
                exp = Experiment.from_dict(json.loads(data))
                if exp.status == ExperimentStatus.RUNNING:
                    experiments.append(exp)
        return experiments

    def delete(self, experiment_id: str) -> bool:
        """Delete experiment."""
        key = f"{self.prefix}{experiment_id}"
        return self.redis.delete(key) > 0

Step 3: Traffic Router

src/router.py

"""Traffic routing for A/B tests."""

import hashlib
import random
from typing import Optional
from .experiments import Experiment, Variant, ExperimentStatus


class TrafficRouter:
    """
    Route traffic to experiment variants.

    Uses consistent hashing for deterministic assignment
    based on user ID.
    """

    def __init__(self, experiment_store):
        self.store = experiment_store

    def route(
        self,
        user_id: str,
        experiment_id: Optional[str] = None
    ) -> tuple[Optional[str], Optional[Variant]]:
        """
        Route a request to a variant.

        Args:
            user_id: Unique user identifier
            experiment_id: Specific experiment (or uses active)

        Returns:
            Tuple of (experiment_id, variant) or (None, None)
        """
        if experiment_id:
            experiment = self.store.get(experiment_id)
            if not experiment or experiment.status != ExperimentStatus.RUNNING:
                return None, None
        else:
            # Get first active experiment
            active = self.store.get_active()
            if not active:
                return None, None
            experiment = active[0]

        # Check if user is in experiment traffic
        if not self._is_in_experiment(user_id, experiment):
            return None, None

        # Assign variant
        variant = self._assign_variant(user_id, experiment)
        return experiment.id, variant

    def _is_in_experiment(self, user_id: str, experiment: Experiment) -> bool:
        """Check if user should be in experiment."""
        hash_value = self._hash_user(user_id, "traffic")
        threshold = experiment.traffic_percentage / 100.0
        return hash_value < threshold

    def _assign_variant(self, user_id: str, experiment: Experiment) -> Variant:
        """Assign user to a variant deterministically."""
        hash_value = self._hash_user(user_id, experiment.id)

        cumulative = 0.0
        for variant in experiment.variants:
            cumulative += variant.weight
            if hash_value < cumulative:
                return variant

        # Fallback to last variant
        return experiment.variants[-1]

    def _hash_user(self, user_id: str, salt: str) -> float:
        """Hash user ID to value between 0 and 1."""
        combined = f"{user_id}:{salt}"
        hash_bytes = hashlib.sha256(combined.encode()).digest()
        hash_int = int.from_bytes(hash_bytes[:8], byteorder="big")
        return hash_int / (2**64)

Understanding Consistent Hashing:

Why Consistent Hashing?

Random Assignment

User "abc123" visits: Visit 1 → random()=0.3 → Model A. Visit 2 → random()=0.7 → Model B (different!). Visit 3 → random()=0.2 → Model A. Inconsistent experience invalidates results.

Consistent Hashing

Recommended

Traffic Splitting Strategy:

Strategy	Deterministic?	Use Case
Consistent Hash	Yes	A/B tests (user sees same variant)
Weighted Random	No	Load balancing, canary deploys
Sticky Sessions	Yes (with storage)	When hash not possible
Percentage Ramp	Yes	Gradual rollouts (1% → 10% → 50%)

class WeightedRouter:
    """Simple weighted random routing (non-deterministic)."""

    def route(self, experiment: Experiment) -> Variant:
        """Select variant based on weights."""
        r = random.random()
        cumulative = 0.0
        for variant in experiment.variants:
            cumulative += variant.weight
            if r < cumulative:
                return variant
        return experiment.variants[-1]

Step 4: Statistical Analysis

src/statistics.py

"""Statistical analysis for A/B tests."""

from dataclasses import dataclass
from typing import Optional
import numpy as np
from scipy import stats


@dataclass
class VariantMetrics:
    """Metrics for a single variant."""

    name: str
    samples: int
    conversions: int
    conversion_rate: float
    mean_value: float
    std_value: float


@dataclass
class ExperimentResults:
    """Results of statistical analysis."""

    control: VariantMetrics
    treatment: VariantMetrics

    # Statistical results
    relative_lift: float
    p_value: float
    confidence_interval: tuple[float, float]
    is_significant: bool
    required_samples: int
    power: float


class StatisticalAnalyzer:
    """
    Statistical analysis for A/B test results.

    Supports both conversion rate and continuous metric analysis.
    """

    def __init__(self, confidence_level: float = 0.95):
        self.confidence_level = confidence_level
        self.alpha = 1 - confidence_level

    def analyze_conversion(
        self,
        control_conversions: int,
        control_samples: int,
        treatment_conversions: int,
        treatment_samples: int
    ) -> ExperimentResults:
        """
        Analyze conversion rate experiment.

        Uses two-proportion z-test for significance.
        """
        # Calculate rates
        control_rate = control_conversions / control_samples
        treatment_rate = treatment_conversions / treatment_samples

        # Pooled proportion
        pooled = (control_conversions + treatment_conversions) / (
            control_samples + treatment_samples
        )

        # Standard error
        se = np.sqrt(
            pooled * (1 - pooled) * (1/control_samples + 1/treatment_samples)
        )

        # Z-statistic
        z_stat = (treatment_rate - control_rate) / se if se > 0 else 0

        # P-value (two-tailed)
        p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

        # Confidence interval for difference
        se_diff = np.sqrt(
            (control_rate * (1 - control_rate) / control_samples) +
            (treatment_rate * (1 - treatment_rate) / treatment_samples)
        )
        z_crit = stats.norm.ppf(1 - self.alpha / 2)
        ci_low = (treatment_rate - control_rate) - z_crit * se_diff
        ci_high = (treatment_rate - control_rate) + z_crit * se_diff

        # Relative lift
        lift = (treatment_rate - control_rate) / control_rate if control_rate > 0 else 0

        # Required sample size for 80% power
        required = self._required_sample_size(control_rate, treatment_rate)

        # Statistical power
        power = self._calculate_power(
            control_rate, treatment_rate,
            control_samples, treatment_samples
        )

        return ExperimentResults(
            control=VariantMetrics(
                name="control",
                samples=control_samples,
                conversions=control_conversions,
                conversion_rate=control_rate,
                mean_value=control_rate,
                std_value=np.sqrt(control_rate * (1 - control_rate))
            ),
            treatment=VariantMetrics(
                name="treatment",
                samples=treatment_samples,
                conversions=treatment_conversions,
                conversion_rate=treatment_rate,
                mean_value=treatment_rate,
                std_value=np.sqrt(treatment_rate * (1 - treatment_rate))
            ),
            relative_lift=lift,
            p_value=p_value,
            confidence_interval=(ci_low, ci_high),
            is_significant=p_value < self.alpha,
            required_samples=required,
            power=power
        )

    def analyze_continuous(
        self,
        control_values: np.ndarray,
        treatment_values: np.ndarray
    ) -> ExperimentResults:
        """
        Analyze continuous metric experiment.

        Uses Welch's t-test for significance.
        """
        # Calculate statistics
        control_mean = np.mean(control_values)
        control_std = np.std(control_values, ddof=1)
        treatment_mean = np.mean(treatment_values)
        treatment_std = np.std(treatment_values, ddof=1)

        # Welch's t-test
        t_stat, p_value = stats.ttest_ind(
            treatment_values, control_values, equal_var=False
        )

        # Confidence interval
        se = np.sqrt(
            (control_std**2 / len(control_values)) +
            (treatment_std**2 / len(treatment_values))
        )
        df = self._welch_df(control_values, treatment_values)
        t_crit = stats.t.ppf(1 - self.alpha / 2, df)
        diff = treatment_mean - control_mean
        ci_low = diff - t_crit * se
        ci_high = diff + t_crit * se

        # Relative lift
        lift = diff / control_mean if control_mean != 0 else 0

        return ExperimentResults(
            control=VariantMetrics(
                name="control",
                samples=len(control_values),
                conversions=0,
                conversion_rate=0,
                mean_value=control_mean,
                std_value=control_std
            ),
            treatment=VariantMetrics(
                name="treatment",
                samples=len(treatment_values),
                conversions=0,
                conversion_rate=0,
                mean_value=treatment_mean,
                std_value=treatment_std
            ),
            relative_lift=lift,
            p_value=p_value,
            confidence_interval=(ci_low, ci_high),
            is_significant=p_value < self.alpha,
            required_samples=0,
            power=0
        )

    def _required_sample_size(
        self,
        baseline_rate: float,
        expected_rate: float,
        power: float = 0.8
    ) -> int:
        """Calculate required sample size per variant."""
        if baseline_rate == 0 or expected_rate == baseline_rate:
            return 0

        z_alpha = stats.norm.ppf(1 - self.alpha / 2)
        z_beta = stats.norm.ppf(power)

        p1, p2 = baseline_rate, expected_rate
        p_avg = (p1 + p2) / 2

        n = (
            (z_alpha * np.sqrt(2 * p_avg * (1 - p_avg)) +
             z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2)))**2
        ) / (p2 - p1)**2

        return int(np.ceil(n))

    def _calculate_power(
        self,
        p1: float,
        p2: float,
        n1: int,
        n2: int
    ) -> float:
        """Calculate statistical power."""
        if p1 == p2 or n1 == 0 or n2 == 0:
            return 0

        z_alpha = stats.norm.ppf(1 - self.alpha / 2)
        se = np.sqrt(p1 * (1 - p1) / n1 + p2 * (1 - p2) / n2)
        effect = abs(p2 - p1)

        z_power = (effect / se) - z_alpha
        return float(stats.norm.cdf(z_power))

    def _welch_df(self, a: np.ndarray, b: np.ndarray) -> float:
        """Calculate Welch-Satterthwaite degrees of freedom."""
        var_a = np.var(a, ddof=1)
        var_b = np.var(b, ddof=1)
        n_a, n_b = len(a), len(b)

        num = (var_a/n_a + var_b/n_b)**2
        denom = (var_a/n_a)**2/(n_a-1) + (var_b/n_b)**2/(n_b-1)

        return num / denom if denom > 0 else 1

Step 5: Multi-Armed Bandit

src/bandit.py

"""Multi-armed bandit algorithms for adaptive experiments."""

from abc import ABC, abstractmethod
import numpy as np
from typing import Optional


class BanditAlgorithm(ABC):
    """Base class for bandit algorithms."""

    @abstractmethod
    def select_arm(self) -> int:
        """Select which arm to pull."""
        pass

    @abstractmethod
    def update(self, arm: int, reward: float) -> None:
        """Update arm statistics with reward."""
        pass


class EpsilonGreedy(BanditAlgorithm):
    """
    Epsilon-greedy bandit algorithm.

    Explores randomly with probability epsilon,
    exploits best arm otherwise.
    """

    def __init__(self, n_arms: int, epsilon: float = 0.1):
        self.n_arms = n_arms
        self.epsilon = epsilon
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)

    def select_arm(self) -> int:
        """Select arm using epsilon-greedy."""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_arms)
        return int(np.argmax(self.values))

    def update(self, arm: int, reward: float) -> None:
        """Update arm value with incremental mean."""
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n


class UCB1(BanditAlgorithm):
    """
    Upper Confidence Bound (UCB1) algorithm.

    Balances exploration and exploitation using
    confidence bounds.
    """

    def __init__(self, n_arms: int):
        self.n_arms = n_arms
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)
        self.total_counts = 0

    def select_arm(self) -> int:
        """Select arm with highest UCB."""
        # Try each arm at least once
        for arm in range(self.n_arms):
            if self.counts[arm] == 0:
                return arm

        # Calculate UCB values
        ucb_values = self.values + np.sqrt(
            2 * np.log(self.total_counts) / self.counts
        )
        return int(np.argmax(ucb_values))

    def update(self, arm: int, reward: float) -> None:
        """Update arm statistics."""
        self.total_counts += 1
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n


class ThompsonSampling(BanditAlgorithm):
    """
    Thompson Sampling for Bernoulli bandits.

    Uses Beta distribution for posterior sampling.
    """

    def __init__(self, n_arms: int):
        self.n_arms = n_arms
        # Beta distribution parameters (successes, failures)
        self.alpha = np.ones(n_arms)  # Successes + 1
        self.beta = np.ones(n_arms)   # Failures + 1

    def select_arm(self) -> int:
        """Select arm by sampling from posteriors."""
        samples = np.array([
            np.random.beta(self.alpha[i], self.beta[i])
            for i in range(self.n_arms)
        ])
        return int(np.argmax(samples))

    def update(self, arm: int, reward: float) -> None:
        """Update Beta distribution parameters."""
        if reward > 0:
            self.alpha[arm] += 1
        else:
            self.beta[arm] += 1

    def get_probabilities(self) -> np.ndarray:
        """Get current probability estimates."""
        return self.alpha / (self.alpha + self.beta)

Step 6: FastAPI Application

src/api.py

"""FastAPI application for A/B testing."""

from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional
import redis
import uuid

from .experiments import Experiment, Variant, ExperimentStatus, ExperimentStore
from .router import TrafficRouter
from .statistics import StatisticalAnalyzer


# Global instances
store: ExperimentStore = None
router: TrafficRouter = None
analyzer: StatisticalAnalyzer = None
redis_client: redis.Redis = None


@asynccontextmanager
async def lifespan(app: FastAPI):
    """Application lifespan."""
    global store, router, analyzer, redis_client

    redis_client = redis.Redis(host="localhost", port=6379, db=0)
    store = ExperimentStore(redis_client)
    router = TrafficRouter(store)
    analyzer = StatisticalAnalyzer(confidence_level=0.95)

    yield

    redis_client.close()


app = FastAPI(
    title="A/B Testing API",
    description="Statistical A/B testing for ML models",
    lifespan=lifespan
)


# Request/Response Models
class CreateExperimentRequest(BaseModel):
    name: str
    description: str
    control_model_id: str
    treatment_model_id: str
    traffic_percentage: float = 100.0
    control_weight: float = 0.5


class RouteRequest(BaseModel):
    user_id: str
    experiment_id: Optional[str] = None


class RouteResponse(BaseModel):
    experiment_id: Optional[str]
    variant_name: Optional[str]
    model_id: Optional[str]


class LogEventRequest(BaseModel):
    experiment_id: str
    user_id: str
    variant_name: str
    converted: bool
    value: Optional[float] = None


class ExperimentResultsResponse(BaseModel):
    experiment_id: str
    control_samples: int
    treatment_samples: int
    control_conversion_rate: float
    treatment_conversion_rate: float
    relative_lift: float
    p_value: float
    is_significant: bool
    confidence_interval: tuple[float, float]


@app.post("/experiments", response_model=dict)
async def create_experiment(request: CreateExperimentRequest):
    """Create a new A/B test experiment."""
    experiment = Experiment(
        id=str(uuid.uuid4())[:8],
        name=request.name,
        description=request.description,
        variants=[
            Variant(
                name="control",
                model_id=request.control_model_id,
                weight=request.control_weight
            ),
            Variant(
                name="treatment",
                model_id=request.treatment_model_id,
                weight=1 - request.control_weight
            )
        ],
        traffic_percentage=request.traffic_percentage
    )
    store.save(experiment)
    return experiment.to_dict()


@app.post("/experiments/{experiment_id}/start")
async def start_experiment(experiment_id: str):
    """Start an experiment."""
    experiment = store.get(experiment_id)
    if not experiment:
        raise HTTPException(status_code=404, detail="Experiment not found")

    experiment.status = ExperimentStatus.RUNNING
    store.save(experiment)
    return {"status": "started"}


@app.post("/experiments/{experiment_id}/stop")
async def stop_experiment(experiment_id: str):
    """Stop an experiment."""
    experiment = store.get(experiment_id)
    if not experiment:
        raise HTTPException(status_code=404, detail="Experiment not found")

    experiment.status = ExperimentStatus.STOPPED
    store.save(experiment)
    return {"status": "stopped"}


@app.post("/route", response_model=RouteResponse)
async def route_request(request: RouteRequest):
    """Route a user to an experiment variant."""
    exp_id, variant = router.route(request.user_id, request.experiment_id)

    if not variant:
        return RouteResponse(
            experiment_id=None,
            variant_name=None,
            model_id=None
        )

    return RouteResponse(
        experiment_id=exp_id,
        variant_name=variant.name,
        model_id=variant.model_id
    )


@app.post("/log")
async def log_event(request: LogEventRequest):
    """Log a conversion event."""
    key = f"events:{request.experiment_id}:{request.variant_name}"

    # Store event
    event = {
        "user_id": request.user_id,
        "converted": request.converted,
        "value": request.value
    }
    redis_client.rpush(key, str(event))

    # Update counts
    count_key = f"counts:{request.experiment_id}:{request.variant_name}"
    redis_client.hincrby(count_key, "samples", 1)
    if request.converted:
        redis_client.hincrby(count_key, "conversions", 1)

    return {"status": "logged"}


@app.get("/experiments/{experiment_id}/results", response_model=ExperimentResultsResponse)
async def get_results(experiment_id: str):
    """Get experiment results with statistical analysis."""
    experiment = store.get(experiment_id)
    if not experiment:
        raise HTTPException(status_code=404, detail="Experiment not found")

    # Get counts for each variant
    control_key = f"counts:{experiment_id}:control"
    treatment_key = f"counts:{experiment_id}:treatment"

    control_data = redis_client.hgetall(control_key)
    treatment_data = redis_client.hgetall(treatment_key)

    control_samples = int(control_data.get(b"samples", 0))
    control_conversions = int(control_data.get(b"conversions", 0))
    treatment_samples = int(treatment_data.get(b"samples", 0))
    treatment_conversions = int(treatment_data.get(b"conversions", 0))

    if control_samples == 0 or treatment_samples == 0:
        raise HTTPException(
            status_code=400,
            detail="Insufficient data for analysis"
        )

    # Analyze results
    results = analyzer.analyze_conversion(
        control_conversions, control_samples,
        treatment_conversions, treatment_samples
    )

    return ExperimentResultsResponse(
        experiment_id=experiment_id,
        control_samples=control_samples,
        treatment_samples=treatment_samples,
        control_conversion_rate=results.control.conversion_rate,
        treatment_conversion_rate=results.treatment.conversion_rate,
        relative_lift=results.relative_lift,
        p_value=results.p_value,
        is_significant=results.is_significant,
        confidence_interval=results.confidence_interval
    )


@app.get("/experiments/{experiment_id}")
async def get_experiment(experiment_id: str):
    """Get experiment details."""
    experiment = store.get(experiment_id)
    if not experiment:
        raise HTTPException(status_code=404, detail="Experiment not found")
    return experiment.to_dict()

Step 7: Docker Compose

docker-compose.yml

version: '3.8'

services:
  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - REDIS_HOST=redis
    depends_on:
      - redis

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis-data:/data

volumes:
  redis-data:

Usage Example

# Create experiment
curl -X POST http://localhost:8000/experiments \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Model v2 Test",
    "description": "Testing new model version",
    "control_model_id": "model-v1",
    "treatment_model_id": "model-v2",
    "traffic_percentage": 50
  }'

# Start experiment
curl -X POST http://localhost:8000/experiments/abc123/start

# Route user to variant
curl -X POST http://localhost:8000/route \
  -H "Content-Type: application/json" \
  -d '{"user_id": "user-456"}'

# Log conversion
curl -X POST http://localhost:8000/log \
  -H "Content-Type: application/json" \
  -d '{
    "experiment_id": "abc123",
    "user_id": "user-456",
    "variant_name": "treatment",
    "converted": true
  }'

# Get results
curl http://localhost:8000/experiments/abc123/results

Statistical Considerations

Factor	Recommendation
Sample size	Min 1000 per variant
Duration	1-2 weeks minimum
Significance	p < 0.05 (95% confidence)
Power	80% minimum

Key Concepts Recap

Concept	What It Is	Why It Matters
Consistent Hashing	Same user always gets same variant	Stable experience, valid comparison
Two-Proportion Z-Test	Compare conversion rates statistically	Know if difference is real or noise
P-Value	Probability difference is due to chance	p < 0.05 means statistically significant
Statistical Power	Probability of detecting real effect	80% power = won't miss true improvements
Confidence Interval	Range of plausible effect sizes	Know the magnitude, not just direction
Epsilon-Greedy	Explore ε%, exploit (1-ε)% best arm	Simple bandit, tunable exploration
UCB1	Balance mean reward + uncertainty bonus	Optimistic exploration, good regret
Thompson Sampling	Sample from posterior, pick max	Bayesian approach, often best empirically

Next Steps

Complete Pipeline - Integrate with CI/CD
Model Registry - Version your models

A/B Testing Framework

A/B Testing Framework

What You'll Learn

Why Measure Before Shipping Model Changes?

Tech Stack

Architecture

Project Structure

Implementation

Step 1: Dependencies

Step 2: Experiment Management

Step 3: Traffic Router

Step 4: Statistical Analysis

Step 5: Multi-Armed Bandit

Step 6: FastAPI Application

Step 7: Docker Compose

Usage Example

Statistical Considerations

Key Concepts Recap

Next Steps

On this page

A/B Testing Framework

A/B Testing Framework

What You'll Learn

Why Measure Before Shipping Model Changes?

Tech Stack

Architecture

Project Structure

Implementation

Step 1: Dependencies

Step 2: Experiment Management

Step 3: Traffic Router

Step 4: Statistical Analysis

Step 5: Multi-Armed Bandit

Step 6: FastAPI Application

Step 7: Docker Compose

Usage Example

Statistical Considerations

Key Concepts Recap

Next Steps

On this page