Master systematic prompt engineering with templates, versioning, testing, and advanced techniques like Chain-of-Thought

Prompt Engineering

Property	Value
Difficulty	Intermediate
Time	~5 hours
Code Size	~450 LOC
Prerequisites	Chatbot

TL;DR

Build reusable prompt templates with Jinja2, add few-shot examples for consistent outputs, and use Chain-of-Thought (CoT) for complex reasoning. Version your prompts in SQLite, A/B test variations, and measure consistency—treat prompts as code.

Tech Stack

Technology	Purpose
LangChain	Prompt templates
Jinja2	Advanced templating
Pydantic	Prompt validation
SQLite	Version storage
OpenAI	LLM provider

Prerequisites

Python 3.10+
OpenAI API key

pip install langchain langchain-openai jinja2 pydantic openai

What You'll Learn

Design effective prompt templates
Implement few-shot learning with dynamic examples
Build Chain-of-Thought prompting systems
Version control and A/B test prompts
Measure and optimize prompt performance

The Problem: Ad-Hoc Prompting Doesn't Scale

Issue	Impact
Hardcoded prompts	Can't iterate or experiment
No versioning	Lost track of what worked
No testing	Unknown reliability
No examples	Inconsistent output format
No structure	Repeated prompt rewriting

┌─────────────────────────────────────────────────────────────────────────────┐
│                    AD-HOC vs SYSTEMATIC PROMPTING                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Ad-Hoc Prompting                     Systematic Prompting                  │
│  ─────────────────                    ────────────────────                  │
│                                                                             │
│  ┌──────────────────┐                 ┌──────────────────┐                  │
│  │ Hardcoded String │                 │    Template      │                  │
│  │ "Summarize this" │                 │  "{{ style }}    │                  │
│  └────────┬─────────┘                 │   summary of     │                  │
│           │                           │   {{ text }}"    │                  │
│           ▼                           └────────┬─────────┘                  │
│  ┌──────────────────┐                          │                            │
│  │       LLM        │                          ▼                            │
│  └────────┬─────────┘                 ┌──────────────────┐                  │
│           │                           │    Variables     │                  │
│           ▼                           │  style="bullet"  │                  │
│  ┌──────────────────┐                 │  text="..."      │                  │
│  │  Unpredictable   │                 └────────┬─────────┘                  │
│  │     Output       │                          │                            │
│  │  (varies each    │                          ▼                            │
│  │   time)          │                 ┌──────────────────┐                  │
│  └──────────────────┘                 │    Examples      │                  │
│                                       │  (few-shot)      │                  │
│                                       └────────┬─────────┘                  │
│                                                │                            │
│                                                ▼                            │
│                                       ┌──────────────────┐                  │
│                                       │       LLM        │                  │
│                                       └────────┬─────────┘                  │
│                                                │                            │
│                                                ▼                            │
│                                       ┌──────────────────┐                  │
│                                       │   Consistent     │                  │
│                                       │     Output       │                  │
│                                       │  (predictable    │                  │
│                                       │   format)        │                  │
│                                       └──────────────────┘                  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Project Structure

prompt-engineering/
├── config.py            # Configuration
├── templates.py         # Prompt template system
├── few_shot.py          # Few-shot example management
├── chain_of_thought.py  # CoT prompting
├── versioning.py        # Prompt version control
├── testing.py           # Prompt testing framework
├── registry.py          # Prompt registry
└── requirements.txt

Step 1: Configuration

# config.py
from pydantic_settings import BaseSettings
from pydantic import Field
from functools import lru_cache
from typing import Optional


class Settings(BaseSettings):
    openai_api_key: str
    default_model: str = "gpt-4o"
    temperature: float = 0.7
    max_tokens: int = 2000

    # Versioning
    db_path: str = "prompts.db"
    enable_versioning: bool = True

    # Testing
    test_iterations: int = 5
    consistency_threshold: float = 0.8

    class Config:
        env_file = ".env"


@lru_cache
def get_settings() -> Settings:
    return Settings()

Step 2: Prompt Template System

Build a flexible template system that separates structure from content:

# templates.py
from typing import Dict, Any, List, Optional
from dataclasses import dataclass, field
from enum import Enum
from jinja2 import Template, Environment, BaseLoader
from pydantic import BaseModel, validator
import json


class PromptRole(str, Enum):
    SYSTEM = "system"
    USER = "user"
    ASSISTANT = "assistant"


class PromptVariable(BaseModel):
    """Define a template variable with validation."""
    name: str
    description: str
    required: bool = True
    default: Optional[str] = None
    examples: List[str] = []

    def validate_value(self, value: Any) -> bool:
        if self.required and value is None:
            return False
        return True


@dataclass
class PromptTemplate:
    """
    A structured prompt template with variables and metadata.

    Supports Jinja2 templating for complex logic.
    """
    name: str
    template: str
    role: PromptRole = PromptRole.USER
    variables: List[PromptVariable] = field(default_factory=list)
    description: str = ""
    version: str = "1.0.0"
    tags: List[str] = field(default_factory=list)

    def __post_init__(self):
        self._jinja_env = Environment(loader=BaseLoader())
        self._compiled = self._jinja_env.from_string(self.template)

    def render(self, **kwargs) -> str:
        """
        Render template with provided variables.

        Args:
            **kwargs: Variable values

        Returns:
            Rendered prompt string

        Raises:
            ValueError: If required variables are missing
        """
        # Validate required variables
        for var in self.variables:
            if var.required and var.name not in kwargs:
                if var.default is not None:
                    kwargs[var.name] = var.default
                else:
                    raise ValueError(f"Required variable '{var.name}' not provided")

        return self._compiled.render(**kwargs)

    def get_variable_names(self) -> List[str]:
        """Get list of variable names."""
        return [v.name for v in self.variables]

    def to_dict(self) -> Dict[str, Any]:
        """Serialize template to dict."""
        return {
            "name": self.name,
            "template": self.template,
            "role": self.role.value,
            "variables": [v.dict() for v in self.variables],
            "description": self.description,
            "version": self.version,
            "tags": self.tags
        }

    @classmethod
    def from_dict(cls, data: Dict[str, Any]) -> "PromptTemplate":
        """Deserialize from dict."""
        data["role"] = PromptRole(data["role"])
        data["variables"] = [PromptVariable(**v) for v in data.get("variables", [])]
        return cls(**data)


class PromptBuilder:
    """
    Fluent builder for constructing prompts.

    Provides a clean API for building complex prompts.
    """

    def __init__(self, name: str):
        self.name = name
        self._template_parts: List[str] = []
        self._variables: List[PromptVariable] = []
        self._role = PromptRole.USER
        self._description = ""
        self._tags: List[str] = []

    def with_role(self, role: PromptRole) -> "PromptBuilder":
        self._role = role
        return self

    def with_description(self, desc: str) -> "PromptBuilder":
        self._description = desc
        return self

    def add_section(self, content: str) -> "PromptBuilder":
        """Add a static section to the prompt."""
        self._template_parts.append(content)
        return self

    def add_variable(
        self,
        name: str,
        description: str = "",
        required: bool = True,
        default: Optional[str] = None
    ) -> "PromptBuilder":
        """Add a variable placeholder."""
        self._template_parts.append("{{ " + name + " }}")
        self._variables.append(PromptVariable(
            name=name,
            description=description,
            required=required,
            default=default
        ))
        return self

    def add_conditional(
        self,
        condition: str,
        content: str,
        else_content: str = ""
    ) -> "PromptBuilder":
        """Add conditional content."""
        template = f"{{% if {condition} %}}{content}"
        if else_content:
            template += f"{{% else %}}{else_content}"
        template += "{% endif %}"
        self._template_parts.append(template)
        return self

    def add_loop(
        self,
        items_var: str,
        item_template: str,
        separator: str = "\n"
    ) -> "PromptBuilder":
        """Add a loop over items."""
        template = f"{{% for item in {items_var} %}}{item_template}"
        template += f"{{% if not loop.last %}}{separator}{{% endif %}}"
        template += "{% endfor %}"
        self._template_parts.append(template)
        return self

    def with_tags(self, *tags: str) -> "PromptBuilder":
        self._tags.extend(tags)
        return self

    def build(self) -> PromptTemplate:
        """Build the final PromptTemplate."""
        return PromptTemplate(
            name=self.name,
            template="\n".join(self._template_parts),
            role=self._role,
            variables=self._variables,
            description=self._description,
            tags=self._tags
        )

Understanding the Template Builder Pattern:

┌─────────────────────────────────────────────────────────────────────────────┐
│                     FLUENT BUILDER IN ACTION                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Traditional Construction (hard to read):                                  │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │ template = PromptTemplate(                                          │   │
│   │     name="classify",                                                │   │
│   │     template="Classify: {{ text }}\n{% if examples %}...",         │   │
│   │     role=PromptRole.USER,                                          │   │
│   │     variables=[PromptVariable(name="text", ...), ...],             │   │
│   │     tags=["classification"]                                         │   │
│   │ )                                                                   │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│   Fluent Builder (readable, step-by-step):                                 │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │ template = (                                                        │   │
│   │     PromptBuilder("classify")                                       │   │
│   │     .with_role(PromptRole.USER)                                    │   │
│   │     .add_section("Classify the following text:")                   │   │
│   │     .add_variable("text", description="Text to classify")          │   │
│   │     .add_conditional("examples", "Examples: {{ examples }}")       │   │
│   │     .with_tags("classification")                                   │   │
│   │     .build()                                                       │   │
│   │ )                                                                   │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│   Each method returns `self`, enabling method chaining.                    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Jinja2 Template Features:

Feature	Syntax	Use Case
Variable	`{{ var }}`	Insert dynamic content
Conditional	`{% if x %}...{% endif %}`	Optional sections
Loop	`{% for item in items %}`	List few-shot examples
Filter	`{{ text\|upper }}`	Transform content
Default	`{{ var\|default('N/A') }}`	Fallback values

# Pre-built common templates
COMMON_TEMPLATES = {
    "summarize": PromptTemplate(
        name="summarize",
        template="""Summarize the following text in {{ style }} style.

Text:
{{ text }}

Summary:""",
        variables=[
            PromptVariable(name="text", description="Text to summarize", required=True),
            PromptVariable(name="style", description="Summary style", default="concise")
        ],
        description="General-purpose summarization template"
    ),

    "analyze": PromptTemplate(
        name="analyze",
        template="""Analyze the following {{ content_type }} and provide insights.

Content:
{{ content }}

Focus areas:
{% for area in focus_areas %}- {{ area }}
{% endfor %}

Analysis:""",
        variables=[
            PromptVariable(name="content", description="Content to analyze", required=True),
            PromptVariable(name="content_type", description="Type of content", default="text"),
            PromptVariable(name="focus_areas", description="Areas to focus on", required=False)
        ],
        description="Content analysis template"
    ),

    "extract": PromptTemplate(
        name="extract",
        template="""Extract the following information from the text:

{% for field in fields %}- {{ field.name }}: {{ field.description }}
{% endfor %}

Text:
{{ text }}

Extracted Information (JSON):""",
        variables=[
            PromptVariable(name="text", description="Source text", required=True),
            PromptVariable(name="fields", description="Fields to extract", required=True)
        ],
        description="Structured information extraction"
    )
}

★ Insight ───────────────────────────────────── The PromptBuilder pattern separates prompt construction from execution. This enables: 1) Reusable prompt components, 2) Type-safe variable handling, 3) Easy A/B testing by swapping templates. Jinja2 templating adds conditionals and loops without string concatenation. ─────────────────────────────────────────────────

Step 3: Few-Shot Learning

Dynamic example selection for consistent outputs:

# few_shot.py
from typing import List, Dict, Any, Optional, Callable
from dataclasses import dataclass
from enum import Enum
import random

from langchain_openai import OpenAIEmbeddings
import numpy as np


@dataclass
class Example:
    """A single few-shot example."""
    input: str
    output: str
    metadata: Dict[str, Any] = None

    def format(self, input_prefix: str = "Input", output_prefix: str = "Output") -> str:
        return f"{input_prefix}: {self.input}\n{output_prefix}: {self.output}"


class SelectionStrategy(str, Enum):
    RANDOM = "random"
    SEMANTIC = "semantic"
    DIVERSE = "diverse"
    RECENT = "recent"


class FewShotSelector:
    """
    Select optimal examples for few-shot prompting.

    Strategies:
    - Random: Simple random selection
    - Semantic: Most similar to query (embedding-based)
    - Diverse: Maximize coverage of different patterns
    - Recent: Most recently added examples
    """

    def __init__(
        self,
        examples: List[Example],
        strategy: SelectionStrategy = SelectionStrategy.SEMANTIC
    ):
        self.examples = examples
        self.strategy = strategy
        self._embeddings = None
        self._example_vectors = None

    def _init_embeddings(self):
        """Initialize embedding model for semantic selection."""
        if self._embeddings is None:
            from config import get_settings
            settings = get_settings()
            self._embeddings = OpenAIEmbeddings(
                openai_api_key=settings.openai_api_key
            )
            # Pre-compute example embeddings
            texts = [ex.input for ex in self.examples]
            self._example_vectors = self._embeddings.embed_documents(texts)

    def select(
        self,
        query: str,
        k: int = 3,
        strategy: Optional[SelectionStrategy] = None
    ) -> List[Example]:
        """
        Select k examples based on strategy.

        Args:
            query: The input query to match against
            k: Number of examples to select
            strategy: Override default strategy

        Returns:
            List of selected examples
        """
        strategy = strategy or self.strategy
        k = min(k, len(self.examples))

        if strategy == SelectionStrategy.RANDOM:
            return self._select_random(k)
        elif strategy == SelectionStrategy.SEMANTIC:
            return self._select_semantic(query, k)
        elif strategy == SelectionStrategy.DIVERSE:
            return self._select_diverse(query, k)
        elif strategy == SelectionStrategy.RECENT:
            return self._select_recent(k)

        return self._select_random(k)

    def _select_random(self, k: int) -> List[Example]:
        """Random selection."""
        return random.sample(self.examples, k)

    def _select_semantic(self, query: str, k: int) -> List[Example]:
        """Select most semantically similar examples."""
        self._init_embeddings()

        query_vector = self._embeddings.embed_query(query)

        # Compute cosine similarities
        similarities = []
        for i, ex_vector in enumerate(self._example_vectors):
            sim = np.dot(query_vector, ex_vector) / (
                np.linalg.norm(query_vector) * np.linalg.norm(ex_vector)
            )
            similarities.append((i, sim))

        # Sort by similarity
        similarities.sort(key=lambda x: x[1], reverse=True)

        # Return top k
        return [self.examples[i] for i, _ in similarities[:k]]

    def _select_diverse(self, query: str, k: int) -> List[Example]:
        """Select diverse examples using MMR-like algorithm."""
        self._init_embeddings()

        query_vector = self._embeddings.embed_query(query)

        # Start with most similar
        similarities = []
        for i, ex_vector in enumerate(self._example_vectors):
            sim = np.dot(query_vector, ex_vector) / (
                np.linalg.norm(query_vector) * np.linalg.norm(ex_vector)
            )
            similarities.append((i, sim))

        similarities.sort(key=lambda x: x[1], reverse=True)

        selected_indices = [similarities[0][0]]
        selected = [self.examples[similarities[0][0]]]

        # Iteratively select diverse examples
        lambda_param = 0.5  # Balance relevance and diversity

        while len(selected) < k:
            best_score = -1
            best_idx = -1

            for i, _ in enumerate(self.examples):
                if i in selected_indices:
                    continue

                # Relevance to query
                relevance = next(s for idx, s in similarities if idx == i)

                # Max similarity to already selected (for diversity)
                max_sim_to_selected = max(
                    np.dot(self._example_vectors[i], self._example_vectors[j]) / (
                        np.linalg.norm(self._example_vectors[i]) *
                        np.linalg.norm(self._example_vectors[j])
                    )
                    for j in selected_indices
                )

                # MMR score
                score = lambda_param * relevance - (1 - lambda_param) * max_sim_to_selected

                if score > best_score:
                    best_score = score
                    best_idx = i

            if best_idx >= 0:
                selected_indices.append(best_idx)
                selected.append(self.examples[best_idx])

        return selected

    def _select_recent(self, k: int) -> List[Example]:
        """Select most recent examples."""
        return self.examples[-k:]

    def add_example(self, example: Example) -> None:
        """Add a new example."""
        self.examples.append(example)
        # Invalidate cached vectors
        self._example_vectors = None


class FewShotPrompt:
    """Build prompts with dynamic few-shot examples."""

    def __init__(
        self,
        task_description: str,
        examples: List[Example],
        input_prefix: str = "Input",
        output_prefix: str = "Output",
        selection_strategy: SelectionStrategy = SelectionStrategy.SEMANTIC
    ):
        self.task_description = task_description
        self.selector = FewShotSelector(examples, selection_strategy)
        self.input_prefix = input_prefix
        self.output_prefix = output_prefix

    def format(
        self,
        query: str,
        num_examples: int = 3
    ) -> str:
        """
        Format prompt with selected examples.

        Args:
            query: User query
            num_examples: Number of examples to include

        Returns:
            Formatted prompt string
        """
        # Select examples
        selected = self.selector.select(query, k=num_examples)

        # Build prompt
        parts = [self.task_description, "\nExamples:\n"]

        for ex in selected:
            parts.append(ex.format(self.input_prefix, self.output_prefix))
            parts.append("")

        parts.append(f"\n{self.input_prefix}: {query}")
        parts.append(f"{self.output_prefix}:")

        return "\n".join(parts)

Step 4: Chain-of-Thought Prompting

Guide the model through step-by-step reasoning:

# chain_of_thought.py
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from enum import Enum

from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage, AIMessage

from config import get_settings


class ReasoningStyle(str, Enum):
    STEP_BY_STEP = "step_by_step"
    THINK_ALOUD = "think_aloud"
    STRUCTURED = "structured"
    SOCRATIC = "socratic"


@dataclass
class ReasoningStep:
    """A single reasoning step."""
    step_number: int
    thought: str
    action: Optional[str] = None
    observation: Optional[str] = None


@dataclass
class ChainOfThoughtResult:
    """Complete CoT result."""
    question: str
    reasoning_steps: List[ReasoningStep]
    final_answer: str
    confidence: float


class ChainOfThoughtBuilder:
    """
    Build Chain-of-Thought prompts for complex reasoning.

    Supports multiple reasoning styles:
    - Step-by-step: Numbered steps
    - Think aloud: Stream of consciousness
    - Structured: Specific framework (e.g., pros/cons)
    - Socratic: Question-based exploration
    """

    STYLE_PROMPTS = {
        ReasoningStyle.STEP_BY_STEP: """
Let's solve this step by step:

Step 1: [First consideration]
Step 2: [Next logical step]
...
Final Answer: [Conclusion]
""",
        ReasoningStyle.THINK_ALOUD: """
Let me think through this...

First, I notice that...
This makes me think...
Considering all of this...

Therefore: [Conclusion]
""",
        ReasoningStyle.STRUCTURED: """
Let me analyze this systematically:

1. Understanding the problem:
   - Key elements: ...
   - Constraints: ...

2. Possible approaches:
   - Option A: ...
   - Option B: ...

3. Evaluation:
   - Pros/Cons of each...

4. Conclusion:
   Based on this analysis: [Answer]
""",
        ReasoningStyle.SOCRATIC: """
Let me explore this through questions:

Q1: What is the core issue here?
A1: ...

Q2: What assumptions am I making?
A2: ...

Q3: What would change if...?
A3: ...

Given these insights: [Conclusion]
"""
    }

    def __init__(self, style: ReasoningStyle = ReasoningStyle.STEP_BY_STEP):
        self.style = style
        self.settings = get_settings()
        self.llm = ChatOpenAI(
            model=self.settings.default_model,
            api_key=self.settings.openai_api_key,
            temperature=0.3  # Lower for reasoning
        )

    def build_prompt(
        self,
        question: str,
        context: Optional[str] = None,
        style: Optional[ReasoningStyle] = None
    ) -> str:
        """Build a CoT prompt for the question."""
        style = style or self.style
        style_guide = self.STYLE_PROMPTS[style]

        prompt_parts = []

        if context:
            prompt_parts.append(f"Context:\n{context}\n")

        prompt_parts.append(f"Question: {question}")
        prompt_parts.append(f"\n{style_guide}")

        return "\n".join(prompt_parts)

    def reason(
        self,
        question: str,
        context: Optional[str] = None,
        style: Optional[ReasoningStyle] = None
    ) -> ChainOfThoughtResult:
        """
        Execute Chain-of-Thought reasoning.

        Args:
            question: The question to reason about
            context: Optional context information
            style: Reasoning style to use

        Returns:
            ChainOfThoughtResult with steps and answer
        """
        style = style or self.style

        system_prompt = """You are a careful, methodical thinker.
When answering questions, show your reasoning process explicitly.
Break down complex problems into manageable steps.
Be honest about uncertainty."""

        user_prompt = self.build_prompt(question, context, style)

        messages = [
            SystemMessage(content=system_prompt),
            HumanMessage(content=user_prompt)
        ]

        response = self.llm.invoke(messages)
        content = response.content

        # Parse response into steps
        steps = self._parse_reasoning(content, style)
        final_answer = self._extract_answer(content)
        confidence = self._estimate_confidence(content)

        return ChainOfThoughtResult(
            question=question,
            reasoning_steps=steps,
            final_answer=final_answer,
            confidence=confidence
        )

    def _parse_reasoning(
        self,
        content: str,
        style: ReasoningStyle
    ) -> List[ReasoningStep]:
        """Parse reasoning steps from response."""
        steps = []
        lines = content.split('\n')

        step_num = 0
        current_thought = []

        for line in lines:
            line = line.strip()

            # Detect step markers
            is_step_marker = (
                line.startswith('Step ') or
                line.startswith('1.') or
                line.startswith('Q1:') or
                (line.startswith('-') and step_num == 0)
            )

            if is_step_marker and current_thought:
                steps.append(ReasoningStep(
                    step_number=step_num,
                    thought='\n'.join(current_thought)
                ))
                step_num += 1
                current_thought = [line]
            elif line:
                current_thought.append(line)

        if current_thought:
            steps.append(ReasoningStep(
                step_number=step_num,
                thought='\n'.join(current_thought)
            ))

        return steps

    def _extract_answer(self, content: str) -> str:
        """Extract final answer from response."""
        markers = [
            'Final Answer:', 'Therefore:', 'Conclusion:',
            'Given these insights:', 'Based on this analysis:',
            'The answer is:', 'In conclusion:'
        ]

        for marker in markers:
            if marker in content:
                answer = content.split(marker)[-1].strip()
                # Take first paragraph
                return answer.split('\n\n')[0].strip()

        # Fallback: last paragraph
        paragraphs = content.strip().split('\n\n')
        return paragraphs[-1] if paragraphs else content

    def _estimate_confidence(self, content: str) -> float:
        """Estimate confidence based on language used."""
        low_confidence_markers = [
            'uncertain', 'not sure', 'might be', 'possibly',
            'I think', 'it seems', 'probably', 'unclear'
        ]

        high_confidence_markers = [
            'definitely', 'certainly', 'clearly', 'must be',
            'obviously', 'without doubt', 'always', 'never'
        ]

        content_lower = content.lower()

        low_count = sum(1 for m in low_confidence_markers if m in content_lower)
        high_count = sum(1 for m in high_confidence_markers if m in content_lower)

        # Base confidence
        base = 0.7

        # Adjust based on markers
        confidence = base - (low_count * 0.05) + (high_count * 0.05)

        return max(0.1, min(1.0, confidence))


class SelfConsistencyReasoner:
    """
    Implement self-consistency for improved accuracy.

    Generates multiple reasoning paths and selects the most common answer.
    """

    def __init__(self, num_samples: int = 5):
        self.num_samples = num_samples
        self.cot = ChainOfThoughtBuilder()
        self.settings = get_settings()

    def reason(
        self,
        question: str,
        context: Optional[str] = None
    ) -> Dict[str, Any]:
        """
        Generate multiple reasoning paths and find consensus.

        Args:
            question: Question to answer
            context: Optional context

        Returns:
            Dict with answer, confidence, and all paths
        """
        # Generate multiple reasoning paths
        results = []

        # Use different temperatures for diversity
        temperatures = [0.3, 0.5, 0.7, 0.5, 0.3][:self.num_samples]

        for temp in temperatures:
            self.cot.llm.temperature = temp
            result = self.cot.reason(question, context)
            results.append(result)

        # Extract answers and count
        answer_counts: Dict[str, int] = {}
        for result in results:
            answer = result.final_answer.strip().lower()
            # Normalize answer
            answer_key = self._normalize_answer(answer)
            answer_counts[answer_key] = answer_counts.get(answer_key, 0) + 1

        # Find most common
        best_answer = max(answer_counts.keys(), key=lambda k: answer_counts[k])
        confidence = answer_counts[best_answer] / self.num_samples

        # Get best result with this answer
        best_result = next(
            r for r in results
            if self._normalize_answer(r.final_answer) == best_answer
        )

        return {
            "answer": best_result.final_answer,
            "confidence": confidence,
            "consensus_count": answer_counts[best_answer],
            "total_samples": self.num_samples,
            "reasoning": best_result.reasoning_steps,
            "all_answers": answer_counts
        }

    def _normalize_answer(self, answer: str) -> str:
        """Normalize answer for comparison."""
        # Simple normalization - could be enhanced
        normalized = answer.lower().strip()
        # Remove common prefixes
        prefixes = ['the answer is', 'therefore', 'so', 'thus']
        for prefix in prefixes:
            if normalized.startswith(prefix):
                normalized = normalized[len(prefix):].strip()
        return normalized[:100]  # First 100 chars for comparison

★ Insight ───────────────────────────────────── Self-consistency dramatically improves CoT accuracy. Instead of relying on a single reasoning path, we sample multiple paths with varying temperatures and select the most common answer. This catches reasoning errors that might occur in any single attempt. ─────────────────────────────────────────────────

Step 5: Prompt Versioning

Track and manage prompt versions:

# versioning.py
import sqlite3
import json
from datetime import datetime
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
import hashlib

from templates import PromptTemplate


@dataclass
class PromptVersion:
    """A versioned prompt with metadata."""
    id: int
    name: str
    version: str
    template: PromptTemplate
    created_at: datetime
    metrics: Dict[str, float]
    is_active: bool


class PromptVersionStore:
    """
    SQLite-based prompt version storage.

    Features:
    - Version history for all prompts
    - Performance metrics tracking
    - Rollback capability
    - A/B test assignment
    """

    def __init__(self, db_path: str = "prompts.db"):
        self.db_path = db_path
        self._init_db()

    def _init_db(self):
        """Initialize database schema."""
        with sqlite3.connect(self.db_path) as conn:
            conn.execute("""
                CREATE TABLE IF NOT EXISTS prompts (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    name TEXT NOT NULL,
                    version TEXT NOT NULL,
                    template_json TEXT NOT NULL,
                    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    is_active BOOLEAN DEFAULT FALSE,
                    metrics_json TEXT DEFAULT '{}',
                    hash TEXT NOT NULL,
                    UNIQUE(name, version)
                )
            """)

            conn.execute("""
                CREATE TABLE IF NOT EXISTS prompt_metrics (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    prompt_id INTEGER NOT NULL,
                    metric_name TEXT NOT NULL,
                    metric_value REAL NOT NULL,
                    recorded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    FOREIGN KEY (prompt_id) REFERENCES prompts(id)
                )
            """)

            conn.execute("""
                CREATE INDEX IF NOT EXISTS idx_prompt_name
                ON prompts(name)
            """)

    def save(
        self,
        template: PromptTemplate,
        version: Optional[str] = None
    ) -> PromptVersion:
        """
        Save a prompt template version.

        Args:
            template: PromptTemplate to save
            version: Optional version override

        Returns:
            Created PromptVersion
        """
        version = version or template.version
        template_json = json.dumps(template.to_dict())
        template_hash = hashlib.md5(template_json.encode()).hexdigest()

        with sqlite3.connect(self.db_path) as conn:
            cursor = conn.execute("""
                INSERT INTO prompts (name, version, template_json, hash)
                VALUES (?, ?, ?, ?)
                ON CONFLICT(name, version) DO UPDATE SET
                    template_json = excluded.template_json,
                    hash = excluded.hash
                RETURNING id, created_at
            """, (template.name, version, template_json, template_hash))

            row = cursor.fetchone()

            return PromptVersion(
                id=row[0],
                name=template.name,
                version=version,
                template=template,
                created_at=datetime.fromisoformat(row[1]) if isinstance(row[1], str) else row[1],
                metrics={},
                is_active=False
            )

    def get(
        self,
        name: str,
        version: Optional[str] = None
    ) -> Optional[PromptTemplate]:
        """
        Get a prompt template by name and version.

        Args:
            name: Prompt name
            version: Version (latest if None)

        Returns:
            PromptTemplate or None
        """
        with sqlite3.connect(self.db_path) as conn:
            if version:
                cursor = conn.execute("""
                    SELECT template_json FROM prompts
                    WHERE name = ? AND version = ?
                """, (name, version))
            else:
                # Get latest version
                cursor = conn.execute("""
                    SELECT template_json FROM prompts
                    WHERE name = ?
                    ORDER BY created_at DESC
                    LIMIT 1
                """, (name,))

            row = cursor.fetchone()
            if row:
                data = json.loads(row[0])
                return PromptTemplate.from_dict(data)
            return None

    def get_active(self, name: str) -> Optional[PromptTemplate]:
        """Get the active version of a prompt."""
        with sqlite3.connect(self.db_path) as conn:
            cursor = conn.execute("""
                SELECT template_json FROM prompts
                WHERE name = ? AND is_active = TRUE
            """, (name,))

            row = cursor.fetchone()
            if row:
                data = json.loads(row[0])
                return PromptTemplate.from_dict(data)
            return None

    def set_active(self, name: str, version: str) -> bool:
        """Set a version as active (for production use)."""
        with sqlite3.connect(self.db_path) as conn:
            # Deactivate all versions
            conn.execute("""
                UPDATE prompts SET is_active = FALSE
                WHERE name = ?
            """, (name,))

            # Activate specific version
            cursor = conn.execute("""
                UPDATE prompts SET is_active = TRUE
                WHERE name = ? AND version = ?
            """, (name, version))

            return cursor.rowcount > 0

    def list_versions(self, name: str) -> List[Dict[str, Any]]:
        """List all versions of a prompt."""
        with sqlite3.connect(self.db_path) as conn:
            cursor = conn.execute("""
                SELECT version, created_at, is_active, metrics_json
                FROM prompts
                WHERE name = ?
                ORDER BY created_at DESC
            """, (name,))

            return [
                {
                    "version": row[0],
                    "created_at": row[1],
                    "is_active": bool(row[2]),
                    "metrics": json.loads(row[3])
                }
                for row in cursor.fetchall()
            ]

    def record_metric(
        self,
        name: str,
        version: str,
        metric_name: str,
        value: float
    ) -> None:
        """Record a performance metric for a prompt version."""
        with sqlite3.connect(self.db_path) as conn:
            # Get prompt id
            cursor = conn.execute("""
                SELECT id FROM prompts
                WHERE name = ? AND version = ?
            """, (name, version))

            row = cursor.fetchone()
            if not row:
                return

            prompt_id = row[0]

            # Record metric
            conn.execute("""
                INSERT INTO prompt_metrics (prompt_id, metric_name, metric_value)
                VALUES (?, ?, ?)
            """, (prompt_id, metric_name, value))

            # Update aggregated metrics
            cursor = conn.execute("""
                SELECT metric_name, AVG(metric_value) as avg_value
                FROM prompt_metrics
                WHERE prompt_id = ?
                GROUP BY metric_name
            """, (prompt_id,))

            metrics = {row[0]: row[1] for row in cursor.fetchall()}

            conn.execute("""
                UPDATE prompts SET metrics_json = ?
                WHERE id = ?
            """, (json.dumps(metrics), prompt_id))

    def compare_versions(
        self,
        name: str,
        version_a: str,
        version_b: str
    ) -> Dict[str, Any]:
        """Compare metrics between two versions."""
        with sqlite3.connect(self.db_path) as conn:
            results = {}

            for version in [version_a, version_b]:
                cursor = conn.execute("""
                    SELECT metrics_json FROM prompts
                    WHERE name = ? AND version = ?
                """, (name, version))

                row = cursor.fetchone()
                if row:
                    results[version] = json.loads(row[0])

            return results

Step 6: Prompt Testing

Test prompts for reliability and consistency:

# testing.py
from typing import List, Dict, Any, Callable, Optional
from dataclasses import dataclass
from enum import Enum
import statistics
import time

from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage

from templates import PromptTemplate
from config import get_settings


class TestStatus(str, Enum):
    PASSED = "passed"
    FAILED = "failed"
    WARNING = "warning"


@dataclass
class TestCase:
    """A single test case for a prompt."""
    name: str
    input_vars: Dict[str, Any]
    expected_contains: Optional[List[str]] = None
    expected_not_contains: Optional[List[str]] = None
    expected_format: Optional[str] = None  # "json", "markdown", etc.
    max_tokens: Optional[int] = None
    custom_validator: Optional[Callable[[str], bool]] = None


@dataclass
class TestResult:
    """Result of a single test."""
    test_name: str
    status: TestStatus
    output: str
    latency_ms: float
    token_count: int
    errors: List[str]


@dataclass
class TestSuiteResult:
    """Result of running all tests."""
    prompt_name: str
    total_tests: int
    passed: int
    failed: int
    warnings: int
    avg_latency_ms: float
    consistency_score: float
    results: List[TestResult]


class PromptTester:
    """
    Test prompts for reliability and consistency.

    Features:
    - Content validation (contains/not contains)
    - Format validation (JSON, markdown, etc.)
    - Latency measurement
    - Consistency testing (same input, multiple runs)
    - Custom validators
    """

    def __init__(self):
        self.settings = get_settings()
        self.llm = ChatOpenAI(
            model=self.settings.default_model,
            api_key=self.settings.openai_api_key,
            temperature=0.7
        )

    def run_test(
        self,
        template: PromptTemplate,
        test_case: TestCase
    ) -> TestResult:
        """Run a single test case."""
        errors = []
        start_time = time.time()

        # Render and execute prompt
        try:
            rendered = template.render(**test_case.input_vars)

            messages = [HumanMessage(content=rendered)]
            response = self.llm.invoke(messages)
            output = response.content

        except Exception as e:
            return TestResult(
                test_name=test_case.name,
                status=TestStatus.FAILED,
                output="",
                latency_ms=0,
                token_count=0,
                errors=[f"Execution error: {str(e)}"]
            )

        latency_ms = (time.time() - start_time) * 1000
        token_count = len(output.split())  # Approximate

        # Validate output
        if test_case.expected_contains:
            for expected in test_case.expected_contains:
                if expected.lower() not in output.lower():
                    errors.append(f"Missing expected content: '{expected}'")

        if test_case.expected_not_contains:
            for unexpected in test_case.expected_not_contains:
                if unexpected.lower() in output.lower():
                    errors.append(f"Contains unexpected content: '{unexpected}'")

        if test_case.expected_format:
            if not self._validate_format(output, test_case.expected_format):
                errors.append(f"Invalid format: expected {test_case.expected_format}")

        if test_case.max_tokens and token_count > test_case.max_tokens:
            errors.append(f"Exceeded max tokens: {token_count} > {test_case.max_tokens}")

        if test_case.custom_validator:
            try:
                if not test_case.custom_validator(output):
                    errors.append("Custom validation failed")
            except Exception as e:
                errors.append(f"Custom validator error: {str(e)}")

        # Determine status
        if errors:
            status = TestStatus.FAILED
        elif latency_ms > 5000:  # Warning if slow
            status = TestStatus.WARNING
            errors.append(f"Slow response: {latency_ms:.0f}ms")
        else:
            status = TestStatus.PASSED

        return TestResult(
            test_name=test_case.name,
            status=status,
            output=output,
            latency_ms=latency_ms,
            token_count=token_count,
            errors=errors
        )

    def run_suite(
        self,
        template: PromptTemplate,
        test_cases: List[TestCase]
    ) -> TestSuiteResult:
        """Run all test cases for a prompt."""
        results = []

        for test_case in test_cases:
            result = self.run_test(template, test_case)
            results.append(result)

        # Calculate metrics
        passed = sum(1 for r in results if r.status == TestStatus.PASSED)
        failed = sum(1 for r in results if r.status == TestStatus.FAILED)
        warnings = sum(1 for r in results if r.status == TestStatus.WARNING)

        latencies = [r.latency_ms for r in results]
        avg_latency = statistics.mean(latencies) if latencies else 0

        return TestSuiteResult(
            prompt_name=template.name,
            total_tests=len(results),
            passed=passed,
            failed=failed,
            warnings=warnings,
            avg_latency_ms=avg_latency,
            consistency_score=passed / len(results) if results else 0,
            results=results
        )

    def test_consistency(
        self,
        template: PromptTemplate,
        input_vars: Dict[str, Any],
        runs: int = 5
    ) -> Dict[str, Any]:
        """
        Test output consistency across multiple runs.

        Returns similarity metrics between outputs.
        """
        outputs = []
        latencies = []

        for _ in range(runs):
            start = time.time()
            rendered = template.render(**input_vars)
            messages = [HumanMessage(content=rendered)]
            response = self.llm.invoke(messages)
            outputs.append(response.content)
            latencies.append((time.time() - start) * 1000)

        # Calculate consistency
        from difflib import SequenceMatcher

        similarities = []
        for i in range(len(outputs)):
            for j in range(i + 1, len(outputs)):
                sim = SequenceMatcher(None, outputs[i], outputs[j]).ratio()
                similarities.append(sim)

        return {
            "runs": runs,
            "avg_similarity": statistics.mean(similarities) if similarities else 1.0,
            "min_similarity": min(similarities) if similarities else 1.0,
            "avg_latency_ms": statistics.mean(latencies),
            "latency_std": statistics.stdev(latencies) if len(latencies) > 1 else 0,
            "outputs": outputs
        }

    def _validate_format(self, output: str, expected_format: str) -> bool:
        """Validate output format."""
        if expected_format == "json":
            import json
            try:
                json.loads(output)
                return True
            except:
                # Try to extract JSON from output
                import re
                json_match = re.search(r'\{[\s\S]*\}', output)
                if json_match:
                    try:
                        json.loads(json_match.group())
                        return True
                    except:
                        pass
                return False

        elif expected_format == "markdown":
            # Check for markdown elements
            return any([
                output.startswith('#'),
                '**' in output,
                '- ' in output,
                '```' in output
            ])

        elif expected_format == "list":
            return any([
                '\n- ' in output,
                '\n* ' in output,
                '\n1.' in output
            ])

        return True  # Unknown format, assume valid

Step 7: Prompt Registry

Central registry for managing prompts:

# registry.py
from typing import Dict, List, Optional, Any
from pathlib import Path
import yaml
import json

from templates import PromptTemplate, COMMON_TEMPLATES
from versioning import PromptVersionStore
from testing import PromptTester, TestCase, TestSuiteResult


class PromptRegistry:
    """
    Central registry for prompt management.

    Features:
    - Load prompts from files or code
    - Version management
    - Testing integration
    - A/B test support
    """

    def __init__(self, db_path: str = "prompts.db"):
        self.store = PromptVersionStore(db_path)
        self.tester = PromptTester()
        self._cache: Dict[str, PromptTemplate] = {}
        self._ab_tests: Dict[str, Dict[str, float]] = {}

        # Load common templates
        for name, template in COMMON_TEMPLATES.items():
            self.register(template)

    def register(
        self,
        template: PromptTemplate,
        version: Optional[str] = None
    ) -> None:
        """Register a prompt template."""
        self.store.save(template, version)
        self._cache[template.name] = template

    def get(
        self,
        name: str,
        version: Optional[str] = None
    ) -> Optional[PromptTemplate]:
        """Get a prompt by name."""
        # Check cache first
        if name in self._cache and version is None:
            return self._cache[name]

        # Load from store
        template = self.store.get(name, version)
        if template:
            self._cache[name] = template
        return template

    def get_for_request(self, name: str, request_id: str) -> PromptTemplate:
        """
        Get prompt for a request, handling A/B tests.

        Args:
            name: Prompt name
            request_id: Unique request identifier for consistent assignment

        Returns:
            Selected PromptTemplate
        """
        if name in self._ab_tests:
            # A/B test active - select version based on request_id
            import hashlib
            hash_val = int(hashlib.md5(request_id.encode()).hexdigest(), 16)
            normalized = (hash_val % 1000) / 1000.0

            cumulative = 0.0
            for version, weight in self._ab_tests[name].items():
                cumulative += weight
                if normalized < cumulative:
                    return self.get(name, version)

        # Return active or latest version
        return self.store.get_active(name) or self.get(name)

    def start_ab_test(
        self,
        name: str,
        versions: Dict[str, float]
    ) -> None:
        """
        Start an A/B test for a prompt.

        Args:
            name: Prompt name
            versions: Dict of version -> weight (should sum to 1.0)
        """
        total = sum(versions.values())
        if abs(total - 1.0) > 0.01:
            raise ValueError(f"Weights must sum to 1.0, got {total}")

        self._ab_tests[name] = versions

    def stop_ab_test(self, name: str) -> None:
        """Stop an A/B test."""
        if name in self._ab_tests:
            del self._ab_tests[name]

    def test(
        self,
        name: str,
        test_cases: List[TestCase],
        version: Optional[str] = None
    ) -> TestSuiteResult:
        """Test a prompt with given test cases."""
        template = self.get(name, version)
        if not template:
            raise ValueError(f"Prompt '{name}' not found")

        return self.tester.run_suite(template, test_cases)

    def promote(self, name: str, version: str) -> bool:
        """Promote a version to active (production)."""
        return self.store.set_active(name, version)

    def load_from_yaml(self, path: str) -> List[str]:
        """Load prompts from YAML file."""
        loaded = []

        with open(path) as f:
            data = yaml.safe_load(f)

        for prompt_data in data.get("prompts", []):
            template = PromptTemplate.from_dict(prompt_data)
            self.register(template)
            loaded.append(template.name)

        return loaded

    def export_to_yaml(self, path: str, names: Optional[List[str]] = None) -> None:
        """Export prompts to YAML file."""
        prompts = []

        for name in (names or self._cache.keys()):
            template = self.get(name)
            if template:
                prompts.append(template.to_dict())

        with open(path, 'w') as f:
            yaml.dump({"prompts": prompts}, f, default_flow_style=False)

    def list_prompts(self) -> List[Dict[str, Any]]:
        """List all registered prompts."""
        prompts = []

        for name in self._cache.keys():
            versions = self.store.list_versions(name)
            prompts.append({
                "name": name,
                "versions": versions,
                "ab_test_active": name in self._ab_tests
            })

        return prompts

Usage Examples

Basic Template Usage

from templates import PromptTemplate, PromptBuilder

# Using pre-built template
from templates import COMMON_TEMPLATES

summarize = COMMON_TEMPLATES["summarize"]
prompt = summarize.render(text="Long article here...", style="bullet points")

# Building custom template
template = (
    PromptBuilder("code_review")
    .with_description("Review code for issues")
    .add_section("Review the following code and identify:")
    .add_section("1. Bugs and errors")
    .add_section("2. Security issues")
    .add_section("3. Performance problems")
    .add_section("\nCode:")
    .add_variable("code", "The code to review")
    .add_section("\nProvide your review:")
    .build()
)

prompt = template.render(code="def foo(): pass")

Few-Shot Learning

from few_shot import Example, FewShotPrompt, SelectionStrategy

examples = [
    Example(input="happy", output="positive"),
    Example(input="sad", output="negative"),
    Example(input="excited", output="positive"),
    Example(input="angry", output="negative"),
]

few_shot = FewShotPrompt(
    task_description="Classify the sentiment of the word.",
    examples=examples,
    selection_strategy=SelectionStrategy.SEMANTIC
)

prompt = few_shot.format("joyful", num_examples=2)

Chain-of-Thought

from chain_of_thought import ChainOfThoughtBuilder, ReasoningStyle

cot = ChainOfThoughtBuilder(style=ReasoningStyle.STRUCTURED)
result = cot.reason("Should we use microservices or monolith for a startup?")

print(f"Answer: {result.final_answer}")
print(f"Confidence: {result.confidence}")
for step in result.reasoning_steps:
    print(f"Step {step.step_number}: {step.thought[:100]}...")

Requirements

# requirements.txt
langchain>=0.3.0
langchain-openai>=0.2.0
openai>=1.50.0
jinja2>=3.1.0
pydantic>=2.9.0
pydantic-settings>=2.6.0
numpy>=1.26.0
pyyaml>=6.0.0

Key Concepts Recap

Concept	What It Is	Why It Matters
Prompt Templates	Reusable prompts with Jinja2 variables	Separate structure from content, enable iteration
Few-Shot Learning	Include examples in the prompt	LLM learns output format from examples
Semantic Selection	Choose examples similar to query	Better examples = better outputs
Chain-of-Thought	"Let's think step by step" prompting	Improves reasoning on complex tasks
Self-Consistency	Sample multiple paths, pick majority	Catches reasoning errors, improves accuracy
Prompt Versioning	Store versions in SQLite with metrics	Track what worked, enable rollback
A/B Testing	Route traffic to different versions	Data-driven prompt optimization
Consistency Testing	Run same input multiple times	Ensure reliable, predictable outputs

Resources

Prompt Engineering

Property	Value
Difficulty	Intermediate
Time	~5 hours
Code Size	~450 LOC
Prerequisites	Chatbot

TL;DR

Tech Stack

Technology	Purpose
LangChain	Prompt templates
Jinja2	Advanced templating
Pydantic	Prompt validation
SQLite	Version storage
OpenAI	LLM provider

Prerequisites

Python 3.10+
OpenAI API key

pip install langchain langchain-openai jinja2 pydantic openai

What You'll Learn

Design effective prompt templates
Implement few-shot learning with dynamic examples
Build Chain-of-Thought prompting systems
Version control and A/B test prompts
Measure and optimize prompt performance

The Problem: Ad-Hoc Prompting Doesn't Scale

Issue	Impact
Hardcoded prompts	Can't iterate or experiment
No versioning	Lost track of what worked
No testing	Unknown reliability
No examples	Inconsistent output format
No structure	Repeated prompt rewriting

┌─────────────────────────────────────────────────────────────────────────────┐
│                    AD-HOC vs SYSTEMATIC PROMPTING                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Ad-Hoc Prompting                     Systematic Prompting                  │
│  ─────────────────                    ────────────────────                  │
│                                                                             │
│  ┌──────────────────┐                 ┌──────────────────┐                  │
│  │ Hardcoded String │                 │    Template      │                  │
│  │ "Summarize this" │                 │  "{{ style }}    │                  │
│  └────────┬─────────┘                 │   summary of     │                  │
│           │                           │   {{ text }}"    │                  │
│           ▼                           └────────┬─────────┘                  │
│  ┌──────────────────┐                          │                            │
│  │       LLM        │                          ▼                            │
│  └────────┬─────────┘                 ┌──────────────────┐                  │
│           │                           │    Variables     │                  │
│           ▼                           │  style="bullet"  │                  │
│  ┌──────────────────┐                 │  text="..."      │                  │
│  │  Unpredictable   │                 └────────┬─────────┘                  │
│  │     Output       │                          │                            │
│  │  (varies each    │                          ▼                            │
│  │   time)          │                 ┌──────────────────┐                  │
│  └──────────────────┘                 │    Examples      │                  │
│                                       │  (few-shot)      │                  │
│                                       └────────┬─────────┘                  │
│                                                │                            │
│                                                ▼                            │
│                                       ┌──────────────────┐                  │
│                                       │       LLM        │                  │
│                                       └────────┬─────────┘                  │
│                                                │                            │
│                                                ▼                            │
│                                       ┌──────────────────┐                  │
│                                       │   Consistent     │                  │
│                                       │     Output       │                  │
│                                       │  (predictable    │                  │
│                                       │   format)        │                  │
│                                       └──────────────────┘                  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Project Structure

prompt-engineering/
├── config.py            # Configuration
├── templates.py         # Prompt template system
├── few_shot.py          # Few-shot example management
├── chain_of_thought.py  # CoT prompting
├── versioning.py        # Prompt version control
├── testing.py           # Prompt testing framework
├── registry.py          # Prompt registry
└── requirements.txt

Step 1: Configuration

# config.py
from pydantic_settings import BaseSettings
from pydantic import Field
from functools import lru_cache
from typing import Optional


class Settings(BaseSettings):
    openai_api_key: str
    default_model: str = "gpt-4o"
    temperature: float = 0.7
    max_tokens: int = 2000

    # Versioning
    db_path: str = "prompts.db"
    enable_versioning: bool = True

    # Testing
    test_iterations: int = 5
    consistency_threshold: float = 0.8

    class Config:
        env_file = ".env"


@lru_cache
def get_settings() -> Settings:
    return Settings()

Step 2: Prompt Template System

Build a flexible template system that separates structure from content:

# templates.py
from typing import Dict, Any, List, Optional
from dataclasses import dataclass, field
from enum import Enum
from jinja2 import Template, Environment, BaseLoader
from pydantic import BaseModel, validator
import json


class PromptRole(str, Enum):
    SYSTEM = "system"
    USER = "user"
    ASSISTANT = "assistant"


class PromptVariable(BaseModel):
    """Define a template variable with validation."""
    name: str
    description: str
    required: bool = True
    default: Optional[str] = None
    examples: List[str] = []

    def validate_value(self, value: Any) -> bool:
        if self.required and value is None:
            return False
        return True


@dataclass
class PromptTemplate:
    """
    A structured prompt template with variables and metadata.

    Supports Jinja2 templating for complex logic.
    """
    name: str
    template: str
    role: PromptRole = PromptRole.USER
    variables: List[PromptVariable] = field(default_factory=list)
    description: str = ""
    version: str = "1.0.0"
    tags: List[str] = field(default_factory=list)

    def __post_init__(self):
        self._jinja_env = Environment(loader=BaseLoader())
        self._compiled = self._jinja_env.from_string(self.template)

    def render(self, **kwargs) -> str:
        """
        Render template with provided variables.

        Args:
            **kwargs: Variable values

        Returns:
            Rendered prompt string

        Raises:
            ValueError: If required variables are missing
        """
        # Validate required variables
        for var in self.variables:
            if var.required and var.name not in kwargs:
                if var.default is not None:
                    kwargs[var.name] = var.default
                else:
                    raise ValueError(f"Required variable '{var.name}' not provided")

        return self._compiled.render(**kwargs)

    def get_variable_names(self) -> List[str]:
        """Get list of variable names."""
        return [v.name for v in self.variables]

    def to_dict(self) -> Dict[str, Any]:
        """Serialize template to dict."""
        return {
            "name": self.name,
            "template": self.template,
            "role": self.role.value,
            "variables": [v.dict() for v in self.variables],
            "description": self.description,
            "version": self.version,
            "tags": self.tags
        }

    @classmethod
    def from_dict(cls, data: Dict[str, Any]) -> "PromptTemplate":
        """Deserialize from dict."""
        data["role"] = PromptRole(data["role"])
        data["variables"] = [PromptVariable(**v) for v in data.get("variables", [])]
        return cls(**data)


class PromptBuilder:
    """
    Fluent builder for constructing prompts.

    Provides a clean API for building complex prompts.
    """

    def __init__(self, name: str):
        self.name = name
        self._template_parts: List[str] = []
        self._variables: List[PromptVariable] = []
        self._role = PromptRole.USER
        self._description = ""
        self._tags: List[str] = []

    def with_role(self, role: PromptRole) -> "PromptBuilder":
        self._role = role
        return self

    def with_description(self, desc: str) -> "PromptBuilder":
        self._description = desc
        return self

    def add_section(self, content: str) -> "PromptBuilder":
        """Add a static section to the prompt."""
        self._template_parts.append(content)
        return self

    def add_variable(
        self,
        name: str,
        description: str = "",
        required: bool = True,
        default: Optional[str] = None
    ) -> "PromptBuilder":
        """Add a variable placeholder."""
        self._template_parts.append("{{ " + name + " }}")
        self._variables.append(PromptVariable(
            name=name,
            description=description,
            required=required,
            default=default
        ))
        return self

    def add_conditional(
        self,
        condition: str,
        content: str,
        else_content: str = ""
    ) -> "PromptBuilder":
        """Add conditional content."""
        template = f"{{% if {condition} %}}{content}"
        if else_content:
            template += f"{{% else %}}{else_content}"
        template += "{% endif %}"
        self._template_parts.append(template)
        return self

    def add_loop(
        self,
        items_var: str,
        item_template: str,
        separator: str = "\n"
    ) -> "PromptBuilder":
        """Add a loop over items."""
        template = f"{{% for item in {items_var} %}}{item_template}"
        template += f"{{% if not loop.last %}}{separator}{{% endif %}}"
        template += "{% endfor %}"
        self._template_parts.append(template)
        return self

    def with_tags(self, *tags: str) -> "PromptBuilder":
        self._tags.extend(tags)
        return self

    def build(self) -> PromptTemplate:
        """Build the final PromptTemplate."""
        return PromptTemplate(
            name=self.name,
            template="\n".join(self._template_parts),
            role=self._role,
            variables=self._variables,
            description=self._description,
            tags=self._tags
        )

Understanding the Template Builder Pattern:

┌─────────────────────────────────────────────────────────────────────────────┐
│                     FLUENT BUILDER IN ACTION                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Traditional Construction (hard to read):                                  │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │ template = PromptTemplate(                                          │   │
│   │     name="classify",                                                │   │
│   │     template="Classify: {{ text }}\n{% if examples %}...",         │   │
│   │     role=PromptRole.USER,                                          │   │
│   │     variables=[PromptVariable(name="text", ...), ...],             │   │
│   │     tags=["classification"]                                         │   │
│   │ )                                                                   │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│   Fluent Builder (readable, step-by-step):                                 │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │ template = (                                                        │   │
│   │     PromptBuilder("classify")                                       │   │
│   │     .with_role(PromptRole.USER)                                    │   │
│   │     .add_section("Classify the following text:")                   │   │
│   │     .add_variable("text", description="Text to classify")          │   │
│   │     .add_conditional("examples", "Examples: {{ examples }}")       │   │
│   │     .with_tags("classification")                                   │   │
│   │     .build()                                                       │   │
│   │ )                                                                   │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│   Each method returns `self`, enabling method chaining.                    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Jinja2 Template Features:

Feature	Syntax	Use Case
Variable	`{{ var }}`	Insert dynamic content
Conditional	`{% if x %}...{% endif %}`	Optional sections
Loop	`{% for item in items %}`	List few-shot examples
Filter	`{{ text\|upper }}`	Transform content
Default	`{{ var\|default('N/A') }}`	Fallback values

# Pre-built common templates
COMMON_TEMPLATES = {
    "summarize": PromptTemplate(
        name="summarize",
        template="""Summarize the following text in {{ style }} style.

Text:
{{ text }}

Summary:""",
        variables=[
            PromptVariable(name="text", description="Text to summarize", required=True),
            PromptVariable(name="style", description="Summary style", default="concise")
        ],
        description="General-purpose summarization template"
    ),

    "analyze": PromptTemplate(
        name="analyze",
        template="""Analyze the following {{ content_type }} and provide insights.

Content:
{{ content }}

Focus areas:
{% for area in focus_areas %}- {{ area }}
{% endfor %}

Analysis:""",
        variables=[
            PromptVariable(name="content", description="Content to analyze", required=True),
            PromptVariable(name="content_type", description="Type of content", default="text"),
            PromptVariable(name="focus_areas", description="Areas to focus on", required=False)
        ],
        description="Content analysis template"
    ),

    "extract": PromptTemplate(
        name="extract",
        template="""Extract the following information from the text:

{% for field in fields %}- {{ field.name }}: {{ field.description }}
{% endfor %}

Text:
{{ text }}

Extracted Information (JSON):""",
        variables=[
            PromptVariable(name="text", description="Source text", required=True),
            PromptVariable(name="fields", description="Fields to extract", required=True)
        ],
        description="Structured information extraction"
    )
}

Step 3: Few-Shot Learning

Dynamic example selection for consistent outputs:

# few_shot.py
from typing import List, Dict, Any, Optional, Callable
from dataclasses import dataclass
from enum import Enum
import random

from langchain_openai import OpenAIEmbeddings
import numpy as np


@dataclass
class Example:
    """A single few-shot example."""
    input: str
    output: str
    metadata: Dict[str, Any] = None

    def format(self, input_prefix: str = "Input", output_prefix: str = "Output") -> str:
        return f"{input_prefix}: {self.input}\n{output_prefix}: {self.output}"


class SelectionStrategy(str, Enum):
    RANDOM = "random"
    SEMANTIC = "semantic"
    DIVERSE = "diverse"
    RECENT = "recent"


class FewShotSelector:
    """
    Select optimal examples for few-shot prompting.

    Strategies:
    - Random: Simple random selection
    - Semantic: Most similar to query (embedding-based)
    - Diverse: Maximize coverage of different patterns
    - Recent: Most recently added examples
    """

    def __init__(
        self,
        examples: List[Example],
        strategy: SelectionStrategy = SelectionStrategy.SEMANTIC
    ):
        self.examples = examples
        self.strategy = strategy
        self._embeddings = None
        self._example_vectors = None

    def _init_embeddings(self):
        """Initialize embedding model for semantic selection."""
        if self._embeddings is None:
            from config import get_settings
            settings = get_settings()
            self._embeddings = OpenAIEmbeddings(
                openai_api_key=settings.openai_api_key
            )
            # Pre-compute example embeddings
            texts = [ex.input for ex in self.examples]
            self._example_vectors = self._embeddings.embed_documents(texts)

    def select(
        self,
        query: str,
        k: int = 3,
        strategy: Optional[SelectionStrategy] = None
    ) -> List[Example]:
        """
        Select k examples based on strategy.

        Args:
            query: The input query to match against
            k: Number of examples to select
            strategy: Override default strategy

        Returns:
            List of selected examples
        """
        strategy = strategy or self.strategy
        k = min(k, len(self.examples))

        if strategy == SelectionStrategy.RANDOM:
            return self._select_random(k)
        elif strategy == SelectionStrategy.SEMANTIC:
            return self._select_semantic(query, k)
        elif strategy == SelectionStrategy.DIVERSE:
            return self._select_diverse(query, k)
        elif strategy == SelectionStrategy.RECENT:
            return self._select_recent(k)

        return self._select_random(k)

    def _select_random(self, k: int) -> List[Example]:
        """Random selection."""
        return random.sample(self.examples, k)

    def _select_semantic(self, query: str, k: int) -> List[Example]:
        """Select most semantically similar examples."""
        self._init_embeddings()

        query_vector = self._embeddings.embed_query(query)

        # Compute cosine similarities
        similarities = []
        for i, ex_vector in enumerate(self._example_vectors):
            sim = np.dot(query_vector, ex_vector) / (
                np.linalg.norm(query_vector) * np.linalg.norm(ex_vector)
            )
            similarities.append((i, sim))

        # Sort by similarity
        similarities.sort(key=lambda x: x[1], reverse=True)

        # Return top k
        return [self.examples[i] for i, _ in similarities[:k]]

    def _select_diverse(self, query: str, k: int) -> List[Example]:
        """Select diverse examples using MMR-like algorithm."""
        self._init_embeddings()

        query_vector = self._embeddings.embed_query(query)

        # Start with most similar
        similarities = []
        for i, ex_vector in enumerate(self._example_vectors):
            sim = np.dot(query_vector, ex_vector) / (
                np.linalg.norm(query_vector) * np.linalg.norm(ex_vector)
            )
            similarities.append((i, sim))

        similarities.sort(key=lambda x: x[1], reverse=True)

        selected_indices = [similarities[0][0]]
        selected = [self.examples[similarities[0][0]]]

        # Iteratively select diverse examples
        lambda_param = 0.5  # Balance relevance and diversity

        while len(selected) < k:
            best_score = -1
            best_idx = -1

            for i, _ in enumerate(self.examples):
                if i in selected_indices:
                    continue

                # Relevance to query
                relevance = next(s for idx, s in similarities if idx == i)

                # Max similarity to already selected (for diversity)
                max_sim_to_selected = max(
                    np.dot(self._example_vectors[i], self._example_vectors[j]) / (
                        np.linalg.norm(self._example_vectors[i]) *
                        np.linalg.norm(self._example_vectors[j])
                    )
                    for j in selected_indices
                )

                # MMR score
                score = lambda_param * relevance - (1 - lambda_param) * max_sim_to_selected

                if score > best_score:
                    best_score = score
                    best_idx = i

            if best_idx >= 0:
                selected_indices.append(best_idx)
                selected.append(self.examples[best_idx])

        return selected

    def _select_recent(self, k: int) -> List[Example]:
        """Select most recent examples."""
        return self.examples[-k:]

    def add_example(self, example: Example) -> None:
        """Add a new example."""
        self.examples.append(example)
        # Invalidate cached vectors
        self._example_vectors = None


class FewShotPrompt:
    """Build prompts with dynamic few-shot examples."""

    def __init__(
        self,
        task_description: str,
        examples: List[Example],
        input_prefix: str = "Input",
        output_prefix: str = "Output",
        selection_strategy: SelectionStrategy = SelectionStrategy.SEMANTIC
    ):
        self.task_description = task_description
        self.selector = FewShotSelector(examples, selection_strategy)
        self.input_prefix = input_prefix
        self.output_prefix = output_prefix

    def format(
        self,
        query: str,
        num_examples: int = 3
    ) -> str:
        """
        Format prompt with selected examples.

        Args:
            query: User query
            num_examples: Number of examples to include

        Returns:
            Formatted prompt string
        """
        # Select examples
        selected = self.selector.select(query, k=num_examples)

        # Build prompt
        parts = [self.task_description, "\nExamples:\n"]

        for ex in selected:
            parts.append(ex.format(self.input_prefix, self.output_prefix))
            parts.append("")

        parts.append(f"\n{self.input_prefix}: {query}")
        parts.append(f"{self.output_prefix}:")

        return "\n".join(parts)

Step 4: Chain-of-Thought Prompting

Guide the model through step-by-step reasoning:

# chain_of_thought.py
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from enum import Enum

from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage, AIMessage

from config import get_settings


class ReasoningStyle(str, Enum):
    STEP_BY_STEP = "step_by_step"
    THINK_ALOUD = "think_aloud"
    STRUCTURED = "structured"
    SOCRATIC = "socratic"


@dataclass
class ReasoningStep:
    """A single reasoning step."""
    step_number: int
    thought: str
    action: Optional[str] = None
    observation: Optional[str] = None


@dataclass
class ChainOfThoughtResult:
    """Complete CoT result."""
    question: str
    reasoning_steps: List[ReasoningStep]
    final_answer: str
    confidence: float


class ChainOfThoughtBuilder:
    """
    Build Chain-of-Thought prompts for complex reasoning.

    Supports multiple reasoning styles:
    - Step-by-step: Numbered steps
    - Think aloud: Stream of consciousness
    - Structured: Specific framework (e.g., pros/cons)
    - Socratic: Question-based exploration
    """

    STYLE_PROMPTS = {
        ReasoningStyle.STEP_BY_STEP: """
Let's solve this step by step:

Step 1: [First consideration]
Step 2: [Next logical step]
...
Final Answer: [Conclusion]
""",
        ReasoningStyle.THINK_ALOUD: """
Let me think through this...

First, I notice that...
This makes me think...
Considering all of this...

Therefore: [Conclusion]
""",
        ReasoningStyle.STRUCTURED: """
Let me analyze this systematically:

1. Understanding the problem:
   - Key elements: ...
   - Constraints: ...

2. Possible approaches:
   - Option A: ...
   - Option B: ...

3. Evaluation:
   - Pros/Cons of each...

4. Conclusion:
   Based on this analysis: [Answer]
""",
        ReasoningStyle.SOCRATIC: """
Let me explore this through questions:

Q1: What is the core issue here?
A1: ...

Q2: What assumptions am I making?
A2: ...

Q3: What would change if...?
A3: ...

Given these insights: [Conclusion]
"""
    }

    def __init__(self, style: ReasoningStyle = ReasoningStyle.STEP_BY_STEP):
        self.style = style
        self.settings = get_settings()
        self.llm = ChatOpenAI(
            model=self.settings.default_model,
            api_key=self.settings.openai_api_key,
            temperature=0.3  # Lower for reasoning
        )

    def build_prompt(
        self,
        question: str,
        context: Optional[str] = None,
        style: Optional[ReasoningStyle] = None
    ) -> str:
        """Build a CoT prompt for the question."""
        style = style or self.style
        style_guide = self.STYLE_PROMPTS[style]

        prompt_parts = []

        if context:
            prompt_parts.append(f"Context:\n{context}\n")

        prompt_parts.append(f"Question: {question}")
        prompt_parts.append(f"\n{style_guide}")

        return "\n".join(prompt_parts)

    def reason(
        self,
        question: str,
        context: Optional[str] = None,
        style: Optional[ReasoningStyle] = None
    ) -> ChainOfThoughtResult:
        """
        Execute Chain-of-Thought reasoning.

        Args:
            question: The question to reason about
            context: Optional context information
            style: Reasoning style to use

        Returns:
            ChainOfThoughtResult with steps and answer
        """
        style = style or self.style

        system_prompt = """You are a careful, methodical thinker.
When answering questions, show your reasoning process explicitly.
Break down complex problems into manageable steps.
Be honest about uncertainty."""

        user_prompt = self.build_prompt(question, context, style)

        messages = [
            SystemMessage(content=system_prompt),
            HumanMessage(content=user_prompt)
        ]

        response = self.llm.invoke(messages)
        content = response.content

        # Parse response into steps
        steps = self._parse_reasoning(content, style)
        final_answer = self._extract_answer(content)
        confidence = self._estimate_confidence(content)

        return ChainOfThoughtResult(
            question=question,
            reasoning_steps=steps,
            final_answer=final_answer,
            confidence=confidence
        )

    def _parse_reasoning(
        self,
        content: str,
        style: ReasoningStyle
    ) -> List[ReasoningStep]:
        """Parse reasoning steps from response."""
        steps = []
        lines = content.split('\n')

        step_num = 0
        current_thought = []

        for line in lines:
            line = line.strip()

            # Detect step markers
            is_step_marker = (
                line.startswith('Step ') or
                line.startswith('1.') or
                line.startswith('Q1:') or
                (line.startswith('-') and step_num == 0)
            )

            if is_step_marker and current_thought:
                steps.append(ReasoningStep(
                    step_number=step_num,
                    thought='\n'.join(current_thought)
                ))
                step_num += 1
                current_thought = [line]
            elif line:
                current_thought.append(line)

        if current_thought:
            steps.append(ReasoningStep(
                step_number=step_num,
                thought='\n'.join(current_thought)
            ))

        return steps

    def _extract_answer(self, content: str) -> str:
        """Extract final answer from response."""
        markers = [
            'Final Answer:', 'Therefore:', 'Conclusion:',
            'Given these insights:', 'Based on this analysis:',
            'The answer is:', 'In conclusion:'
        ]

        for marker in markers:
            if marker in content:
                answer = content.split(marker)[-1].strip()
                # Take first paragraph
                return answer.split('\n\n')[0].strip()

        # Fallback: last paragraph
        paragraphs = content.strip().split('\n\n')
        return paragraphs[-1] if paragraphs else content

    def _estimate_confidence(self, content: str) -> float:
        """Estimate confidence based on language used."""
        low_confidence_markers = [
            'uncertain', 'not sure', 'might be', 'possibly',
            'I think', 'it seems', 'probably', 'unclear'
        ]

        high_confidence_markers = [
            'definitely', 'certainly', 'clearly', 'must be',
            'obviously', 'without doubt', 'always', 'never'
        ]

        content_lower = content.lower()

        low_count = sum(1 for m in low_confidence_markers if m in content_lower)
        high_count = sum(1 for m in high_confidence_markers if m in content_lower)

        # Base confidence
        base = 0.7

        # Adjust based on markers
        confidence = base - (low_count * 0.05) + (high_count * 0.05)

        return max(0.1, min(1.0, confidence))


class SelfConsistencyReasoner:
    """
    Implement self-consistency for improved accuracy.

    Generates multiple reasoning paths and selects the most common answer.
    """

    def __init__(self, num_samples: int = 5):
        self.num_samples = num_samples
        self.cot = ChainOfThoughtBuilder()
        self.settings = get_settings()

    def reason(
        self,
        question: str,
        context: Optional[str] = None
    ) -> Dict[str, Any]:
        """
        Generate multiple reasoning paths and find consensus.

        Args:
            question: Question to answer
            context: Optional context

        Returns:
            Dict with answer, confidence, and all paths
        """
        # Generate multiple reasoning paths
        results = []

        # Use different temperatures for diversity
        temperatures = [0.3, 0.5, 0.7, 0.5, 0.3][:self.num_samples]

        for temp in temperatures:
            self.cot.llm.temperature = temp
            result = self.cot.reason(question, context)
            results.append(result)

        # Extract answers and count
        answer_counts: Dict[str, int] = {}
        for result in results:
            answer = result.final_answer.strip().lower()
            # Normalize answer
            answer_key = self._normalize_answer(answer)
            answer_counts[answer_key] = answer_counts.get(answer_key, 0) + 1

        # Find most common
        best_answer = max(answer_counts.keys(), key=lambda k: answer_counts[k])
        confidence = answer_counts[best_answer] / self.num_samples

        # Get best result with this answer
        best_result = next(
            r for r in results
            if self._normalize_answer(r.final_answer) == best_answer
        )

        return {
            "answer": best_result.final_answer,
            "confidence": confidence,
            "consensus_count": answer_counts[best_answer],
            "total_samples": self.num_samples,
            "reasoning": best_result.reasoning_steps,
            "all_answers": answer_counts
        }

    def _normalize_answer(self, answer: str) -> str:
        """Normalize answer for comparison."""
        # Simple normalization - could be enhanced
        normalized = answer.lower().strip()
        # Remove common prefixes
        prefixes = ['the answer is', 'therefore', 'so', 'thus']
        for prefix in prefixes:
            if normalized.startswith(prefix):
                normalized = normalized[len(prefix):].strip()
        return normalized[:100]  # First 100 chars for comparison

Step 5: Prompt Versioning

Track and manage prompt versions:

# versioning.py
import sqlite3
import json
from datetime import datetime
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
import hashlib

from templates import PromptTemplate


@dataclass
class PromptVersion:
    """A versioned prompt with metadata."""
    id: int
    name: str
    version: str
    template: PromptTemplate
    created_at: datetime
    metrics: Dict[str, float]
    is_active: bool


class PromptVersionStore:
    """
    SQLite-based prompt version storage.

    Features:
    - Version history for all prompts
    - Performance metrics tracking
    - Rollback capability
    - A/B test assignment
    """

    def __init__(self, db_path: str = "prompts.db"):
        self.db_path = db_path
        self._init_db()

    def _init_db(self):
        """Initialize database schema."""
        with sqlite3.connect(self.db_path) as conn:
            conn.execute("""
                CREATE TABLE IF NOT EXISTS prompts (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    name TEXT NOT NULL,
                    version TEXT NOT NULL,
                    template_json TEXT NOT NULL,
                    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    is_active BOOLEAN DEFAULT FALSE,
                    metrics_json TEXT DEFAULT '{}',
                    hash TEXT NOT NULL,
                    UNIQUE(name, version)
                )
            """)

            conn.execute("""
                CREATE TABLE IF NOT EXISTS prompt_metrics (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    prompt_id INTEGER NOT NULL,
                    metric_name TEXT NOT NULL,
                    metric_value REAL NOT NULL,
                    recorded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    FOREIGN KEY (prompt_id) REFERENCES prompts(id)
                )
            """)

            conn.execute("""
                CREATE INDEX IF NOT EXISTS idx_prompt_name
                ON prompts(name)
            """)

    def save(
        self,
        template: PromptTemplate,
        version: Optional[str] = None
    ) -> PromptVersion:
        """
        Save a prompt template version.

        Args:
            template: PromptTemplate to save
            version: Optional version override

        Returns:
            Created PromptVersion
        """
        version = version or template.version
        template_json = json.dumps(template.to_dict())
        template_hash = hashlib.md5(template_json.encode()).hexdigest()

        with sqlite3.connect(self.db_path) as conn:
            cursor = conn.execute("""
                INSERT INTO prompts (name, version, template_json, hash)
                VALUES (?, ?, ?, ?)
                ON CONFLICT(name, version) DO UPDATE SET
                    template_json = excluded.template_json,
                    hash = excluded.hash
                RETURNING id, created_at
            """, (template.name, version, template_json, template_hash))

            row = cursor.fetchone()

            return PromptVersion(
                id=row[0],
                name=template.name,
                version=version,
                template=template,
                created_at=datetime.fromisoformat(row[1]) if isinstance(row[1], str) else row[1],
                metrics={},
                is_active=False
            )

    def get(
        self,
        name: str,
        version: Optional[str] = None
    ) -> Optional[PromptTemplate]:
        """
        Get a prompt template by name and version.

        Args:
            name: Prompt name
            version: Version (latest if None)

        Returns:
            PromptTemplate or None
        """
        with sqlite3.connect(self.db_path) as conn:
            if version:
                cursor = conn.execute("""
                    SELECT template_json FROM prompts
                    WHERE name = ? AND version = ?
                """, (name, version))
            else:
                # Get latest version
                cursor = conn.execute("""
                    SELECT template_json FROM prompts
                    WHERE name = ?
                    ORDER BY created_at DESC
                    LIMIT 1
                """, (name,))

            row = cursor.fetchone()
            if row:
                data = json.loads(row[0])
                return PromptTemplate.from_dict(data)
            return None

    def get_active(self, name: str) -> Optional[PromptTemplate]:
        """Get the active version of a prompt."""
        with sqlite3.connect(self.db_path) as conn:
            cursor = conn.execute("""
                SELECT template_json FROM prompts
                WHERE name = ? AND is_active = TRUE
            """, (name,))

            row = cursor.fetchone()
            if row:
                data = json.loads(row[0])
                return PromptTemplate.from_dict(data)
            return None

    def set_active(self, name: str, version: str) -> bool:
        """Set a version as active (for production use)."""
        with sqlite3.connect(self.db_path) as conn:
            # Deactivate all versions
            conn.execute("""
                UPDATE prompts SET is_active = FALSE
                WHERE name = ?
            """, (name,))

            # Activate specific version
            cursor = conn.execute("""
                UPDATE prompts SET is_active = TRUE
                WHERE name = ? AND version = ?
            """, (name, version))

            return cursor.rowcount > 0

    def list_versions(self, name: str) -> List[Dict[str, Any]]:
        """List all versions of a prompt."""
        with sqlite3.connect(self.db_path) as conn:
            cursor = conn.execute("""
                SELECT version, created_at, is_active, metrics_json
                FROM prompts
                WHERE name = ?
                ORDER BY created_at DESC
            """, (name,))

            return [
                {
                    "version": row[0],
                    "created_at": row[1],
                    "is_active": bool(row[2]),
                    "metrics": json.loads(row[3])
                }
                for row in cursor.fetchall()
            ]

    def record_metric(
        self,
        name: str,
        version: str,
        metric_name: str,
        value: float
    ) -> None:
        """Record a performance metric for a prompt version."""
        with sqlite3.connect(self.db_path) as conn:
            # Get prompt id
            cursor = conn.execute("""
                SELECT id FROM prompts
                WHERE name = ? AND version = ?
            """, (name, version))

            row = cursor.fetchone()
            if not row:
                return

            prompt_id = row[0]

            # Record metric
            conn.execute("""
                INSERT INTO prompt_metrics (prompt_id, metric_name, metric_value)
                VALUES (?, ?, ?)
            """, (prompt_id, metric_name, value))

            # Update aggregated metrics
            cursor = conn.execute("""
                SELECT metric_name, AVG(metric_value) as avg_value
                FROM prompt_metrics
                WHERE prompt_id = ?
                GROUP BY metric_name
            """, (prompt_id,))

            metrics = {row[0]: row[1] for row in cursor.fetchall()}

            conn.execute("""
                UPDATE prompts SET metrics_json = ?
                WHERE id = ?
            """, (json.dumps(metrics), prompt_id))

    def compare_versions(
        self,
        name: str,
        version_a: str,
        version_b: str
    ) -> Dict[str, Any]:
        """Compare metrics between two versions."""
        with sqlite3.connect(self.db_path) as conn:
            results = {}

            for version in [version_a, version_b]:
                cursor = conn.execute("""
                    SELECT metrics_json FROM prompts
                    WHERE name = ? AND version = ?
                """, (name, version))

                row = cursor.fetchone()
                if row:
                    results[version] = json.loads(row[0])

            return results

Step 6: Prompt Testing

Test prompts for reliability and consistency:

# testing.py
from typing import List, Dict, Any, Callable, Optional
from dataclasses import dataclass
from enum import Enum
import statistics
import time

from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage

from templates import PromptTemplate
from config import get_settings


class TestStatus(str, Enum):
    PASSED = "passed"
    FAILED = "failed"
    WARNING = "warning"


@dataclass
class TestCase:
    """A single test case for a prompt."""
    name: str
    input_vars: Dict[str, Any]
    expected_contains: Optional[List[str]] = None
    expected_not_contains: Optional[List[str]] = None
    expected_format: Optional[str] = None  # "json", "markdown", etc.
    max_tokens: Optional[int] = None
    custom_validator: Optional[Callable[[str], bool]] = None


@dataclass
class TestResult:
    """Result of a single test."""
    test_name: str
    status: TestStatus
    output: str
    latency_ms: float
    token_count: int
    errors: List[str]


@dataclass
class TestSuiteResult:
    """Result of running all tests."""
    prompt_name: str
    total_tests: int
    passed: int
    failed: int
    warnings: int
    avg_latency_ms: float
    consistency_score: float
    results: List[TestResult]


class PromptTester:
    """
    Test prompts for reliability and consistency.

    Features:
    - Content validation (contains/not contains)
    - Format validation (JSON, markdown, etc.)
    - Latency measurement
    - Consistency testing (same input, multiple runs)
    - Custom validators
    """

    def __init__(self):
        self.settings = get_settings()
        self.llm = ChatOpenAI(
            model=self.settings.default_model,
            api_key=self.settings.openai_api_key,
            temperature=0.7
        )

    def run_test(
        self,
        template: PromptTemplate,
        test_case: TestCase
    ) -> TestResult:
        """Run a single test case."""
        errors = []
        start_time = time.time()

        # Render and execute prompt
        try:
            rendered = template.render(**test_case.input_vars)

            messages = [HumanMessage(content=rendered)]
            response = self.llm.invoke(messages)
            output = response.content

        except Exception as e:
            return TestResult(
                test_name=test_case.name,
                status=TestStatus.FAILED,
                output="",
                latency_ms=0,
                token_count=0,
                errors=[f"Execution error: {str(e)}"]
            )

        latency_ms = (time.time() - start_time) * 1000
        token_count = len(output.split())  # Approximate

        # Validate output
        if test_case.expected_contains:
            for expected in test_case.expected_contains:
                if expected.lower() not in output.lower():
                    errors.append(f"Missing expected content: '{expected}'")

        if test_case.expected_not_contains:
            for unexpected in test_case.expected_not_contains:
                if unexpected.lower() in output.lower():
                    errors.append(f"Contains unexpected content: '{unexpected}'")

        if test_case.expected_format:
            if not self._validate_format(output, test_case.expected_format):
                errors.append(f"Invalid format: expected {test_case.expected_format}")

        if test_case.max_tokens and token_count > test_case.max_tokens:
            errors.append(f"Exceeded max tokens: {token_count} > {test_case.max_tokens}")

        if test_case.custom_validator:
            try:
                if not test_case.custom_validator(output):
                    errors.append("Custom validation failed")
            except Exception as e:
                errors.append(f"Custom validator error: {str(e)}")

        # Determine status
        if errors:
            status = TestStatus.FAILED
        elif latency_ms > 5000:  # Warning if slow
            status = TestStatus.WARNING
            errors.append(f"Slow response: {latency_ms:.0f}ms")
        else:
            status = TestStatus.PASSED

        return TestResult(
            test_name=test_case.name,
            status=status,
            output=output,
            latency_ms=latency_ms,
            token_count=token_count,
            errors=errors
        )

    def run_suite(
        self,
        template: PromptTemplate,
        test_cases: List[TestCase]
    ) -> TestSuiteResult:
        """Run all test cases for a prompt."""
        results = []

        for test_case in test_cases:
            result = self.run_test(template, test_case)
            results.append(result)

        # Calculate metrics
        passed = sum(1 for r in results if r.status == TestStatus.PASSED)
        failed = sum(1 for r in results if r.status == TestStatus.FAILED)
        warnings = sum(1 for r in results if r.status == TestStatus.WARNING)

        latencies = [r.latency_ms for r in results]
        avg_latency = statistics.mean(latencies) if latencies else 0

        return TestSuiteResult(
            prompt_name=template.name,
            total_tests=len(results),
            passed=passed,
            failed=failed,
            warnings=warnings,
            avg_latency_ms=avg_latency,
            consistency_score=passed / len(results) if results else 0,
            results=results
        )

    def test_consistency(
        self,
        template: PromptTemplate,
        input_vars: Dict[str, Any],
        runs: int = 5
    ) -> Dict[str, Any]:
        """
        Test output consistency across multiple runs.

        Returns similarity metrics between outputs.
        """
        outputs = []
        latencies = []

        for _ in range(runs):
            start = time.time()
            rendered = template.render(**input_vars)
            messages = [HumanMessage(content=rendered)]
            response = self.llm.invoke(messages)
            outputs.append(response.content)
            latencies.append((time.time() - start) * 1000)

        # Calculate consistency
        from difflib import SequenceMatcher

        similarities = []
        for i in range(len(outputs)):
            for j in range(i + 1, len(outputs)):
                sim = SequenceMatcher(None, outputs[i], outputs[j]).ratio()
                similarities.append(sim)

        return {
            "runs": runs,
            "avg_similarity": statistics.mean(similarities) if similarities else 1.0,
            "min_similarity": min(similarities) if similarities else 1.0,
            "avg_latency_ms": statistics.mean(latencies),
            "latency_std": statistics.stdev(latencies) if len(latencies) > 1 else 0,
            "outputs": outputs
        }

    def _validate_format(self, output: str, expected_format: str) -> bool:
        """Validate output format."""
        if expected_format == "json":
            import json
            try:
                json.loads(output)
                return True
            except:
                # Try to extract JSON from output
                import re
                json_match = re.search(r'\{[\s\S]*\}', output)
                if json_match:
                    try:
                        json.loads(json_match.group())
                        return True
                    except:
                        pass
                return False

        elif expected_format == "markdown":
            # Check for markdown elements
            return any([
                output.startswith('#'),
                '**' in output,
                '- ' in output,
                '```' in output
            ])

        elif expected_format == "list":
            return any([
                '\n- ' in output,
                '\n* ' in output,
                '\n1.' in output
            ])

        return True  # Unknown format, assume valid

Step 7: Prompt Registry

Central registry for managing prompts:

# registry.py
from typing import Dict, List, Optional, Any
from pathlib import Path
import yaml
import json

from templates import PromptTemplate, COMMON_TEMPLATES
from versioning import PromptVersionStore
from testing import PromptTester, TestCase, TestSuiteResult


class PromptRegistry:
    """
    Central registry for prompt management.

    Features:
    - Load prompts from files or code
    - Version management
    - Testing integration
    - A/B test support
    """

    def __init__(self, db_path: str = "prompts.db"):
        self.store = PromptVersionStore(db_path)
        self.tester = PromptTester()
        self._cache: Dict[str, PromptTemplate] = {}
        self._ab_tests: Dict[str, Dict[str, float]] = {}

        # Load common templates
        for name, template in COMMON_TEMPLATES.items():
            self.register(template)

    def register(
        self,
        template: PromptTemplate,
        version: Optional[str] = None
    ) -> None:
        """Register a prompt template."""
        self.store.save(template, version)
        self._cache[template.name] = template

    def get(
        self,
        name: str,
        version: Optional[str] = None
    ) -> Optional[PromptTemplate]:
        """Get a prompt by name."""
        # Check cache first
        if name in self._cache and version is None:
            return self._cache[name]

        # Load from store
        template = self.store.get(name, version)
        if template:
            self._cache[name] = template
        return template

    def get_for_request(self, name: str, request_id: str) -> PromptTemplate:
        """
        Get prompt for a request, handling A/B tests.

        Args:
            name: Prompt name
            request_id: Unique request identifier for consistent assignment

        Returns:
            Selected PromptTemplate
        """
        if name in self._ab_tests:
            # A/B test active - select version based on request_id
            import hashlib
            hash_val = int(hashlib.md5(request_id.encode()).hexdigest(), 16)
            normalized = (hash_val % 1000) / 1000.0

            cumulative = 0.0
            for version, weight in self._ab_tests[name].items():
                cumulative += weight
                if normalized < cumulative:
                    return self.get(name, version)

        # Return active or latest version
        return self.store.get_active(name) or self.get(name)

    def start_ab_test(
        self,
        name: str,
        versions: Dict[str, float]
    ) -> None:
        """
        Start an A/B test for a prompt.

        Args:
            name: Prompt name
            versions: Dict of version -> weight (should sum to 1.0)
        """
        total = sum(versions.values())
        if abs(total - 1.0) > 0.01:
            raise ValueError(f"Weights must sum to 1.0, got {total}")

        self._ab_tests[name] = versions

    def stop_ab_test(self, name: str) -> None:
        """Stop an A/B test."""
        if name in self._ab_tests:
            del self._ab_tests[name]

    def test(
        self,
        name: str,
        test_cases: List[TestCase],
        version: Optional[str] = None
    ) -> TestSuiteResult:
        """Test a prompt with given test cases."""
        template = self.get(name, version)
        if not template:
            raise ValueError(f"Prompt '{name}' not found")

        return self.tester.run_suite(template, test_cases)

    def promote(self, name: str, version: str) -> bool:
        """Promote a version to active (production)."""
        return self.store.set_active(name, version)

    def load_from_yaml(self, path: str) -> List[str]:
        """Load prompts from YAML file."""
        loaded = []

        with open(path) as f:
            data = yaml.safe_load(f)

        for prompt_data in data.get("prompts", []):
            template = PromptTemplate.from_dict(prompt_data)
            self.register(template)
            loaded.append(template.name)

        return loaded

    def export_to_yaml(self, path: str, names: Optional[List[str]] = None) -> None:
        """Export prompts to YAML file."""
        prompts = []

        for name in (names or self._cache.keys()):
            template = self.get(name)
            if template:
                prompts.append(template.to_dict())

        with open(path, 'w') as f:
            yaml.dump({"prompts": prompts}, f, default_flow_style=False)

    def list_prompts(self) -> List[Dict[str, Any]]:
        """List all registered prompts."""
        prompts = []

        for name in self._cache.keys():
            versions = self.store.list_versions(name)
            prompts.append({
                "name": name,
                "versions": versions,
                "ab_test_active": name in self._ab_tests
            })

        return prompts

Usage Examples

Basic Template Usage

from templates import PromptTemplate, PromptBuilder

# Using pre-built template
from templates import COMMON_TEMPLATES

summarize = COMMON_TEMPLATES["summarize"]
prompt = summarize.render(text="Long article here...", style="bullet points")

# Building custom template
template = (
    PromptBuilder("code_review")
    .with_description("Review code for issues")
    .add_section("Review the following code and identify:")
    .add_section("1. Bugs and errors")
    .add_section("2. Security issues")
    .add_section("3. Performance problems")
    .add_section("\nCode:")
    .add_variable("code", "The code to review")
    .add_section("\nProvide your review:")
    .build()
)

prompt = template.render(code="def foo(): pass")

Few-Shot Learning

from few_shot import Example, FewShotPrompt, SelectionStrategy

examples = [
    Example(input="happy", output="positive"),
    Example(input="sad", output="negative"),
    Example(input="excited", output="positive"),
    Example(input="angry", output="negative"),
]

few_shot = FewShotPrompt(
    task_description="Classify the sentiment of the word.",
    examples=examples,
    selection_strategy=SelectionStrategy.SEMANTIC
)

prompt = few_shot.format("joyful", num_examples=2)

Chain-of-Thought

from chain_of_thought import ChainOfThoughtBuilder, ReasoningStyle

cot = ChainOfThoughtBuilder(style=ReasoningStyle.STRUCTURED)
result = cot.reason("Should we use microservices or monolith for a startup?")

print(f"Answer: {result.final_answer}")
print(f"Confidence: {result.confidence}")
for step in result.reasoning_steps:
    print(f"Step {step.step_number}: {step.thought[:100]}...")

Requirements

# requirements.txt
langchain>=0.3.0
langchain-openai>=0.2.0
openai>=1.50.0
jinja2>=3.1.0
pydantic>=2.9.0
pydantic-settings>=2.6.0
numpy>=1.26.0
pyyaml>=6.0.0

Key Concepts Recap

Concept	What It Is	Why It Matters
Prompt Templates	Reusable prompts with Jinja2 variables	Separate structure from content, enable iteration
Few-Shot Learning	Include examples in the prompt	LLM learns output format from examples
Semantic Selection	Choose examples similar to query	Better examples = better outputs
Chain-of-Thought	"Let's think step by step" prompting	Improves reasoning on complex tasks
Self-Consistency	Sample multiple paths, pick majority	Catches reasoning errors, improves accuracy
Prompt Versioning	Store versions in SQLite with metrics	Track what worked, enable rollback
A/B Testing	Route traffic to different versions	Data-driven prompt optimization
Consistency Testing	Run same input multiple times	Ensure reliable, predictable outputs

Prompt Engineering

On this page

Prompt Engineering

On this page