Prompt Engineering
Master systematic prompt engineering with templates, versioning, testing, and advanced techniques like Chain-of-Thought
Prompt Engineering
| Property | Value |
|---|---|
| Difficulty | Intermediate |
| Time | ~5 hours |
| Code Size | ~450 LOC |
| Prerequisites | Chatbot |
TL;DR
Build reusable prompt templates with Jinja2, add few-shot examples for consistent outputs, and use Chain-of-Thought (CoT) for complex reasoning. Version your prompts in SQLite, A/B test variations, and measure consistency—treat prompts as code.
Tech Stack
| Technology | Purpose |
|---|---|
| LangChain | Prompt templates |
| Jinja2 | Advanced templating |
| Pydantic | Prompt validation |
| SQLite | Version storage |
| OpenAI | LLM provider |
Prerequisites
- Python 3.10+
- OpenAI API key
pip install langchain langchain-openai jinja2 pydantic openaiWhat You'll Learn
- Design effective prompt templates
- Implement few-shot learning with dynamic examples
- Build Chain-of-Thought prompting systems
- Version control and A/B test prompts
- Measure and optimize prompt performance
The Problem: Ad-Hoc Prompting Doesn't Scale
| Issue | Impact |
|---|---|
| Hardcoded prompts | Can't iterate or experiment |
| No versioning | Lost track of what worked |
| No testing | Unknown reliability |
| No examples | Inconsistent output format |
| No structure | Repeated prompt rewriting |
┌─────────────────────────────────────────────────────────────────────────────┐
│ AD-HOC vs SYSTEMATIC PROMPTING │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Ad-Hoc Prompting Systematic Prompting │
│ ───────────────── ──────────────────── │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Hardcoded String │ │ Template │ │
│ │ "Summarize this" │ │ "{{ style }} │ │
│ └────────┬─────────┘ │ summary of │ │
│ │ │ {{ text }}" │ │
│ ▼ └────────┬─────────┘ │
│ ┌──────────────────┐ │ │
│ │ LLM │ ▼ │
│ └────────┬─────────┘ ┌──────────────────┐ │
│ │ │ Variables │ │
│ ▼ │ style="bullet" │ │
│ ┌──────────────────┐ │ text="..." │ │
│ │ Unpredictable │ └────────┬─────────┘ │
│ │ Output │ │ │
│ │ (varies each │ ▼ │
│ │ time) │ ┌──────────────────┐ │
│ └──────────────────┘ │ Examples │ │
│ │ (few-shot) │ │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ LLM │ │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Consistent │ │
│ │ Output │ │
│ │ (predictable │ │
│ │ format) │ │
│ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Project Structure
prompt-engineering/
├── config.py # Configuration
├── templates.py # Prompt template system
├── few_shot.py # Few-shot example management
├── chain_of_thought.py # CoT prompting
├── versioning.py # Prompt version control
├── testing.py # Prompt testing framework
├── registry.py # Prompt registry
└── requirements.txtStep 1: Configuration
# config.py
from pydantic_settings import BaseSettings
from pydantic import Field
from functools import lru_cache
from typing import Optional
class Settings(BaseSettings):
openai_api_key: str
default_model: str = "gpt-4o"
temperature: float = 0.7
max_tokens: int = 2000
# Versioning
db_path: str = "prompts.db"
enable_versioning: bool = True
# Testing
test_iterations: int = 5
consistency_threshold: float = 0.8
class Config:
env_file = ".env"
@lru_cache
def get_settings() -> Settings:
return Settings()Step 2: Prompt Template System
Build a flexible template system that separates structure from content:
# templates.py
from typing import Dict, Any, List, Optional
from dataclasses import dataclass, field
from enum import Enum
from jinja2 import Template, Environment, BaseLoader
from pydantic import BaseModel, validator
import json
class PromptRole(str, Enum):
SYSTEM = "system"
USER = "user"
ASSISTANT = "assistant"
class PromptVariable(BaseModel):
"""Define a template variable with validation."""
name: str
description: str
required: bool = True
default: Optional[str] = None
examples: List[str] = []
def validate_value(self, value: Any) -> bool:
if self.required and value is None:
return False
return True
@dataclass
class PromptTemplate:
"""
A structured prompt template with variables and metadata.
Supports Jinja2 templating for complex logic.
"""
name: str
template: str
role: PromptRole = PromptRole.USER
variables: List[PromptVariable] = field(default_factory=list)
description: str = ""
version: str = "1.0.0"
tags: List[str] = field(default_factory=list)
def __post_init__(self):
self._jinja_env = Environment(loader=BaseLoader())
self._compiled = self._jinja_env.from_string(self.template)
def render(self, **kwargs) -> str:
"""
Render template with provided variables.
Args:
**kwargs: Variable values
Returns:
Rendered prompt string
Raises:
ValueError: If required variables are missing
"""
# Validate required variables
for var in self.variables:
if var.required and var.name not in kwargs:
if var.default is not None:
kwargs[var.name] = var.default
else:
raise ValueError(f"Required variable '{var.name}' not provided")
return self._compiled.render(**kwargs)
def get_variable_names(self) -> List[str]:
"""Get list of variable names."""
return [v.name for v in self.variables]
def to_dict(self) -> Dict[str, Any]:
"""Serialize template to dict."""
return {
"name": self.name,
"template": self.template,
"role": self.role.value,
"variables": [v.dict() for v in self.variables],
"description": self.description,
"version": self.version,
"tags": self.tags
}
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "PromptTemplate":
"""Deserialize from dict."""
data["role"] = PromptRole(data["role"])
data["variables"] = [PromptVariable(**v) for v in data.get("variables", [])]
return cls(**data)
class PromptBuilder:
"""
Fluent builder for constructing prompts.
Provides a clean API for building complex prompts.
"""
def __init__(self, name: str):
self.name = name
self._template_parts: List[str] = []
self._variables: List[PromptVariable] = []
self._role = PromptRole.USER
self._description = ""
self._tags: List[str] = []
def with_role(self, role: PromptRole) -> "PromptBuilder":
self._role = role
return self
def with_description(self, desc: str) -> "PromptBuilder":
self._description = desc
return self
def add_section(self, content: str) -> "PromptBuilder":
"""Add a static section to the prompt."""
self._template_parts.append(content)
return self
def add_variable(
self,
name: str,
description: str = "",
required: bool = True,
default: Optional[str] = None
) -> "PromptBuilder":
"""Add a variable placeholder."""
self._template_parts.append("{{ " + name + " }}")
self._variables.append(PromptVariable(
name=name,
description=description,
required=required,
default=default
))
return self
def add_conditional(
self,
condition: str,
content: str,
else_content: str = ""
) -> "PromptBuilder":
"""Add conditional content."""
template = f"{{% if {condition} %}}{content}"
if else_content:
template += f"{{% else %}}{else_content}"
template += "{% endif %}"
self._template_parts.append(template)
return self
def add_loop(
self,
items_var: str,
item_template: str,
separator: str = "\n"
) -> "PromptBuilder":
"""Add a loop over items."""
template = f"{{% for item in {items_var} %}}{item_template}"
template += f"{{% if not loop.last %}}{separator}{{% endif %}}"
template += "{% endfor %}"
self._template_parts.append(template)
return self
def with_tags(self, *tags: str) -> "PromptBuilder":
self._tags.extend(tags)
return self
def build(self) -> PromptTemplate:
"""Build the final PromptTemplate."""
return PromptTemplate(
name=self.name,
template="\n".join(self._template_parts),
role=self._role,
variables=self._variables,
description=self._description,
tags=self._tags
)Understanding the Template Builder Pattern:
┌─────────────────────────────────────────────────────────────────────────────┐
│ FLUENT BUILDER IN ACTION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Traditional Construction (hard to read): │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ template = PromptTemplate( │ │
│ │ name="classify", │ │
│ │ template="Classify: {{ text }}\n{% if examples %}...", │ │
│ │ role=PromptRole.USER, │ │
│ │ variables=[PromptVariable(name="text", ...), ...], │ │
│ │ tags=["classification"] │ │
│ │ ) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Fluent Builder (readable, step-by-step): │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ template = ( │ │
│ │ PromptBuilder("classify") │ │
│ │ .with_role(PromptRole.USER) │ │
│ │ .add_section("Classify the following text:") │ │
│ │ .add_variable("text", description="Text to classify") │ │
│ │ .add_conditional("examples", "Examples: {{ examples }}") │ │
│ │ .with_tags("classification") │ │
│ │ .build() │ │
│ │ ) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Each method returns `self`, enabling method chaining. │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Jinja2 Template Features:
| Feature | Syntax | Use Case |
|---|---|---|
| Variable | {{ var }} | Insert dynamic content |
| Conditional | {% if x %}...{% endif %} | Optional sections |
| Loop | {% for item in items %} | List few-shot examples |
| Filter | {{ text|upper }} | Transform content |
| Default | {{ var|default('N/A') }} | Fallback values |
# Pre-built common templates
COMMON_TEMPLATES = {
"summarize": PromptTemplate(
name="summarize",
template="""Summarize the following text in {{ style }} style.
Text:
{{ text }}
Summary:""",
variables=[
PromptVariable(name="text", description="Text to summarize", required=True),
PromptVariable(name="style", description="Summary style", default="concise")
],
description="General-purpose summarization template"
),
"analyze": PromptTemplate(
name="analyze",
template="""Analyze the following {{ content_type }} and provide insights.
Content:
{{ content }}
Focus areas:
{% for area in focus_areas %}- {{ area }}
{% endfor %}
Analysis:""",
variables=[
PromptVariable(name="content", description="Content to analyze", required=True),
PromptVariable(name="content_type", description="Type of content", default="text"),
PromptVariable(name="focus_areas", description="Areas to focus on", required=False)
],
description="Content analysis template"
),
"extract": PromptTemplate(
name="extract",
template="""Extract the following information from the text:
{% for field in fields %}- {{ field.name }}: {{ field.description }}
{% endfor %}
Text:
{{ text }}
Extracted Information (JSON):""",
variables=[
PromptVariable(name="text", description="Source text", required=True),
PromptVariable(name="fields", description="Fields to extract", required=True)
],
description="Structured information extraction"
)
}★ Insight ─────────────────────────────────────
The PromptBuilder pattern separates prompt construction from execution. This enables: 1) Reusable prompt components, 2) Type-safe variable handling, 3) Easy A/B testing by swapping templates. Jinja2 templating adds conditionals and loops without string concatenation.
─────────────────────────────────────────────────
Step 3: Few-Shot Learning
Dynamic example selection for consistent outputs:
# few_shot.py
from typing import List, Dict, Any, Optional, Callable
from dataclasses import dataclass
from enum import Enum
import random
from langchain_openai import OpenAIEmbeddings
import numpy as np
@dataclass
class Example:
"""A single few-shot example."""
input: str
output: str
metadata: Dict[str, Any] = None
def format(self, input_prefix: str = "Input", output_prefix: str = "Output") -> str:
return f"{input_prefix}: {self.input}\n{output_prefix}: {self.output}"
class SelectionStrategy(str, Enum):
RANDOM = "random"
SEMANTIC = "semantic"
DIVERSE = "diverse"
RECENT = "recent"
class FewShotSelector:
"""
Select optimal examples for few-shot prompting.
Strategies:
- Random: Simple random selection
- Semantic: Most similar to query (embedding-based)
- Diverse: Maximize coverage of different patterns
- Recent: Most recently added examples
"""
def __init__(
self,
examples: List[Example],
strategy: SelectionStrategy = SelectionStrategy.SEMANTIC
):
self.examples = examples
self.strategy = strategy
self._embeddings = None
self._example_vectors = None
def _init_embeddings(self):
"""Initialize embedding model for semantic selection."""
if self._embeddings is None:
from config import get_settings
settings = get_settings()
self._embeddings = OpenAIEmbeddings(
openai_api_key=settings.openai_api_key
)
# Pre-compute example embeddings
texts = [ex.input for ex in self.examples]
self._example_vectors = self._embeddings.embed_documents(texts)
def select(
self,
query: str,
k: int = 3,
strategy: Optional[SelectionStrategy] = None
) -> List[Example]:
"""
Select k examples based on strategy.
Args:
query: The input query to match against
k: Number of examples to select
strategy: Override default strategy
Returns:
List of selected examples
"""
strategy = strategy or self.strategy
k = min(k, len(self.examples))
if strategy == SelectionStrategy.RANDOM:
return self._select_random(k)
elif strategy == SelectionStrategy.SEMANTIC:
return self._select_semantic(query, k)
elif strategy == SelectionStrategy.DIVERSE:
return self._select_diverse(query, k)
elif strategy == SelectionStrategy.RECENT:
return self._select_recent(k)
return self._select_random(k)
def _select_random(self, k: int) -> List[Example]:
"""Random selection."""
return random.sample(self.examples, k)
def _select_semantic(self, query: str, k: int) -> List[Example]:
"""Select most semantically similar examples."""
self._init_embeddings()
query_vector = self._embeddings.embed_query(query)
# Compute cosine similarities
similarities = []
for i, ex_vector in enumerate(self._example_vectors):
sim = np.dot(query_vector, ex_vector) / (
np.linalg.norm(query_vector) * np.linalg.norm(ex_vector)
)
similarities.append((i, sim))
# Sort by similarity
similarities.sort(key=lambda x: x[1], reverse=True)
# Return top k
return [self.examples[i] for i, _ in similarities[:k]]
def _select_diverse(self, query: str, k: int) -> List[Example]:
"""Select diverse examples using MMR-like algorithm."""
self._init_embeddings()
query_vector = self._embeddings.embed_query(query)
# Start with most similar
similarities = []
for i, ex_vector in enumerate(self._example_vectors):
sim = np.dot(query_vector, ex_vector) / (
np.linalg.norm(query_vector) * np.linalg.norm(ex_vector)
)
similarities.append((i, sim))
similarities.sort(key=lambda x: x[1], reverse=True)
selected_indices = [similarities[0][0]]
selected = [self.examples[similarities[0][0]]]
# Iteratively select diverse examples
lambda_param = 0.5 # Balance relevance and diversity
while len(selected) < k:
best_score = -1
best_idx = -1
for i, _ in enumerate(self.examples):
if i in selected_indices:
continue
# Relevance to query
relevance = next(s for idx, s in similarities if idx == i)
# Max similarity to already selected (for diversity)
max_sim_to_selected = max(
np.dot(self._example_vectors[i], self._example_vectors[j]) / (
np.linalg.norm(self._example_vectors[i]) *
np.linalg.norm(self._example_vectors[j])
)
for j in selected_indices
)
# MMR score
score = lambda_param * relevance - (1 - lambda_param) * max_sim_to_selected
if score > best_score:
best_score = score
best_idx = i
if best_idx >= 0:
selected_indices.append(best_idx)
selected.append(self.examples[best_idx])
return selected
def _select_recent(self, k: int) -> List[Example]:
"""Select most recent examples."""
return self.examples[-k:]
def add_example(self, example: Example) -> None:
"""Add a new example."""
self.examples.append(example)
# Invalidate cached vectors
self._example_vectors = None
class FewShotPrompt:
"""Build prompts with dynamic few-shot examples."""
def __init__(
self,
task_description: str,
examples: List[Example],
input_prefix: str = "Input",
output_prefix: str = "Output",
selection_strategy: SelectionStrategy = SelectionStrategy.SEMANTIC
):
self.task_description = task_description
self.selector = FewShotSelector(examples, selection_strategy)
self.input_prefix = input_prefix
self.output_prefix = output_prefix
def format(
self,
query: str,
num_examples: int = 3
) -> str:
"""
Format prompt with selected examples.
Args:
query: User query
num_examples: Number of examples to include
Returns:
Formatted prompt string
"""
# Select examples
selected = self.selector.select(query, k=num_examples)
# Build prompt
parts = [self.task_description, "\nExamples:\n"]
for ex in selected:
parts.append(ex.format(self.input_prefix, self.output_prefix))
parts.append("")
parts.append(f"\n{self.input_prefix}: {query}")
parts.append(f"{self.output_prefix}:")
return "\n".join(parts)Step 4: Chain-of-Thought Prompting
Guide the model through step-by-step reasoning:
# chain_of_thought.py
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from enum import Enum
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage, AIMessage
from config import get_settings
class ReasoningStyle(str, Enum):
STEP_BY_STEP = "step_by_step"
THINK_ALOUD = "think_aloud"
STRUCTURED = "structured"
SOCRATIC = "socratic"
@dataclass
class ReasoningStep:
"""A single reasoning step."""
step_number: int
thought: str
action: Optional[str] = None
observation: Optional[str] = None
@dataclass
class ChainOfThoughtResult:
"""Complete CoT result."""
question: str
reasoning_steps: List[ReasoningStep]
final_answer: str
confidence: float
class ChainOfThoughtBuilder:
"""
Build Chain-of-Thought prompts for complex reasoning.
Supports multiple reasoning styles:
- Step-by-step: Numbered steps
- Think aloud: Stream of consciousness
- Structured: Specific framework (e.g., pros/cons)
- Socratic: Question-based exploration
"""
STYLE_PROMPTS = {
ReasoningStyle.STEP_BY_STEP: """
Let's solve this step by step:
Step 1: [First consideration]
Step 2: [Next logical step]
...
Final Answer: [Conclusion]
""",
ReasoningStyle.THINK_ALOUD: """
Let me think through this...
First, I notice that...
This makes me think...
Considering all of this...
Therefore: [Conclusion]
""",
ReasoningStyle.STRUCTURED: """
Let me analyze this systematically:
1. Understanding the problem:
- Key elements: ...
- Constraints: ...
2. Possible approaches:
- Option A: ...
- Option B: ...
3. Evaluation:
- Pros/Cons of each...
4. Conclusion:
Based on this analysis: [Answer]
""",
ReasoningStyle.SOCRATIC: """
Let me explore this through questions:
Q1: What is the core issue here?
A1: ...
Q2: What assumptions am I making?
A2: ...
Q3: What would change if...?
A3: ...
Given these insights: [Conclusion]
"""
}
def __init__(self, style: ReasoningStyle = ReasoningStyle.STEP_BY_STEP):
self.style = style
self.settings = get_settings()
self.llm = ChatOpenAI(
model=self.settings.default_model,
api_key=self.settings.openai_api_key,
temperature=0.3 # Lower for reasoning
)
def build_prompt(
self,
question: str,
context: Optional[str] = None,
style: Optional[ReasoningStyle] = None
) -> str:
"""Build a CoT prompt for the question."""
style = style or self.style
style_guide = self.STYLE_PROMPTS[style]
prompt_parts = []
if context:
prompt_parts.append(f"Context:\n{context}\n")
prompt_parts.append(f"Question: {question}")
prompt_parts.append(f"\n{style_guide}")
return "\n".join(prompt_parts)
def reason(
self,
question: str,
context: Optional[str] = None,
style: Optional[ReasoningStyle] = None
) -> ChainOfThoughtResult:
"""
Execute Chain-of-Thought reasoning.
Args:
question: The question to reason about
context: Optional context information
style: Reasoning style to use
Returns:
ChainOfThoughtResult with steps and answer
"""
style = style or self.style
system_prompt = """You are a careful, methodical thinker.
When answering questions, show your reasoning process explicitly.
Break down complex problems into manageable steps.
Be honest about uncertainty."""
user_prompt = self.build_prompt(question, context, style)
messages = [
SystemMessage(content=system_prompt),
HumanMessage(content=user_prompt)
]
response = self.llm.invoke(messages)
content = response.content
# Parse response into steps
steps = self._parse_reasoning(content, style)
final_answer = self._extract_answer(content)
confidence = self._estimate_confidence(content)
return ChainOfThoughtResult(
question=question,
reasoning_steps=steps,
final_answer=final_answer,
confidence=confidence
)
def _parse_reasoning(
self,
content: str,
style: ReasoningStyle
) -> List[ReasoningStep]:
"""Parse reasoning steps from response."""
steps = []
lines = content.split('\n')
step_num = 0
current_thought = []
for line in lines:
line = line.strip()
# Detect step markers
is_step_marker = (
line.startswith('Step ') or
line.startswith('1.') or
line.startswith('Q1:') or
(line.startswith('-') and step_num == 0)
)
if is_step_marker and current_thought:
steps.append(ReasoningStep(
step_number=step_num,
thought='\n'.join(current_thought)
))
step_num += 1
current_thought = [line]
elif line:
current_thought.append(line)
if current_thought:
steps.append(ReasoningStep(
step_number=step_num,
thought='\n'.join(current_thought)
))
return steps
def _extract_answer(self, content: str) -> str:
"""Extract final answer from response."""
markers = [
'Final Answer:', 'Therefore:', 'Conclusion:',
'Given these insights:', 'Based on this analysis:',
'The answer is:', 'In conclusion:'
]
for marker in markers:
if marker in content:
answer = content.split(marker)[-1].strip()
# Take first paragraph
return answer.split('\n\n')[0].strip()
# Fallback: last paragraph
paragraphs = content.strip().split('\n\n')
return paragraphs[-1] if paragraphs else content
def _estimate_confidence(self, content: str) -> float:
"""Estimate confidence based on language used."""
low_confidence_markers = [
'uncertain', 'not sure', 'might be', 'possibly',
'I think', 'it seems', 'probably', 'unclear'
]
high_confidence_markers = [
'definitely', 'certainly', 'clearly', 'must be',
'obviously', 'without doubt', 'always', 'never'
]
content_lower = content.lower()
low_count = sum(1 for m in low_confidence_markers if m in content_lower)
high_count = sum(1 for m in high_confidence_markers if m in content_lower)
# Base confidence
base = 0.7
# Adjust based on markers
confidence = base - (low_count * 0.05) + (high_count * 0.05)
return max(0.1, min(1.0, confidence))
class SelfConsistencyReasoner:
"""
Implement self-consistency for improved accuracy.
Generates multiple reasoning paths and selects the most common answer.
"""
def __init__(self, num_samples: int = 5):
self.num_samples = num_samples
self.cot = ChainOfThoughtBuilder()
self.settings = get_settings()
def reason(
self,
question: str,
context: Optional[str] = None
) -> Dict[str, Any]:
"""
Generate multiple reasoning paths and find consensus.
Args:
question: Question to answer
context: Optional context
Returns:
Dict with answer, confidence, and all paths
"""
# Generate multiple reasoning paths
results = []
# Use different temperatures for diversity
temperatures = [0.3, 0.5, 0.7, 0.5, 0.3][:self.num_samples]
for temp in temperatures:
self.cot.llm.temperature = temp
result = self.cot.reason(question, context)
results.append(result)
# Extract answers and count
answer_counts: Dict[str, int] = {}
for result in results:
answer = result.final_answer.strip().lower()
# Normalize answer
answer_key = self._normalize_answer(answer)
answer_counts[answer_key] = answer_counts.get(answer_key, 0) + 1
# Find most common
best_answer = max(answer_counts.keys(), key=lambda k: answer_counts[k])
confidence = answer_counts[best_answer] / self.num_samples
# Get best result with this answer
best_result = next(
r for r in results
if self._normalize_answer(r.final_answer) == best_answer
)
return {
"answer": best_result.final_answer,
"confidence": confidence,
"consensus_count": answer_counts[best_answer],
"total_samples": self.num_samples,
"reasoning": best_result.reasoning_steps,
"all_answers": answer_counts
}
def _normalize_answer(self, answer: str) -> str:
"""Normalize answer for comparison."""
# Simple normalization - could be enhanced
normalized = answer.lower().strip()
# Remove common prefixes
prefixes = ['the answer is', 'therefore', 'so', 'thus']
for prefix in prefixes:
if normalized.startswith(prefix):
normalized = normalized[len(prefix):].strip()
return normalized[:100] # First 100 chars for comparison★ Insight ─────────────────────────────────────
Self-consistency dramatically improves CoT accuracy. Instead of relying on a single reasoning path, we sample multiple paths with varying temperatures and select the most common answer. This catches reasoning errors that might occur in any single attempt.
─────────────────────────────────────────────────
Step 5: Prompt Versioning
Track and manage prompt versions:
# versioning.py
import sqlite3
import json
from datetime import datetime
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
import hashlib
from templates import PromptTemplate
@dataclass
class PromptVersion:
"""A versioned prompt with metadata."""
id: int
name: str
version: str
template: PromptTemplate
created_at: datetime
metrics: Dict[str, float]
is_active: bool
class PromptVersionStore:
"""
SQLite-based prompt version storage.
Features:
- Version history for all prompts
- Performance metrics tracking
- Rollback capability
- A/B test assignment
"""
def __init__(self, db_path: str = "prompts.db"):
self.db_path = db_path
self._init_db()
def _init_db(self):
"""Initialize database schema."""
with sqlite3.connect(self.db_path) as conn:
conn.execute("""
CREATE TABLE IF NOT EXISTS prompts (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT NOT NULL,
version TEXT NOT NULL,
template_json TEXT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
is_active BOOLEAN DEFAULT FALSE,
metrics_json TEXT DEFAULT '{}',
hash TEXT NOT NULL,
UNIQUE(name, version)
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS prompt_metrics (
id INTEGER PRIMARY KEY AUTOINCREMENT,
prompt_id INTEGER NOT NULL,
metric_name TEXT NOT NULL,
metric_value REAL NOT NULL,
recorded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (prompt_id) REFERENCES prompts(id)
)
""")
conn.execute("""
CREATE INDEX IF NOT EXISTS idx_prompt_name
ON prompts(name)
""")
def save(
self,
template: PromptTemplate,
version: Optional[str] = None
) -> PromptVersion:
"""
Save a prompt template version.
Args:
template: PromptTemplate to save
version: Optional version override
Returns:
Created PromptVersion
"""
version = version or template.version
template_json = json.dumps(template.to_dict())
template_hash = hashlib.md5(template_json.encode()).hexdigest()
with sqlite3.connect(self.db_path) as conn:
cursor = conn.execute("""
INSERT INTO prompts (name, version, template_json, hash)
VALUES (?, ?, ?, ?)
ON CONFLICT(name, version) DO UPDATE SET
template_json = excluded.template_json,
hash = excluded.hash
RETURNING id, created_at
""", (template.name, version, template_json, template_hash))
row = cursor.fetchone()
return PromptVersion(
id=row[0],
name=template.name,
version=version,
template=template,
created_at=datetime.fromisoformat(row[1]) if isinstance(row[1], str) else row[1],
metrics={},
is_active=False
)
def get(
self,
name: str,
version: Optional[str] = None
) -> Optional[PromptTemplate]:
"""
Get a prompt template by name and version.
Args:
name: Prompt name
version: Version (latest if None)
Returns:
PromptTemplate or None
"""
with sqlite3.connect(self.db_path) as conn:
if version:
cursor = conn.execute("""
SELECT template_json FROM prompts
WHERE name = ? AND version = ?
""", (name, version))
else:
# Get latest version
cursor = conn.execute("""
SELECT template_json FROM prompts
WHERE name = ?
ORDER BY created_at DESC
LIMIT 1
""", (name,))
row = cursor.fetchone()
if row:
data = json.loads(row[0])
return PromptTemplate.from_dict(data)
return None
def get_active(self, name: str) -> Optional[PromptTemplate]:
"""Get the active version of a prompt."""
with sqlite3.connect(self.db_path) as conn:
cursor = conn.execute("""
SELECT template_json FROM prompts
WHERE name = ? AND is_active = TRUE
""", (name,))
row = cursor.fetchone()
if row:
data = json.loads(row[0])
return PromptTemplate.from_dict(data)
return None
def set_active(self, name: str, version: str) -> bool:
"""Set a version as active (for production use)."""
with sqlite3.connect(self.db_path) as conn:
# Deactivate all versions
conn.execute("""
UPDATE prompts SET is_active = FALSE
WHERE name = ?
""", (name,))
# Activate specific version
cursor = conn.execute("""
UPDATE prompts SET is_active = TRUE
WHERE name = ? AND version = ?
""", (name, version))
return cursor.rowcount > 0
def list_versions(self, name: str) -> List[Dict[str, Any]]:
"""List all versions of a prompt."""
with sqlite3.connect(self.db_path) as conn:
cursor = conn.execute("""
SELECT version, created_at, is_active, metrics_json
FROM prompts
WHERE name = ?
ORDER BY created_at DESC
""", (name,))
return [
{
"version": row[0],
"created_at": row[1],
"is_active": bool(row[2]),
"metrics": json.loads(row[3])
}
for row in cursor.fetchall()
]
def record_metric(
self,
name: str,
version: str,
metric_name: str,
value: float
) -> None:
"""Record a performance metric for a prompt version."""
with sqlite3.connect(self.db_path) as conn:
# Get prompt id
cursor = conn.execute("""
SELECT id FROM prompts
WHERE name = ? AND version = ?
""", (name, version))
row = cursor.fetchone()
if not row:
return
prompt_id = row[0]
# Record metric
conn.execute("""
INSERT INTO prompt_metrics (prompt_id, metric_name, metric_value)
VALUES (?, ?, ?)
""", (prompt_id, metric_name, value))
# Update aggregated metrics
cursor = conn.execute("""
SELECT metric_name, AVG(metric_value) as avg_value
FROM prompt_metrics
WHERE prompt_id = ?
GROUP BY metric_name
""", (prompt_id,))
metrics = {row[0]: row[1] for row in cursor.fetchall()}
conn.execute("""
UPDATE prompts SET metrics_json = ?
WHERE id = ?
""", (json.dumps(metrics), prompt_id))
def compare_versions(
self,
name: str,
version_a: str,
version_b: str
) -> Dict[str, Any]:
"""Compare metrics between two versions."""
with sqlite3.connect(self.db_path) as conn:
results = {}
for version in [version_a, version_b]:
cursor = conn.execute("""
SELECT metrics_json FROM prompts
WHERE name = ? AND version = ?
""", (name, version))
row = cursor.fetchone()
if row:
results[version] = json.loads(row[0])
return resultsStep 6: Prompt Testing
Test prompts for reliability and consistency:
# testing.py
from typing import List, Dict, Any, Callable, Optional
from dataclasses import dataclass
from enum import Enum
import statistics
import time
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage
from templates import PromptTemplate
from config import get_settings
class TestStatus(str, Enum):
PASSED = "passed"
FAILED = "failed"
WARNING = "warning"
@dataclass
class TestCase:
"""A single test case for a prompt."""
name: str
input_vars: Dict[str, Any]
expected_contains: Optional[List[str]] = None
expected_not_contains: Optional[List[str]] = None
expected_format: Optional[str] = None # "json", "markdown", etc.
max_tokens: Optional[int] = None
custom_validator: Optional[Callable[[str], bool]] = None
@dataclass
class TestResult:
"""Result of a single test."""
test_name: str
status: TestStatus
output: str
latency_ms: float
token_count: int
errors: List[str]
@dataclass
class TestSuiteResult:
"""Result of running all tests."""
prompt_name: str
total_tests: int
passed: int
failed: int
warnings: int
avg_latency_ms: float
consistency_score: float
results: List[TestResult]
class PromptTester:
"""
Test prompts for reliability and consistency.
Features:
- Content validation (contains/not contains)
- Format validation (JSON, markdown, etc.)
- Latency measurement
- Consistency testing (same input, multiple runs)
- Custom validators
"""
def __init__(self):
self.settings = get_settings()
self.llm = ChatOpenAI(
model=self.settings.default_model,
api_key=self.settings.openai_api_key,
temperature=0.7
)
def run_test(
self,
template: PromptTemplate,
test_case: TestCase
) -> TestResult:
"""Run a single test case."""
errors = []
start_time = time.time()
# Render and execute prompt
try:
rendered = template.render(**test_case.input_vars)
messages = [HumanMessage(content=rendered)]
response = self.llm.invoke(messages)
output = response.content
except Exception as e:
return TestResult(
test_name=test_case.name,
status=TestStatus.FAILED,
output="",
latency_ms=0,
token_count=0,
errors=[f"Execution error: {str(e)}"]
)
latency_ms = (time.time() - start_time) * 1000
token_count = len(output.split()) # Approximate
# Validate output
if test_case.expected_contains:
for expected in test_case.expected_contains:
if expected.lower() not in output.lower():
errors.append(f"Missing expected content: '{expected}'")
if test_case.expected_not_contains:
for unexpected in test_case.expected_not_contains:
if unexpected.lower() in output.lower():
errors.append(f"Contains unexpected content: '{unexpected}'")
if test_case.expected_format:
if not self._validate_format(output, test_case.expected_format):
errors.append(f"Invalid format: expected {test_case.expected_format}")
if test_case.max_tokens and token_count > test_case.max_tokens:
errors.append(f"Exceeded max tokens: {token_count} > {test_case.max_tokens}")
if test_case.custom_validator:
try:
if not test_case.custom_validator(output):
errors.append("Custom validation failed")
except Exception as e:
errors.append(f"Custom validator error: {str(e)}")
# Determine status
if errors:
status = TestStatus.FAILED
elif latency_ms > 5000: # Warning if slow
status = TestStatus.WARNING
errors.append(f"Slow response: {latency_ms:.0f}ms")
else:
status = TestStatus.PASSED
return TestResult(
test_name=test_case.name,
status=status,
output=output,
latency_ms=latency_ms,
token_count=token_count,
errors=errors
)
def run_suite(
self,
template: PromptTemplate,
test_cases: List[TestCase]
) -> TestSuiteResult:
"""Run all test cases for a prompt."""
results = []
for test_case in test_cases:
result = self.run_test(template, test_case)
results.append(result)
# Calculate metrics
passed = sum(1 for r in results if r.status == TestStatus.PASSED)
failed = sum(1 for r in results if r.status == TestStatus.FAILED)
warnings = sum(1 for r in results if r.status == TestStatus.WARNING)
latencies = [r.latency_ms for r in results]
avg_latency = statistics.mean(latencies) if latencies else 0
return TestSuiteResult(
prompt_name=template.name,
total_tests=len(results),
passed=passed,
failed=failed,
warnings=warnings,
avg_latency_ms=avg_latency,
consistency_score=passed / len(results) if results else 0,
results=results
)
def test_consistency(
self,
template: PromptTemplate,
input_vars: Dict[str, Any],
runs: int = 5
) -> Dict[str, Any]:
"""
Test output consistency across multiple runs.
Returns similarity metrics between outputs.
"""
outputs = []
latencies = []
for _ in range(runs):
start = time.time()
rendered = template.render(**input_vars)
messages = [HumanMessage(content=rendered)]
response = self.llm.invoke(messages)
outputs.append(response.content)
latencies.append((time.time() - start) * 1000)
# Calculate consistency
from difflib import SequenceMatcher
similarities = []
for i in range(len(outputs)):
for j in range(i + 1, len(outputs)):
sim = SequenceMatcher(None, outputs[i], outputs[j]).ratio()
similarities.append(sim)
return {
"runs": runs,
"avg_similarity": statistics.mean(similarities) if similarities else 1.0,
"min_similarity": min(similarities) if similarities else 1.0,
"avg_latency_ms": statistics.mean(latencies),
"latency_std": statistics.stdev(latencies) if len(latencies) > 1 else 0,
"outputs": outputs
}
def _validate_format(self, output: str, expected_format: str) -> bool:
"""Validate output format."""
if expected_format == "json":
import json
try:
json.loads(output)
return True
except:
# Try to extract JSON from output
import re
json_match = re.search(r'\{[\s\S]*\}', output)
if json_match:
try:
json.loads(json_match.group())
return True
except:
pass
return False
elif expected_format == "markdown":
# Check for markdown elements
return any([
output.startswith('#'),
'**' in output,
'- ' in output,
'```' in output
])
elif expected_format == "list":
return any([
'\n- ' in output,
'\n* ' in output,
'\n1.' in output
])
return True # Unknown format, assume validStep 7: Prompt Registry
Central registry for managing prompts:
# registry.py
from typing import Dict, List, Optional, Any
from pathlib import Path
import yaml
import json
from templates import PromptTemplate, COMMON_TEMPLATES
from versioning import PromptVersionStore
from testing import PromptTester, TestCase, TestSuiteResult
class PromptRegistry:
"""
Central registry for prompt management.
Features:
- Load prompts from files or code
- Version management
- Testing integration
- A/B test support
"""
def __init__(self, db_path: str = "prompts.db"):
self.store = PromptVersionStore(db_path)
self.tester = PromptTester()
self._cache: Dict[str, PromptTemplate] = {}
self._ab_tests: Dict[str, Dict[str, float]] = {}
# Load common templates
for name, template in COMMON_TEMPLATES.items():
self.register(template)
def register(
self,
template: PromptTemplate,
version: Optional[str] = None
) -> None:
"""Register a prompt template."""
self.store.save(template, version)
self._cache[template.name] = template
def get(
self,
name: str,
version: Optional[str] = None
) -> Optional[PromptTemplate]:
"""Get a prompt by name."""
# Check cache first
if name in self._cache and version is None:
return self._cache[name]
# Load from store
template = self.store.get(name, version)
if template:
self._cache[name] = template
return template
def get_for_request(self, name: str, request_id: str) -> PromptTemplate:
"""
Get prompt for a request, handling A/B tests.
Args:
name: Prompt name
request_id: Unique request identifier for consistent assignment
Returns:
Selected PromptTemplate
"""
if name in self._ab_tests:
# A/B test active - select version based on request_id
import hashlib
hash_val = int(hashlib.md5(request_id.encode()).hexdigest(), 16)
normalized = (hash_val % 1000) / 1000.0
cumulative = 0.0
for version, weight in self._ab_tests[name].items():
cumulative += weight
if normalized < cumulative:
return self.get(name, version)
# Return active or latest version
return self.store.get_active(name) or self.get(name)
def start_ab_test(
self,
name: str,
versions: Dict[str, float]
) -> None:
"""
Start an A/B test for a prompt.
Args:
name: Prompt name
versions: Dict of version -> weight (should sum to 1.0)
"""
total = sum(versions.values())
if abs(total - 1.0) > 0.01:
raise ValueError(f"Weights must sum to 1.0, got {total}")
self._ab_tests[name] = versions
def stop_ab_test(self, name: str) -> None:
"""Stop an A/B test."""
if name in self._ab_tests:
del self._ab_tests[name]
def test(
self,
name: str,
test_cases: List[TestCase],
version: Optional[str] = None
) -> TestSuiteResult:
"""Test a prompt with given test cases."""
template = self.get(name, version)
if not template:
raise ValueError(f"Prompt '{name}' not found")
return self.tester.run_suite(template, test_cases)
def promote(self, name: str, version: str) -> bool:
"""Promote a version to active (production)."""
return self.store.set_active(name, version)
def load_from_yaml(self, path: str) -> List[str]:
"""Load prompts from YAML file."""
loaded = []
with open(path) as f:
data = yaml.safe_load(f)
for prompt_data in data.get("prompts", []):
template = PromptTemplate.from_dict(prompt_data)
self.register(template)
loaded.append(template.name)
return loaded
def export_to_yaml(self, path: str, names: Optional[List[str]] = None) -> None:
"""Export prompts to YAML file."""
prompts = []
for name in (names or self._cache.keys()):
template = self.get(name)
if template:
prompts.append(template.to_dict())
with open(path, 'w') as f:
yaml.dump({"prompts": prompts}, f, default_flow_style=False)
def list_prompts(self) -> List[Dict[str, Any]]:
"""List all registered prompts."""
prompts = []
for name in self._cache.keys():
versions = self.store.list_versions(name)
prompts.append({
"name": name,
"versions": versions,
"ab_test_active": name in self._ab_tests
})
return promptsUsage Examples
Basic Template Usage
from templates import PromptTemplate, PromptBuilder
# Using pre-built template
from templates import COMMON_TEMPLATES
summarize = COMMON_TEMPLATES["summarize"]
prompt = summarize.render(text="Long article here...", style="bullet points")
# Building custom template
template = (
PromptBuilder("code_review")
.with_description("Review code for issues")
.add_section("Review the following code and identify:")
.add_section("1. Bugs and errors")
.add_section("2. Security issues")
.add_section("3. Performance problems")
.add_section("\nCode:")
.add_variable("code", "The code to review")
.add_section("\nProvide your review:")
.build()
)
prompt = template.render(code="def foo(): pass")Few-Shot Learning
from few_shot import Example, FewShotPrompt, SelectionStrategy
examples = [
Example(input="happy", output="positive"),
Example(input="sad", output="negative"),
Example(input="excited", output="positive"),
Example(input="angry", output="negative"),
]
few_shot = FewShotPrompt(
task_description="Classify the sentiment of the word.",
examples=examples,
selection_strategy=SelectionStrategy.SEMANTIC
)
prompt = few_shot.format("joyful", num_examples=2)Chain-of-Thought
from chain_of_thought import ChainOfThoughtBuilder, ReasoningStyle
cot = ChainOfThoughtBuilder(style=ReasoningStyle.STRUCTURED)
result = cot.reason("Should we use microservices or monolith for a startup?")
print(f"Answer: {result.final_answer}")
print(f"Confidence: {result.confidence}")
for step in result.reasoning_steps:
print(f"Step {step.step_number}: {step.thought[:100]}...")Requirements
# requirements.txt
langchain>=0.3.0
langchain-openai>=0.2.0
openai>=1.50.0
jinja2>=3.1.0
pydantic>=2.9.0
pydantic-settings>=2.6.0
numpy>=1.26.0
pyyaml>=6.0.0Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| Prompt Templates | Reusable prompts with Jinja2 variables | Separate structure from content, enable iteration |
| Few-Shot Learning | Include examples in the prompt | LLM learns output format from examples |
| Semantic Selection | Choose examples similar to query | Better examples = better outputs |
| Chain-of-Thought | "Let's think step by step" prompting | Improves reasoning on complex tasks |
| Self-Consistency | Sample multiple paths, pick majority | Catches reasoning errors, improves accuracy |
| Prompt Versioning | Store versions in SQLite with metrics | Track what worked, enable rollback |
| A/B Testing | Route traffic to different versions | Data-driven prompt optimization |
| Consistency Testing | Run same input multiple times | Ensure reliable, predictable outputs |