Build a production RAG system that handles 100K+ support tickets with intelligent routing and response generation

Enterprise Customer Support System

TL;DR

Build a production-grade support system that classifies tickets, retrieves relevant knowledge, and routes intelligently based on confidence. High-confidence answers go out automatically (40% of tickets), medium-confidence get agent drafts, and low-confidence escalate to specialists. The secret sauce: content-type-aware chunking, hybrid search, and confidence-based routing.

Build a production-grade customer support system that uses RAG to automatically answer customer questions, route complex issues, and reduce support costs by 60%.


Industry	SaaS / E-commerce
Difficulty	Advanced
Time	1 week
Code	~1500 lines

What You'll Build

A complete customer support automation system that:

Ingests knowledge sources - Help docs, FAQs, past tickets, product documentation
Classifies incoming tickets - Urgency, category, sentiment analysis
Generates responses - Draft answers using RAG with source citations
Routes intelligently - Escalate to human agents when confidence is low
Learns continuously - Improve from agent feedback and corrections

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ENTERPRISE SUPPORT RAG ARCHITECTURE                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ KNOWLEDGE INGESTION                                                 │   │
│  │   Help Center ─┬─► FAQ Database ─┬─► Historical Tickets ─┬─► Docs  │   │
│  └────────────────┴─────────────────┴───────────────────────┴─────────┘   │
│                                    │                                       │
│                                    ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ DOCUMENT PROCESSING                                                 │   │
│  │   Smart Chunking ──────────► Embeddings ──────────► Metadata        │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                    │                                       │
│                                    ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ INTELLIGENT RETRIEVAL                                               │   │
│  │   Hybrid Search ───────────► Reranking ───────────► Filtering       │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                    │                                       │
│                                    ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ RESPONSE GENERATION                                                 │   │
│  │   Classification ──────────► RAG Pipeline ─────────► Confidence     │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                    │                                       │
│                                    ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ OUTPUT HANDLING                                                     │   │
│  │                 ┌──────────────┼──────────────┐                     │   │
│  │                 ▼              ▼              ▼                     │   │
│  │          Auto-Response   Agent Draft    Escalation                  │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Project Structure

enterprise-support/
├── src/
│   ├── __init__.py
│   ├── config.py              # Configuration management
│   ├── ingestion/
│   │   ├── __init__.py
│   │   ├── loader.py          # Document loaders
│   │   ├── chunker.py         # Smart chunking
│   │   └── processor.py       # Metadata extraction
│   ├── retrieval/
│   │   ├── __init__.py
│   │   ├── embeddings.py      # Embedding generation
│   │   ├── vector_store.py    # Vector database
│   │   ├── hybrid_search.py   # Hybrid retrieval
│   │   └── reranker.py        # Cross-encoder reranking
│   ├── classification/
│   │   ├── __init__.py
│   │   ├── classifier.py      # Ticket classification
│   │   └── sentiment.py       # Sentiment analysis
│   ├── generation/
│   │   ├── __init__.py
│   │   ├── rag_pipeline.py    # RAG response generation
│   │   ├── confidence.py      # Confidence scoring
│   │   └── templates.py       # Response templates
│   ├── routing/
│   │   ├── __init__.py
│   │   └── router.py          # Ticket routing logic
│   └── api/
│       ├── __init__.py
│       ├── main.py            # FastAPI application
│       └── models.py          # Pydantic models
├── tests/
├── docker-compose.yml
├── Dockerfile
└── requirements.txt

Step 1: Configuration

# src/config.py
from pydantic_settings import BaseSettings
from typing import Optional

class Settings(BaseSettings):
    # OpenAI
    openai_api_key: str
    embedding_model: str = "text-embedding-3-large"
    llm_model: str = "gpt-4o"

    # Vector Store (Qdrant)
    qdrant_url: str = "http://localhost:6333"
    qdrant_collection: str = "support_knowledge"

    # Redis
    redis_url: str = "redis://localhost:6379"

    # Classification thresholds
    auto_response_confidence: float = 0.85
    escalation_threshold: float = 0.4

    # Retrieval settings
    retrieval_top_k: int = 10
    rerank_top_k: int = 5

    class Config:
        env_file = ".env"

settings = Settings()

Understanding the Configuration Thresholds:

Setting	Value	Purpose
`auto_response_confidence`	0.85	Only auto-respond when 85%+ confident - high bar to avoid bad answers
`escalation_threshold`	0.4	Below 40% confidence → escalate to human agent
`retrieval_top_k`	10	Retrieve 10 candidates for hybrid search fusion
`rerank_top_k`	5	Keep top 5 after reranking for generation context

The gap between 0.4 and 0.85 is the "agent draft" zone - the system provides a suggested response but requires human approval.

Step 2: Document Ingestion

Smart Chunking Strategy

# src/ingestion/chunker.py
from typing import List, Dict, Any
from dataclasses import dataclass
from langchain.text_splitter import RecursiveCharacterTextSplitter
import re

@dataclass
class Chunk:
    content: str
    metadata: Dict[str, Any]
    chunk_id: str

class SmartChunker:
    """
    Context-aware chunking that preserves semantic boundaries.
    Different strategies for different content types.
    """

    def __init__(
        self,
        chunk_size: int = 512,
        chunk_overlap: int = 50
    ):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap

        # FAQ chunker - keeps Q&A pairs together
        self.faq_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=0,  # No overlap for FAQs
            separators=["\n\n", "\n"]
        )

        # Documentation chunker - respects headers
        self.doc_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            separators=["\n## ", "\n### ", "\n\n", "\n", " "]
        )

        # Ticket chunker - keeps conversations intact
        self.ticket_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size * 2,  # Larger for context
            chunk_overlap=chunk_overlap,
            separators=["\n---\n", "\n\n", "\n"]
        )

    def chunk_faq(
        self,
        content: str,
        source_id: str,
        base_metadata: Dict[str, Any]
    ) -> List[Chunk]:
        """Chunk FAQ content, keeping Q&A pairs together."""
        chunks = []

        # Parse Q&A pairs
        qa_pattern = r'Q:\s*(.*?)\nA:\s*(.*?)(?=\nQ:|$)'
        matches = re.findall(qa_pattern, content, re.DOTALL)

        for i, (question, answer) in enumerate(matches):
            chunk_content = f"Question: {question.strip()}\n\nAnswer: {answer.strip()}"

            chunks.append(Chunk(
                content=chunk_content,
                metadata={
                    **base_metadata,
                    "content_type": "faq",
                    "question": question.strip(),
                    "chunk_index": i
                },
                chunk_id=f"{source_id}_faq_{i}"
            ))

        return chunks

    def chunk_documentation(
        self,
        content: str,
        source_id: str,
        base_metadata: Dict[str, Any]
    ) -> List[Chunk]:
        """Chunk documentation while preserving header context."""
        chunks = []

        # Extract sections with headers
        sections = self._extract_sections(content)

        for section_idx, section in enumerate(sections):
            header = section.get("header", "")
            body = section.get("body", "")

            # Split body if too large
            if len(body) > self.chunk_size:
                sub_chunks = self.doc_splitter.split_text(body)
                for i, sub_chunk in enumerate(sub_chunks):
                    # Prepend header for context
                    chunk_content = f"{header}\n\n{sub_chunk}" if header else sub_chunk

                    chunks.append(Chunk(
                        content=chunk_content,
                        metadata={
                            **base_metadata,
                            "content_type": "documentation",
                            "section_header": header,
                            "chunk_index": f"{section_idx}_{i}"
                        },
                        chunk_id=f"{source_id}_doc_{section_idx}_{i}"
                    ))
            else:
                chunk_content = f"{header}\n\n{body}" if header else body
                chunks.append(Chunk(
                    content=chunk_content,
                    metadata={
                        **base_metadata,
                        "content_type": "documentation",
                        "section_header": header,
                        "chunk_index": section_idx
                    },
                    chunk_id=f"{source_id}_doc_{section_idx}"
                ))

        return chunks

    def chunk_ticket(
        self,
        content: str,
        source_id: str,
        base_metadata: Dict[str, Any],
        resolution: str = None
    ) -> List[Chunk]:
        """Chunk support tickets, emphasizing resolution."""
        chunks = []

        # If there's a resolution, create a focused chunk
        if resolution:
            resolution_chunk = f"Issue: {content[:500]}...\n\nResolution: {resolution}"
            chunks.append(Chunk(
                content=resolution_chunk,
                metadata={
                    **base_metadata,
                    "content_type": "ticket_resolution",
                    "has_resolution": True,
                    "chunk_index": 0
                },
                chunk_id=f"{source_id}_resolution"
            ))

        # Chunk the full conversation
        sub_chunks = self.ticket_splitter.split_text(content)
        for i, sub_chunk in enumerate(sub_chunks):
            chunks.append(Chunk(
                content=sub_chunk,
                metadata={
                    **base_metadata,
                    "content_type": "ticket_conversation",
                    "has_resolution": resolution is not None,
                    "chunk_index": i + 1
                },
                chunk_id=f"{source_id}_conv_{i}"
            ))

        return chunks

    def _extract_sections(self, content: str) -> List[Dict[str, str]]:
        """Extract sections with their headers."""
        sections = []

        # Split by headers (## or ###)
        header_pattern = r'^(#{2,3}\s+.+)$'
        parts = re.split(header_pattern, content, flags=re.MULTILINE)

        current_header = ""
        for part in parts:
            if re.match(header_pattern, part):
                current_header = part.strip()
            elif part.strip():
                sections.append({
                    "header": current_header,
                    "body": part.strip()
                })

        return sections

Why Content-Type-Aware Chunking?

Standard chunking treats all text the same, which destroys semantic boundaries:

┌─────────────────────────────────────────────────────────────┐
│ STANDARD CHUNKING: Breaks at arbitrary character count      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  FAQ:  "Q: How do I reset my password?  │  ← BROKEN!       │
│         A: Go to settings and..."       │                   │
│                                                             │
│  Problem: Question separated from answer = useless chunk    │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ SMART CHUNKING: Preserves semantic units                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  FAQ → Keep Q&A pairs together (no overlap)                 │
│  Docs → Keep headers attached to content                    │
│  Tickets → Prioritize resolution, larger context            │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Content Type	Strategy	Why
FAQ	Parse Q&A pairs with regex	Questions without answers are useless
Documentation	Split on headers, prepend header to chunks	"Step 3" means nothing without section context
Tickets	Larger chunks, resolution-focused	Conversation context and resolution are critical

Document Loader

# src/ingestion/loader.py
from typing import List, Dict, Any, AsyncGenerator
from dataclasses import dataclass
import aiohttp
import asyncio
from pathlib import Path
import json

@dataclass
class Document:
    content: str
    metadata: Dict[str, Any]
    source_id: str

class KnowledgeLoader:
    """Load knowledge from multiple sources."""

    async def load_help_center(
        self,
        api_url: str,
        api_key: str
    ) -> AsyncGenerator[Document, None]:
        """Load articles from help center API (e.g., Zendesk, Intercom)."""
        async with aiohttp.ClientSession() as session:
            headers = {"Authorization": f"Bearer {api_key}"}

            async with session.get(
                f"{api_url}/articles",
                headers=headers
            ) as response:
                data = await response.json()

                for article in data.get("articles", []):
                    yield Document(
                        content=article["body"],
                        metadata={
                            "source": "help_center",
                            "title": article["title"],
                            "category": article.get("category", "general"),
                            "updated_at": article.get("updated_at"),
                            "url": article.get("url")
                        },
                        source_id=f"help_{article['id']}"
                    )

    async def load_faqs(self, faq_file: Path) -> AsyncGenerator[Document, None]:
        """Load FAQs from JSON file."""
        with open(faq_file) as f:
            faqs = json.load(f)

        for faq in faqs:
            content = f"Q: {faq['question']}\nA: {faq['answer']}"
            yield Document(
                content=content,
                metadata={
                    "source": "faq",
                    "category": faq.get("category", "general"),
                    "tags": faq.get("tags", [])
                },
                source_id=f"faq_{faq['id']}"
            )

    async def load_tickets(
        self,
        api_url: str,
        api_key: str,
        resolved_only: bool = True
    ) -> AsyncGenerator[Document, None]:
        """Load historical tickets with resolutions."""
        async with aiohttp.ClientSession() as session:
            headers = {"Authorization": f"Bearer {api_key}"}
            params = {"status": "resolved"} if resolved_only else {}

            async with session.get(
                f"{api_url}/tickets",
                headers=headers,
                params=params
            ) as response:
                data = await response.json()

                for ticket in data.get("tickets", []):
                    # Build conversation thread
                    conversation = self._build_conversation(ticket)

                    yield Document(
                        content=conversation,
                        metadata={
                            "source": "ticket",
                            "category": ticket.get("category"),
                            "priority": ticket.get("priority"),
                            "resolution": ticket.get("resolution"),
                            "satisfaction_score": ticket.get("satisfaction_score"),
                            "resolved_at": ticket.get("resolved_at")
                        },
                        source_id=f"ticket_{ticket['id']}"
                    )

    def _build_conversation(self, ticket: Dict[str, Any]) -> str:
        """Build readable conversation from ticket."""
        parts = [f"Subject: {ticket.get('subject', 'No subject')}"]

        for message in ticket.get("messages", []):
            role = "Customer" if message["role"] == "customer" else "Agent"
            parts.append(f"\n{role}: {message['content']}")

        if ticket.get("resolution"):
            parts.append(f"\n\nResolution: {ticket['resolution']}")

        return "\n".join(parts)

Step 3: Vector Store & Retrieval

# src/retrieval/vector_store.py
from typing import List, Dict, Any, Optional
from qdrant_client import QdrantClient, models
from qdrant_client.http.models import Distance, VectorParams
import numpy as np
from openai import OpenAI

from ..config import settings
from ..ingestion.chunker import Chunk

class VectorStore:
    """Qdrant vector store with metadata filtering."""

    def __init__(self):
        self.client = QdrantClient(url=settings.qdrant_url)
        self.openai = OpenAI(api_key=settings.openai_api_key)
        self.collection_name = settings.qdrant_collection
        self.embedding_dim = 3072  # text-embedding-3-large

        self._ensure_collection()

    def _ensure_collection(self):
        """Create collection if it doesn't exist."""
        collections = self.client.get_collections().collections
        exists = any(c.name == self.collection_name for c in collections)

        if not exists:
            self.client.create_collection(
                collection_name=self.collection_name,
                vectors_config=VectorParams(
                    size=self.embedding_dim,
                    distance=Distance.COSINE
                )
            )

            # Create payload indexes for filtering
            self.client.create_payload_index(
                collection_name=self.collection_name,
                field_name="content_type",
                field_schema=models.PayloadSchemaType.KEYWORD
            )
            self.client.create_payload_index(
                collection_name=self.collection_name,
                field_name="category",
                field_schema=models.PayloadSchemaType.KEYWORD
            )

    def embed_text(self, text: str) -> List[float]:
        """Generate embedding for text."""
        response = self.openai.embeddings.create(
            model=settings.embedding_model,
            input=text
        )
        return response.data[0].embedding

    def embed_batch(self, texts: List[str]) -> List[List[float]]:
        """Generate embeddings for batch of texts."""
        response = self.openai.embeddings.create(
            model=settings.embedding_model,
            input=texts
        )
        return [d.embedding for d in response.data]

    async def upsert_chunks(self, chunks: List[Chunk], batch_size: int = 100):
        """Upsert chunks to vector store."""
        for i in range(0, len(chunks), batch_size):
            batch = chunks[i:i + batch_size]

            # Generate embeddings
            texts = [c.content for c in batch]
            embeddings = self.embed_batch(texts)

            # Prepare points
            points = [
                models.PointStruct(
                    id=hash(chunk.chunk_id) % (2**63),  # Convert to int64
                    vector=embedding,
                    payload={
                        "chunk_id": chunk.chunk_id,
                        "content": chunk.content,
                        **chunk.metadata
                    }
                )
                for chunk, embedding in zip(batch, embeddings)
            ]

            # Upsert
            self.client.upsert(
                collection_name=self.collection_name,
                points=points
            )

    def search(
        self,
        query: str,
        top_k: int = 10,
        content_types: Optional[List[str]] = None,
        category: Optional[str] = None
    ) -> List[Dict[str, Any]]:
        """Search with optional metadata filtering."""
        query_embedding = self.embed_text(query)

        # Build filter conditions
        must_conditions = []
        if content_types:
            must_conditions.append(
                models.FieldCondition(
                    key="content_type",
                    match=models.MatchAny(any=content_types)
                )
            )
        if category:
            must_conditions.append(
                models.FieldCondition(
                    key="category",
                    match=models.MatchValue(value=category)
                )
            )

        query_filter = models.Filter(must=must_conditions) if must_conditions else None

        results = self.client.search(
            collection_name=self.collection_name,
            query_vector=query_embedding,
            query_filter=query_filter,
            limit=top_k
        )

        return [
            {
                "content": r.payload["content"],
                "metadata": {k: v for k, v in r.payload.items() if k != "content"},
                "score": r.score
            }
            for r in results
        ]

Why Use Qdrant with Payload Indexes?

Support systems need filtered retrieval - billing questions should search billing docs, not feature requests:

┌─────────────────────────────────────────────────────────────┐
│ WITHOUT FILTERS: Irrelevant results dilute context          │
├─────────────────────────────────────────────────────────────┤
│  Query: "How do I get a refund?"                            │
│  Results: Feature docs, billing docs, random FAQ all mixed  │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ WITH CATEGORY FILTER: Focused, relevant results             │
├─────────────────────────────────────────────────────────────┤
│  Query: "How do I get a refund?" + category="billing"       │
│  Results: Only billing-related docs → better answers        │
└─────────────────────────────────────────────────────────────┘

The create_payload_index calls enable fast filtering on content_type and category without scanning all vectors.

Hybrid Search

# src/retrieval/hybrid_search.py
from typing import List, Dict, Any, Optional
from rank_bm25 import BM25Okapi
import numpy as np
from dataclasses import dataclass

from .vector_store import VectorStore

@dataclass
class SearchResult:
    content: str
    metadata: Dict[str, Any]
    vector_score: float
    bm25_score: float
    hybrid_score: float

class HybridSearch:
    """
    Combines semantic (vector) search with keyword (BM25) search.
    Uses Reciprocal Rank Fusion (RRF) for score combination.
    """

    def __init__(
        self,
        vector_store: VectorStore,
        alpha: float = 0.7  # Weight for vector search
    ):
        self.vector_store = vector_store
        self.alpha = alpha
        self.bm25_index = None
        self.corpus = []
        self.corpus_metadata = []

    def build_bm25_index(self, documents: List[Dict[str, Any]]):
        """Build BM25 index from documents."""
        self.corpus = [doc["content"] for doc in documents]
        self.corpus_metadata = [doc.get("metadata", {}) for doc in documents]

        # Tokenize for BM25
        tokenized_corpus = [doc.lower().split() for doc in self.corpus]
        self.bm25_index = BM25Okapi(tokenized_corpus)

    def search(
        self,
        query: str,
        top_k: int = 10,
        content_types: Optional[List[str]] = None,
        category: Optional[str] = None
    ) -> List[SearchResult]:
        """Perform hybrid search."""
        # Vector search
        vector_results = self.vector_store.search(
            query=query,
            top_k=top_k * 2,  # Get more for fusion
            content_types=content_types,
            category=category
        )

        # BM25 search
        bm25_scores = []
        if self.bm25_index:
            tokenized_query = query.lower().split()
            bm25_scores = self.bm25_index.get_scores(tokenized_query)

        # Combine using RRF
        combined_results = self._reciprocal_rank_fusion(
            vector_results=vector_results,
            bm25_scores=bm25_scores,
            top_k=top_k
        )

        return combined_results

    def _reciprocal_rank_fusion(
        self,
        vector_results: List[Dict[str, Any]],
        bm25_scores: np.ndarray,
        top_k: int,
        k: int = 60  # RRF constant
    ) -> List[SearchResult]:
        """Combine rankings using Reciprocal Rank Fusion."""
        scores = {}
        content_map = {}
        metadata_map = {}
        vector_score_map = {}
        bm25_score_map = {}

        # Process vector results
        for rank, result in enumerate(vector_results):
            content = result["content"]
            scores[content] = scores.get(content, 0) + self.alpha / (k + rank + 1)
            content_map[content] = content
            metadata_map[content] = result["metadata"]
            vector_score_map[content] = result["score"]

        # Process BM25 results
        if len(bm25_scores) > 0:
            bm25_ranking = np.argsort(bm25_scores)[::-1][:top_k * 2]
            for rank, idx in enumerate(bm25_ranking):
                content = self.corpus[idx]
                scores[content] = scores.get(content, 0) + (1 - self.alpha) / (k + rank + 1)
                content_map[content] = content
                if content not in metadata_map:
                    metadata_map[content] = self.corpus_metadata[idx]
                bm25_score_map[content] = float(bm25_scores[idx])

        # Sort by combined score
        sorted_results = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k]

        return [
            SearchResult(
                content=content,
                metadata=metadata_map.get(content, {}),
                vector_score=vector_score_map.get(content, 0.0),
                bm25_score=bm25_score_map.get(content, 0.0),
                hybrid_score=score
            )
            for content, score in sorted_results
        ]

Why Hybrid Search for Support?

Support queries are often a mix of exact terms ("error code E1234") and semantic meaning ("my payment failed"). Neither pure vector search nor pure keyword search handles both well:

Query Type	Vector Search	BM25	Hybrid
"error code E1234"	❌ Finds similar errors	✅ Exact match	✅
"can't log in"	✅ Semantic match	❌ Misses "authentication failed"	✅
"E1234 login issue"	⚠️ Partial	⚠️ Partial	✅ Best of both

Reciprocal Rank Fusion (RRF) combines rankings without needing calibrated scores:

Vector says Doc A is #1, Doc B is #2
BM25 says Doc B is #1, Doc A is #3
RRF: Both appear highly in both → both get high final scores

The alpha=0.7 weights vector search higher because most support queries are semantic, but BM25 catches exact matches.

Reranker

# src/retrieval/reranker.py
from typing import List
from sentence_transformers import CrossEncoder

from .hybrid_search import SearchResult

class Reranker:
    """Cross-encoder reranker for improved relevance."""

    def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-12-v2"):
        self.model = CrossEncoder(model_name)

    def rerank(
        self,
        query: str,
        results: List[SearchResult],
        top_k: int = 5
    ) -> List[SearchResult]:
        """Rerank results using cross-encoder."""
        if not results:
            return []

        # Prepare pairs for cross-encoder
        pairs = [(query, r.content) for r in results]

        # Get rerank scores
        rerank_scores = self.model.predict(pairs)

        # Combine with original scores (weighted)
        for result, rerank_score in zip(results, rerank_scores):
            # Boost hybrid score with rerank score
            result.hybrid_score = 0.3 * result.hybrid_score + 0.7 * float(rerank_score)

        # Sort by new scores
        results.sort(key=lambda x: x.hybrid_score, reverse=True)

        return results[:top_k]

Why Add Reranking After Hybrid Search?

Hybrid search retrieves candidates, but order matters for generation. Cross-encoder reranking improves ranking precision:

┌─────────────────────────────────────────────────────────────┐
│ RETRIEVAL PIPELINE                                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Query ─┬─► Vector (10) ─┬─► Hybrid Fusion (10) ─► Rerank   │
│         │                │                            │     │
│         └─► BM25 (10) ───┘                            │     │
│                                                       ▼     │
│                                                  Top 5 docs │
│                                                             │
│  Stage 1: Fast but imprecise (bi-encoders)                  │
│  Stage 2: Slow but precise (cross-encoder)                  │
└─────────────────────────────────────────────────────────────┘

Stage	Model Type	Speed	Precision
Hybrid Search	Bi-encoder	~10ms	Good
Reranking	Cross-encoder	~100ms	Excellent

The reranker's 0.3 * hybrid + 0.7 * rerank weighting prioritizes cross-encoder scores since they're more accurate at relevance judgment.

Step 4: Ticket Classification

# src/classification/classifier.py
from typing import Dict, Any, List, Tuple
from dataclasses import dataclass
from enum import Enum
from openai import OpenAI
import json

from ..config import settings

class TicketCategory(Enum):
    BILLING = "billing"
    TECHNICAL = "technical"
    ACCOUNT = "account"
    FEATURE_REQUEST = "feature_request"
    BUG_REPORT = "bug_report"
    GENERAL = "general"

class TicketPriority(Enum):
    URGENT = "urgent"
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"

@dataclass
class TicketClassification:
    category: TicketCategory
    priority: TicketPriority
    sentiment: str  # positive, neutral, negative, frustrated
    confidence: float
    suggested_tags: List[str]
    requires_escalation: bool
    escalation_reason: str = None

class TicketClassifier:
    """Classify incoming support tickets."""

    def __init__(self):
        self.client = OpenAI(api_key=settings.openai_api_key)

    def classify(self, ticket_content: str, customer_context: Dict[str, Any] = None) -> TicketClassification:
        """Classify a support ticket."""

        context_info = ""
        if customer_context:
            context_info = f"""
Customer Context:
- Account Type: {customer_context.get('account_type', 'unknown')}
- Customer Since: {customer_context.get('customer_since', 'unknown')}
- Previous Tickets (30 days): {customer_context.get('recent_tickets', 0)}
- Account Value: {customer_context.get('account_value', 'unknown')}
"""

        prompt = f"""Analyze this support ticket and classify it.

Ticket Content:
{ticket_content}

{context_info}

Provide classification in JSON format:
{{
    "category": "billing|technical|account|feature_request|bug_report|general",
    "priority": "urgent|high|medium|low",
    "sentiment": "positive|neutral|negative|frustrated",
    "confidence": 0.0-1.0,
    "suggested_tags": ["tag1", "tag2"],
    "requires_escalation": true|false,
    "escalation_reason": "reason if escalation needed"
}}

Classification Guidelines:
- URGENT: Service down, security issue, data loss
- HIGH: Blocking issue, frustrated VIP customer
- MEDIUM: Standard issues with workaround
- LOW: General questions, feature requests

Consider escalation if:
- Customer mentions legal action or cancellation
- Multiple failed attempts mentioned
- VIP/Enterprise customer with critical issue
- Security or privacy concern"""

        response = self.client.chat.completions.create(
            model=settings.llm_model,
            messages=[
                {"role": "system", "content": "You are a support ticket classifier. Respond only with valid JSON."},
                {"role": "user", "content": prompt}
            ],
            response_format={"type": "json_object"},
            temperature=0.1
        )

        result = json.loads(response.choices[0].message.content)

        return TicketClassification(
            category=TicketCategory(result["category"]),
            priority=TicketPriority(result["priority"]),
            sentiment=result["sentiment"],
            confidence=result["confidence"],
            suggested_tags=result["suggested_tags"],
            requires_escalation=result["requires_escalation"],
            escalation_reason=result.get("escalation_reason")
        )

Why Classify Before Retrieving?

Classification happens before RAG retrieval for two reasons:

Enables filtered retrieval - If we know it's a billing question, search billing docs
Catches escalations early - Don't waste retrieval on "I'm going to sue you" tickets

┌─────────────────────────────────────────────────────────────┐
│ CLASSIFICATION DECISION TREE                                │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Ticket ───► Classify ───┬───► [URGENT] ───► Skip RAG,     │
│                          │                   Escalate Now   │
│                          │                                  │
│                          ├───► [Normal] ───► Filter by      │
│                          │                   category       │
│                          │                                  │
│                          └───► [Escalation │                │
│                                 keywords]  ───► Flag for    │
│                                                 review      │
└─────────────────────────────────────────────────────────────┘

Escalation Detection Keywords:

Legal threats: "lawyer", "sue", "legal action"
Churn risk: "cancel", "competitor", "leaving"
Security: "hacked", "data breach", "unauthorized"

Step 5: RAG Response Generation

# src/generation/rag_pipeline.py
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from openai import OpenAI

from ..config import settings
from ..retrieval.hybrid_search import SearchResult

@dataclass
class GeneratedResponse:
    response: str
    confidence: float
    sources: List[Dict[str, Any]]
    suggested_actions: List[str]
    requires_human_review: bool
    review_reason: str = None

class RAGPipeline:
    """Generate responses using RAG."""

    def __init__(self):
        self.client = OpenAI(api_key=settings.openai_api_key)

    def generate(
        self,
        query: str,
        context: List[SearchResult],
        ticket_classification: Dict[str, Any] = None,
        customer_context: Dict[str, Any] = None,
        conversation_history: List[Dict[str, str]] = None
    ) -> GeneratedResponse:
        """Generate response with citations."""

        # Format context
        context_text = self._format_context(context)

        # Format conversation history
        history_text = ""
        if conversation_history:
            history_text = "\n\nPrevious conversation:\n"
            for msg in conversation_history[-5:]:  # Last 5 messages
                history_text += f"{msg['role'].title()}: {msg['content']}\n"

        # Build system prompt
        system_prompt = self._build_system_prompt(
            ticket_classification,
            customer_context
        )

        # Generate response
        prompt = f"""Customer Query: {query}

{history_text}

Relevant Knowledge Base Information:
{context_text}

Instructions:
1. Answer the customer's question using ONLY the provided knowledge base information
2. If the information is insufficient, acknowledge what you can help with and what needs escalation
3. Be empathetic and professional
4. Include specific steps or instructions when applicable
5. End with an offer to help further

Format your response as JSON:
{{
    "response": "Your response to the customer",
    "confidence": 0.0-1.0,
    "sources_used": [0, 1, 2],  // indices of sources used
    "suggested_actions": ["action1", "action2"],
    "requires_human_review": true|false,
    "review_reason": "reason if human review needed"
}}"""

        response = self.client.chat.completions.create(
            model=settings.llm_model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": prompt}
            ],
            response_format={"type": "json_object"},
            temperature=0.3
        )

        result = json.loads(response.choices[0].message.content)

        # Map source indices to actual sources
        used_sources = [
            {
                "content": context[i].content[:200] + "...",
                "metadata": context[i].metadata
            }
            for i in result.get("sources_used", [])
            if i < len(context)
        ]

        return GeneratedResponse(
            response=result["response"],
            confidence=result["confidence"],
            sources=used_sources,
            suggested_actions=result.get("suggested_actions", []),
            requires_human_review=result.get("requires_human_review", False),
            review_reason=result.get("review_reason")
        )

    def _format_context(self, context: List[SearchResult]) -> str:
        """Format context for prompt."""
        formatted = []
        for i, result in enumerate(context):
            source_type = result.metadata.get("content_type", "unknown")
            formatted.append(f"[Source {i}] ({source_type})\n{result.content}\n")
        return "\n".join(formatted)

    def _build_system_prompt(
        self,
        classification: Dict[str, Any],
        customer: Dict[str, Any]
    ) -> str:
        """Build context-aware system prompt."""

        base_prompt = """You are a helpful customer support agent. Your role is to:
- Provide accurate, helpful responses based on the knowledge base
- Be empathetic and professional
- Acknowledge when you don't have enough information
- Suggest escalation when appropriate"""

        # Adjust tone based on sentiment
        if classification:
            sentiment = classification.get("sentiment", "neutral")
            if sentiment == "frustrated":
                base_prompt += "\n\nThe customer appears frustrated. Be extra empathetic, acknowledge their frustration, and focus on resolution."
            elif sentiment == "negative":
                base_prompt += "\n\nThe customer seems unhappy. Show understanding and work towards a positive resolution."

        # Adjust for customer tier
        if customer:
            account_type = customer.get("account_type", "standard")
            if account_type in ["enterprise", "vip"]:
                base_prompt += f"\n\nThis is a {account_type.upper()} customer. Prioritize their issue and offer premium support options if needed."

        return base_prompt


import json

Understanding Context-Aware Generation:

The system adapts its tone and approach based on classification:

┌─────────────────────────────────────────────────────────────┐
│ TONE ADAPTATION                                             │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Sentiment: "frustrated"                                    │
│  ───────────────────────                                    │
│  System prompt adds:                                        │
│  "Be extra empathetic, acknowledge their frustration,       │
│   and focus on resolution."                                 │
│                                                             │
│  Customer tier: "enterprise"                                │
│  ───────────────────────────                                │
│  System prompt adds:                                        │
│  "This is an ENTERPRISE customer. Prioritize their issue    │
│   and offer premium support options if needed."             │
└─────────────────────────────────────────────────────────────┘

Source Attribution Pattern:

[Source 0] (faq)
Q: How do I get a refund?
A: Contact billing@company.com...

[Source 1] (ticket_resolution)
Issue: Double charged...
Resolution: Processed refund via...

The LLM is asked to cite sources using [0], [1] indices, creating verifiable responses that agents can audit.

Step 6: Intelligent Routing

# src/routing/router.py
from typing import Dict, Any, Optional
from dataclasses import dataclass
from enum import Enum

from ..config import settings
from ..classification.classifier import TicketClassification, TicketPriority, TicketCategory
from ..generation.rag_pipeline import GeneratedResponse

class RoutingDecision(Enum):
    AUTO_RESPOND = "auto_respond"
    AGENT_DRAFT = "agent_draft"
    ESCALATE_TIER1 = "escalate_tier1"
    ESCALATE_TIER2 = "escalate_tier2"
    ESCALATE_SPECIALIST = "escalate_specialist"

@dataclass
class RoutingResult:
    decision: RoutingDecision
    assigned_queue: str
    response: Optional[str]
    confidence: float
    reasoning: str
    metadata: Dict[str, Any]

class TicketRouter:
    """Route tickets based on classification and response confidence."""

    def __init__(self):
        self.auto_response_threshold = settings.auto_response_confidence
        self.escalation_threshold = settings.escalation_threshold

        # Queue mappings
        self.specialist_queues = {
            TicketCategory.BILLING: "billing_specialists",
            TicketCategory.TECHNICAL: "technical_support",
            TicketCategory.ACCOUNT: "account_managers",
            TicketCategory.BUG_REPORT: "engineering_triage",
            TicketCategory.FEATURE_REQUEST: "product_feedback"
        }

    def route(
        self,
        classification: TicketClassification,
        generated_response: GeneratedResponse,
        customer_context: Dict[str, Any] = None
    ) -> RoutingResult:
        """Determine routing for a ticket."""

        # Check for forced escalation
        if classification.requires_escalation:
            return self._create_escalation(
                classification,
                reason=classification.escalation_reason or "Classification flagged for escalation"
            )

        # Check response confidence
        confidence = generated_response.confidence

        # Adjust thresholds for VIP customers
        auto_threshold = self.auto_response_threshold
        if customer_context and customer_context.get("account_type") in ["enterprise", "vip"]:
            auto_threshold += 0.1  # Higher bar for auto-response to VIPs

        # Check if human review was flagged
        if generated_response.requires_human_review:
            return RoutingResult(
                decision=RoutingDecision.AGENT_DRAFT,
                assigned_queue=self._get_queue(classification.category),
                response=generated_response.response,
                confidence=confidence,
                reasoning=f"Response generated but flagged for review: {generated_response.review_reason}",
                metadata={
                    "draft_response": generated_response.response,
                    "sources": generated_response.sources
                }
            )

        # High confidence - auto respond
        if confidence >= auto_threshold:
            return RoutingResult(
                decision=RoutingDecision.AUTO_RESPOND,
                assigned_queue="auto_resolved",
                response=generated_response.response,
                confidence=confidence,
                reasoning=f"High confidence ({confidence:.2f}) response with verified sources",
                metadata={
                    "sources": generated_response.sources,
                    "auto_responded": True
                }
            )

        # Medium confidence - agent draft
        if confidence >= self.escalation_threshold:
            return RoutingResult(
                decision=RoutingDecision.AGENT_DRAFT,
                assigned_queue=self._get_queue(classification.category),
                response=generated_response.response,
                confidence=confidence,
                reasoning=f"Medium confidence ({confidence:.2f}) - providing draft for agent review",
                metadata={
                    "draft_response": generated_response.response,
                    "sources": generated_response.sources,
                    "suggested_actions": generated_response.suggested_actions
                }
            )

        # Low confidence - escalate
        return self._create_escalation(
            classification,
            reason=f"Low confidence ({confidence:.2f}) - insufficient knowledge base coverage"
        )

    def _create_escalation(
        self,
        classification: TicketClassification,
        reason: str
    ) -> RoutingResult:
        """Create escalation routing."""

        # Determine escalation level based on priority
        if classification.priority == TicketPriority.URGENT:
            decision = RoutingDecision.ESCALATE_TIER2
            queue = "urgent_escalations"
        elif classification.priority == TicketPriority.HIGH:
            decision = RoutingDecision.ESCALATE_SPECIALIST
            queue = self.specialist_queues.get(
                classification.category,
                "general_escalations"
            )
        else:
            decision = RoutingDecision.ESCALATE_TIER1
            queue = self._get_queue(classification.category)

        return RoutingResult(
            decision=decision,
            assigned_queue=queue,
            response=None,
            confidence=0.0,
            reasoning=reason,
            metadata={
                "classification": {
                    "category": classification.category.value,
                    "priority": classification.priority.value,
                    "sentiment": classification.sentiment
                }
            }
        )

    def _get_queue(self, category: TicketCategory) -> str:
        """Get queue for category."""
        return self.specialist_queues.get(category, "general_support")

Understanding the Routing Decision Matrix:

┌─────────────────────────────────────────────────────────────┐
│ CONFIDENCE-BASED ROUTING                                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Confidence    Decision            Action                   │
│  ──────────    ────────            ──────                   │
│  ≥ 0.85        AUTO_RESPOND        Send response directly   │
│                                    to customer              │
│                                                             │
│  0.40 - 0.84   AGENT_DRAFT         Provide draft response   │
│                                    for agent to approve     │
│                                                             │
│  &lt; 0.40       ESCALATE            Route to specialist      │
│                                    queue without draft      │
│                                                             │
│  VIP Customer  +0.10 threshold     Higher bar for auto      │
│                                    (require 0.95)           │
└─────────────────────────────────────────────────────────────┘

Why This Works:

Auto-respond zone (≥0.85): System is confident enough to trust the answer
Agent draft zone (0.4-0.85): Good attempt but needs human verification
Escalation zone (below 0.4): Not enough knowledge - don't guess, escalate

Escalation Priority Mapping:

Priority	Escalation Level	Queue
URGENT	Tier 2	`urgent_escalations`
HIGH	Specialist	Category-specific queue
MEDIUM/LOW	Tier 1	General support

Step 7: FastAPI Application

# src/api/main.py
from fastapi import FastAPI, HTTPException, BackgroundTasks
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from typing import List, Dict, Any, Optional
import asyncio

from ..config import settings
from ..retrieval.vector_store import VectorStore
from ..retrieval.hybrid_search import HybridSearch
from ..retrieval.reranker import Reranker
from ..classification.classifier import TicketClassifier
from ..generation.rag_pipeline import RAGPipeline
from ..routing.router import TicketRouter

app = FastAPI(
    title="Enterprise Support System",
    description="AI-powered customer support with RAG",
    version="1.0.0"
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"]
)

# Initialize components
vector_store = VectorStore()
hybrid_search = HybridSearch(vector_store)
reranker = Reranker()
classifier = TicketClassifier()
rag_pipeline = RAGPipeline()
router = TicketRouter()


class TicketRequest(BaseModel):
    content: str
    customer_id: Optional[str] = None
    conversation_history: Optional[List[Dict[str, str]]] = None

class TicketResponse(BaseModel):
    ticket_id: str
    classification: Dict[str, Any]
    routing: Dict[str, Any]
    response: Optional[str]
    sources: List[Dict[str, Any]]
    confidence: float

class SearchRequest(BaseModel):
    query: str
    top_k: int = 5
    content_types: Optional[List[str]] = None
    category: Optional[str] = None


@app.post("/api/tickets/process", response_model=TicketResponse)
async def process_ticket(request: TicketRequest):
    """Process a support ticket through the full pipeline."""
    import uuid

    ticket_id = str(uuid.uuid4())

    # Get customer context (mock - replace with actual CRM lookup)
    customer_context = await get_customer_context(request.customer_id)

    # Step 1: Classify ticket
    classification = classifier.classify(
        ticket_content=request.content,
        customer_context=customer_context
    )

    # Step 2: Retrieve relevant context
    search_results = hybrid_search.search(
        query=request.content,
        top_k=settings.retrieval_top_k
    )

    # Step 3: Rerank results
    reranked = reranker.rerank(
        query=request.content,
        results=search_results,
        top_k=settings.rerank_top_k
    )

    # Step 4: Generate response
    generated = rag_pipeline.generate(
        query=request.content,
        context=reranked,
        ticket_classification={
            "category": classification.category.value,
            "priority": classification.priority.value,
            "sentiment": classification.sentiment
        },
        customer_context=customer_context,
        conversation_history=request.conversation_history
    )

    # Step 5: Route ticket
    routing = router.route(
        classification=classification,
        generated_response=generated,
        customer_context=customer_context
    )

    return TicketResponse(
        ticket_id=ticket_id,
        classification={
            "category": classification.category.value,
            "priority": classification.priority.value,
            "sentiment": classification.sentiment,
            "confidence": classification.confidence,
            "tags": classification.suggested_tags
        },
        routing={
            "decision": routing.decision.value,
            "queue": routing.assigned_queue,
            "reasoning": routing.reasoning
        },
        response=routing.response,
        sources=generated.sources,
        confidence=generated.confidence
    )


@app.post("/api/search")
async def search_knowledge(request: SearchRequest):
    """Search the knowledge base."""
    results = hybrid_search.search(
        query=request.query,
        top_k=request.top_k,
        content_types=request.content_types,
        category=request.category
    )

    reranked = reranker.rerank(
        query=request.query,
        results=results,
        top_k=request.top_k
    )

    return {
        "results": [
            {
                "content": r.content,
                "metadata": r.metadata,
                "score": r.hybrid_score
            }
            for r in reranked
        ]
    }


@app.get("/api/health")
async def health_check():
    """Health check endpoint."""
    return {"status": "healthy"}


async def get_customer_context(customer_id: Optional[str]) -> Dict[str, Any]:
    """Get customer context from CRM (mock implementation)."""
    if not customer_id:
        return {}

    # Replace with actual CRM integration
    return {
        "customer_id": customer_id,
        "account_type": "standard",
        "customer_since": "2023-01-15",
        "recent_tickets": 2,
        "account_value": "medium"
    }

Step 8: Docker Deployment

# docker-compose.yml
version: '3.8'

services:
  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - QDRANT_URL=http://qdrant:6333
      - REDIS_URL=redis://redis:6379
    depends_on:
      - qdrant
      - redis
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/api/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
    volumes:
      - qdrant_data:/qdrant/storage

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data

volumes:
  qdrant_data:
  redis_data:

# Dockerfile
FROM python:3.11-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy source code
COPY src/ ./src/

# Run the application
CMD ["uvicorn", "src.api.main:app", "--host", "0.0.0.0", "--port", "8000"]

# requirements.txt
fastapi==0.109.0
uvicorn==0.27.0
pydantic==2.5.3
pydantic-settings==2.1.0
openai==1.10.0
qdrant-client==1.7.0
sentence-transformers==2.2.2
rank-bm25==0.2.2
redis==5.0.1
aiohttp==3.9.1
numpy==1.26.3

Usage Example

import requests

# Process a ticket
response = requests.post(
    "http://localhost:8000/api/tickets/process",
    json={
        "content": "I've been charged twice for my subscription this month. Can you help me get a refund?",
        "customer_id": "cust_12345",
        "conversation_history": []
    }
)

result = response.json()
print(f"Ticket ID: {result['ticket_id']}")
print(f"Category: {result['classification']['category']}")
print(f"Priority: {result['classification']['priority']}")
print(f"Routing: {result['routing']['decision']}")
print(f"Response: {result['response']}")
print(f"Confidence: {result['confidence']}")

Business Impact

Metric	Improvement
First Response Time	90% faster
Ticket Resolution	40% automated
Agent Productivity	2x increase
Customer Satisfaction	+15 NPS points
Support Costs	60% reduction

Next Steps

Add feedback loop - Collect agent corrections to improve responses
Implement caching - Cache frequent queries with Redis
Add analytics - Track resolution rates, response quality
Multi-language - Add translation for international support
Integration - Connect with Zendesk, Intercom, Freshdesk

Key Concepts Recap

Concept	What It Is	Why It Matters
Content-Aware Chunking	Different chunking strategies for FAQs, docs, tickets	Preserves semantic units (Q&A pairs, header context)
Hybrid Search	Vector + BM25 with RRF fusion	Catches both semantic matches and exact terms
Cross-Encoder Reranking	Second-pass precision ranking	~10% relevance improvement over bi-encoder alone
Pre-RAG Classification	Classify tickets before retrieval	Enables filtered search, catches escalations early
Confidence Routing	Route based on RAG confidence score	Auto-respond when confident, escalate when unsure
Tone Adaptation	Adjust prompts based on sentiment/tier	Frustrated customers get empathy, VIPs get priority
Source Attribution	LLM cites sources with indices	Auditable responses, builds trust
VIP Threshold Boost	Higher confidence bar for premium customers	Protect high-value relationships from bad auto-responses

Prerequisites

Before starting this case study, complete:

Enterprise Customer Support System

TL;DR

Build a production-grade customer support system that uses RAG to automatically answer customer questions, route complex issues, and reduce support costs by 60%.


Industry	SaaS / E-commerce
Difficulty	Advanced
Time	1 week
Code	~1500 lines

What You'll Build

A complete customer support automation system that:

Ingests knowledge sources - Help docs, FAQs, past tickets, product documentation
Classifies incoming tickets - Urgency, category, sentiment analysis
Generates responses - Draft answers using RAG with source citations
Routes intelligently - Escalate to human agents when confidence is low
Learns continuously - Improve from agent feedback and corrections

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ENTERPRISE SUPPORT RAG ARCHITECTURE                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ KNOWLEDGE INGESTION                                                 │   │
│  │   Help Center ─┬─► FAQ Database ─┬─► Historical Tickets ─┬─► Docs  │   │
│  └────────────────┴─────────────────┴───────────────────────┴─────────┘   │
│                                    │                                       │
│                                    ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ DOCUMENT PROCESSING                                                 │   │
│  │   Smart Chunking ──────────► Embeddings ──────────► Metadata        │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                    │                                       │
│                                    ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ INTELLIGENT RETRIEVAL                                               │   │
│  │   Hybrid Search ───────────► Reranking ───────────► Filtering       │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                    │                                       │
│                                    ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ RESPONSE GENERATION                                                 │   │
│  │   Classification ──────────► RAG Pipeline ─────────► Confidence     │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                    │                                       │
│                                    ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ OUTPUT HANDLING                                                     │   │
│  │                 ┌──────────────┼──────────────┐                     │   │
│  │                 ▼              ▼              ▼                     │   │
│  │          Auto-Response   Agent Draft    Escalation                  │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Project Structure

enterprise-support/
├── src/
│   ├── __init__.py
│   ├── config.py              # Configuration management
│   ├── ingestion/
│   │   ├── __init__.py
│   │   ├── loader.py          # Document loaders
│   │   ├── chunker.py         # Smart chunking
│   │   └── processor.py       # Metadata extraction
│   ├── retrieval/
│   │   ├── __init__.py
│   │   ├── embeddings.py      # Embedding generation
│   │   ├── vector_store.py    # Vector database
│   │   ├── hybrid_search.py   # Hybrid retrieval
│   │   └── reranker.py        # Cross-encoder reranking
│   ├── classification/
│   │   ├── __init__.py
│   │   ├── classifier.py      # Ticket classification
│   │   └── sentiment.py       # Sentiment analysis
│   ├── generation/
│   │   ├── __init__.py
│   │   ├── rag_pipeline.py    # RAG response generation
│   │   ├── confidence.py      # Confidence scoring
│   │   └── templates.py       # Response templates
│   ├── routing/
│   │   ├── __init__.py
│   │   └── router.py          # Ticket routing logic
│   └── api/
│       ├── __init__.py
│       ├── main.py            # FastAPI application
│       └── models.py          # Pydantic models
├── tests/
├── docker-compose.yml
├── Dockerfile
└── requirements.txt

Step 1: Configuration

# src/config.py
from pydantic_settings import BaseSettings
from typing import Optional

class Settings(BaseSettings):
    # OpenAI
    openai_api_key: str
    embedding_model: str = "text-embedding-3-large"
    llm_model: str = "gpt-4o"

    # Vector Store (Qdrant)
    qdrant_url: str = "http://localhost:6333"
    qdrant_collection: str = "support_knowledge"

    # Redis
    redis_url: str = "redis://localhost:6379"

    # Classification thresholds
    auto_response_confidence: float = 0.85
    escalation_threshold: float = 0.4

    # Retrieval settings
    retrieval_top_k: int = 10
    rerank_top_k: int = 5

    class Config:
        env_file = ".env"

settings = Settings()

Understanding the Configuration Thresholds:

Setting	Value	Purpose
`auto_response_confidence`	0.85	Only auto-respond when 85%+ confident - high bar to avoid bad answers
`escalation_threshold`	0.4	Below 40% confidence → escalate to human agent
`retrieval_top_k`	10	Retrieve 10 candidates for hybrid search fusion
`rerank_top_k`	5	Keep top 5 after reranking for generation context

The gap between 0.4 and 0.85 is the "agent draft" zone - the system provides a suggested response but requires human approval.

Step 2: Document Ingestion

Smart Chunking Strategy

# src/ingestion/chunker.py
from typing import List, Dict, Any
from dataclasses import dataclass
from langchain.text_splitter import RecursiveCharacterTextSplitter
import re

@dataclass
class Chunk:
    content: str
    metadata: Dict[str, Any]
    chunk_id: str

class SmartChunker:
    """
    Context-aware chunking that preserves semantic boundaries.
    Different strategies for different content types.
    """

    def __init__(
        self,
        chunk_size: int = 512,
        chunk_overlap: int = 50
    ):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap

        # FAQ chunker - keeps Q&A pairs together
        self.faq_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=0,  # No overlap for FAQs
            separators=["\n\n", "\n"]
        )

        # Documentation chunker - respects headers
        self.doc_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            separators=["\n## ", "\n### ", "\n\n", "\n", " "]
        )

        # Ticket chunker - keeps conversations intact
        self.ticket_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size * 2,  # Larger for context
            chunk_overlap=chunk_overlap,
            separators=["\n---\n", "\n\n", "\n"]
        )

    def chunk_faq(
        self,
        content: str,
        source_id: str,
        base_metadata: Dict[str, Any]
    ) -> List[Chunk]:
        """Chunk FAQ content, keeping Q&A pairs together."""
        chunks = []

        # Parse Q&A pairs
        qa_pattern = r'Q:\s*(.*?)\nA:\s*(.*?)(?=\nQ:|$)'
        matches = re.findall(qa_pattern, content, re.DOTALL)

        for i, (question, answer) in enumerate(matches):
            chunk_content = f"Question: {question.strip()}\n\nAnswer: {answer.strip()}"

            chunks.append(Chunk(
                content=chunk_content,
                metadata={
                    **base_metadata,
                    "content_type": "faq",
                    "question": question.strip(),
                    "chunk_index": i
                },
                chunk_id=f"{source_id}_faq_{i}"
            ))

        return chunks

    def chunk_documentation(
        self,
        content: str,
        source_id: str,
        base_metadata: Dict[str, Any]
    ) -> List[Chunk]:
        """Chunk documentation while preserving header context."""
        chunks = []

        # Extract sections with headers
        sections = self._extract_sections(content)

        for section_idx, section in enumerate(sections):
            header = section.get("header", "")
            body = section.get("body", "")

            # Split body if too large
            if len(body) > self.chunk_size:
                sub_chunks = self.doc_splitter.split_text(body)
                for i, sub_chunk in enumerate(sub_chunks):
                    # Prepend header for context
                    chunk_content = f"{header}\n\n{sub_chunk}" if header else sub_chunk

                    chunks.append(Chunk(
                        content=chunk_content,
                        metadata={
                            **base_metadata,
                            "content_type": "documentation",
                            "section_header": header,
                            "chunk_index": f"{section_idx}_{i}"
                        },
                        chunk_id=f"{source_id}_doc_{section_idx}_{i}"
                    ))
            else:
                chunk_content = f"{header}\n\n{body}" if header else body
                chunks.append(Chunk(
                    content=chunk_content,
                    metadata={
                        **base_metadata,
                        "content_type": "documentation",
                        "section_header": header,
                        "chunk_index": section_idx
                    },
                    chunk_id=f"{source_id}_doc_{section_idx}"
                ))

        return chunks

    def chunk_ticket(
        self,
        content: str,
        source_id: str,
        base_metadata: Dict[str, Any],
        resolution: str = None
    ) -> List[Chunk]:
        """Chunk support tickets, emphasizing resolution."""
        chunks = []

        # If there's a resolution, create a focused chunk
        if resolution:
            resolution_chunk = f"Issue: {content[:500]}...\n\nResolution: {resolution}"
            chunks.append(Chunk(
                content=resolution_chunk,
                metadata={
                    **base_metadata,
                    "content_type": "ticket_resolution",
                    "has_resolution": True,
                    "chunk_index": 0
                },
                chunk_id=f"{source_id}_resolution"
            ))

        # Chunk the full conversation
        sub_chunks = self.ticket_splitter.split_text(content)
        for i, sub_chunk in enumerate(sub_chunks):
            chunks.append(Chunk(
                content=sub_chunk,
                metadata={
                    **base_metadata,
                    "content_type": "ticket_conversation",
                    "has_resolution": resolution is not None,
                    "chunk_index": i + 1
                },
                chunk_id=f"{source_id}_conv_{i}"
            ))

        return chunks

    def _extract_sections(self, content: str) -> List[Dict[str, str]]:
        """Extract sections with their headers."""
        sections = []

        # Split by headers (## or ###)
        header_pattern = r'^(#{2,3}\s+.+)$'
        parts = re.split(header_pattern, content, flags=re.MULTILINE)

        current_header = ""
        for part in parts:
            if re.match(header_pattern, part):
                current_header = part.strip()
            elif part.strip():
                sections.append({
                    "header": current_header,
                    "body": part.strip()
                })

        return sections

Why Content-Type-Aware Chunking?

Standard chunking treats all text the same, which destroys semantic boundaries:

┌─────────────────────────────────────────────────────────────┐
│ STANDARD CHUNKING: Breaks at arbitrary character count      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  FAQ:  "Q: How do I reset my password?  │  ← BROKEN!       │
│         A: Go to settings and..."       │                   │
│                                                             │
│  Problem: Question separated from answer = useless chunk    │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ SMART CHUNKING: Preserves semantic units                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  FAQ → Keep Q&A pairs together (no overlap)                 │
│  Docs → Keep headers attached to content                    │
│  Tickets → Prioritize resolution, larger context            │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Content Type	Strategy	Why
FAQ	Parse Q&A pairs with regex	Questions without answers are useless
Documentation	Split on headers, prepend header to chunks	"Step 3" means nothing without section context
Tickets	Larger chunks, resolution-focused	Conversation context and resolution are critical

Document Loader

# src/ingestion/loader.py
from typing import List, Dict, Any, AsyncGenerator
from dataclasses import dataclass
import aiohttp
import asyncio
from pathlib import Path
import json

@dataclass
class Document:
    content: str
    metadata: Dict[str, Any]
    source_id: str

class KnowledgeLoader:
    """Load knowledge from multiple sources."""

    async def load_help_center(
        self,
        api_url: str,
        api_key: str
    ) -> AsyncGenerator[Document, None]:
        """Load articles from help center API (e.g., Zendesk, Intercom)."""
        async with aiohttp.ClientSession() as session:
            headers = {"Authorization": f"Bearer {api_key}"}

            async with session.get(
                f"{api_url}/articles",
                headers=headers
            ) as response:
                data = await response.json()

                for article in data.get("articles", []):
                    yield Document(
                        content=article["body"],
                        metadata={
                            "source": "help_center",
                            "title": article["title"],
                            "category": article.get("category", "general"),
                            "updated_at": article.get("updated_at"),
                            "url": article.get("url")
                        },
                        source_id=f"help_{article['id']}"
                    )

    async def load_faqs(self, faq_file: Path) -> AsyncGenerator[Document, None]:
        """Load FAQs from JSON file."""
        with open(faq_file) as f:
            faqs = json.load(f)

        for faq in faqs:
            content = f"Q: {faq['question']}\nA: {faq['answer']}"
            yield Document(
                content=content,
                metadata={
                    "source": "faq",
                    "category": faq.get("category", "general"),
                    "tags": faq.get("tags", [])
                },
                source_id=f"faq_{faq['id']}"
            )

    async def load_tickets(
        self,
        api_url: str,
        api_key: str,
        resolved_only: bool = True
    ) -> AsyncGenerator[Document, None]:
        """Load historical tickets with resolutions."""
        async with aiohttp.ClientSession() as session:
            headers = {"Authorization": f"Bearer {api_key}"}
            params = {"status": "resolved"} if resolved_only else {}

            async with session.get(
                f"{api_url}/tickets",
                headers=headers,
                params=params
            ) as response:
                data = await response.json()

                for ticket in data.get("tickets", []):
                    # Build conversation thread
                    conversation = self._build_conversation(ticket)

                    yield Document(
                        content=conversation,
                        metadata={
                            "source": "ticket",
                            "category": ticket.get("category"),
                            "priority": ticket.get("priority"),
                            "resolution": ticket.get("resolution"),
                            "satisfaction_score": ticket.get("satisfaction_score"),
                            "resolved_at": ticket.get("resolved_at")
                        },
                        source_id=f"ticket_{ticket['id']}"
                    )

    def _build_conversation(self, ticket: Dict[str, Any]) -> str:
        """Build readable conversation from ticket."""
        parts = [f"Subject: {ticket.get('subject', 'No subject')}"]

        for message in ticket.get("messages", []):
            role = "Customer" if message["role"] == "customer" else "Agent"
            parts.append(f"\n{role}: {message['content']}")

        if ticket.get("resolution"):
            parts.append(f"\n\nResolution: {ticket['resolution']}")

        return "\n".join(parts)

Step 3: Vector Store & Retrieval

# src/retrieval/vector_store.py
from typing import List, Dict, Any, Optional
from qdrant_client import QdrantClient, models
from qdrant_client.http.models import Distance, VectorParams
import numpy as np
from openai import OpenAI

from ..config import settings
from ..ingestion.chunker import Chunk

class VectorStore:
    """Qdrant vector store with metadata filtering."""

    def __init__(self):
        self.client = QdrantClient(url=settings.qdrant_url)
        self.openai = OpenAI(api_key=settings.openai_api_key)
        self.collection_name = settings.qdrant_collection
        self.embedding_dim = 3072  # text-embedding-3-large

        self._ensure_collection()

    def _ensure_collection(self):
        """Create collection if it doesn't exist."""
        collections = self.client.get_collections().collections
        exists = any(c.name == self.collection_name for c in collections)

        if not exists:
            self.client.create_collection(
                collection_name=self.collection_name,
                vectors_config=VectorParams(
                    size=self.embedding_dim,
                    distance=Distance.COSINE
                )
            )

            # Create payload indexes for filtering
            self.client.create_payload_index(
                collection_name=self.collection_name,
                field_name="content_type",
                field_schema=models.PayloadSchemaType.KEYWORD
            )
            self.client.create_payload_index(
                collection_name=self.collection_name,
                field_name="category",
                field_schema=models.PayloadSchemaType.KEYWORD
            )

    def embed_text(self, text: str) -> List[float]:
        """Generate embedding for text."""
        response = self.openai.embeddings.create(
            model=settings.embedding_model,
            input=text
        )
        return response.data[0].embedding

    def embed_batch(self, texts: List[str]) -> List[List[float]]:
        """Generate embeddings for batch of texts."""
        response = self.openai.embeddings.create(
            model=settings.embedding_model,
            input=texts
        )
        return [d.embedding for d in response.data]

    async def upsert_chunks(self, chunks: List[Chunk], batch_size: int = 100):
        """Upsert chunks to vector store."""
        for i in range(0, len(chunks), batch_size):
            batch = chunks[i:i + batch_size]

            # Generate embeddings
            texts = [c.content for c in batch]
            embeddings = self.embed_batch(texts)

            # Prepare points
            points = [
                models.PointStruct(
                    id=hash(chunk.chunk_id) % (2**63),  # Convert to int64
                    vector=embedding,
                    payload={
                        "chunk_id": chunk.chunk_id,
                        "content": chunk.content,
                        **chunk.metadata
                    }
                )
                for chunk, embedding in zip(batch, embeddings)
            ]

            # Upsert
            self.client.upsert(
                collection_name=self.collection_name,
                points=points
            )

    def search(
        self,
        query: str,
        top_k: int = 10,
        content_types: Optional[List[str]] = None,
        category: Optional[str] = None
    ) -> List[Dict[str, Any]]:
        """Search with optional metadata filtering."""
        query_embedding = self.embed_text(query)

        # Build filter conditions
        must_conditions = []
        if content_types:
            must_conditions.append(
                models.FieldCondition(
                    key="content_type",
                    match=models.MatchAny(any=content_types)
                )
            )
        if category:
            must_conditions.append(
                models.FieldCondition(
                    key="category",
                    match=models.MatchValue(value=category)
                )
            )

        query_filter = models.Filter(must=must_conditions) if must_conditions else None

        results = self.client.search(
            collection_name=self.collection_name,
            query_vector=query_embedding,
            query_filter=query_filter,
            limit=top_k
        )

        return [
            {
                "content": r.payload["content"],
                "metadata": {k: v for k, v in r.payload.items() if k != "content"},
                "score": r.score
            }
            for r in results
        ]

Why Use Qdrant with Payload Indexes?

Support systems need filtered retrieval - billing questions should search billing docs, not feature requests:

┌─────────────────────────────────────────────────────────────┐
│ WITHOUT FILTERS: Irrelevant results dilute context          │
├─────────────────────────────────────────────────────────────┤
│  Query: "How do I get a refund?"                            │
│  Results: Feature docs, billing docs, random FAQ all mixed  │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ WITH CATEGORY FILTER: Focused, relevant results             │
├─────────────────────────────────────────────────────────────┤
│  Query: "How do I get a refund?" + category="billing"       │
│  Results: Only billing-related docs → better answers        │
└─────────────────────────────────────────────────────────────┘

The create_payload_index calls enable fast filtering on content_type and category without scanning all vectors.

Hybrid Search

# src/retrieval/hybrid_search.py
from typing import List, Dict, Any, Optional
from rank_bm25 import BM25Okapi
import numpy as np
from dataclasses import dataclass

from .vector_store import VectorStore

@dataclass
class SearchResult:
    content: str
    metadata: Dict[str, Any]
    vector_score: float
    bm25_score: float
    hybrid_score: float

class HybridSearch:
    """
    Combines semantic (vector) search with keyword (BM25) search.
    Uses Reciprocal Rank Fusion (RRF) for score combination.
    """

    def __init__(
        self,
        vector_store: VectorStore,
        alpha: float = 0.7  # Weight for vector search
    ):
        self.vector_store = vector_store
        self.alpha = alpha
        self.bm25_index = None
        self.corpus = []
        self.corpus_metadata = []

    def build_bm25_index(self, documents: List[Dict[str, Any]]):
        """Build BM25 index from documents."""
        self.corpus = [doc["content"] for doc in documents]
        self.corpus_metadata = [doc.get("metadata", {}) for doc in documents]

        # Tokenize for BM25
        tokenized_corpus = [doc.lower().split() for doc in self.corpus]
        self.bm25_index = BM25Okapi(tokenized_corpus)

    def search(
        self,
        query: str,
        top_k: int = 10,
        content_types: Optional[List[str]] = None,
        category: Optional[str] = None
    ) -> List[SearchResult]:
        """Perform hybrid search."""
        # Vector search
        vector_results = self.vector_store.search(
            query=query,
            top_k=top_k * 2,  # Get more for fusion
            content_types=content_types,
            category=category
        )

        # BM25 search
        bm25_scores = []
        if self.bm25_index:
            tokenized_query = query.lower().split()
            bm25_scores = self.bm25_index.get_scores(tokenized_query)

        # Combine using RRF
        combined_results = self._reciprocal_rank_fusion(
            vector_results=vector_results,
            bm25_scores=bm25_scores,
            top_k=top_k
        )

        return combined_results

    def _reciprocal_rank_fusion(
        self,
        vector_results: List[Dict[str, Any]],
        bm25_scores: np.ndarray,
        top_k: int,
        k: int = 60  # RRF constant
    ) -> List[SearchResult]:
        """Combine rankings using Reciprocal Rank Fusion."""
        scores = {}
        content_map = {}
        metadata_map = {}
        vector_score_map = {}
        bm25_score_map = {}

        # Process vector results
        for rank, result in enumerate(vector_results):
            content = result["content"]
            scores[content] = scores.get(content, 0) + self.alpha / (k + rank + 1)
            content_map[content] = content
            metadata_map[content] = result["metadata"]
            vector_score_map[content] = result["score"]

        # Process BM25 results
        if len(bm25_scores) > 0:
            bm25_ranking = np.argsort(bm25_scores)[::-1][:top_k * 2]
            for rank, idx in enumerate(bm25_ranking):
                content = self.corpus[idx]
                scores[content] = scores.get(content, 0) + (1 - self.alpha) / (k + rank + 1)
                content_map[content] = content
                if content not in metadata_map:
                    metadata_map[content] = self.corpus_metadata[idx]
                bm25_score_map[content] = float(bm25_scores[idx])

        # Sort by combined score
        sorted_results = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k]

        return [
            SearchResult(
                content=content,
                metadata=metadata_map.get(content, {}),
                vector_score=vector_score_map.get(content, 0.0),
                bm25_score=bm25_score_map.get(content, 0.0),
                hybrid_score=score
            )
            for content, score in sorted_results
        ]

Why Hybrid Search for Support?

Support queries are often a mix of exact terms ("error code E1234") and semantic meaning ("my payment failed"). Neither pure vector search nor pure keyword search handles both well:

Query Type	Vector Search	BM25	Hybrid
"error code E1234"	❌ Finds similar errors	✅ Exact match	✅
"can't log in"	✅ Semantic match	❌ Misses "authentication failed"	✅
"E1234 login issue"	⚠️ Partial	⚠️ Partial	✅ Best of both

Reciprocal Rank Fusion (RRF) combines rankings without needing calibrated scores:

Vector says Doc A is #1, Doc B is #2
BM25 says Doc B is #1, Doc A is #3
RRF: Both appear highly in both → both get high final scores

The alpha=0.7 weights vector search higher because most support queries are semantic, but BM25 catches exact matches.

Reranker

# src/retrieval/reranker.py
from typing import List
from sentence_transformers import CrossEncoder

from .hybrid_search import SearchResult

class Reranker:
    """Cross-encoder reranker for improved relevance."""

    def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-12-v2"):
        self.model = CrossEncoder(model_name)

    def rerank(
        self,
        query: str,
        results: List[SearchResult],
        top_k: int = 5
    ) -> List[SearchResult]:
        """Rerank results using cross-encoder."""
        if not results:
            return []

        # Prepare pairs for cross-encoder
        pairs = [(query, r.content) for r in results]

        # Get rerank scores
        rerank_scores = self.model.predict(pairs)

        # Combine with original scores (weighted)
        for result, rerank_score in zip(results, rerank_scores):
            # Boost hybrid score with rerank score
            result.hybrid_score = 0.3 * result.hybrid_score + 0.7 * float(rerank_score)

        # Sort by new scores
        results.sort(key=lambda x: x.hybrid_score, reverse=True)

        return results[:top_k]

Why Add Reranking After Hybrid Search?

Hybrid search retrieves candidates, but order matters for generation. Cross-encoder reranking improves ranking precision:

┌─────────────────────────────────────────────────────────────┐
│ RETRIEVAL PIPELINE                                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Query ─┬─► Vector (10) ─┬─► Hybrid Fusion (10) ─► Rerank   │
│         │                │                            │     │
│         └─► BM25 (10) ───┘                            │     │
│                                                       ▼     │
│                                                  Top 5 docs │
│                                                             │
│  Stage 1: Fast but imprecise (bi-encoders)                  │
│  Stage 2: Slow but precise (cross-encoder)                  │
└─────────────────────────────────────────────────────────────┘

Stage	Model Type	Speed	Precision
Hybrid Search	Bi-encoder	~10ms	Good
Reranking	Cross-encoder	~100ms	Excellent

The reranker's 0.3 * hybrid + 0.7 * rerank weighting prioritizes cross-encoder scores since they're more accurate at relevance judgment.

Step 4: Ticket Classification

# src/classification/classifier.py
from typing import Dict, Any, List, Tuple
from dataclasses import dataclass
from enum import Enum
from openai import OpenAI
import json

from ..config import settings

class TicketCategory(Enum):
    BILLING = "billing"
    TECHNICAL = "technical"
    ACCOUNT = "account"
    FEATURE_REQUEST = "feature_request"
    BUG_REPORT = "bug_report"
    GENERAL = "general"

class TicketPriority(Enum):
    URGENT = "urgent"
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"

@dataclass
class TicketClassification:
    category: TicketCategory
    priority: TicketPriority
    sentiment: str  # positive, neutral, negative, frustrated
    confidence: float
    suggested_tags: List[str]
    requires_escalation: bool
    escalation_reason: str = None

class TicketClassifier:
    """Classify incoming support tickets."""

    def __init__(self):
        self.client = OpenAI(api_key=settings.openai_api_key)

    def classify(self, ticket_content: str, customer_context: Dict[str, Any] = None) -> TicketClassification:
        """Classify a support ticket."""

        context_info = ""
        if customer_context:
            context_info = f"""
Customer Context:
- Account Type: {customer_context.get('account_type', 'unknown')}
- Customer Since: {customer_context.get('customer_since', 'unknown')}
- Previous Tickets (30 days): {customer_context.get('recent_tickets', 0)}
- Account Value: {customer_context.get('account_value', 'unknown')}
"""

        prompt = f"""Analyze this support ticket and classify it.

Ticket Content:
{ticket_content}

{context_info}

Provide classification in JSON format:
{{
    "category": "billing|technical|account|feature_request|bug_report|general",
    "priority": "urgent|high|medium|low",
    "sentiment": "positive|neutral|negative|frustrated",
    "confidence": 0.0-1.0,
    "suggested_tags": ["tag1", "tag2"],
    "requires_escalation": true|false,
    "escalation_reason": "reason if escalation needed"
}}

Classification Guidelines:
- URGENT: Service down, security issue, data loss
- HIGH: Blocking issue, frustrated VIP customer
- MEDIUM: Standard issues with workaround
- LOW: General questions, feature requests

Consider escalation if:
- Customer mentions legal action or cancellation
- Multiple failed attempts mentioned
- VIP/Enterprise customer with critical issue
- Security or privacy concern"""

        response = self.client.chat.completions.create(
            model=settings.llm_model,
            messages=[
                {"role": "system", "content": "You are a support ticket classifier. Respond only with valid JSON."},
                {"role": "user", "content": prompt}
            ],
            response_format={"type": "json_object"},
            temperature=0.1
        )

        result = json.loads(response.choices[0].message.content)

        return TicketClassification(
            category=TicketCategory(result["category"]),
            priority=TicketPriority(result["priority"]),
            sentiment=result["sentiment"],
            confidence=result["confidence"],
            suggested_tags=result["suggested_tags"],
            requires_escalation=result["requires_escalation"],
            escalation_reason=result.get("escalation_reason")
        )

Why Classify Before Retrieving?

Classification happens before RAG retrieval for two reasons:

Enables filtered retrieval - If we know it's a billing question, search billing docs
Catches escalations early - Don't waste retrieval on "I'm going to sue you" tickets

┌─────────────────────────────────────────────────────────────┐
│ CLASSIFICATION DECISION TREE                                │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Ticket ───► Classify ───┬───► [URGENT] ───► Skip RAG,     │
│                          │                   Escalate Now   │
│                          │                                  │
│                          ├───► [Normal] ───► Filter by      │
│                          │                   category       │
│                          │                                  │
│                          └───► [Escalation │                │
│                                 keywords]  ───► Flag for    │
│                                                 review      │
└─────────────────────────────────────────────────────────────┘

Escalation Detection Keywords:

Legal threats: "lawyer", "sue", "legal action"
Churn risk: "cancel", "competitor", "leaving"
Security: "hacked", "data breach", "unauthorized"

Step 5: RAG Response Generation

# src/generation/rag_pipeline.py
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from openai import OpenAI

from ..config import settings
from ..retrieval.hybrid_search import SearchResult

@dataclass
class GeneratedResponse:
    response: str
    confidence: float
    sources: List[Dict[str, Any]]
    suggested_actions: List[str]
    requires_human_review: bool
    review_reason: str = None

class RAGPipeline:
    """Generate responses using RAG."""

    def __init__(self):
        self.client = OpenAI(api_key=settings.openai_api_key)

    def generate(
        self,
        query: str,
        context: List[SearchResult],
        ticket_classification: Dict[str, Any] = None,
        customer_context: Dict[str, Any] = None,
        conversation_history: List[Dict[str, str]] = None
    ) -> GeneratedResponse:
        """Generate response with citations."""

        # Format context
        context_text = self._format_context(context)

        # Format conversation history
        history_text = ""
        if conversation_history:
            history_text = "\n\nPrevious conversation:\n"
            for msg in conversation_history[-5:]:  # Last 5 messages
                history_text += f"{msg['role'].title()}: {msg['content']}\n"

        # Build system prompt
        system_prompt = self._build_system_prompt(
            ticket_classification,
            customer_context
        )

        # Generate response
        prompt = f"""Customer Query: {query}

{history_text}

Relevant Knowledge Base Information:
{context_text}

Instructions:
1. Answer the customer's question using ONLY the provided knowledge base information
2. If the information is insufficient, acknowledge what you can help with and what needs escalation
3. Be empathetic and professional
4. Include specific steps or instructions when applicable
5. End with an offer to help further

Format your response as JSON:
{{
    "response": "Your response to the customer",
    "confidence": 0.0-1.0,
    "sources_used": [0, 1, 2],  // indices of sources used
    "suggested_actions": ["action1", "action2"],
    "requires_human_review": true|false,
    "review_reason": "reason if human review needed"
}}"""

        response = self.client.chat.completions.create(
            model=settings.llm_model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": prompt}
            ],
            response_format={"type": "json_object"},
            temperature=0.3
        )

        result = json.loads(response.choices[0].message.content)

        # Map source indices to actual sources
        used_sources = [
            {
                "content": context[i].content[:200] + "...",
                "metadata": context[i].metadata
            }
            for i in result.get("sources_used", [])
            if i < len(context)
        ]

        return GeneratedResponse(
            response=result["response"],
            confidence=result["confidence"],
            sources=used_sources,
            suggested_actions=result.get("suggested_actions", []),
            requires_human_review=result.get("requires_human_review", False),
            review_reason=result.get("review_reason")
        )

    def _format_context(self, context: List[SearchResult]) -> str:
        """Format context for prompt."""
        formatted = []
        for i, result in enumerate(context):
            source_type = result.metadata.get("content_type", "unknown")
            formatted.append(f"[Source {i}] ({source_type})\n{result.content}\n")
        return "\n".join(formatted)

    def _build_system_prompt(
        self,
        classification: Dict[str, Any],
        customer: Dict[str, Any]
    ) -> str:
        """Build context-aware system prompt."""

        base_prompt = """You are a helpful customer support agent. Your role is to:
- Provide accurate, helpful responses based on the knowledge base
- Be empathetic and professional
- Acknowledge when you don't have enough information
- Suggest escalation when appropriate"""

        # Adjust tone based on sentiment
        if classification:
            sentiment = classification.get("sentiment", "neutral")
            if sentiment == "frustrated":
                base_prompt += "\n\nThe customer appears frustrated. Be extra empathetic, acknowledge their frustration, and focus on resolution."
            elif sentiment == "negative":
                base_prompt += "\n\nThe customer seems unhappy. Show understanding and work towards a positive resolution."

        # Adjust for customer tier
        if customer:
            account_type = customer.get("account_type", "standard")
            if account_type in ["enterprise", "vip"]:
                base_prompt += f"\n\nThis is a {account_type.upper()} customer. Prioritize their issue and offer premium support options if needed."

        return base_prompt


import json

Understanding Context-Aware Generation:

The system adapts its tone and approach based on classification:

┌─────────────────────────────────────────────────────────────┐
│ TONE ADAPTATION                                             │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Sentiment: "frustrated"                                    │
│  ───────────────────────                                    │
│  System prompt adds:                                        │
│  "Be extra empathetic, acknowledge their frustration,       │
│   and focus on resolution."                                 │
│                                                             │
│  Customer tier: "enterprise"                                │
│  ───────────────────────────                                │
│  System prompt adds:                                        │
│  "This is an ENTERPRISE customer. Prioritize their issue    │
│   and offer premium support options if needed."             │
└─────────────────────────────────────────────────────────────┘

Source Attribution Pattern:

[Source 0] (faq)
Q: How do I get a refund?
A: Contact billing@company.com...

[Source 1] (ticket_resolution)
Issue: Double charged...
Resolution: Processed refund via...

The LLM is asked to cite sources using [0], [1] indices, creating verifiable responses that agents can audit.

Step 6: Intelligent Routing

# src/routing/router.py
from typing import Dict, Any, Optional
from dataclasses import dataclass
from enum import Enum

from ..config import settings
from ..classification.classifier import TicketClassification, TicketPriority, TicketCategory
from ..generation.rag_pipeline import GeneratedResponse

class RoutingDecision(Enum):
    AUTO_RESPOND = "auto_respond"
    AGENT_DRAFT = "agent_draft"
    ESCALATE_TIER1 = "escalate_tier1"
    ESCALATE_TIER2 = "escalate_tier2"
    ESCALATE_SPECIALIST = "escalate_specialist"

@dataclass
class RoutingResult:
    decision: RoutingDecision
    assigned_queue: str
    response: Optional[str]
    confidence: float
    reasoning: str
    metadata: Dict[str, Any]

class TicketRouter:
    """Route tickets based on classification and response confidence."""

    def __init__(self):
        self.auto_response_threshold = settings.auto_response_confidence
        self.escalation_threshold = settings.escalation_threshold

        # Queue mappings
        self.specialist_queues = {
            TicketCategory.BILLING: "billing_specialists",
            TicketCategory.TECHNICAL: "technical_support",
            TicketCategory.ACCOUNT: "account_managers",
            TicketCategory.BUG_REPORT: "engineering_triage",
            TicketCategory.FEATURE_REQUEST: "product_feedback"
        }

    def route(
        self,
        classification: TicketClassification,
        generated_response: GeneratedResponse,
        customer_context: Dict[str, Any] = None
    ) -> RoutingResult:
        """Determine routing for a ticket."""

        # Check for forced escalation
        if classification.requires_escalation:
            return self._create_escalation(
                classification,
                reason=classification.escalation_reason or "Classification flagged for escalation"
            )

        # Check response confidence
        confidence = generated_response.confidence

        # Adjust thresholds for VIP customers
        auto_threshold = self.auto_response_threshold
        if customer_context and customer_context.get("account_type") in ["enterprise", "vip"]:
            auto_threshold += 0.1  # Higher bar for auto-response to VIPs

        # Check if human review was flagged
        if generated_response.requires_human_review:
            return RoutingResult(
                decision=RoutingDecision.AGENT_DRAFT,
                assigned_queue=self._get_queue(classification.category),
                response=generated_response.response,
                confidence=confidence,
                reasoning=f"Response generated but flagged for review: {generated_response.review_reason}",
                metadata={
                    "draft_response": generated_response.response,
                    "sources": generated_response.sources
                }
            )

        # High confidence - auto respond
        if confidence >= auto_threshold:
            return RoutingResult(
                decision=RoutingDecision.AUTO_RESPOND,
                assigned_queue="auto_resolved",
                response=generated_response.response,
                confidence=confidence,
                reasoning=f"High confidence ({confidence:.2f}) response with verified sources",
                metadata={
                    "sources": generated_response.sources,
                    "auto_responded": True
                }
            )

        # Medium confidence - agent draft
        if confidence >= self.escalation_threshold:
            return RoutingResult(
                decision=RoutingDecision.AGENT_DRAFT,
                assigned_queue=self._get_queue(classification.category),
                response=generated_response.response,
                confidence=confidence,
                reasoning=f"Medium confidence ({confidence:.2f}) - providing draft for agent review",
                metadata={
                    "draft_response": generated_response.response,
                    "sources": generated_response.sources,
                    "suggested_actions": generated_response.suggested_actions
                }
            )

        # Low confidence - escalate
        return self._create_escalation(
            classification,
            reason=f"Low confidence ({confidence:.2f}) - insufficient knowledge base coverage"
        )

    def _create_escalation(
        self,
        classification: TicketClassification,
        reason: str
    ) -> RoutingResult:
        """Create escalation routing."""

        # Determine escalation level based on priority
        if classification.priority == TicketPriority.URGENT:
            decision = RoutingDecision.ESCALATE_TIER2
            queue = "urgent_escalations"
        elif classification.priority == TicketPriority.HIGH:
            decision = RoutingDecision.ESCALATE_SPECIALIST
            queue = self.specialist_queues.get(
                classification.category,
                "general_escalations"
            )
        else:
            decision = RoutingDecision.ESCALATE_TIER1
            queue = self._get_queue(classification.category)

        return RoutingResult(
            decision=decision,
            assigned_queue=queue,
            response=None,
            confidence=0.0,
            reasoning=reason,
            metadata={
                "classification": {
                    "category": classification.category.value,
                    "priority": classification.priority.value,
                    "sentiment": classification.sentiment
                }
            }
        )

    def _get_queue(self, category: TicketCategory) -> str:
        """Get queue for category."""
        return self.specialist_queues.get(category, "general_support")

Understanding the Routing Decision Matrix:

┌─────────────────────────────────────────────────────────────┐
│ CONFIDENCE-BASED ROUTING                                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Confidence    Decision            Action                   │
│  ──────────    ────────            ──────                   │
│  ≥ 0.85        AUTO_RESPOND        Send response directly   │
│                                    to customer              │
│                                                             │
│  0.40 - 0.84   AGENT_DRAFT         Provide draft response   │
│                                    for agent to approve     │
│                                                             │
│  &lt; 0.40       ESCALATE            Route to specialist      │
│                                    queue without draft      │
│                                                             │
│  VIP Customer  +0.10 threshold     Higher bar for auto      │
│                                    (require 0.95)           │
└─────────────────────────────────────────────────────────────┘

Why This Works:

Auto-respond zone (≥0.85): System is confident enough to trust the answer
Agent draft zone (0.4-0.85): Good attempt but needs human verification
Escalation zone (below 0.4): Not enough knowledge - don't guess, escalate

Escalation Priority Mapping:

Priority	Escalation Level	Queue
URGENT	Tier 2	`urgent_escalations`
HIGH	Specialist	Category-specific queue
MEDIUM/LOW	Tier 1	General support

Step 7: FastAPI Application

# src/api/main.py
from fastapi import FastAPI, HTTPException, BackgroundTasks
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from typing import List, Dict, Any, Optional
import asyncio

from ..config import settings
from ..retrieval.vector_store import VectorStore
from ..retrieval.hybrid_search import HybridSearch
from ..retrieval.reranker import Reranker
from ..classification.classifier import TicketClassifier
from ..generation.rag_pipeline import RAGPipeline
from ..routing.router import TicketRouter

app = FastAPI(
    title="Enterprise Support System",
    description="AI-powered customer support with RAG",
    version="1.0.0"
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"]
)

# Initialize components
vector_store = VectorStore()
hybrid_search = HybridSearch(vector_store)
reranker = Reranker()
classifier = TicketClassifier()
rag_pipeline = RAGPipeline()
router = TicketRouter()


class TicketRequest(BaseModel):
    content: str
    customer_id: Optional[str] = None
    conversation_history: Optional[List[Dict[str, str]]] = None

class TicketResponse(BaseModel):
    ticket_id: str
    classification: Dict[str, Any]
    routing: Dict[str, Any]
    response: Optional[str]
    sources: List[Dict[str, Any]]
    confidence: float

class SearchRequest(BaseModel):
    query: str
    top_k: int = 5
    content_types: Optional[List[str]] = None
    category: Optional[str] = None


@app.post("/api/tickets/process", response_model=TicketResponse)
async def process_ticket(request: TicketRequest):
    """Process a support ticket through the full pipeline."""
    import uuid

    ticket_id = str(uuid.uuid4())

    # Get customer context (mock - replace with actual CRM lookup)
    customer_context = await get_customer_context(request.customer_id)

    # Step 1: Classify ticket
    classification = classifier.classify(
        ticket_content=request.content,
        customer_context=customer_context
    )

    # Step 2: Retrieve relevant context
    search_results = hybrid_search.search(
        query=request.content,
        top_k=settings.retrieval_top_k
    )

    # Step 3: Rerank results
    reranked = reranker.rerank(
        query=request.content,
        results=search_results,
        top_k=settings.rerank_top_k
    )

    # Step 4: Generate response
    generated = rag_pipeline.generate(
        query=request.content,
        context=reranked,
        ticket_classification={
            "category": classification.category.value,
            "priority": classification.priority.value,
            "sentiment": classification.sentiment
        },
        customer_context=customer_context,
        conversation_history=request.conversation_history
    )

    # Step 5: Route ticket
    routing = router.route(
        classification=classification,
        generated_response=generated,
        customer_context=customer_context
    )

    return TicketResponse(
        ticket_id=ticket_id,
        classification={
            "category": classification.category.value,
            "priority": classification.priority.value,
            "sentiment": classification.sentiment,
            "confidence": classification.confidence,
            "tags": classification.suggested_tags
        },
        routing={
            "decision": routing.decision.value,
            "queue": routing.assigned_queue,
            "reasoning": routing.reasoning
        },
        response=routing.response,
        sources=generated.sources,
        confidence=generated.confidence
    )


@app.post("/api/search")
async def search_knowledge(request: SearchRequest):
    """Search the knowledge base."""
    results = hybrid_search.search(
        query=request.query,
        top_k=request.top_k,
        content_types=request.content_types,
        category=request.category
    )

    reranked = reranker.rerank(
        query=request.query,
        results=results,
        top_k=request.top_k
    )

    return {
        "results": [
            {
                "content": r.content,
                "metadata": r.metadata,
                "score": r.hybrid_score
            }
            for r in reranked
        ]
    }


@app.get("/api/health")
async def health_check():
    """Health check endpoint."""
    return {"status": "healthy"}


async def get_customer_context(customer_id: Optional[str]) -> Dict[str, Any]:
    """Get customer context from CRM (mock implementation)."""
    if not customer_id:
        return {}

    # Replace with actual CRM integration
    return {
        "customer_id": customer_id,
        "account_type": "standard",
        "customer_since": "2023-01-15",
        "recent_tickets": 2,
        "account_value": "medium"
    }

Step 8: Docker Deployment

# docker-compose.yml
version: '3.8'

services:
  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - QDRANT_URL=http://qdrant:6333
      - REDIS_URL=redis://redis:6379
    depends_on:
      - qdrant
      - redis
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/api/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
    volumes:
      - qdrant_data:/qdrant/storage

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data

volumes:
  qdrant_data:
  redis_data:

# Dockerfile
FROM python:3.11-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy source code
COPY src/ ./src/

# Run the application
CMD ["uvicorn", "src.api.main:app", "--host", "0.0.0.0", "--port", "8000"]

# requirements.txt
fastapi==0.109.0
uvicorn==0.27.0
pydantic==2.5.3
pydantic-settings==2.1.0
openai==1.10.0
qdrant-client==1.7.0
sentence-transformers==2.2.2
rank-bm25==0.2.2
redis==5.0.1
aiohttp==3.9.1
numpy==1.26.3

Usage Example

import requests

# Process a ticket
response = requests.post(
    "http://localhost:8000/api/tickets/process",
    json={
        "content": "I've been charged twice for my subscription this month. Can you help me get a refund?",
        "customer_id": "cust_12345",
        "conversation_history": []
    }
)

result = response.json()
print(f"Ticket ID: {result['ticket_id']}")
print(f"Category: {result['classification']['category']}")
print(f"Priority: {result['classification']['priority']}")
print(f"Routing: {result['routing']['decision']}")
print(f"Response: {result['response']}")
print(f"Confidence: {result['confidence']}")

Business Impact

Metric	Improvement
First Response Time	90% faster
Ticket Resolution	40% automated
Agent Productivity	2x increase
Customer Satisfaction	+15 NPS points
Support Costs	60% reduction

Next Steps

Add feedback loop - Collect agent corrections to improve responses
Implement caching - Cache frequent queries with Redis
Add analytics - Track resolution rates, response quality
Multi-language - Add translation for international support
Integration - Connect with Zendesk, Intercom, Freshdesk

Key Concepts Recap

Concept	What It Is	Why It Matters
Content-Aware Chunking	Different chunking strategies for FAQs, docs, tickets	Preserves semantic units (Q&A pairs, header context)
Hybrid Search	Vector + BM25 with RRF fusion	Catches both semantic matches and exact terms
Cross-Encoder Reranking	Second-pass precision ranking	~10% relevance improvement over bi-encoder alone
Pre-RAG Classification	Classify tickets before retrieval	Enables filtered search, catches escalations early
Confidence Routing	Route based on RAG confidence score	Auto-respond when confident, escalate when unsure
Tone Adaptation	Adjust prompts based on sentiment/tier	Frustrated customers get empathy, VIPs get priority
Source Attribution	LLM cites sources with indices	Auditable responses, builds trust
VIP Threshold Boost	Higher confidence bar for premium customers	Protect high-value relationships from bad auto-responses

Prerequisites

Before starting this case study, complete:

Enterprise Customer Support System

On this page

Enterprise Customer Support System

On this page