Live Clinical Orchestration Simulator
Build a multi-agent clinical orchestrator — planner, parallel specialists, a reflection loop, and a human gate — and watch it run step by step
Live Clinical Orchestration Simulator
TL;DR
Routing picks one model. Orchestration coordinates many agents to solve one problem together. This project builds a clinical orchestrator that plans the work, runs specialists in parallel on a shared blackboard, uses a critic to catch what no single agent saw, and pauses for a human before acting — and you can watch a real run replay step by step in your browser.
| Difficulty | Advanced |
| Time | ~4–5 days |
| Code Size | ~700 LOC |
| Prerequisites | Multi-Agent System, LLM Router |
Why Routing Isn't Enough
In the LLM Router project, one model read a request and chose where to send it. That is perfect for triage, but it falls apart when a single question needs several kinds of expertise at once.
Think about reviewing an elderly patient's medications. You need a drug-interaction check, a kidney-dosing check, and a falls-risk check — and crucially, you need someone to look at all of those findings together, because the most dangerous problems hide between specialties, not inside one. No single model call does this well.
That is what orchestration is for: a coordinator runs many agents, shares their work, reviews it, and only then acts.
The Case
Our orchestrator reviews one realistic patient:
78-year-old woman. eGFR 38 (chronic kidney disease, stage 3). Recurrent falls. Taking 9 medications: ramipril, furosemide, ibuprofen (as needed), metformin, atorvastatin, amlodipine, omeprazole, zopiclone, amitriptyline.
The goal: produce a safe deprescribing plan — but never act on it without a clinician's sign-off.
Five Ideas That Make It Orchestration
These five concepts are the whole project. Each maps to a concrete LangGraph feature you will build below.
| Concept | What it does | How we build it |
|---|---|---|
| Dynamic planner | Reads the case and decides which specialists are needed | A planner node |
| Parallel fan-out | Runs specialists at the same time, not one by one | Send() to many nodes |
| Blackboard | One shared memory all agents read from and write to | The graph State with a reducer |
| Reflection (critic) | Reviews all findings together; sends work back if something's wrong | A conditional edge that loops |
| Human-in-the-loop | Pauses for a person before anything is acted on | interrupt() + checkpointer |
Clinical Orchestrator Architecture
Coordinate
Parallel specialists (blackboard)
Reflect
Gate
Output
Watch It Run
Press Play (or Step through it). Click any agent to read its full reasoning. Watch what happens at step 5 — that is the moment orchestration earns its keep.
Orchestration replay — Polypharmacy review
Step 1 / 7Case: 78-year-old woman · eGFR 38 (CKD stage 3) · recurrent falls · 9 medications: ramipril, furosemide, ibuprofen (PRN), metformin, atorvastatin, amlodipine, omeprazole, zopiclone, amitriptyline.
Intake
The Planner reads the case: 78 y/o, 9 medications, CKD stage 3, recurrent falls.
Tip: click any agent to read its role and full reasoning.
This replay is deterministic — it steps through a real run that the LangGraph code below produced. Building the simulator into the page (instead of calling a live model) keeps it free, offline, and identical every time, which is what you want for teaching.
The "Triple Whammy" — Why the Critic Matters
At step 5 the Critic caught something none of the specialists flagged on their own: the patient is on an ACE inhibitor (ramipril) + a diuretic (furosemide) + an NSAID (ibuprofen) at the same time. This combination is so well known for causing acute kidney injury (AKI) that it has a name — the "triple whammy" — and it is especially dangerous in older patients with reduced kidney function, exactly like ours.
The reason a single agent missed it is instructive:
- Pharmacology looked at drug–drug interactions, but in isolation the NSAID looks like a minor issue.
- Renal confirmed the ACE inhibitor was fine on its own.
- Geriatrics was focused on falls.
The risk only appears when you look at all three findings together — which is precisely the Critic's job. This is the core lesson: a reflection step that reviews the combined output catches cross-cutting errors that no individual specialist can.
Build It: Step by Step
We use LangGraph because it gives us the four features we need out of the box: shared state, parallel fan-out, conditional loops, and human-in-the-loop pauses.
1. The Blackboard (shared state)
Every agent reads from and writes to one shared state object. The findings list uses a reducer (operator.add) so parallel specialists can append to it without overwriting each other.
import operator
from typing import Annotated, Literal
from typing_extensions import TypedDict
class CaseState(TypedDict):
case: str # the patient case
specialists: list[str] # planner fills this in
findings: Annotated[list[dict], operator.add] # blackboard — append-only
critic_feedback: str
revision_round: int
plan: str
approved: boolThe Annotated[..., operator.add] part is the key detail: without it, three specialists writing findings at the same time would clobber one another. With it, their results are merged (appended) into one list.
2. The Planner
The planner reads the case and decides which specialists are needed. Here it is dynamic — the LLM could choose different specialists for a different patient.
def planner(state: CaseState) -> dict:
specialists = decide_specialists(state["case"]) # LLM → ["pharmacology", "renal", "geriatrics"]
# Preserve the revision counter so the critic loop can't run forever.
return {"specialists": specialists, "revision_round": state.get("revision_round", 0)}3. Parallel Fan-Out
This is the heart of orchestration. Instead of a normal edge, we use a conditional edge that returns a list of Send() objects — one per specialist. LangGraph runs them in parallel.
from langgraph.types import Send
def assign_specialists(state: CaseState):
# One parallel task per specialist (a "map" step).
return [
Send("specialist", {
"case": state["case"],
"role": role,
"feedback": state.get("critic_feedback", ""),
})
for role in state["specialists"]
]A single, reusable specialist node handles whichever role it is given:
def specialist(payload: dict) -> dict:
role = payload["role"]
finding = run_specialist(role, payload["case"], payload["feedback"])
# Appends to the blackboard thanks to the operator.add reducer.
return {"findings": [{"role": role, "finding": finding}]}4. The Critic (reflection loop)
The critic reviews all findings together. If it spots a problem (like the triple whammy), it writes feedback and asks for a revision; otherwise it drafts the plan.
def critic(state: CaseState) -> dict:
issues = review_for_conflicts(state["findings"]) # looks across ALL specialists
if issues and state["revision_round"] < 1: # allow one revision round
return {
"critic_feedback": issues,
"revision_round": state["revision_round"] + 1,
}
return {"critic_feedback": "", "plan": draft_plan(state["findings"])}A routing function decides whether to loop back or move on:
def after_critic(state: CaseState) -> Literal["planner", "human_gate"]:
return "planner" if state["critic_feedback"] else "human_gate"One detail to know: because findings is append-only (the operator.add reducer), a revision round adds new findings on top of the old ones rather than replacing them. That keeps the demo simple, but in production you would tag each finding with its round number — or clear findings before re-running — so the critic always compares like with like.
The revision_round counter is what stops this loop from running forever: the critic only asks for one revision (revision_round < 1), then drafts the plan.
5. The Human Gate
Before anything is acted on, execution pauses for a clinician. interrupt() saves the state and hands control back to your application.
from langgraph.types import interrupt, Command
def human_gate(state: CaseState) -> dict:
decision = interrupt({
"question": "Approve this deprescribing plan?",
"plan": state["plan"],
})
return {"approved": bool(decision)}6. Wire the Graph Together
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import InMemorySaver
builder = StateGraph(CaseState)
builder.add_node("planner", planner)
builder.add_node("specialist", specialist)
builder.add_node("critic", critic)
builder.add_node("human_gate", human_gate)
builder.add_edge(START, "planner")
builder.add_conditional_edges("planner", assign_specialists, ["specialist"])
builder.add_edge("specialist", "critic")
builder.add_conditional_edges("critic", after_critic, ["planner", "human_gate"])
builder.add_edge("human_gate", END)
# A checkpointer is required for interrupt()/resume to work.
graph = builder.compile(checkpointer=InMemorySaver())7. Run It (with the human pause)
config = {"configurable": {"thread_id": "patient-001"}}
# Runs planner → specialists (parallel) → critic (loops once) → pauses at human_gate.
result = graph.invoke({"case": CASE}, config)
print(result["__interrupt__"]) # the approval request + drafted plan
# Clinician approves → resume from exactly where it paused.
final = graph.invoke(Command(resume=True), config)
print(final["plan"], final["approved"])Observability: The Replay Is Your Audit Log
The simulator above is really a trace of a run. In production you want the same thing: a record of every agent's input, output, and the order things happened. That trace is what lets you debug, explain a decision to a clinician, and improve the system over time. Tools like LangSmith capture it automatically; at minimum, log every node's (role, input, output) to your own store.
When to Use Orchestration — and When Not To
Orchestration is powerful, but it is not free. Every extra agent adds tokens, latency, and a new way for things to go wrong. The research is blunt: most multi-agent failures come from poor design, not weak models. So the rule is start simple and add structure only when a simpler design clearly can't cope.
Pick the simplest design that works
One LLM call
The task fits in a single prompt with no branching. Always try this first. Cheapest, fastest, easiest to debug.
A workflow (fixed path)
The steps are known in advance — e.g. classify, then route, then answer. Use a router or a fixed chain. Predictable and still easy to trace.
Multi-agent orchestration
The task genuinely needs several kinds of expertise at once, the sub-tasks aren't known up front, or one context window can't hold it all — like our polypharmacy review. Worth the overhead only here.
A quick gut-check before reaching for orchestration: would a human need more than one specialist for this? If not, a single agent is almost always the better engineering choice.
Failure Modes & How This Design Avoids Them
Multi-agent systems fail in predictable ways. A 2025 study — the Multi-Agent System Failure Taxonomy (MAST) — catalogued 14 failure modes across three buckets: system design, inter-agent misalignment, and task verification. The good news: the patterns in this project are direct countermeasures to the most common ones.
| Common failure (MAST category) | What goes wrong | How this design prevents it |
|---|---|---|
| No / weak verification (task verification) | Nobody checks the combined result, so cross-cutting errors slip through | The Critic reviews all findings together — that's how it caught the triple whammy |
| Step repetition / infinite loops (system design) | Agents redo work and never finish | The revision_round cap allows exactly one revision, then forces a decision |
| Role / task ambiguity (inter-agent misalignment) | An agent drifts outside its job | Each specialist gets a typed Send payload with one clear role |
| Information loss between agents (inter-agent misalignment) | Context gets dropped on handoff | The blackboard (shared state) keeps every finding in one place |
| Acting without confirmation (system design) | The system takes an unsafe action on its own | The human-in-the-loop gate pauses before anything is acted on |
Production hardening (not shown in the demo). A real run must survive a specialist that times out, errors, or returns garbage. Add a per-agent timeout, a small number of retries, and a partial-failure policy (proceed with the specialists that succeeded, and tell the Critic which ones are missing) so one slow agent can't stall the whole review.
A Note on Safety
This orchestrator proposes a plan — it never changes a medication on its own. The human gate is not optional decoration; it is the safety boundary. For real clinical use you also need rule-based red-flag checks, input sanitization, and audit logging. Build those properly with the techniques in Agent Security & Safe Deployment.
Key Concepts Recap
| Concept | What it is | Why it matters |
|---|---|---|
| Orchestration | Coordinating many agents to solve one problem | Handles tasks that need several kinds of expertise at once |
| Blackboard | One shared state with a reducer | Parallel agents merge their work instead of overwriting it |
| Parallel fan-out | Send() to many nodes | Specialists run at the same time — faster than one by one |
| Reflection / critic | Reviews combined findings, can loop back | Catches cross-cutting errors no single agent sees |
| Human-in-the-loop | interrupt() pause for approval | A person signs off before anything is acted on |
Next Steps
- Orchestration Patterns — the big-picture map of all orchestration patterns and when to use each.
- Tumor Board Simulator — multi-specialist debate (adversarial), a different orchestration shape.
- Clinical Decision Support — a full case study with safety guardrails.
- Adverse Event Surveillance — supervisor-worker orchestration over live patient data.
- Agent Security & Safe Deployment — make the human gate and red-flag checks production-grade.