Shipping a multi-agent architecture to production

Most multi-agent systems impress in a demo and disappoint in production. The model is rarely the problem. What breaks is everything around it.

The demo trap

A demo optimizes for the ideal case: a well-formed question, clean context, one user at a time. Production looks nothing like that.

One we lived through: an internal support agent answering HR questions on top of a document search tool. Flawless in the demo. Three weeks after going live, the search tool started answering in eight seconds instead of 300 ms. The agent saw nothing wrong with that: it waited, retried, tried again. The result was 45-second answers and an API bill that tripled. Nobody had tested a slow tool. We had tested a tool that works and a tool that fails, never a tool that crawls.

An agent that can’t fail cleanly isn’t ready for production.

That is the trap: ambiguous input and degraded tools cause more incidents than the model ever does.

Three decisions that matter

Make every step observable

Trace everything: tool calls, routing decisions, tokens spent, latency per step. In practice, every run carries an identifier, every step emits a span, and you can replay the full thread of a conversation that went wrong. The day someone asks “why did the agent answer that on Tuesday at 2:07 pm”, you have the answer in two minutes instead of losing the afternoon to it.

Bound autonomy

An explicit budget and timeout beat a prompt that politely asks the agent not to overspend. On our projects that means: twelve steps maximum per run, a cost ceiling per conversation, a five-second timeout per tool call, and human approval for anything irreversible like sending an email or writing to a database. An agent that hits a limit stops and says so. It does not keep going and hope for the best.

Isolate state

Almost every non-reproducible bug I have chased came down to memory shared between agents. Two agents writing to the same conversation history while a third reads a half-updated state: good luck replaying that. The rule we hold ourselves to: each agent gets its context as input and returns a result as output. Anything that must be shared goes through a versioned store, never a mutated object.

Our current stack

It will change, but here is what we ship today:

Models: Claude through the Anthropic API for reasoning and orchestration, a smaller, cheaper model for classification and routing.
Orchestration: plain code. A hand-rolled state machine in TypeScript you can read line by line. An explicit switch beats a magic framework on the day it breaks at 3 am.
Tools: MCP to expose internal tools, with one server per business domain rather than one catch-all server.
RAG: Postgres with pgvector. A dedicated vector database is not worth it below a few million documents.
Observability: OpenTelemetry for traces, Langfuse for LLM monitoring, Grafana for dashboards.
Evals: a set of scenarios replayed on every deploy, including the cases where a tool answers slowly or not at all. That test set is what would have caught the eight-second story.

The takeaway

A useful agent system is one you can measure and fix without breaking everything else. Less spectacular than a demo, but that is what separates a prototype from a product. If you only keep one thing: instrument first, everything else follows from it.