AI Agents in Production: What Nobody Tells You
Agentic AI systems are powerful and notoriously hard to operate reliably. Here's what we've learned shipping agents to production.
AI Agents in Production: What Nobody Tells You
AI agents — LLMs that can use tools, execute code, browse the web, and take multi-step actions — are the most exciting development in software since cloud computing. They’re also the most failure-prone systems I’ve had to operate.
Here’s what shipping agents to production actually looks like.
The Reliability Problem is Fundamental
LLMs are probabilistic. Agents chain multiple LLM calls together, with each call’s output feeding the next. Errors compound. A 95% reliable LLM making 10 sequential decisions produces a pipeline that succeeds 60% of the time.
This is not a bug you can fix. It’s the nature of the system. Your architecture has to account for it.
The practical implications:
- Design for failure, not success. Every tool call can fail. Every LLM response can be malformed.
- Implement retries with exponential backoff at every step.
- Build human-in-the-loop checkpoints for high-stakes decisions.
- Log everything. You need full traces of every agent decision to debug failures.
Tool Design is Product Design
The tools you give an agent determine what it can accomplish and how reliably it can accomplish it. Poorly designed tools are the most common source of agent failures we encounter.
Good tool design principles:
Make tools atomic and idempotent. A tool that creates a record and sends a notification is two tools. An idempotent tool can be safely retried without side effects.
Return structured, machine-readable output. Don’t return prose that the LLM has to parse. Return JSON with clear field names. The LLM will misparse prose under load.
Include error context in errors. When a tool fails, return enough context for the agent to decide what to do next — not just “error 500”.
Limit tool scope. An agent with 50 tools underperforms one with 10 well-chosen tools. More tools means more ambiguity in tool selection.
State Management
Agents need state: what have they done, what do they know, what decisions have they made. Managing this state well is the difference between a reliable agent and one that loops, forgets, or contradicts itself.
For short-lived agents (single task, minutes), in-memory state is fine. For long-running agents (hours, days), you need durable state storage with the ability to pause, resume, and inspect.
We use a simple pattern: an append-only event log stored in PostgreSQL. Every action the agent takes is an event. The agent’s current state is derived from replaying the event log. This gives you full auditability and the ability to resume from any point.
The Observability Stack for Agents
Standard APM tools don’t capture what you need for agents. You need:
- Full trace of every LLM call — prompt, response, token count, latency, model
- Tool call traces — what was called, with what arguments, what was returned
- Decision traces — why did the agent choose this action over alternatives?
- Cost tracking per agent run — LLM API costs compound quickly in agentic loops
LangSmith, Langfuse, and Helicone all provide agent-specific observability. We typically deploy Langfuse self-hosted alongside our agents for full data control.
When Not to Use Agents
Agents are powerful and genuinely overkill for many tasks. Use an agent when:
- The task requires dynamic decision-making based on intermediate results
- The set of steps cannot be determined in advance
- Multiple tools need to be orchestrated in flexible order
Use a simple pipeline when:
- The steps are fixed and known in advance
- Each step has deterministic output
- Reliability is critical and the task is well-defined
A deterministic pipeline with a few LLM calls is 10x easier to test, debug, and operate than an agent. Start simple.
Building agentic AI systems? Get in touch — we help teams design agent architectures that are reliable enough to ship.