AI Agents in Production: Evals, Guardrails, and Observability

Every agent demo looks the same: a model plans, calls a few tools, and produces something that would have taken a person an hour. The room nods. Then the demo meets real data, real users, and real edge cases — and the gap between "works in the demo" and "works at 2 a.m. on a Tuesday" turns out to be the entire project.

That gap isn't closed by a better model. It's closed by the unglamorous systems around the model. Here's what we build, in the order we build it.

Evals before features

An agent without evals is a system whose behavior you discover from user complaints. Before adding a second tool or a fancier planning loop, you need a way to answer: did this change make the agent better or worse?

A practical eval stack, smallest first:

A golden set. Thirty to a hundred real inputs with known-good outputs, pulled from actual usage — not invented by the team that built the agent. Run on every change.
Assertion checks. Cheap, deterministic rules: the response cited a source, the JSON parsed, the refund amount is within policy, no tool was called twice with identical arguments. These catch the embarrassing failures for free.
Model-graded rubrics. For qualities you can't assert — tone, completeness, reasoning — use a second model with a written rubric. Calibrate it against human judgments before trusting it, and re-calibrate when you change the rubric.
Regression tagging. Every production failure becomes a new eval case. This is the flywheel: the eval set grows into a precise description of what your users actually need.

The discipline that matters: no change merges without the evals running. Teams that treat evals as a quarterly report instead of a CI gate end up debugging vibes.

Guardrails: design for the wrong answer

The model will be wrong. The system's job is to make wrong cheap. In practice, this means deciding — explicitly, per action — what the agent may do alone, what it must ask about, and what it can never do.

A pattern that holds up:

Action class Examples Policy Read Search, summarize, retrieve Autonomous Draft Email replies, task lists, reports Autonomous, human approves send Reversible write Tag a ticket, update a CRM field Autonomous, logged, undoable Irreversible write Send money, delete records, contact customers Human confirmation, always

Three implementation notes that save pain later:

Constrain at the tool layer, not the prompt layer. "Please never issue refunds over $200" is a suggestion; a tool schema whose amount parameter is capped at 200 is a rule. Prompts express intent — interfaces enforce policy.
Validate inputs as untrusted. Anything the agent reads — web pages, emails, documents — can contain instructions. Treat retrieved content as data, never as commands, and strip or sandbox anything executable.
Budget the loop. Agents that can retry can also spiral. Cap steps, tokens, and wall-clock per task, and make hitting a cap a graceful handoff to a human, not a crash.

Observability: you can't fix what you can't replay

When an agent misbehaves, the question is always "what exactly happened?" If the answer takes more than a few minutes to reconstruct, you don't have observability — you have logs.

The minimum we ship with every agent:

Full traces. Every step: the prompt, the model's plan, each tool call with arguments and results, and the final output, linked under one trace ID. When a user reports a bad outcome, you replay it, not re-imagine it.
Cost and latency per task — not per request. A task that takes 40 model calls at four cents each is a $1.60 task; you want to know that before finance does.
Drift dashboards. Daily distributions of: tasks completed without handoff, guardrail triggers, eval-set pass rate, and tool error rates. Most production incidents announce themselves as a slow drift days before the failure.
A feedback affordance in the product. A thumbs-down that captures the trace is worth a thousand satisfaction surveys.

We built a version of this for Chronicle, where an AI interface works directly over user data — the trace-and-replay loop is what made improving it routine instead of archaeological.

Rollout: shadow, assist, automate

The deployment plan is the same one that works for any risky system, applied honestly:

Shadow. The agent runs on real traffic; nobody sees its output but you. Two weeks of shadow data beats any benchmark.
Assist. The agent drafts, humans approve. You're now measuring the acceptance rate — the single most honest metric an agent has.
Automate the proven slice. Whatever category sits above your accuracy threshold goes autonomous; everything else stays assisted. Expand category by category, never all at once.

Acceptance rates also tell you when to stop. An agent stuck at 60% acceptance after a month of iteration is solving the wrong problem or missing the data it needs — both findings worth far more than a forced launch.

The short version

Evals first, in CI, growing from production failures.
Guardrails at the tool interface, sized to the cost of a wrong answer.
Traces you can replay, costs you can attribute, drift you can see.
Shadow → assist → automate, with acceptance rate as the gate.

None of this is exotic. It's the same engineering discipline software has always needed, pointed at a component that's probabilistic. The teams that internalize that ship agents that survive contact with February's data, March's users, and April's audit.

If you're somewhere between the demo and production right now, this is the exact terrain our DIGIUP AI engagements cover — from eval design through rollout. And if you want to see how we scope a first, smaller step, start with how to pick your first AI automation project.