Engineering7 min read

How agents stay reliable: evals, tests, and feedback loops

An agent in production needs to prove it works before deployment and stay accurate after. Here's how teams measure and maintain agent reliability.

Before you let an agent handle real work, you need to know it works. After it ships, you need to know it stays reliable. This requires three things: evaluations before deployment, tests during development, and feedback loops in production.

Reliability in agentic systems is different from reliability in traditional software. Traditional software is deterministic. You write code, it either works or it doesn't. Agents are probabilistic. They work most of the time but not always.

An agent that's 90% accurate sounds good until you realize it gets 1 in 10 decisions wrong. At scale, that's a problem.

Evaluations: does the agent work well enough?

Before an agent touches production, you run evaluations. You give the agent a set of test cases and measure how well it handles them.

A simple evaluation: you have 100 sample customer requests. You run them through the agent. For each request, you check: Did the agent make the right decision? Did it escalate appropriately? Was the response clear?

Good evaluations are comprehensive. You don't just test the happy path. You test edge cases. You test requests the agent hasn't seen before. You test scenarios that are ambiguous or tricky.

The benchmark matters. If 95% of requests should be routine and 5% should be escalated, and your agent escalates 20% of routine requests, that's a problem. You'd know it because evaluation would show it.

Different processes have different reliability thresholds. A scheduling agent should probably be 98% accurate before going live. A screening agent (initial evaluation) can be 85% accurate because escalations catch errors. A refund agent handling high-value refunds needs to be 99%+ accurate.

You decide the threshold based on how much error you can tolerate.

Tests: catching breakage during development

Tests verify that specific features work. Does the agent correctly identify when to escalate? Does it handle payment failures appropriately? Does it remember past interactions?

A test is narrow. "When the agent sees a refund request for an amount over $1000, does it escalate?" You set up that scenario and check.

Most teams build a test suite as they develop the agent. A test case for each major decision type. A test case for each kind of error the agent might encounter. A test case for each edge case you discover.

Tests fail when development breaks something. You refactor the agent's logic and suddenly it's escalating things it shouldn't. The test catches it immediately.

The challenge with agent tests is that they're not fully deterministic. An agent might handle the same request slightly differently each time, depending on context. This is why test suites include a "pass rate" not just "pass/fail." The agent should handle this request correctly 95%+ of the time.

Feedback loops: staying reliable in production

Evaluation and tests happen before the agent ships. Then the agent goes live and encounters the real world, which is messier than your test cases.

A feedback loop means observing what the agent does in production and using that to improve.

The simplest feedback loop: you spot-check decisions. Once a week you randomly sample 20 decisions the agent made and grade them. Did it do the right thing? If you find problems, you investigate why.

A pattern emerges: "The agent is approving returns from this vendor when it should escalate." Now you understand the problem and can fix it.

More sophisticated teams build automated feedback loops. Automated systems flag decisions the agent made that later turned out to be wrong. "The agent approved this return, but the customer then disputed the charge." That's a signal the agent's approval threshold was wrong.

The feedback feeds back into evaluation. You re-run the evaluation with the corrected understanding. "We thought the agent should approve returns if the customer has no prior disputes, but we learned that's not good enough. Now it should also check if the return is within 30 days."

Measuring reliability in production

You can't measure 95% accuracy if you don't observe outcomes. This means logging.

The agent logs: what request it received, what decision it made, what action it took, what the outcome was. Later, you can trace a decision back and ask: was that the right call?

Some outcomes are immediate. The agent approved an appointment and the customer showed up or didn't. Some outcomes take time. The agent issued a refund and the customer disputes it three months later.

Build the pipeline to observe both. Immediate feedback tells you if the agent is good at the job it's doing now. Delayed feedback tells you if the agent's judgment is sound.

Building a testing culture

Testing agents is newer than testing traditional code, so not every team has built the discipline. But the best teams treat agent reliability like they treat software reliability.

You have acceptance criteria before the agent ships. "95% of appointments will be scheduled correctly on the first attempt." You measure against these criteria constantly.

You have regression tests. If you discover the agent is escalating correctly a certain type of request, that becomes a test case. If future changes break that behavior, the test fails.

You have a post-mortems process. When the agent makes a bad decision that costs you money, you investigate. What went wrong? How do we prevent this in the future? You add a test case.

Common pitfalls

Teams often skip evaluation and jump straight to production. "Let's just run it and see what happens." What happens is the agent makes bad decisions that damage your business.

Teams often build tests that don't reflect reality. The test checks whether the agent can handle a clean scenario, but in production the data is messy. Test with real data or at least realistic messy data.

Teams often stop monitoring after the agent ships. "We shipped the agent, we're done." The agent degrades over time as your business changes. Regular monitoring catches this.

Teams often don't correlate agent decisions with outcomes. They don't know whether the agent's decisions were actually right. You need someone assigning correctness labels to decisions, or you can't learn.

Practical first steps

Start by defining what success looks like for your agent. 95% accuracy on routine decisions? 99% on financial decisions? That's your threshold.

Build test cases for the main decision types your agent will handle. Run the agent through them and see how it performs.

If it meets your threshold, deploy with monitoring. If not, improve the agent and test again.

In production, spot-check decisions regularly. Sample 20 a week, grade them, look for patterns.

After a month, you'll see whether the agent is reliable and where it's weak. That tells you what to improve next.

Reliability is built, not assumed. Good agents are tested exhaustively before deployment and monitored continuously after. If you're building an agent and thinking through evaluation strategy and monitoring, we can help you build the right testing framework. Reach out to talk through your specific reliability requirements.