Essay · May 2026

Lessons from evaluating AI in production.

By the Kairos team. 7 minute read.

Over the past year, our team ran evaluation infrastructure across deployments in customer support, operations-heavy industries, and consumer-facing AI products. The agents spanned chat, voice, and multimodal modalities, and ran in front of real users against real workflows.

When we started, the consensus was to specialize. Pick one modality, ship evals deep, and expand later. Our read was that evaluation was fundamentally the same problem in every vertical we entered. The form factor changes (what you measure for a leasing agent is not what you measure for a voice support agent), but the underlying mechanics don't.

The wide net let us build a generalizable evaluation platform that automated the full pipeline, and also gave us a ground-level view into how to ship agents that work. We reflected on our lessons, and compressed these into seven principles for how to actually deploy AI reliably.

The model is rarely the bottleneck.

Most production failures aren't the model getting the answer wrong. They are the system getting the wrong answer to the model: stale retrieval, the wrong tool, missing context, malformed prompts, or a policy the operator never wrote down. Switching to a stronger model rarely closes those gaps. The work that actually moved reliability was almost always upstream of the model itself.

Evaluation criteria belong to operators, not engineers.

The best evals come from operators writing them themselves. The underlying testing platform gets you halfway there. The other half is the work of sitting with a team member, walking them through what a test case looks like, teaching them what is worth checking and what is noise, and then translating their judgment into something a judge can score consistently. Most of the deployment work was this conversion.

Pass rate is usually the wrong number.

Two agents at 95% pass rate aren't equivalent. One fails on rare edge cases the operator can catch. The other fails on common workflows the operator cannot. Creating detailed metrics specific to the workflow the agent is operating in matters more than a headline percentage. For example, a customer support agent can score 90% on answer similarity to ground truth and still miss the cases that decide the deployment: the deflections that should have been escalated or the hard resolutions where the operator would have caught a policy nuance.

The eval suite is the specification.

When operators, compliance, and engineering disagree, the disagreement surfaces in eval design before it surfaces anywhere else. Writing the eval is the act of forcing the spec into the open. Teams that skip this stage rebuild it later, in incident reviews, after a customer has already been on the other end of the failure.

Production traces find what you did not think to test.

Most of the eval cases that mattered came from real production runs, not from the cases anticipated up front. Teams try to anticipate every edge case, but the gap between what gets predicted and what shows up in production is large. The agent's own runtime is the only honest source of edge cases. Without a feedback loop from production into evals, the suite stays frozen at the assumptions made on day one, and reliability stops compounding the moment the deployment goes live.

Reliability is structural, not a final step.

Evals added at the end of a build only catch what made it to the end. By then the real failures are already baked in upstream, in prompts, retrieval, tool wiring, and policy decisions made weeks earlier, and pulling them out is expensive. Evaluation has to run continuously, with the team's standards encoded into the suite from the first commit. Running evals from day one means more feedback loops and faster iteration, and every cycle compounds what you have learned.

Stakeholders trust evidence, not argument.

What unblocks a pilot in a regulated industry isn't a better demo. It's a verifiable artifact that says: this workflow has been tested against these failure modes, here is the result, here is the audit trail. Without that, every meeting is the same conversation: claims about accuracy with nothing underneath them, decisions made only on intuition. With it, conversations actually move to scope.

Why we are building Kairos

A year of operating evaluation infrastructure made the bigger picture clear: evals alone were never going to be enough. Adoption in regulated industries is stuck, and not because the models aren't capable. It's stuck because most of the work that makes a model actually reliable in production sits outside the model itself (training on first-party data, evals translated from operator judgment, self-learning loops), and almost none of it has been solved for the teams who need it.

So we decided to take on the whole scope. We deliver a fully working system, not a piece of one. We build specialized agents that automate the most manual workflows in the enterprise.