How to evaluate AI agents in production: RAG eval, LLM‑as‑judge, and live KPIs

Evaluating AI agents is not the same as evaluating a model. Models answer questions. Agents take actions. That means your evaluation must include the decision process, the tool calls, and the effect on the business outcome—not just how “smart” an answer looks.

Why eval is different for agents

An agent can be eloquent and still be wrong if it writes to the wrong record, calls the wrong tool, or exceeds a rate limit. Success is the task outcome under constraints, not a score on a static dataset.

Offline vs. online evaluation

Use offline tests to iterate quickly without risking production: unit tests for tools and prompts; scenario suites that replay real cases; synthetic edge‑cases you expect to fail gracefully. Then confirm in production with online metrics: adoption, success, handoff, and latency.

Think of it as two loops. The inner loop (offline) is fast and cheap; the outer loop (online) is slower but closer to reality. You need both.

RAG evaluation in practice

If your agent uses retrieval, evaluate the retrieval separately from the generation. Measure coverage (did we fetch the right docs?) and precision (did we avoid irrelevant ones?). Keep a tiny labeled set of questions and gold documents; update it when the corpus changes.

At generation time, enforce citations. When the answer references documents, you can audit it later. If a claim can’t be tied to a source, it’s a guess.

LLM‑as‑judge: where it helps, where it harms

LLMs are useful judges for subjective qualities like clarity or tone. They are also helpful to triage large volumes of outputs to find likely failures. But never let an automated judge be the only arbiter of truth, especially for compliance‑sensitive tasks. Spot‑check with humans and compare against ground truth when available.

If you use LLM‑as‑judge, fix the prompt and version it; evaluate the evaluator on a small labeled set; and watch for drift.

Safety & failure modes

List the failure modes you care about: wrong writes, PII leakage, policy violations, hallucinated tools, prompt injection. Add detectors for each: schema checks, allow‑lists, rate‑limit alarms, and red‑team prompts. Log enough detail to replay incidents.

Experiment design & versioning

Version everything: prompts, tools, policies, and retrieval indexes. Use small, reversible experiments: canary a new prompt to 10% of traffic; compare acceptance and error rates. Roll forward when the experiment wins; roll back within minutes when it doesn’t.

Dashboards and alerts that matter

Build one dashboard per agent with four sections: usage (WAU, sessions), effectiveness (task success, acceptance, handoff), speed (latency P50/P95), and safety (errors, blocked actions). Alert on deltas, not absolutes—“success rate dropped 15% since yesterday” is more useful than “success rate is 78%.”

A lightweight eval stack you can ship in a week

Day 1–2: add request IDs, event logging for tool calls/writes, and simple metrics. Day 3–4: create a small offline test set (10–30 cases) and wire a script to run it on every change. Day 5: build a one‑page dashboard and a canary flag. You now have enough to move fast without guessing.

Great agents are measured agents. If you can see what they did, why they did it, and what happened as a result, improvement becomes routine—and safe.