Evals: How to Test Systems That Don't Have Right Answers

A Practical Guide for Agentic Programmers

You wouldn't ship a web application without tests. You wouldn't deploy a database migration without running it against staging first. But teams ship LLM-powered features every day with nothing more than "we tried it a few times and it looked good."

That's not testing. That's hoping.

Evaluations — evals — are the test suite for AI systems. They're how you know whether a change to your prompt, model, or retrieval pipeline actually improved things or just felt like it did. WHOOP learned this the hard way: a prompt change to their Memory agent that seemed better during manual testing turned out to be measurably worse when they ran their evaluation suite. Without evals, that regression would have shipped to production.

This guide covers what evals are, how to build them, and why most teams get them wrong.

Why LLM Evaluation Is Hard

Traditional software has deterministic outputs. Given the same input, you get the same output. You write assertions: assertEquals(expected, actual). Done.

LLMs are stochastic. Ask the same question twice, you might get different answers — both correct, phrased differently. There's no single "right answer" to compare against. The output is natural language, which means evaluating it requires judgment, not just string matching.

This leads to three problems that make teams give up on evals prematurely:

No stable ground truth. For many tasks, there are multiple valid responses. A customer support answer can be correct in ten different wordings. Traditional reference-based metrics (BLEU, ROUGE) penalize valid paraphrases.

Multi-dimensional quality. An answer can be factually correct but poorly written, or beautifully written but completely wrong. You need to measure accuracy, relevance, groundedness, tone, safety, and completeness — often simultaneously.

Evaluation at scale is expensive. Human evaluation is the gold standard but costs too much for iterative development. You can't have a human review every test case on every prompt change.

None of these problems are unsolvable. They just require a different testing mindset.

The Three Evaluation Methods

1. Code-Based Evaluation

The simplest approach: write programmatic checks on the output. Does the response contain the expected keyword? Is it under the token limit? Does it include a citation? Does it match a regex pattern?

Code-based evals are fast, deterministic, and cheap. They catch structural failures — the model forgot to include a required field, exceeded the length limit, or returned invalid JSON. They don't measure quality, but they catch a surprising number of production failures.

Use code-based evals as your first line of defense. They run in CI, cost nothing, and catch regressions immediately.

2. LLM-as-a-Judge

Use a strong model (typically a frontier model like GPT-4o or Claude) to evaluate the output of your system. You give the judge the input, the output, optionally a reference answer, and a scoring rubric. The judge returns a score and an explanation.

This is the workhorse of modern eval frameworks. It scales, it's relatively cheap, and when well-calibrated, it correlates surprisingly well with human judgment. The key is writing good rubrics. A rubric like "rate the quality from 1-5" is useless. A rubric like "does the response contain a specific recommendation based on the user's data, rather than generic advice? Yes = 1, No = 0" is actionable.

The G-Eval framework formalizes this: it uses chain-of-thought to generate evaluation steps from your criteria, then scores the output against those steps. DeepEval, Ragas, and most modern eval frameworks implement variations of this approach.

Pitfalls: The judge model has its own biases. It tends to prefer longer responses, formal language, and responses that look like its own outputs. Calibrate your judge against human labels on a small set before trusting it at scale.

3. Human Evaluation

Humans review outputs and score them against criteria. This is the most accurate method and the most expensive. Use it strategically: label 25–50 representative test cases to calibrate your automated metrics, then use human review only on flagged outputs in production.

The practical workflow: run LLM-as-a-judge on everything, send the failures and edge cases to human reviewers, and use their feedback to improve the judge's rubric. This creates a feedback loop where human evaluation gets cheaper over time, not more expensive.

Building Your Eval Suite

Step 1: Define What You're Measuring

Before writing a single test case, decide on your metrics. Common dimensions for agentic systems:

Correctness: Is the factual content accurate?
Groundedness / Faithfulness: Are the claims supported by the retrieved context (for RAG systems)?
Relevance: Does the response actually address what was asked?
Completeness: Did it cover all the important points?
Safety: Does it avoid harmful, biased, or policy-violating content?
Tool use accuracy: Did the agent call the right tools with the right arguments?
Task completion: Did the agent achieve the stated goal?

For RAG systems specifically, measure retrieval and generation separately. Retrieval metrics (precision, recall, relevance of retrieved chunks) tell you if your search is working. Generation metrics (faithfulness, answer quality) tell you if the model is using the context well. If you only measure the final answer, you can't diagnose whether a bad answer came from bad retrieval or bad generation.

Step 2: Build Your Dataset

A good eval dataset has three properties:

Representative of production. Don't test on clean, well-formed questions if your users type messy queries with typos and ambiguous references.
Covers edge cases. Include adversarial inputs, out-of-scope questions, questions with no good answer, and inputs designed to trigger known failure modes.
Labeled with expected outcomes. Not necessarily a single "right answer," but at minimum a classification of what a good response looks like — required facts, forbidden claims, acceptable formats.

Start small. 25–50 high-quality test cases, labeled by a domain expert, are more valuable than 1,000 auto-generated ones. You can expand later with synthetic data generation, but the initial calibration set must be human-curated.

WHOOP built synthetic "Personas" — reproducible member profiles with specific data characteristics — to test their agents against realistic scenarios. Cursor uses actual user acceptance/rejection signals as their eval data. Both approaches work; the right one depends on whether you have production data yet.

Step 3: Automate and Integrate

Evals should run automatically on every change. Treat them like unit tests:

# Using DeepEval (pytest-style)
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

def test_refund_policy_question():
    test_case = LLMTestCase(
        input="What is the refund policy for enterprise customers?",
        actual_output=get_model_response("What is the refund policy for enterprise customers?"),
        retrieval_context=["Enterprise customers can request a full refund within 30 days..."]
    )
    metric = AnswerRelevancyMetric(threshold=0.7)
    assert_test(test_case, [metric])

Run evals in CI/CD. Gate deployments on eval results. Track metrics over time to catch drift. This is exactly what WHOOP does with AI Studio — every agent deployment goes through an evaluation gate before reaching users.

Step 4: Monitor in Production

Pre-deployment evals catch regressions. Production monitoring catches the failures you didn't anticipate. Sample 1–5% of production outputs and run your eval metrics. Flag outputs where users gave negative feedback. Send low-scoring outputs to human review.

Cursor's Tab RL takes this to the extreme: every accepted or rejected suggestion is a production eval signal that retrains the model multiple times per day. You don't need to go that far, but you do need some signal from production usage feeding back into your eval dataset.

Common Mistakes

Evaluating too late. Teams build the entire system, then add evals as an afterthought. By then, you've made architectural decisions you can't measure the impact of. Start with evals. They're your compass.

Measuring the wrong thing. Optimizing for a metric that doesn't correlate with user satisfaction is worse than not measuring at all. Align your metrics with business outcomes: ticket resolution rate, user retention, task completion rate — not abstract quality scores.

Trusting the judge blindly. LLM judges have failure modes. They miss factual errors in domains they don't understand. They can't count. They prefer verbose responses. Calibrate your judge against human labels, and re-calibrate periodically.

Evaluating only the final output. In agentic systems with multi-step reasoning, tool calls, and retrieval — evaluate each step independently. A correct final answer that used the wrong tool or retrieved the wrong documents is a system that will fail on slightly different inputs.

The Takeaway

Evals are not optional. They're not a nice-to-have for mature teams. They're the mechanism by which you know whether your system works. Without them, every change is a coin flip.

The good news: you don't need a custom evaluation platform to start. A spreadsheet with 30 labeled test cases and a script that runs your model against them is a legitimate evaluation framework. WHOOP started with spreadsheets. You can too. Just start.