Building an Eval Harness for Your AI Features

At some point every team shipping an AI feature discovers the same uncomfortable truth: the model that worked beautifully last Tuesday behaves differently after a prompt tweak, a model upgrade, or a shift in the distribution of real user inputs. Without a structured evaluation harness, you find out about regressions the way you find out about most bad news, from a user complaint or a spike in support tickets.

This guide walks through building a practical, lightweight eval harness that you can run locally, in CI, or on a schedule. It is not about chasing benchmark numbers. It is about catching the regressions that matter to your specific product before they reach production.

Why “It Looks Good” Fails at Scale

Manual inspection has real value during early development. Reading model outputs trains your intuition and helps you write better prompts. But it does not scale, and it does not persist. The moment you stop reading every output, you have no signal. Three things make manual review break down:

Prompt changes compound. A change that improves tone might silently degrade factual accuracy. You rarely see both failure modes at once when you are only skimming.
Model updates are invisible. When you switch from one model to another, or when Anthropic updates an existing model, behavior can shift in subtle ways that casual inspection misses.
Context windows and edge cases multiply. Features that work on short inputs often fail on long documents, multilingual inputs, or adversarial phrasings. You cannot manually cover that space.

An eval harness gives you a repeatable, auditable signal. It does not replace human judgment. It frees your human judgment for the cases that actually need it.

The Core Anatomy of an Eval Harness

A minimal harness has four parts:

A test suite. A collection of inputs paired with expected behaviors or scoring criteria.
A runner. Code that sends each input through your actual production prompt and model, then collects outputs.
Scorers. Functions that measure whether each output meets your criteria.
A results store. Somewhere you log scores over time so you can track trends, not just pass/fail snapshots.

The most common mistake is skipping the results store and treating evals as one-off checks. Trend data is where the real value lives.

Building Your Test Suite

Start with real examples, not synthetic ones. Pull a sample of actual user inputs from your logs (anonymized appropriately), find the cases where your feature either performed well or poorly, and lock those in. Supplement with hand-crafted edge cases that represent known failure modes.

For each example, define what “good” looks like. There are three levels of specificity you can use:

Exact match. The output must contain a specific string or match a regex. Use this for structured outputs like JSON fields, citations, or numeric answers.
Rubric scoring. A set of criteria scored 0 to N. “Does the response stay on topic? Does it answer the user’s question? Does it avoid making up facts?” You score these manually or with a model-as-judge approach.
Model-as-judge. You send the input and output to a capable model and ask it to evaluate against your rubric. This scales well but introduces its own variance, so use it alongside, not instead of, deterministic checks.

A practical starting point is 30 to 50 examples. That is small enough to review manually when you set it up, large enough to give you a meaningful signal.

Writing the Runner

Your runner should call the same prompt and model configuration your production feature uses. Hardcoding a different model in your eval defeats the purpose. Below is a minimal Python runner using the anthropic SDK:

import anthropic
import json

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from environment

def run_eval(test_cases: list[dict], system_prompt: str, model: str) -> list[dict]:
    results = []
    for case in test_cases:
        response = client.messages.create(
            model=model,
            max_tokens=1024,
            system=system_prompt,
            messages=[
                {"role": "user", "content": case["input"]}
            ]
        )
        output = response.content[0].text
        results.append({
            "input": case["input"],
            "expected": case.get("expected"),
            "output": output,
            "metadata": case.get("metadata", {})
        })
    return results

If your feature uses extended reasoning, pass the appropriate thinking configuration. On Claude 4.6 and later models (including claude-sonnet-4-6, claude-opus-4-8, claude-haiku-4-5, and claude-fable-5), use adaptive thinking rather than a fixed token budget:

response = client.messages.create(
    model="claude-fable-5",
    max_tokens=8000,
    thinking={"type": "adaptive"},
    system=system_prompt,
    messages=[
        {"role": "user", "content": case["input"]}
    ]
)

The model will determine the appropriate reasoning depth on its own. Do not pass a budget_tokens field on these models. It is no longer supported.

Choosing Scorers That Match Your Feature

Generic scorers give you generic signal. Invest time in scorers that are specific to what your feature is supposed to do.

For a classification feature, score by label accuracy. For a summarization feature, score by whether key facts from the source appear in the output. For a code generation feature, run the generated code and check whether it executes without errors and passes unit tests. For a customer support feature, score by whether the response identifies the user’s problem, offers a resolution path, and avoids hallucinated policy claims.

Model-as-judge scoring works well for nuanced rubrics. A practical pattern is to use a capable model like claude-fable-5 or claude-opus-4-8 as the judge, with a structured rubric in the system prompt and JSON output enforced via your API call. Be aware that judge models have their own biases, so calibrate your judge against a set of human-labeled examples before trusting it at scale.

Storing and Comparing Results Over Time

A single eval run tells you where you are today. A series of runs tells you whether you are improving or regressing. At minimum, log each run with:

Timestamp
Model ID used
System prompt hash or version identifier
Per-example scores
Aggregate score

Even a simple append-only JSONL file or a SQLite database is sufficient to start. What matters is that you can query “how did my score change between prompt version A and prompt version B?” without relying on memory.

Set a threshold. If your aggregate score drops more than a defined amount from baseline, fail the CI job or send an alert. The threshold is a product decision, not a technical one. A 2% regression in a medical information tool is a crisis. A 2% regression in a creative brainstorming tool might be acceptable.

Fitting Evals Into Your Development Workflow

Run your eval harness in three places:

Locally, before any prompt change ships. Make it a one-command script. If it is slow to run, you will skip it.
In CI, on every pull request that touches your prompt or model configuration. Gate merges on passing a minimum score threshold.
On a schedule against production. This catches model drift and distribution shift that your development suite cannot anticipate. A weekly or daily run against a sample of recent real inputs gives you a canary signal.

Keep your eval suite fast. If it takes more than a few minutes, engineers will route around it. Use a smaller, cheaper model for evals where the scoring model does not need to be your production model. Use claude-haiku-4-5 for high-volume scoring tasks where throughput matters more than maximum capability, and reserve claude-fable-5 or claude-opus-4-8 for the judge role in complex rubric evaluation.

Common Pitfalls to Avoid

Teaching to the test. If you tune your prompt specifically to pass your eval cases, your eval has lost its value. Maintain a held-out set that you never use during prompt development.
Evaluating the wrong thing. Measuring fluency when your users care about accuracy, or measuring response length when they care about actionability, gives you a false sense of quality.
Ignoring failure modes from production. Your eval suite should grow every time a real user hits a bug. Add a regression test for every incident.
Non-determinism without accounting for variance. LLM outputs vary. Run each test case multiple times or use temperature 0 where your feature permits it. Understand whether a score change is signal or noise.

Takeaway

An eval harness is not a research project. It is a maintenance tool, the same category as integration tests and monitoring dashboards. You do not need a complex framework to start. You need a small set of real examples, a runner that calls your actual production code, scorers that measure what your users actually care about, and a log that lets you see change over time.

Build the simplest version this week. Add to it every time you change a prompt, upgrade a model, or fix a production bug. Six months from now, that log of scores will be one of the most useful artifacts your team has.