Prompt Evaluation: How to Know If Your LLM Is Actually Working

Most LLM-powered features ship without any systematic evaluation. Teams write a prompt, test it on five examples they already know the answers to, and declare it done. Then, three weeks later, a customer reports that the output has changed — or was always wrong for a class of inputs nobody tested. The team digs in and realizes they can’t tell whether the regression is the prompt, a model version update from the provider, a change in retrieval, or something upstream in the pipeline. There’s no baseline to compare against. This is not a tooling problem. It’s a discipline problem. Evaluation is infrastructure, and skipping it means you’re flying blind.

Why LLM Evaluation Is Harder Than Regular Software Testing

Traditional unit tests are deterministic. Given input X, you expect output Y, and the test either passes or fails. LLM outputs are probabilistic. The same prompt can produce different outputs on different runs. More importantly, there often isn’t a single correct answer — there are better answers and worse answers, and quality exists on a spectrum.

This creates a fundamental mismatch with the standard software testing mental model. You can’t just write assert output == expected. What you’re actually trying to measure is whether the output is helpful, accurate, appropriately scoped, and on-brand — qualities that are inherently subjective.

The other complication: the model can change underneath you. Providers push updates to hosted models without always notifying customers. GPT-4 as of January behaves differently from GPT-4 as of July. Claude 3.5 Sonnet got updated mid-deployment cycle for many teams. If your evals only run when you change your code, you won’t catch model drift. You need evals that run on a schedule.

None of this means evaluation is impossible. It means you need a more sophisticated approach than assert.

The Evaluation Stack

Golden Dataset

This is the foundation of any eval framework, and building it is the hardest part. A golden dataset is a curated set of inputs paired with expected outputs or quality labels. Without it, you have nothing to measure against.

The fastest way to build one: collect real inputs from production or a pilot. These are the actual questions or tasks your users bring, not the tidy examples you invented in a notebook. Real inputs are messier, more diverse, and surface edge cases that you would never think to include.

Once you have candidates, have domain experts label them. For a customer support bot, that’s experienced support agents. For a code generation tool, that’s senior engineers. The goal is to capture what “good” actually looks like for your specific use case, not what the model is capable of in general.

A few practical notes:

50 to 200 examples is often enough to get started. More is better, but 50 representative examples beats 500 that all look the same.
Include hard cases deliberately. If 90% of your eval set is easy, you’ll have high scores that don’t tell you much.
Version your dataset. As your system evolves, you’ll want to know which evals were added when.

Store it as a simple JSON or CSV file. This doesn’t need to be complex infrastructure at the start.

[
  {
    "id": "cs-001",
    "input": "How do I cancel my subscription?",
    "expected_output": "To cancel, go to Settings > Billing > Cancel Plan. You'll keep access until the end of your billing period.",
    "quality_label": "high",
    "notes": "Canonical cancellation question, should be direct and accurate"
  },
  {
    "id": "cs-002",
    "input": "i think i was charged twice last month??",
    "expected_output": null,
    "quality_label": null,
    "notes": "Ambiguous complaint — model should acknowledge, ask for more info, not guess"
  }
]

Automated Metrics

Once you have a dataset, you need a way to score outputs automatically. The options range from simple to sophisticated, and each has real tradeoffs.

BLEU and ROUGE were designed for machine translation and document summarization. They measure n-gram overlap between your output and a reference. For open-ended generation — answering questions, drafting content, generating explanations — they’re a poor signal. A response can have low BLEU but be excellent, and vice versa. Use them only if you have a specific reason to.

Exact match is useful in narrow circumstances: when your output is structured (JSON extraction, classification labels, entity recognition), exact match is a clean pass/fail signal. Don’t try to apply it to prose.

Semantic similarity using embeddings is a step up for open-ended outputs. Embed both the expected answer and the model’s output, then compute cosine similarity. This catches synonymous phrasing and paraphrasing that exact match misses. It won’t catch factual errors or hallucinations, but it’s a reasonable proxy for “is this answer in the same ballpark?” OpenAI’s text-embedding-3-small or Cohere’s embedding models work well here.

from openai import OpenAI
import numpy as np

client = OpenAI()

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def embedding_similarity(text_a: str, text_b: str) -> float:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=[text_a, text_b]
    )
    vec_a = response.data[0].embedding
    vec_b = response.data[1].embedding
    return cosine_similarity(vec_a, vec_b)

LLM-as-judge is currently the most practical approach for measuring subjective quality at scale. The pattern: send the original question, a reference answer (if you have one), and the model’s output to a stronger model (GPT-4o, Claude Sonnet), along with a scoring rubric. Ask the judge to rate on a 1–5 scale and return a brief justification.

JUDGE_PROMPT = """You are evaluating the quality of an AI assistant response.

Question: {question}
Reference answer: {reference}
Model response: {response}

Rate the model response on the following criteria:
- Accuracy: Is the information correct? (1-5)
- Completeness: Does it address the question fully? (1-5)
- Conciseness: Is it appropriately brief without being unhelpful? (1-5)

Return JSON: {{"accuracy": N, "completeness": N, "conciseness": N, "reasoning": "..."}}
"""

The catch: LLM judges can be biased toward longer outputs, toward their own style, and toward confident-sounding text even when it’s wrong. Calibrate your judge by running it against examples where you already know the right score.

RAGAS for RAG Systems

If you’re building retrieval-augmented generation, you need evals that cover both the retrieval and the generation separately. RAGAS is a library designed for this.

The four core metrics:

Faithfulness measures whether the model’s answer is grounded in the retrieved context. A score of 1.0 means every claim in the answer can be traced back to the retrieved documents. Low faithfulness = the model is hallucinating or going beyond the provided context.

Answer relevance measures whether the answer actually addresses the question asked. You can retrieve perfect context and still get a tangential answer.

Context precision measures whether the retrieved chunks are relevant to the question. High precision means the retriever is returning useful material; low precision means you’re flooding the model with noise.

Context recall (requires reference answers) measures whether the retrieved context contains the information needed to answer the question. Low recall means your retriever is missing relevant documents.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

data = {
    "question": ["What is the refund policy?"],
    "answer": ["Refunds are processed within 5-7 business days."],
    "contexts": [["Our refund policy states that all eligible refunds are processed within 5-7 business days of approval."]],
    "ground_truth": ["Refunds are processed within 5-7 business days of approval."]
}

dataset = Dataset.from_dict(data)
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])

RAGAS gives you a concrete number for each dimension, which makes it much easier to diagnose where the RAG pipeline is failing. If faithfulness drops, look at your generation. If context precision drops, look at your retriever.

Human Evaluation

Automated metrics are proxies. Human evaluation is ground truth — expensive, slow, but irreplaceable in two specific situations.

First: establishing your golden dataset. The quality of your automated evals is bounded by the quality of your labels. If the labels are wrong, the evals are wrong. Human judgment is the foundation.

Second: calibrating your automated metrics. Before you trust your LLM judge, check whether its scores correlate with how actual humans rate the same outputs. If they don’t correlate, your automated evals are giving you noise.

For ongoing monitoring at scale, A/B testing in production is more practical than offline human eval. Present users with two versions of an output and measure which they prefer through implicit signals (thumbs up/down, edit rate, follow-up questions). This is what most mature teams use once they have enough traffic.

Running Evals in CI/CD

Treat prompt changes the same way you treat code changes. Every prompt file should live in version control. Every merge should trigger an eval run. If quality drops below a threshold, block the merge.

A minimal CI setup:

// eval-runner.ts
import { runEvals } from "./eval-framework";
import goldenDataset from "./golden-dataset.json";

const QUALITY_THRESHOLD = 0.75; // 75% of examples must pass

async function main() {
  const results = await runEvals(goldenDataset);
  const passRate = results.filter(r => r.score >= 3).length / results.length;

  console.log(`Pass rate: ${(passRate * 100).toFixed(1)}%`);
  console.log(`Threshold: ${(QUALITY_THRESHOLD * 100).toFixed(1)}%`);

  if (passRate < QUALITY_THRESHOLD) {
    console.error("Eval suite failed. Blocking merge.");
    process.exit(1);
  }

  console.log("Eval suite passed.");
  process.exit(0);
}

main();

Set this up as a GitHub Actions job or integrate it into your existing CI pipeline. The key is that it runs automatically — you shouldn’t need to remember to run evals manually.

One practical note: LLM-as-judge evals have API costs. Keep your CI eval set lean (50-100 examples) and run the full dataset less frequently, perhaps nightly or on release branches.

Pairwise Evaluation

Absolute scoring is hard. Asking a human (or an LLM judge) to rate an output on a 1–5 scale produces inconsistent results across raters and across sessions.

Pairwise evaluation sidesteps this. Instead of scoring output A independently, you show the judge both output A and output B and ask which is better. Relative comparisons are more reliable than absolute scores. This is the approach used by LMSYS Chatbot Arena for evaluating models at scale, and it works equally well for evaluating prompt versions.

PAIRWISE_PROMPT = """Compare two AI assistant responses to the same question.

Question: {question}
Response A: {response_a}
Response B: {response_b}

Which response is better? Consider accuracy, helpfulness, and conciseness.
Return JSON: {{"winner": "A" | "B" | "tie", "reasoning": "..."}}
"""

Use pairwise evaluation when you’re deciding between two prompt variants, comparing model versions, or trying to understand whether a change is actually an improvement. It’s particularly useful for prompt engineering: write two versions, run pairwise eval on your golden dataset, pick the winner.

Evaluation Tooling

You don’t have to build this infrastructure from scratch.

Promptfoo is open source and uses YAML-based eval configs. You define test cases, providers, and assertions in a config file, and it handles running the evals and reporting results. Good LLM-as-judge support. Works well if you want something self-hosted and want to keep evals close to your codebase.

# promptfooconfig.yaml
providers:
  - openai:gpt-4o-mini
prompts:
  - "Answer this customer question concisely: {{question}}"
tests:
  - vars:
      question: "How do I reset my password?"
    assert:
      - type: llm-rubric
        value: "Response should include a clear step-by-step process"
      - type: javascript
        value: "output.length < 500"

LangSmith integrates tracing and evaluation in one platform. If you’re already using LangChain, it’s the path of least resistance. You can log traces in production and run evals against them directly. Dataset management is built in.

Braintrust is a dedicated eval platform with strong dataset management and a clean UI for reviewing results. Good for teams that want to give non-engineers visibility into eval results. Has a hosted offering and supports LLM-as-judge natively.

Langfuse is the open source alternative to LangSmith. Self-hostable, supports tracing and evals, and has a growing integration ecosystem. Worth considering if you have data residency requirements or want to avoid vendor lock-in.

None of these tools will make evaluation easy — that’s the wrong expectation. They make it less annoying to run consistently.

Regression Testing When Models Update

This is the thing most teams don’t think about until it happens to them. Your prompt that worked perfectly in March may behave differently in August because the underlying model changed. Providers update hosted models on their own schedules, and the behavior changes can be subtle — slightly different formatting, different edge case handling, different propensity to refuse requests.

Your eval suite needs to run on a schedule, not just on code changes. A nightly or weekly automated run against your golden dataset catches model drift before it reaches a customer.

Set up alerting: if your pass rate drops by more than a few percentage points compared to the previous run, you want to know. This is not a complex system — a scheduled CI job that compares the current score to a stored baseline and fires a Slack alert on regression is sufficient.

What to Do Next

If you have an LLM feature in production and no eval framework, here is how to start:

Define what “good” means for your specific task. Not in abstract terms — write down three to five criteria with examples. For a summarization feature, that might be: accurate, covers the main points, under 200 words, no hallucinations. This is the hardest step because it forces clarity about what you actually want.
Collect 50 real examples. Pull them from logs, customer support tickets, or a pilot group. Label 20 of them yourself (or with a domain expert) to establish a baseline. The other 30 can be unlabeled — useful for testing but not for calibration.
Write an LLM-as-judge prompt. Based on the criteria you defined in step one. Test it on the 20 labeled examples and check whether the judge’s scores match your labels. Adjust the rubric until there’s reasonable agreement.
Put it in a script you can re-run. A single Python or TypeScript file that reads your dataset, calls the judge, and prints a pass rate. Run it manually before any prompt change. This is your v1.
Schedule it. Once the script runs reliably, wire it into CI or a cron job. This is when it becomes infrastructure rather than a one-off tool.

That’s a working eval framework. It’s not complete — you’ll want to add more examples, refine the judge prompt, and eventually add pairwise comparisons and RAGAS metrics if you’re using retrieval. But the foundation is there, and you can iterate from a position of measurement rather than guesswork.

The teams that ship reliable LLM features are not the ones with the best prompts. They’re the ones who know when their prompts stop working.