13 Evaluation — Measuring What Matters
Who this chapter is for: Mid Level → FDE What you’ll be able to answer after reading this:
- Why automated benchmarks are insufficient for production LLM evaluation
- How LLM-as-judge evaluation works and its biases
- How to build a custom eval harness for a specific use case
- Key metrics for RAG, generation quality, and safety
13.1 Why Evaluation Is Hard
Evaluating LLMs is fundamentally different from evaluating traditional ML models, and the difference runs deeper than most practitioners initially appreciate. For a binary classifier, evaluation is unambiguous: you have a test set with ground truth labels, you measure accuracy, precision, and recall, and the numbers mean something concrete. For an LLM generating open-ended text, none of this infrastructure translates cleanly. For any given prompt, there are potentially thousands of valid, high-quality responses — different phrasings, different structures, different levels of detail, all of which a human would judge as equally good or better. The space of possible inputs is effectively infinite, and the space of valid outputs for each input is equally unbounded. There is no ground truth in the classifier sense, only human judgments that are noisy, context-dependent, and sometimes inconsistent.
Three specific failure modes make benchmark-based evaluation unreliable for predicting production performance. Benchmark contamination is the first: the pretraining corpus for modern LLMs includes large fractions of the public internet, which contains answer keys, study guides, and forum discussions for nearly every popular academic benchmark. MMLU questions, for instance, appear verbatim on educational websites, and models trained on web crawls have likely seen these questions and their answers before evaluation. This means a model’s MMLU score partly measures how much of the MMLU question set appeared in its pretraining corpus, not just its general reasoning ability. The proxy gap is the second failure mode: benchmarks measure performance on proxy tasks (multiple-choice questions, sentence completion) that are designed to be easily automated and reliably scored, but these proxy tasks often predict poorly for real production performance. A model that scores 5% higher on MMLU may produce meaningfully worse outputs on your specific customer support use case. Goodhart’s Law — “when a measure becomes a target, it ceases to be a good measure” — is the third and most insidious problem: once a benchmark score becomes a target metric for model development, teams post-train models specifically to score well on that benchmark without improving the underlying capabilities it was meant to measure. This has occurred visibly with popular benchmarks, whose leaderboards have become increasingly unreliable predictors of real-world model quality.
The practical implication of these three problems is direct: no standard benchmark tells you whether a model is good for your use case. Leaderboard rankings are a useful starting point for identifying candidate models to evaluate, but they cannot replace evaluation on your distribution. Building a proprietary evaluation dataset drawn from your actual production traffic is not a nice-to-have — it is the minimum viable evaluation infrastructure for a production LLM system. Teams that deploy models based solely on public benchmark scores, without evaluating on their own data, are making a bet that their use case is well-represented by those benchmarks. For most specialized applications, that bet is wrong.
13.2 Standard Benchmarks and What They Actually Test
Understanding what standard benchmarks actually measure — rather than what their names imply — is essential for using them correctly. MMLU (Massive Multitask Language Understanding, Hendrycks et al., 2021) covers 57 subject areas with multiple-choice questions, testing knowledge breadth across professional domains from law and medicine to elementary mathematics and world history. What MMLU actually tests: knowledge recall in a format that makes it easy to score but that bears little resemblance to how knowledge is used in production tasks. The multiple-choice format means a model can score well by eliminating obviously wrong answers rather than by genuinely understanding the domain. MMLU is most useful as a rough measure of general knowledge breadth and a check for catastrophic forgetting after fine-tuning — if your fine-tuned model scores 10 percentage points lower on MMLU than the base model, that is a meaningful signal of capability degradation.
HellaSwag tests commonsense reasoning through sentence completion: given the beginning of an activity description, choose the most plausible continuation from four options. The key challenge is that wrong answers are designed to be semantically plausible but physically or socially incoherent (“the chef added the ingredients to the bowl and then flew out the window”). Modern large models score 95%+ on HellaSwag, making it close to a ceiling benchmark for models above 7B parameters — it is better suited for evaluating smaller models where there is still room to discriminate. HumanEval tests code generation: given a function signature and docstring, generate a working implementation that passes test cases. The pass@k metric (probability that at least one of k generated samples passes all tests) accounts for the stochastic nature of code generation. HumanEval is a genuinely useful benchmark for code-capable models because it tests functional correctness rather than surface quality.
MT-Bench and the LMSYS Chatbot Arena represent a different evaluation philosophy: testing multi-turn conversational quality using either GPT-4 as an automated judge or real human preference votes. MT-Bench consists of 80 carefully designed multi-turn questions spanning reasoning, coding, writing, and role-playing, with GPT-4 scoring responses on a 1-10 scale. Chatbot Arena is a live evaluation platform where users chat with two anonymous models simultaneously and vote for the better response, producing ELO-style rankings from millions of comparison votes. Chatbot Arena is arguably the most honest public benchmark for general chat quality precisely because real humans are expressing genuine preferences on real tasks they chose to work on — the evaluation distribution matches the actual deployment distribution for a general-purpose chat model. The limitation is that Chatbot Arena favors models that are good at the tasks self-selected users bring, which may not match your specific use case. The key principle underlying both: evaluation is only as good as its alignment with the real task distribution.
13.3 LLM-as-Judge
LLM-as-judge is the practice of using a strong language model — typically GPT-4, Claude, or Gemini — to evaluate the outputs of another model against a scoring rubric. The appeal is obvious: automated evaluation that scales infinitely and costs far less than human annotation. The typical format is a prompt to the judge model containing: the original user prompt, the model response to be evaluated, an explicit rubric specifying what to score and how (for instance, rate helpfulness on a 1-5 scale where 5 means the response fully addressed the user’s question with no unnecessary content), and optionally a reference response or reference documents for grounding. The judge model returns a score, and optionally an explanation of the score. At scale, this provides continuous quality monitoring, automatic regression detection on prompt changes, and comparative evaluation between models or prompt variants.
LLM-as-judge evaluation has been extensively studied and several well-documented biases make it unreliable if not carefully controlled. Positional bias is the most studied: when comparing two responses, the judge model systematically favors the response presented first — in controlled experiments, response A wins approximately 55-60% of the time when presented first, even when human raters assess A and B as equally good. This is an artifact of the attention mechanism’s sensitivity to early-context tokens. Mitigation: always evaluate pairwise comparisons in both orders (A vs. B and B vs. A) and use only cases where both orderings agree, or average the scores. Verbosity bias is the second major bias: longer responses tend to receive higher scores from LLM judges even when the additional length adds no value. This reflects the pretraining distribution — detailed, comprehensive responses are positively associated with quality in human-written text, so models learn this correlation. Mitigation: explicitly include length appropriateness in the rubric, penalize unnecessary verbosity, and sanity-check by manually reviewing the highest-scoring responses.
Self-preference bias occurs when a judge model rates responses from a model in its own family higher than responses from other families. GPT-4 rates GPT-4-generated responses higher on average; Claude rates Claude-generated responses higher. The magnitude varies by task and model, but it is measurable and consistent. The practical implication: use a judge model from a different family than the model you are evaluating when possible. Format bias is the fourth well-documented bias: responses with formatting elements (bullet points, numbered lists, bold headers) receive higher scores from LLM judges than unformatted prose responses of equivalent content quality. Users do not always prefer formatted responses, and for conversational or creative tasks, formatting can be actively worse. Mitigation: in your rubric, explicitly address format appropriateness rather than allowing the judge to penalize or reward format generically. Calibration against human annotations — periodically sampling a subset of judge-scored examples and having humans score them, then measuring agreement — is the best ongoing defense against judge bias accumulating over time.
13.4 RAG-Specific Metrics (RAGAS Framework)
Standard text generation metrics fail to capture the specific quality dimensions that matter in retrieval-augmented generation systems. BLEU and ROUGE measure n-gram overlap against a reference response — they say nothing about whether the response is factually grounded in the retrieved documents. Perplexity measures how well the model predicted the reference text — irrelevant for evaluating whether retrieved context was used correctly. RAGAS (Retrieval Augmented Generation Assessment) provides a framework of four metrics specifically designed to diagnose quality at each stage of the RAG pipeline: retrieval and generation, each evaluated independently. This decomposition is the key diagnostic insight — it lets you identify whether a quality problem originates in the retriever or the generator.
Faithfulness measures whether every factual claim in the generated answer is supported by the retrieved context. The evaluation process decomposes the answer into individual factual claims and checks each against the retrieved documents. A faithfulness score of 0.7 means approximately 70% of claims in the answer are grounded in the context; 30% are either fabricated or drawn from the model’s parametric knowledge (which may or may not be correct). Faithfulness is the most important metric for applications where factual accuracy is critical — legal, medical, financial — because high faithfulness means the model is using the context rather than generating from potentially outdated or incorrect internal knowledge. A faithfulness failure mode: the model retrieves a highly relevant document and then ignores it, generating an answer based on its parametric knowledge. This produces high-confidence wrong answers that are difficult to detect without explicit faithfulness measurement.
Answer Relevance measures whether the generated answer actually addresses the question that was asked. An answer can be perfectly faithful to the retrieved context but still fail to be relevant if the model answered a different question than the one posed — “I can tell you about our return policy, which covers items purchased within 30 days” is a faithful response but irrelevant if the user asked about exchange policy. Answer relevance is computed by checking semantic similarity between the generated answer and the original question, typically using embeddings. A failure mode that causes answer relevance to degrade: retrievers that surface documents topically related but not specifically relevant to the exact question, combined with generators that answer the tangential question implicitly present in those documents rather than the user’s actual question.
Context Precision measures what fraction of the retrieved context chunks are actually relevant to answering the question. If you retrieve five chunks and only one is relevant, context precision is 0.2. Low context precision means you are filling the model’s context window with noise, which can distract the model, increase latency, and increase cost. It typically indicates that the retriever is not sufficiently discriminative — it is returning documents from the right general topic area but not filtering down to the specific, relevant passages. Context Recall measures whether the retrieved context contains all the information necessary to answer the question completely. A context recall failure means the retriever missed a document that was essential for a complete answer — the model may produce a partial or incorrect answer simply because the relevant context was not provided. Context recall failures are hard to attribute to the generator, but they still produce bad user outcomes. Together, precision and recall provide the full picture of retriever quality: you want both high precision (little noise) and high recall (nothing essential missing).
13.5 Building a Production Eval Pipeline
The foundation of a production evaluation system is a golden test set: a curated collection of representative inputs with expected outputs or quality criteria. How you build this set determines how predictive it is of production quality. Start by sampling from your actual production traffic — if you have logs, draw examples randomly, stratified by query category to ensure representation across your full input distribution. If you are pre-launch, generate synthetic examples that faithfully represent what real users will ask. Include not just common cases but explicitly add hard cases (queries the model is likely to struggle with), edge cases (unusual inputs that expose boundary behaviors), adversarial cases (prompts designed to trigger failure modes), and regression cases (specific inputs where previous model versions failed and you want to ensure the fix holds). A golden test set of 200-500 examples, carefully curated to be diverse and representative, is typically more valuable than 5,000 uniformly sampled examples — the quality of the signal matters more than the quantity.
Define your metrics before you build the eval pipeline, not after. This discipline prevents post-hoc rationalization where you select metrics that make your model look good. For each aspect of quality your application cares about, specify a measurable operationalization. Format compliance (does the output match the required schema?) can be checked with a simple parser. Length compliance (is the response within acceptable bounds?) is a rule-based check. Keyword presence (for outputs that must always include or exclude specific terms) is a string match. Factual accuracy (for applications with verifiable ground truth) might use entity extraction and comparison against a knowledge base. Semantic quality (does the response actually address the question?) requires LLM-as-judge or human annotation. Document these metric definitions and their implementations before you start evaluating — the definitions should be stable enough that different engineers would implement them the same way.
Integrate your evaluation pipeline into your development workflow as a continuous integration check. Every prompt change, model version update, or retrieval configuration change should trigger a full eval suite run automatically. Treat a degraded eval score the same way you treat a failing unit test: the change does not ship until the regression is resolved or explicitly accepted with documented rationale. This discipline prevents slow, incremental degradation of model quality that happens when teams make individual changes that each seem minor but collectively accumulate into a significantly worse model over time. The eval pipeline should produce a dashboard with time-series plots of each metric — this makes it easy to see when a change introduced a regression even if the absolute numbers still look acceptable. Set threshold alerts: if any metric drops more than X% relative to the previous baseline, trigger an investigation before the change reaches production.
Shadow testing is the highest-confidence evaluation method for validating model changes before full production deployment. In shadow mode, incoming production requests are routed to both the current production model and the candidate new model simultaneously. Both models produce responses; only the production model’s response is served to the user. The candidate’s response is logged and evaluated offline. This gives you evaluation data on the exact production input distribution, with no risk to user experience. Shadow testing typically runs for 48-72 hours to cover full weekly traffic patterns before evaluating results. Automatic evaluation on shadow traffic uses your LLM-as-judge pipeline for quality scoring and rule-based checks for format compliance. Human spot-checking of 100-200 shadow comparisons, stratified by query category and covering the highest-discrepancy cases, provides ground truth calibration. A candidate model passes the shadow test when it matches or exceeds production quality across all metric categories and has no systematic failure modes visible in the human spot-check.
13.6 Interview Questions
Q1. Why is perplexity not always a good evaluation metric for chat models?
Perplexity measures how well a language model predicts the next token in a reference text sequence — lower perplexity means the model assigned higher probability to the actual tokens in the reference. It is a natural metric for language modeling and is widely used during pretraining to track training progress. However, for chat models, perplexity has two fundamental problems. First, it requires a reference text: to compute perplexity on a response, you need a reference response to compute probability against, but chat applications often have many equally valid responses to any prompt — perplexity against one reference response unfairly penalizes all other valid responses that use different phrasing or structure. Second, perplexity measures statistical likelihood, not quality. A response that is highly probable given the training distribution but is factually wrong, unhelpful, or misaligned with user intent will have low perplexity and appear to score well, while a creative, insightful, or unusual but high-quality response may have high perplexity simply because it is statistically unexpected. Chat model quality dimensions — helpfulness, accuracy, tone, safety — are simply not captured by statistical likelihood of text sequences.
Q2. What is BLEU score and why is it limited for evaluating open-ended text generation?
BLEU (Bilingual Evaluation Understudy) measures the n-gram overlap between a generated text and one or more reference texts, with a brevity penalty to prevent very short outputs from gaming the metric. It was developed for machine translation (2002), where the task has relatively constrained correct outputs — a good translation of a sentence should use many of the same words and phrases as a reference translation. In that context, BLEU correlates reasonably well with human judgments. For open-ended text generation — chat responses, summaries, creative writing — BLEU has severe limitations. The surface form of a good response is highly variable: “The product should be returned within 30 days” and “Returns are accepted for up to one month from purchase date” convey identical information with near-zero n-gram overlap, giving BLEU score 0 for an equivalent answer. BLEU also does not measure factual accuracy, hallucination, coherence, or tone — all of which matter for chat applications. A response that is factually wrong but uses similar phrasing to the reference will score high on BLEU; a response that is factually correct but phrased differently will score low. For generation evaluation in production, BLEU should be reserved for tasks where surface similarity is genuinely meaningful (close-domain machine translation, exact template filling) and replaced with semantic similarity metrics, LLM-as-judge, or task-specific metrics for open-ended tasks.
Q3. What is LLM-as-judge? What biases does it introduce and how do you mitigate them?
LLM-as-judge uses a strong language model (typically GPT-4, Claude, or Gemini) to evaluate another model’s outputs against a rubric, providing a scalar quality score and optionally an explanation. It is widely used because it scales to millions of evaluations with no human annotation cost and provides more nuanced quality assessment than rule-based metrics. The four principal biases are: positional bias (the judge favors whichever response is presented first in a pairwise comparison, by 5-10 percentage points); verbosity bias (longer responses receive higher scores even when extra length adds no value); self-preference bias (a judge model rates outputs from its own family higher than outputs from other model families); and format bias (responses with headers, bullet points, and visual structure receive higher scores than unformatted prose of equivalent content). Mitigations: for positional bias, evaluate every pair in both orderings and only use cases where both orderings agree, or average scores across both orderings. For verbosity bias, explicitly include length-appropriateness in the rubric and manually review top-scoring responses periodically. For self-preference bias, choose a judge model from a different family than the model being evaluated. For format bias, specify format requirements in the rubric rather than allowing the judge to apply implicit formatting preferences. All biases benefit from periodic calibration: sample 100-200 judge-scored examples, have humans score them independently, and measure agreement — this tells you how much to trust the judge on your specific task and rubric.
Q4. How would you build an eval suite for a customer support chatbot? What does the test set look like, what metrics do you compute, and how do you use it in deployment?
Test set construction starts with production data. Collect 300-500 real customer support queries, stratified by query category (billing questions, technical issues, account management, edge cases, escalation scenarios). Include examples where previous model versions failed. For each example, define the expected output properties — not necessarily a single reference response, but a set of criteria: the response should address the specific issue raised, should not provide incorrect policy information, should escalate appropriately when the issue requires human review, and should not commit to actions the system cannot take. For metrics: format compliance (structured response with issue acknowledgment, resolution steps, and next-action guidance, checked via regex or parsing), policy accuracy (does the response accurately reflect your product’s actual policies, checked against a policy knowledge base via LLM-as-judge with the policy documents provided), resolution rate (does the response provide actionable steps that address the user’s issue, scored 1-5 by LLM-as-judge with a detailed rubric), escalation accuracy (for queries that should trigger human escalation, does the model escalate?), and safety (does the response contain any commitment to actions, pricing, or timelines not in the approved policy). Use this suite in deployment as a CI/CD gate: every prompt template change and model version update triggers the full suite. Block deployment if policy accuracy drops more than 2% or escalation accuracy drops more than 5% relative to baseline. Run shadow testing for 48 hours on significant changes before full rollout.
Q5. Explain the four RAGAS metrics (faithfulness, answer relevance, context precision, context recall). For each, describe a failure mode that would cause it to degrade.
Faithfulness measures whether every factual claim in the generated answer is supported by the retrieved context. Failure mode: the model retrieves a relevant document but ignores it, generating an answer from its parametric memory instead — producing a confident response based on potentially outdated or incorrect internal knowledge rather than the grounded retrieved content. Answer Relevance measures whether the answer addresses the question that was actually asked. Failure mode: the retriever surfaces documents topically related to the query domain but addressing a slightly different question — for example, retrieving shipping policy when the user asked about return policy — and the generator answers the tangentially related question present in those documents rather than the user’s actual question. The answer is faithful (grounded in retrieved context) but irrelevant (wrong question answered). Context Precision measures what fraction of retrieved context chunks are genuinely relevant. Failure mode: the retriever uses broad semantic similarity matching that captures documents on the right general topic but not the specific subject — retrieving five chunks about product maintenance when only one is relevant to the specific maintenance procedure asked about. The generator now has its context diluted with noise, which can lead to hallucination or generic responses. Context Recall measures whether the retrieved context contains all information necessary for a complete answer. Failure mode: the relevant information is buried in a document that scores low on the retriever’s similarity function — perhaps because it uses different vocabulary than the query — and is therefore not retrieved. The model answers based on incomplete context and produces a partial or incorrect response, not because of any generation failure but because the retriever failed.
Q6. A customer is switching from GPT-4 to a cheaper model to reduce costs. Design the evaluation process you would run to validate the switch won’t degrade their product.
This evaluation process has four phases. Phase one: establish a production-representative test set. If the customer has logs from their GPT-4 deployment, sample 500-1,000 real production queries, stratified by query category and including hard cases and past failure modes. If no logs exist, work with the customer to identify their primary use case categories and generate representative queries for each. This set must be fixed before any evaluation begins — you cannot select the test set after seeing how models perform on different samples. Phase two: define task-specific success metrics. For each query category, specify what a passing response looks like — not just “is it good?” but operationalized criteria: format compliance, factual accuracy on verifiable claims, completeness (does it address all aspects of the question), tone compliance (does it match brand voice guidelines). Document these in a scoring rubric before scoring any outputs. Phase three: run parallel evaluation. Generate GPT-4 and candidate model responses for all test set queries. Score both models on all metrics. Compute absolute scores and the GPT-4 to candidate model delta for each metric and category. Identify specific categories where the delta is large — these are the highest-risk areas. Phase four: human review of high-delta cases. Have domain experts review the 50-100 pairs where the two models differ most in automated score. This calibrates the automated metrics (are they accurately capturing quality differences?) and provides ground truth for the most consequential comparisons. Acceptance criteria: the candidate model meets a predefined minimum score on each metric — not necessarily matching GPT-4 exactly, but within an acceptable delta agreed with the customer in advance. This prevents post-hoc justification of whatever quality level the cheaper model happens to achieve.
Q7. The customer’s model performs well in offline evaluation but poorly in production. What causes this eval-to-production gap and how would you close it?
The eval-to-production gap has five common causes. First, distribution shift: the offline test set does not represent the real production query distribution. Test sets built before launch are based on anticipated queries; real users ask different things, phrase requests differently, and surface use cases the product team didn’t anticipate. Solution: continuously update the test set with real production queries, sampling regularly from live traffic. Second, prompt version mismatch: the offline eval ran against an older version of the system prompt, or the production system has configuration differences (different model parameters, different retrieved context, different input preprocessing) that were not replicated in the eval environment. Solution: run offline eval in an environment that exactly mirrors production, using the same prompt templates and parameters fetched from the same configuration store. Third, evaluation metric gaps: the offline metrics measured things that were easy to measure (format compliance, keyword presence) but not the things users actually care about (did the response solve my problem?). Solution: invest in richer LLM-as-judge evaluation that captures user-relevant quality dimensions, and correlate offline metrics with user satisfaction signals periodically. Fourth, temporal distribution shift: the production query distribution drifts over time — users discover new use cases, the product adds features, seasonality affects query patterns. The offline test set becomes stale. Solution: implement a monitoring pipeline that continuously evaluates a sample of production traffic and flags when production scores diverge from offline scores. Fifth, user population effects: the offline evaluation does not capture real user follow-up behavior — users may submit low-quality answers as follow-up queries (indicating dissatisfaction) that only appear in production logs. Solution: define production metrics based on user behavior signals (session abandonment, query reformulation rate, explicit feedback) and monitor these alongside offline quality metrics.
Q8. A customer claims their model has a 95% accuracy rate. You suspect this is misleading. What questions do you ask to understand the real picture?
Start with the test set definition: what is the composition of the 5% of questions the model gets wrong, and how was the test set constructed? If the test set was drawn from the same distribution as the training set, accuracy reflects memorization rather than generalization. If it was hand-curated by the team, it likely underrepresents hard cases and edge cases that the team did not anticipate. Ask to see the test set distribution broken down by query category — a 95% average can hide a 60% accuracy on the hardest and most common category if easy categories dominate the count. Next, ask how accuracy is defined. For open-ended generation, what constitutes a correct answer? Was it human-graded, LLM-graded, or matched against a single reference response? Single-reference matching will undercount correct responses that use different phrasing. If LLM-graded, which model was used as judge and was it calibrated against human labels? Then ask about production versus offline evaluation: has this 95% been validated on production traffic, or only on an offline test set? When was the test set last updated relative to when the model was deployed? And ask about failure mode distribution: even with 95% accuracy, what do the 5% failures look like? Five percent of 100,000 daily queries is 5,000 failures per day — if those failures cluster in high-stakes query categories (medical advice, financial guidance, safety information), the product has a serious problem that the 95% headline obscures. The goal is to move the customer from a single headline number to a quality decomposition by query category, metric, and failure mode severity.