Measuring What RAG Actually Produces
The retrieval eval set says the right document was in the top-K. The reader produced the wrong answer anyway. Retrieval metrics cannot diagnose this because they measure the retriever in isolation from the reader. End-to-end evaluation needs the four metrics that anchor RAGAS: faithfulness, answer relevance, context precision, and context recall. Stratified reporting keeps per-category failures from hiding in the aggregate.
The Evaluation Gap
A retrieval-augmented generation system5 has two distinct stages, and each can fail independently.8 The retriever might surface irrelevant chunks, in which case metrics like Recall@K, MRR, and NDCG will catch the problem. There is a second, subtler class of failure that retrieval metrics are blind to.
Consider a scenario where your retriever works perfectly. It surfaces the three most relevant chunks from your knowledge base, each containing exactly the information needed to answer the user's question. Recall@10 is 1.0. MRR is 1.0. By every retrieval metric, the system is performing flawlessly.
Then the LLM ignores the second chunk, misinterprets the third, and fabricates a statistic that appears nowhere in any of them. The user receives a confident, fluent, completely wrong answer. Your retrieval dashboard shows green across the board.
This is the evaluation gap. Retrieval metrics measure whether the right information reached the model. They say nothing about what the model did with it. Did it stay faithful to the provided context? Did it address the actual question? Did it hallucinate despite having perfectly good source material?9 Answering these questions requires evaluating the generation stage on its own terms, with its own metrics. That is what RAGAS was built to do.
The RAGAS Framework
RAGAS (Retrieval Augmented Generation Assessment) is a framework for evaluating RAG pipelines without requiring human-annotated ground truth answers.1 The key innovation is using large language models themselves as evaluators, a technique that sounds circular until you understand how it works in practice.
The framework decomposes RAG evaluation into four core metrics, each targeting a different failure mode. Together, they provide a comprehensive view of system quality that no single metric could achieve alone.
| Metric | What It Measures | Failure It Catches |
|---|---|---|
| Faithfulness | Is the answer supported by the retrieved context? | Hallucination, fabrication, extrapolation beyond context |
| Answer Relevance | Does the answer address the question asked? | Tangential responses, topic drift, over-general answers |
| Context Precision | Were the retrieved contexts actually useful for answering? | Noisy retrieval, irrelevant chunks diluting signal |
| Context Recall | Does the context contain all information needed for the answer? | Missing information, incomplete retrieval |
The first two metrics evaluate the generation stage: did the LLM produce a good answer given what it received?11 The second two evaluate the retrieval stage from the generation perspective: did the retriever give the LLM what it needed? This decomposition is what makes RAGAS powerful. When a metric drops, you know exactly where to look.
LLM-as-Judge: Why It Works
The idea of using an LLM to evaluate another LLM's output invites skepticism. If the evaluator model has the same biases and failure modes as the model being evaluated, what have you gained?
The answer lies in the asymmetry between generation and verification. Generating a correct answer from scratch is hard. Checking whether a given answer is supported by a given piece of text is considerably easier.14 This is the same asymmetry that makes NP problems tractable to verify but hard to solve, and it holds for language models too. An LLM asked "Is the claim 'revenue grew 15% in Q3' supported by this paragraph?" can usually answer correctly, even if it might have hallucinated that same claim during open-ended generation.
Empirical studies of the LLM-as-judge paradigm have shown that strong models achieve over 80% agreement with human evaluators on quality judgments, comparable to the agreement rate between different human annotators.2 The evaluator is not perfect, but it is consistent, scalable, and cheap. For most teams, that tradeoff is far more practical than hiring annotators to grade every response.
Faithfulness: The Most Critical Metric
Of the four RAGAS metrics, faithfulness deserves the most attention. A RAG system exists to ground language model outputs in retrieved evidence. If the system produces answers that contradict or go beyond that evidence, it has failed at its primary purpose.15 Faithfulness is the metric that catches this failure.
How Faithfulness Scoring Works
The faithfulness computation proceeds in two steps. First, the evaluator LLM decomposes the generated answer into a set of atomic claims. Each claim is a single, verifiable statement. Second, the evaluator checks each claim against the retrieved context, classifying it as either supported or unsupported.
The faithfulness score is simply the ratio of supported claims to total claims:16
Faithfulness = (Number of claims supported by context) / (Total number of claims in the answer)
Consider a concrete example. A user asks "What were Acme Corp's Q3 2024 results?" and the retriever surfaces a chunk containing: "Acme Corp reported revenue of $4.2 billion in Q3 2024, up 12% year-over-year. Operating margin improved to 18.5%." The LLM generates this answer:
"Acme Corp had a strong Q3 2024, with revenue of $4.2 billion (up 12% YoY) and operating margin of 18.5%. The company also announced a $500 million share buyback program."
The evaluator extracts four claims: (1) revenue was $4.2 billion, (2) revenue grew 12% year-over-year, (3) operating margin was 18.5%, and (4) the company announced a $500 million buyback. Claims 1 through 3 are supported by the context, but claim 4 is not, so the faithfulness score is 3 of 4, or 0.75.
That unsupported claim is precisely the kind of hallucination that makes RAG systems dangerous in production.17 The model "knew" about buyback programs from its training data and wove the information seamlessly into an otherwise accurate response. Without faithfulness evaluation, this error would be invisible.
Faithful vs. Unfaithful Answers
The distinction becomes clearer with a side-by-side comparison. Given the context "Python 3.12 introduced the new `type` statement for defining type aliases and added performance improvements averaging 5% over 3.11," compare two possible answers to the question "What's new in Python 3.12?"
Faithful answer (score: 1.0): "Python 3.12 introduced a new `type` statement for type alias definitions and delivered approximately 5% performance improvements compared to Python 3.11."
Unfaithful answer (score: 0.4): "Python 3.12 brought several improvements including the `type` statement for type aliases, a new garbage collector, per-interpreter GIL support, improved error messages with color highlighting, and roughly 5% better performance than 3.11."
The unfaithful answer is not wrong in an absolute sense. Several of those claims are factually true about Python 3.12. But the context only supports two of the five claims. The model drew from its parametric knowledge to pad the answer, which defeats the entire purpose of retrieval augmentation.18 In a domain where the model's training data might be outdated or incorrect, this behavior is actively harmful.
Faithfulness and Hallucination
High faithfulness with good retrieval is the gold standard for production RAG. It means the system is doing what it was designed to do: answering questions based on retrieved evidence rather than parametric memory.
Low faithfulness despite good retrieval is the most dangerous failure mode. The retriever did its job, but the model ignored the context and confabulated an answer anyway.19 This pattern often indicates that the prompt template is not sufficiently constraining, or that the model is defaulting to its training data when the retrieved context seems insufficient.
Low faithfulness with poor retrieval is less dangerous but still problematic. The model had no good context to work with and improvised. Fixing the retriever is the first priority here, but adding instructions like "only answer based on the provided context" can limit the blast radius.
Answer Relevance: Did You Actually Answer the Question?
A response can be perfectly faithful to the context and still miss the point entirely. If a user asks "How do I reset my password?" and the system responds with a faithful summary of the company's password policy requirements, the answer is grounded but useless. Answer relevance catches this class of failure.
RAGAS computes answer relevance through an inverse approach. The evaluator LLM generates several questions that the given answer would be a good response to. It then measures the semantic similarity between these generated questions and the original question.20 If the answer is relevant, the generated questions should closely match the original. If the answer is off-topic, the generated questions will diverge.
This indirect measurement is more robust than asking the LLM directly "Is this answer relevant?" because it forces the evaluator to engage with the content rather than producing a surface-level judgment. An answer that vaguely relates to the topic might get a generous "yes" from a direct assessment, but the questions it would actually answer will not match the original query.
Answer relevance also penalizes unnecessary verbosity. An answer that buries the relevant information inside three paragraphs of tangential context will generate questions that cover a wider topic space than the original, lowering the similarity score. This aligns with what users actually want: a focused response that addresses their specific question.
Context Precision and Context Recall
The remaining two RAGAS metrics evaluate the retrieval stage, but from a perspective that traditional retrieval metrics miss. Recall@K tells you whether relevant documents were retrieved. Context precision and context recall tell you whether the retrieved documents were actually useful for generating a correct answer.
Context Precision
Context precision asks: of the chunks in the retrieved context, how many were actually needed to produce the answer? A retriever that returns ten chunks when only two contain relevant information scores poorly on context precision, even if Recall@K is perfect.
This matters because irrelevant context is not merely wasteful. It actively degrades generation quality. Research has shown that language models struggle to extract relevant information when it is buried among irrelevant passages, a phenomenon known as the "lost in the middle" effect.3 High context precision means the model receives a clean, focused context with minimal noise.
Context Recall
Context recall asks: does the retrieved context contain all the information needed to answer the question completely? This requires a reference answer (ground truth), which the evaluator decomposes into constituent claims and then checks whether each claim can be attributed to the retrieved context.
Context recall differs from traditional Recall@K in a subtle but important way. Recall@K measures whether specific documents were retrieved. Context recall measures whether the information content needed for a complete answer is present, regardless of which specific documents contain it.21 A different set of documents that covers the same information would score identically on context recall even if Recall@K differs.
When Metrics Disagree
The diagnostic power of RAGAS comes from examining metric combinations, not individual scores. Each combination points to a specific failure mode.
This kind of decomposition is why end-to-end metrics alone are insufficient. A single "answer quality" score of 0.5 could mean the system is mediocre at everything or excellent at one thing and terrible at another. The fix is completely different in each case.
Implementing RAGAS Evaluation
The ragas library provides a straightforward Python API for running all four metrics. The following example demonstrates a complete evaluation pipeline.
↗ docs from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall, ) from datasets import Dataset # Prepare evaluation data # Each example needs: question, answer, contexts, and # optionally a ground_truth for context_recall eval_data = { "question": [ "What is the maximum context window for GPT-4?", "How does dropout regularization work?", "What causes vanishing gradients in deep networks?", ], "answer": [ "GPT-4 supports a maximum context window of 128K tokens.", "Dropout randomly deactivates neurons during training, " "forcing the network to learn redundant representations.", "Vanishing gradients occur when activation functions like " "sigmoid compress gradients during backpropagation, making " "updates to early layers negligibly small.", ], "contexts": [ ["GPT-4 offers context windows of 8K and 128K tokens. " "The 128K variant can process approximately 300 pages."], ["Dropout (Srivastava et al., 2014) is a regularization " "technique that randomly sets neuron outputs to zero " "during training with probability p, typically 0.5."], ["Deep networks suffer from vanishing gradients when " "using sigmoid or tanh activations. ReLU mitigates this."], ], "ground_truth": [ "GPT-4's maximum context window is 128,000 tokens.", "Dropout randomly zeroes neuron activations during training " "to prevent co-adaptation and improve generalization.", "Vanishing gradients are caused by activation functions " "that squash their inputs, making gradients exponentially " "smaller through successive layers during backpropagation.", ], } # Create a HuggingFace Dataset dataset = Dataset.from_dict(eval_data) # Run evaluation with all four metrics results = evaluate( dataset, metrics=[ faithfulness, answer_relevancy, context_precision, context_recall, ], ) # Print aggregate scores print(results) # {'faithfulness': 0.92, 'answer_relevancy': 0.88, # 'context_precision': 0.85, 'context_recall': 0.90} # Convert to pandas for per-example analysis df = results.to_pandas() print(df[["question", "faithfulness", "answer_relevancy"]])
The library handles the LLM evaluation calls internally, using OpenAI's API by default. You can configure it to use other providers, including Anthropic's Claude or open-source models served via vLLM or Ollama, by passing a custom LLM wrapper.
A Manual Faithfulness Checker
To understand what RAGAS does under the hood, it helps to implement a simplified faithfulness checker from scratch.6 The following code shows the two-step process: claim extraction followed by claim verification.
↗ docs from openai import OpenAI import json client = OpenAI() def extract_claims(answer): """Decompose an answer into atomic, verifiable claims.""" response = client.chat.completions.create( model="gpt-4o", messages=[{ "role": "system", "content": ( "Extract every factual claim from the given text. " "Each claim should be a single, atomic statement " "that can be independently verified. Return as a " "JSON array of strings." ) }, { "role": "user", "content": answer }], response_format={"type": "json_object"}, temperature=0.0, ) result = json.loads(response.choices[0].message.content) return result.get("claims", []) def verify_claim(claim, context): """Check whether a single claim is supported by the context.""" response = client.chat.completions.create( model="gpt-4o", messages=[{ "role": "system", "content": ( "You are a fact-checker. Given a claim and a " "context, determine if the claim is supported " "by the context. Respond with a JSON object: " "{'supported': true/false, 'reasoning': '...'}" ) }, { "role": "user", "content": ( f"Claim: {claim}\n\n" f"Context: {context}" ) }], response_format={"type": "json_object"}, temperature=0.0, ) return json.loads(response.choices[0].message.content) def compute_faithfulness(answer, contexts): """Compute faithfulness score for an answer given contexts.""" claims = extract_claims(answer) if not claims: return 1.0 # No claims to verify # Concatenate all contexts full_context = "\n\n".join(contexts) # Verify each claim supported = 0 results = [] for claim in claims: verdict = verify_claim(claim, full_context) results.append({ "claim": claim, "supported": verdict["supported"], "reasoning": verdict["reasoning"], }) if verdict["supported"]: supported += 1 score = supported / len(claims) return { "score": score, "total_claims": len(claims), "supported_claims": supported, "details": results, } # Example usage result = compute_faithfulness( answer="Python 3.12 introduced the type statement and " "a new garbage collector, with 5% better performance.", contexts=[ "Python 3.12 introduced the new type statement for " "defining type aliases and added performance improvements " "averaging 5% over 3.11." ] ) print(f"Faithfulness: {result['score']:.2f}") # Faithfulness: 0.67 (2 of 3 claims supported; # "new garbage collector" is not in context)
This manual implementation makes the mechanism transparent.22 The RAGAS library optimizes this process with batched evaluations, better prompt engineering, and caching, but the conceptual flow is identical: decompose, verify, aggregate.
Beyond RAGAS: Other Evaluation Approaches
RAGAS is the most widely adopted framework for RAG evaluation, but it is not the only approach. Different contexts call for different evaluation strategies, and a mature evaluation practice draws on several of them.
LLM-as-Judge Patterns
The LLM-as-judge paradigm extends well beyond RAGAS's specific metrics.7 G-Eval applies chain-of-thought prompting to improve evaluation quality.4 Instead of asking the judge model for a single score, G-Eval asks it to first reason about the evaluation criteria, then produce a score. This structured reasoning reduces the noise in evaluation judgments.
You can build custom judge prompts for any quality dimension relevant to your application. Tone consistency, citation accuracy, response length appropriateness, safety compliance: if you can describe the criterion clearly enough for a human to evaluate, you can usually get a reasonable LLM evaluation of it.
↗ docs def llm_judge_evaluate(question, answer, context, criteria): """General-purpose LLM-as-judge evaluation.""" response = client.chat.completions.create( model="gpt-4o", messages=[{ "role": "system", "content": ( "You are an expert evaluator for AI-generated " "responses. Evaluate the following response on " "the given criteria.\n\n" "Think step by step about each criterion before " "assigning scores. Return a JSON object with " "'reasoning' (string) and 'scores' (object mapping " "each criterion name to a float between 0 and 1)." ) }, { "role": "user", "content": ( f"Question: {question}\n\n" f"Context provided to the system:\n{context}\n\n" f"System response:\n{answer}\n\n" f"Evaluation criteria:\n" + "\n".join( f"- {name}: {desc}" for name, desc in criteria.items() ) ) }], response_format={"type": "json_object"}, temperature=0.0, ) return json.loads(response.choices[0].message.content) # Example: custom criteria for a medical Q&A system medical_criteria = { "accuracy": "Are all medical claims factually correct and " "supported by the provided context?", "safety": "Does the response avoid giving dangerous advice " "and include appropriate disclaimers?", "completeness": "Does the response address all aspects of " "the question without omitting important details?", "clarity": "Is the response written in language a patient " "can understand, avoiding unnecessary jargon?", } result = llm_judge_evaluate( question="What are the side effects of metformin?", answer="Common side effects include nausea and diarrhea...", context="Metformin side effects: GI disturbances (nausea, " "diarrhea, abdominal pain) in 20-30% of patients...", criteria=medical_criteria, )
Human Evaluation
LLM judges are convenient. They are not a complete replacement for human judgment, especially in high-stakes domains where errors carry real consequences.
Human evaluation is essential in three scenarios. First, when you are calibrating your automated metrics. You need human judgments as a reference point to know whether your LLM judge is actually measuring what you think it is measuring. Second, when evaluating subjective qualities like tone, empathy, or cultural appropriateness that LLMs struggle to assess reliably. Third, when the cost of a wrong answer is high enough that statistical evaluation is insufficient and you need case-by-case review.
Structuring human evaluation well requires attention to inter-annotator agreement. If two annotators disagree on whether an answer is faithful, the criterion is ambiguous. Cohen's kappa, which measures agreement beyond chance, should be above 0.6 for your evaluation to be meaningful.23 Anything below that suggests your rubric needs refinement, not that your system is difficult to evaluate.
A practical approach is to use LLM judges for continuous, high-volume evaluation and reserve human evaluation for periodic audits, edge cases flagged by the automated system, and initial calibration of new metrics. This gives you the coverage of automation with the precision of human judgment where it matters most.
Automated Test Suites
Beyond scoring individual responses, production RAG systems need regression testing. The pattern is familiar from software engineering: maintain a set of golden question-answer pairs and run them against every system change.
A golden test suite should include several categories of test cases.24 Easy questions that the system should always get right form your baseline. Hard questions that probe known weaknesses test your improvements. Adversarial questions designed to trigger hallucination test your guardrails. And negative cases, questions the system should refuse to answer because the knowledge base does not contain the information, test whether the system knows what it does not know.
This last category is frequently overlooked. A RAG system that always produces an answer, even when the knowledge base contains nothing relevant, is a system that will eventually hallucinate something dangerous. Testing for appropriate abstention is just as important as testing for correct answers.
A/B Testing in Production
Offline evaluation tells you how the system performs on your test set. Production evaluation tells you how it performs on real users with real questions. The gap between these two can be significant.
User satisfaction signals include explicit feedback (thumbs up/down buttons, star ratings), implicit feedback (did the user reformulate their query immediately after receiving an answer, suggesting the answer was unhelpful?), and behavioral signals (did the user click through to a source document, suggesting they wanted to verify the answer?).
Query reformulation is a particularly strong negative signal. When a user asks "How do I configure SSO?" and immediately follows up with "SSO setup guide" or "single sign-on configuration steps," they are telling you the first answer was not useful. Tracking reformulation rates across system versions gives you a real-world quality metric that no offline evaluation can replicate.
Building an Evaluation Pipeline
Individual metrics are useful. An integrated evaluation pipeline that runs automatically, tracks trends over time, and alerts on regressions is what separates a research prototype from a production system.
Collecting Evaluation Data
The first step is building an evaluation dataset that represents your actual workload. This dataset needs three types of examples.
Human-curated golden examples are the highest-quality evaluation data. A domain expert writes the question, identifies the relevant context, and provides a reference answer. These are expensive to create but invaluable for calibrating your metrics. Aim for 50 to 100 examples covering the breadth of your use cases.
Synthetic examples are generated by prompting an LLM to create question-answer pairs from your knowledge base chunks. These are cheap to produce at scale but tend to be easier than real queries. They are useful for broad coverage testing and catching gross regressions.
Production samples are real user queries with system-generated answers, captured from your application logs. These reflect the true distribution of user needs, including the ambiguous, poorly phrased, and adversarial queries that neither human curators nor synthetic generators think to include.
import json from datetime import datetime from pathlib import Path class EvaluationPipeline: """End-to-end RAG evaluation pipeline.""" def __init__(self, rag_system, eval_dataset_path): self.rag = rag_system self.dataset = self._load_dataset(eval_dataset_path) self.results_dir = Path("eval_results") self.results_dir.mkdir(exist_ok=True) def _load_dataset(self, path): """Load evaluation dataset from JSON.""" with open(path) as f: return json.load(f) def run_rag_pipeline(self, question): """Execute the RAG pipeline and capture all outputs.""" # Retrieve contexts contexts = self.rag.retrieve(question) # Generate answer answer = self.rag.generate(question, contexts) return { "question": question, "contexts": [c["text"] for c in contexts], "answer": answer, } def evaluate_all(self): """Run evaluation across the full dataset.""" questions = [] answers = [] contexts_list = [] ground_truths = [] for example in self.dataset: result = self.run_rag_pipeline(example["question"]) questions.append(result["question"]) answers.append(result["answer"]) contexts_list.append(result["contexts"]) ground_truths.append(example.get("ground_truth", "")) # Build RAGAS dataset eval_data = Dataset.from_dict({ "question": questions, "answer": answers, "contexts": contexts_list, "ground_truth": ground_truths, }) # Run RAGAS evaluation scores = evaluate( eval_data, metrics=[ faithfulness, answer_relevancy, context_precision, context_recall, ], ) return scores def save_results(self, scores, run_metadata=None): """Save evaluation results with timestamp.""" timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") output = { "timestamp": timestamp, "metadata": run_metadata or {}, "aggregate_scores": { k: round(v, 4) for k, v in scores.items() if isinstance(v, float) }, "per_example": scores.to_pandas().to_dict( orient="records" ), } output_path = self.results_dir / f"eval_{timestamp}.json" with open(output_path, "w") as f: json.dump(output, f, indent=2) return output_path
Integrating with CI/CD
The evaluation pipeline becomes most valuable when it runs automatically.25 In a CI/CD context, this means executing the evaluation suite against every pull request that touches the RAG pipeline and failing the build if metrics drop below established thresholds.
# ci_eval.py -- Run as part of CI/CD pipeline import sys # Define minimum acceptable thresholds THRESHOLDS = { "faithfulness": 0.85, "answer_relevancy": 0.80, "context_precision": 0.75, "context_recall": 0.80, } def run_ci_evaluation(): """Run evaluation and enforce quality gates.""" pipeline = EvaluationPipeline( rag_system=build_rag_system(), eval_dataset_path="eval/golden_dataset.json", ) scores = pipeline.evaluate_all() pipeline.save_results( scores, run_metadata={"trigger": "ci", "branch": get_branch()} ) # Check thresholds failures = [] for metric, threshold in THRESHOLDS.items(): actual = scores[metric] if actual < threshold: failures.append( f"{metric}: {actual:.3f} < {threshold:.3f}" ) if failures: print("EVALUATION FAILED:") for f in failures: print(f" - {f}") sys.exit(1) print("All evaluation metrics passed.") sys.exit(0) if __name__ == "__main__": run_ci_evaluation()
Dashboarding and Trend Analysis
Individual evaluation runs tell you whether the system is good right now. Trend analysis tells you whether it is getting better or worse over time. Even small, gradual regressions compound. A faithfulness score that drops 0.5% per week will lose 25% of its value in a year.
import json from pathlib import Path import pandas as pd def load_evaluation_history(results_dir): """Load all historical evaluation results for trending.""" records = [] for path in sorted(Path(results_dir).glob("eval_*.json")): with open(path) as f: data = json.load(f) record = {"timestamp": data["timestamp"]} record.update(data["aggregate_scores"]) records.append(record) df = pd.DataFrame(records) df["timestamp"] = pd.to_datetime( df["timestamp"], format="%Y%m%d_%H%M%S" ) return df.sort_values("timestamp") def detect_regressions(history, window=5, threshold=0.02): """Flag metrics that have declined over recent runs.""" if len(history) < window: return [] recent = history.tail(window) older = history.iloc[-2 * window:-window] alerts = [] for metric in ["faithfulness", "answer_relevancy", "context_precision", "context_recall"]: if metric not in recent.columns: continue recent_mean = recent[metric].mean() older_mean = older[metric].mean() delta = recent_mean - older_mean if delta < -threshold: alerts.append({ "metric": metric, "recent_mean": round(recent_mean, 4), "older_mean": round(older_mean, 4), "delta": round(delta, 4), }) return alerts # Example usage history = load_evaluation_history("eval_results") alerts = detect_regressions(history) for alert in alerts: print( f"REGRESSION: {alert['metric']} dropped from " f"{alert['older_mean']} to {alert['recent_mean']} " f"(delta: {alert['delta']})" )
This kind of monitoring turns evaluation from a one-time activity into a continuous quality assurance practice. The alerts surface problems before users notice them, which is always cheaper than the alternative.
Common Pitfalls in RAG Evaluation
Even teams that invest in evaluation infrastructure make predictable mistakes.26 Recognizing these pitfalls early saves months of misdirected effort.
Evaluating Only End-to-End
The most common mistake is measuring only final answer quality without decomposing performance into retrieval and generation components.27 When end-to-end quality drops, you need to know why. Was it a retrieval regression? A prompt template change? A model API update that shifted behavior?
Always evaluate retrieval and generation independently. Run retrieval metrics (Recall@K, MRR, NDCG) on the retriever alone, then run RAGAS metrics on the full pipeline. When a problem appears, you can immediately narrow it to one stage or the other. Without this decomposition, debugging is guesswork.
Synthetic Evaluation Data That Is Too Easy
When you generate evaluation questions by prompting an LLM to read a chunk and write a question about it, the resulting questions have a systematic bias: they are answerable by a single, specific chunk. Real user queries are messier. They span multiple documents, use different vocabulary than the source material, contain ambiguity, and sometimes ask about things that are not in the knowledge base at all.
Synthetic evaluation sets that report 95% faithfulness and 92% answer relevance may be measuring how well your system handles easy questions, not how well it handles real ones. Supplement synthetic data with production queries as quickly as possible. Even a small set of real questions will expose weaknesses that synthetic benchmarks miss entirely.
Overfitting to the Evaluation Set
If you tune your retrieval parameters, prompt templates, and chunk sizes to maximize scores on a fixed evaluation set, you are doing the RAG equivalent of overfitting.28 The system will perform brilliantly on those specific questions and unpredictably on everything else.
Mitigation is the same as in machine learning: hold out a test set that you never use for tuning decisions. Evaluate against it periodically to confirm that gains on the development set generalize. Rotate your evaluation data. Add new questions from production logs regularly. An evaluation set that never changes will eventually stop telling you anything useful.
Ignoring Negative Cases
Most evaluation sets consist entirely of questions the system should be able to answer. This misses a critical failure mode: what happens when the knowledge base does not contain the answer?
A robust RAG system should recognize when it lacks sufficient information and say so, rather than fabricating a plausible-sounding response.29 Testing this behavior requires negative examples: questions that are reasonable but unanswerable given the current knowledge base. A legal RAG system should not answer medical questions. A product documentation system should not speculate about features that do not exist.
Include at least 10-15% negative cases in your evaluation set. Score them on a binary metric: did the system correctly refuse to answer, or did it hallucinate? This single metric is often more revealing than all the others combined, because it exposes whether the system knows the boundaries of its own knowledge.
Not Evaluating Across User Segments
Aggregate metrics hide distributional problems. A system with 90% average faithfulness might achieve 98% on common questions and 40% on rare but important ones. If those rare questions correspond to high-value users or high-stakes decisions, the aggregate score is dangerously misleading.
Slice your evaluation results by question category, difficulty level, and source domain. A breakdown that shows strong performance on product FAQ questions but weak performance on troubleshooting queries tells a very different story than the aggregate, and points directly at what to fix next.
Putting It All Together
A complete RAG evaluation practice combines multiple approaches at different levels of the pipeline.
- Retrieval evaluation (Recall@K, MRR, NDCG) measures whether the right documents reach the context window.
- RAGAS metrics (faithfulness, answer relevance, context precision, context recall) measure whether the system as a whole produces good answers from retrieved context.
- Custom LLM-as-judge evaluations measure domain-specific quality dimensions like safety, tone, and completeness.
- Golden test suites with negative cases catch regressions and test system boundaries.
- Production monitoring (user feedback, query reformulation rates, A/B testing) validates that offline metrics correlate with real-world user satisfaction.
No single layer is sufficient. Retrieval metrics miss generation failures. RAGAS metrics miss the cases that are not in your evaluation set. Offline evaluation misses the distribution shift between your test data and real queries. Production monitoring catches everything but explains nothing without the diagnostic power of decomposed metrics.30
The investment in evaluation infrastructure is significant, but the alternative is flying blind. Every team that has shipped a RAG system to production without systematic evaluation has eventually been surprised by a failure that proper evaluation would have caught in development. The metrics exist. The tools exist. The remaining question is whether you use them before your users discover the problems for you.
References
Textbook grounding, chapter-level citations, and further reading for each numbered reference in this article live on the companion sources page.
- Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). "RAGAS: Automated Evaluation of Retrieval Augmented Generation." arXiv:2309.15217.
- Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., ... & Stoica, I. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023.
- Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172.
- Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023b). "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment." EMNLP 2023.
- Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020.
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." JMLR 15(56): 1929-1958.
- Saad-Falcon, J., Khattab, O., Potts, C., & Zaharia, M. (2023). "ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems." arXiv:2311.09476.