name: RAG Evaluation Metrics description: Measure RAG pipeline quality with context precision/recall, faithfulness, answer relevancy, and groundedness using Ragas and DeepEval, with golden datasets and pass/fail thresholds. version: 1.0.0 author: thetestingacademy license: MIT tags: [rag, llm-evals, ragas, deepeval, faithfulness, context-precision, answer-relevancy, groundedness, retrieval] testingTypes: [llm-evals, integration, regression] frameworks: [ragas, deepeval, pytest] languages: [python] domains: [ai, llm, api] agents: [claude-code, cursor, github-copilot, windsurf, codex, aider, continue, cline, zed, bolt, gemini-cli, amp]
RAG Evaluation Metrics Skill
You are an expert in evaluating retrieval-augmented generation systems. When the user asks you to measure, test, or improve RAG quality, you compute the right metric for the right failure mode, score against a golden dataset, and enforce explicit thresholds. You never report a single "accuracy" number for a RAG system - retrieval and generation fail independently and must be measured independently.
Core Principles
- Retrieval and generation are separate subsystems. A correct answer from bad context is luck; a wrong answer from perfect context is a generation bug. Always measure both halves.
- Four metrics cover the RAG failure surface. Context Precision and Context Recall grade retrieval. Faithfulness and Answer Relevancy grade generation. Together they localize where a pipeline breaks.
- Faithfulness is not relevancy. A faithful answer makes no claims unsupported by the context. A relevant answer addresses the question. An answer can be faithful but off-topic, or on-topic but hallucinated.
- Groundedness == faithfulness for hallucination detection. When the goal is "no made-up facts," measure faithfulness/groundedness; it is the single most important production guardrail.
- Every metric needs a threshold and a golden set. A metric with no pass/fail line is a vanity number. Fix thresholds per metric and evaluate against curated question/ground-truth pairs.
- LLM-as-judge is the scoring engine - pin its model. Ragas and DeepEval use an LLM to score. Pin the judge model and temperature so scores are reproducible across runs.
- Context Recall requires ground-truth contexts; Context Precision does not. Choose metrics based on whether your golden set has reference answers, reference contexts, or both.
- Score distributions, not single questions. Report the mean and the count below threshold across the dataset. One bad question is noise; 20% below threshold is a regression.
The Four Core Metrics
| Metric | Grades | Question it answers | Needs ground truth? |
|---|---|---|---|
| Context Precision | Retrieval | Are the retrieved chunks that are relevant ranked at the top? | Reference answer or contexts |
| Context Recall | Retrieval | Did retrieval fetch all the chunks needed to answer? | Reference answer (ground truth) |
| Faithfulness / Groundedness | Generation | Is every claim in the answer supported by the retrieved context? | No (uses answer + context) |
| Answer Relevancy | Generation | Does the answer actually address the question? | No (uses question + answer) |
Golden Dataset Structure
A golden set is the contract. Store it as versioned JSON so diffs are reviewable.
# golden_dataset.py
from dataclasses import dataclass, field
@dataclass
class GoldenSample:
question: str
ground_truth: str # the ideal reference answer
reference_contexts: list[str] = field(default_factory=list)
GOLDEN_SET: list[GoldenSample] = [
GoldenSample(
question="What is the refund window for digital products?",
ground_truth="Digital products can be refunded within 14 days of purchase if unused.",
reference_contexts=[
"Refund policy: Digital goods are eligible for a refund within 14 days "
"of purchase, provided the license key has not been activated."
],
),
GoldenSample(
question="Does the Pro plan include priority support?",
ground_truth="Yes, the Pro plan includes 24/7 priority email and chat support.",
reference_contexts=[
"Pro plan benefits: unlimited projects, advanced analytics, and 24/7 "
"priority support over email and chat."
],
),
]
Evaluating with Ragas
Ragas computes all four metrics from a dataset of question, answer, contexts, and ground_truth. You produce answer and contexts by running your actual RAG pipeline.
# eval_ragas.py
import os
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
context_precision,
context_recall,
faithfulness,
answer_relevancy,
)
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from golden_dataset import GOLDEN_SET
from my_rag_app import rag_pipeline # your system under test
def build_eval_dataset() -> Dataset:
rows = {"question": [], "answer": [], "contexts": [], "ground_truth": []}
for sample in GOLDEN_SET:
result = rag_pipeline(sample.question) # returns {"answer", "contexts"}
rows["question"].append(sample.question)
rows["answer"].append(result["answer"])
rows["contexts"].append(result["contexts"]) # list[str], retrieved chunks
rows["ground_truth"].append(sample.ground_truth)
return Dataset.from_dict(rows)
def run() -> None:
# Pin the judge model + temperature=0 for reproducible scores.
judge = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini", temperature=0))
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
dataset = build_eval_dataset()
result = evaluate(
dataset,
metrics=[context_precision, context_recall, faithfulness, answer_relevancy],
llm=judge,
embeddings=embeddings,
)
df = result.to_pandas()
print(df[["question", "context_precision", "context_recall",
"faithfulness", "answer_relevancy"]])
print("\nMeans:\n", df[["context_precision", "context_recall",
"faithfulness", "answer_relevancy"]].mean())
if __name__ == "__main__":
assert os.environ.get("OPENAI_API_KEY"), "set OPENAI_API_KEY"
run()
Evaluating with DeepEval
DeepEval frames each metric as an assertable test case, which slots cleanly into pytest. It is the better choice when you want metric failures to fail a CI build.
# test_rag_deepeval.py
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
ContextualPrecisionMetric,
ContextualRecallMetric,
FaithfulnessMetric,
AnswerRelevancyMetric,
)
from golden_dataset import GOLDEN_SET
from my_rag_app import rag_pipeline
JUDGE = "gpt-4o-mini"
def _build_case(sample) -> LLMTestCase:
result = rag_pipeline(sample.question)
return LLMTestCase(
input=sample.question,
actual_output=result["answer"],
expected_output=sample.ground_truth,
retrieval_context=result["contexts"],
)
@pytest.mark.parametrize("sample", GOLDEN_SET, ids=lambda s: s.question[:40])
def test_rag_quality(sample):
case = _build_case(sample)
metrics = [
ContextualPrecisionMetric(threshold=0.8, model=JUDGE),
ContextualRecallMetric(threshold=0.8, model=JUDGE),
FaithfulnessMetric(threshold=0.9, model=JUDGE), # strictest: no hallucinations
AnswerRelevancyMetric(threshold=0.75, model=JUDGE),
]
# Fails the test (and the build) if any metric is below its threshold.
assert_test(case, metrics)
Run it like any pytest suite: deepeval test run test_rag_deepeval.py or plain pytest test_rag_deepeval.py.
Recommended Thresholds
Start here and tighten as the pipeline matures. Faithfulness is always the highest bar because hallucination is the most damaging failure.
THRESHOLDS = {
"faithfulness": 0.90, # strictest - production hallucination guard
"context_precision": 0.80, # good retrievers rank relevant chunks first
"context_recall": 0.80, # missing context is a retrieval/chunking bug
"answer_relevancy": 0.75, # answers should stay on-topic
}
def assert_thresholds(means: dict[str, float]) -> None:
failures = [
f"{m}: {means[m]:.3f} < {t:.2f}"
for m, t in THRESHOLDS.items()
if means.get(m, 0.0) < t
]
if failures:
raise AssertionError("RAG metrics below threshold:\n " + "\n ".join(failures))
Diagnosing With the Metric Matrix
Use the pair of scores to localize the defect instead of guessing:
def diagnose(scores: dict[str, float]) -> str:
retrieval_ok = (scores["context_precision"] >= 0.8
and scores["context_recall"] >= 0.8)
generation_ok = (scores["faithfulness"] >= 0.9
and scores["answer_relevancy"] >= 0.75)
if retrieval_ok and generation_ok:
return "Healthy."
if not retrieval_ok and generation_ok:
return ("Retrieval problem: fix chunking, embeddings, top_k, or reranking. "
"Generation is faithful to whatever it is given.")
if retrieval_ok and not generation_ok:
return ("Generation problem: context is good but the model hallucinates or "
"drifts. Tighten the prompt, lower temperature, add 'answer only "
"from context' instructions.")
return "Both layers failing - debug retrieval first; generation cannot recover from bad context."
Always debug retrieval before generation: a generator cannot produce a faithful answer from context that lacks the fact.
Best Practices
- Pin the judge model and set temperature to 0. LLM-as-judge scores drift run to run otherwise. Record the judge model in the eval report.
- Hold faithfulness to the highest threshold (>= 0.9). It is the direct measure of hallucination and the metric users feel most.
- Curate the golden set by hand, then grow it from production failures. Every real-world bad answer becomes a new golden sample (with the correct
ground_truth). - Report distribution, not just the mean. Track "count below threshold" - a 0.85 mean can hide ten 0.4 outliers.
- Separate the retrieval eval from the generation eval in your report. Two tables, not one blended score, so the diagnosis is immediate.
- Version the golden dataset alongside the prompt and retriever config. A score is only meaningful relative to a fixed dataset version.
- Use DeepEval when you want CI gating; use Ragas for exploratory metric sweeps. They share the same conceptual metrics.
- Sanity-check the judge on a few samples manually. If a human disagrees with the LLM judge on faithfulness, your threshold is meaningless.
Anti-Patterns to Avoid
- Reporting one "accuracy" number for the whole pipeline. It hides whether retrieval or generation failed and is impossible to act on.
- Evaluating generation without checking faithfulness. A fluent, relevant, completely fabricated answer scores well on relevancy alone.
- Using the same model as both generator and judge with temperature > 0. Scores become non-reproducible and self-flattering.
- Measuring Context Recall without ground-truth contexts or reference answers. The metric is undefined; you will get garbage scores.
- Tiny golden sets (under ~20 samples). Means are noisy and a single bad question swings the verdict.
- Treating a 0.8 mean as "passing" while ignoring the tail. The worst 10% of answers are what generate support tickets.
- Changing chunking, embeddings, and the prompt at once, then re-scoring. You cannot attribute the score delta to any single change.
When to Trigger This Skill
Trigger when the user asks to:
- Evaluate or "score" a RAG / retrieval-augmented pipeline
- Measure faithfulness, groundedness, hallucination rate, context precision/recall, or answer relevancy
- Set up Ragas or DeepEval for a RAG system
- Build a golden/eval dataset for retrieval QA
- Decide pass/fail thresholds for LLM answer quality
- Diagnose whether a RAG failure is in retrieval or generation
For regression gating in CI over time (detecting drift across builds), pair this with the RAG Regression Testing skill. For non-RAG agent evaluation, use the AI Agent Evaluation skill instead.