Метрики оценки RAG-систем¶

~9 минут чтения

URL: RAGAS docs, DeepEval, TruLens, arXiv Тип: RAG evaluation metrics / retrieval / generation / benchmarks Дата: 2025-2026 Сбор: Ralph Research ФАЗА 5

Предварительно: RAG-техники и векторные БД, Стратегии чанкинга

Зачем это нужно¶

RAG pipeline без метрик -- гадание на кофейной гуще. Precision@5 показывает долю релевантных чанков в top-5 (70% -- значит 1.5 чанка из 5 -- мусор). Faithfulness измеряет, какая доля утверждений в ответе подтверждается контекстом (ниже 0.7 -- hallucination risk). nDCG штрафует за неправильный порядок результатов. RAGAS Score = среднее четырёх метрик, даёт single number для сравнения pipeline'ов. Без этих чисел невозможно понять, что именно ломается: retrieval, generation или оба.

Ключевые концепции¶

RAG evaluation = Retrieval quality + Generation quality + End-to-end metrics.

graph LR
    subgraph Retrieval
        CR["Context Recall"]
        CP["Context Precision"]
        MRR["MRR"]
        NDCG["nDCG"]
    end
    subgraph Generation
        F["Faithfulness"]
        AR["Answer Relevance"]
        AC["Answer Correctness"]
        AS["Answer Similarity"]
    end
    subgraph System
        LAT["Latency"]
        COST["Cost"]
        TPT["Throughput"]
    end
    subgraph User
        SAT["Satisfaction"]
        HELP["Helpfulness"]
        TRUST["Trust"]
    end

    style Retrieval fill:#e8eaf6,stroke:#3f51b5
    style Generation fill:#e8f5e9,stroke:#4caf50
    style System fill:#fff3e0,stroke:#ff9800
    style User fill:#fce4ec,stroke:#e91e63

Высокая faithfulness != правильный ответ

RAG может быть 100% faithful к retrieved context, но при этом давать неправильный ответ -- если retrieved docs нерелевантны. Всегда оценивай retrieval quality И generation quality отдельно. RAGAS formula покрывает оба: Context Precision/Recall + Faithfulness/Relevance.

Enterprise Priority¶

For enterprises, errors in retrieval or generation can mean compliance failures, reputational damage, or legal exposure. Not "does it work in tests?" but "will it hold up reliably at scale?"

1. Retrieval Metrics¶

Metric	Description
Precision@K	Are the top-K documents relevant?
Recall@K	How much of the relevant info was retrieved?
MRR (Mean Reciprocal Rank)	Are correct docs ranked early?
nDCG (Normalized DCG)	Graded relevance with position weighting
Diversity metrics	Avoid repeatedly surfacing narrow content

Target Ranges¶

Metric	Good	Excellent
Precision@5	> 70%	> 85%
Recall@10	> 75%	> 90%
MRR	> 0.6	> 0.8
nDCG@10	> 0.7	> 0.85

2. Generation Metrics¶

Metric	Description
Faithfulness	Output grounded in retrieved docs?
Answer relevance	Does it address the query?
Citation coverage	Claims backed with sources?
Hallucination rate	Unsupported or fabricated text
Logical coherence	Does the answer make sense?
Completeness	Is the answer thorough?

Targets¶

Metric	Target
Faithfulness	> 90%
Hallucination rate	< 5%
Citation coverage	> 85%

3. End-to-End Metrics¶

Metric	Description
Correctness	Factually correct?
Factuality	Grounded in source material?
Latency	Response time under load
Cost	Compute spend per query
Safety/Compliance	Refusal rates, policy violations

Production SLAs¶

Metric	Typical SLA
Latency P50	< 500ms
Latency P99	< 2s
Availability	> 99.5%

4. Инструменты¶

Landscape 2026¶

Tool	Focus	License	Best For
RAGAS	RAG pipelines	Apache 2.0	Research, RAG-only
DeepEval	LLM, RAG, agents	Apache 2.0	Comprehensive testing
TruLens	RAG + tracing	MIT	Production monitoring
LangSmith	LangChain native	Proprietary	LangChain ecosystem
Phoenix (Arize)	Open-source observability	ELv2	Full observability

Feature Comparison¶

Feature	RAGAS	DeepEval	TruLens	LangSmith
RAG evaluation	Best	Yes	Yes	Yes
Agent evaluation	No	Best	Yes	Yes
Tracing	No	Limited	Best	Yes
Open source	Yes	Yes	Yes	No
Pytest integration	Custom	Native	No	No
Dashboard	No	Yes	Yes	Best
Cost	Free	Free	Free	$39/seat

Metric Coverage¶

Metric	RAGAS	DeepEval	TruLens
Faithfulness	Yes	Yes	Yes (groundedness)
Answer Relevance	Yes	Yes	Yes
Context Precision	Yes	Yes	Yes
Context Recall	Yes	Yes	Yes
Hallucination	No	Yes	Yes
Bias	No	Yes	Limited
Toxicity	No	Yes	Yes
Tool Correctness	No	Yes	No

Tool Adoption (2026)¶

Tool	GitHub Stars	Downloads/mo
RAGAS	8,000+	200,000+
DeepEval	3,000+	100,000+
TruLens	2,000+	80,000+

5. RAGAS¶

\[RAGAS = \frac{1}{4} \times (Faithfulness + Answer\ Relevance + Context\ Precision + Context\ Recall)\]

Metric	What It Measures	Range
Faithfulness	Answer grounded in context	0-1
Answer Relevance	Answer addresses question	0-1
Context Precision	Relevant chunks retrieved	0-1
Context Recall	All needed info retrieved	0-1
Context Relevancy	Signal vs noise in context	0-1

Strengths: purpose-built for RAG, no ground truth needed (LLM-as-judge), synthetic data generation. Limitations: RAG only (no agent/chatbot eval), metric opacity, no tracing.

6. DeepEval¶

Category	Metrics
RAG	Faithfulness, Answer Relevance, Contextual Recall, Contextual Precision
Generation	Hallucination, Bias, Toxicity
Agents	Tool Call Correctness, Task Completion
Conversation	Conversation Relevancy, Role Adherence

Self-Explaining Metrics¶

from deepeval import evaluate
from deepeval.metrics import FaithfulnessMetric

metric = FaithfulnessMetric(
    threshold=0.7,
    model="gpt-4",
    include_reason=True  # Self-explaining
)

# Output includes:
# Score: 0.85
# Reason: "The answer claims X but context only supports Y..."

Strengths: comprehensive (RAG + agents + chatbots), self-explaining, pytest native, CI/CD ready.

7. TruLens¶

Component	Purpose
TruChain	LangChain integration
TruLlama	LlamaIndex integration
Feedback Functions	Evaluation metrics
Dashboard	Visual debugging

Tracing¶

graph LR
    Q["Query: 'What is RAG?'"] --> R["Retrieval<br/>5 chunks, 0.12s"]
    R --> G["Generation<br/>gpt-4, 1245 tokens, 2.22s"]
    G --> FB["Feedback<br/>Relevance: 0.85<br/>Groundedness: 0.72<br/>Completeness: 0.90"]

    style Q fill:#e8eaf6,stroke:#3f51b5
    style R fill:#fff3e0,stroke:#ff9800
    style G fill:#e8f5e9,stroke:#4caf50
    style FB fill:#fce4ec,stroke:#e91e63

Strengths: best tracing/debugging, production monitoring focus.

8. Benchmarks¶

Benchmark	Focus
RAGBench	General-purpose retrieval + generation
CRAG	Contextual relevance and grounding
LegalBench-RAG	Legal QA (compliance impact)
WixQA	Web-scale QA, factual grounding
T2-RAGBench	Multi-turn and task-oriented RAG

9. Test Sets¶

Type	Purpose	Quality
Golden datasets	Foundation; auditable, reproducible	Highest
Synthetic (LLM-generated)	Scale coverage	Good
Production sampling	Real user queries	High
Adversarial	Edge cases, stress testing	Variable
Human-in-the-loop	Safety-critical, compliance	Non-negotiable

Best Practices¶

Cover full scope of the system
Balance easy and hard queries
Freeze versions for comparability
Include governance rules for updates

10. LLM-as-Judge¶

When to Use¶

Scenario	Use LLM-as-Judge?
No ground truth	Yes
Subjective quality	Yes
Fast iteration	Yes
High-stakes accuracy	No -- use human eval
Regulatory compliance	No -- use human eval

Judge Model Selection¶

Model	Quality	Speed	Cost
GPT-4o	Excellent	Medium	High
Claude 4	Excellent	Medium	High
GPT-3.5-turbo	Good	Fast	Low
Local LLM	Variable	Fast	Free

Judge Reliability¶

Metric	GPT-4 Judge	Human Agreement
Faithfulness	85%	90%
Relevance	88%	92%
Hallucination	80%	95%

11. CI/CD Integration¶

# .github/workflows/rag-eval.yml
name: RAG Evaluation
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install dependencies
        run: pip install deepeval ragas
      - name: Run evaluation
        run: pytest tests/eval/ --tb=short
      - name: Check thresholds
        run: deepeval test run --fail-on 0.7

Testing Strategy¶

Stage	What to Test	Tool
Development	Unit tests for components	DeepEval
Integration	Full pipeline	RAGAS
Production	Live monitoring	TruLens / LangSmith
Regression	Prevent degradation	All (CI/CD)

12. Production¶

Trade-offs at Scale¶

Decision	Trade-off
Raising K	Better recall vs slower response, higher cost
Adding re-rankers	Higher precision vs multiplied latency
Multilingual pipelines	English perf vs other language degradation

Evaluation Methods¶

Method	Pros	Cons
Computation-based (string match, embeddings)	Reproducible	Limited scope
LLM-as-a-judge	Captures nuance	Cost + variability
Human evaluation	Gold standard	Expensive, slow

Best Practice: Blend all three. Computation for CI/CD, LLM-as-judge for coverage, human for calibration.

Для интервью¶

Q: "Как оценить RAG-систему?"¶

3 уровня: (1) Retrieval -- Precision@K, Recall@K, MRR, nDCG. Цель: relevant docs в top-K. (2) Generation -- faithfulness (>90%), hallucination rate (<5%), answer relevance, citation coverage. (3) End-to-end -- correctness, latency (P50 <500ms), cost. Tools: RAGAS для RAG-specific metrics, DeepEval для comprehensive testing с pytest, TruLens для production tracing. RAGAS score = (Faithfulness + Answer Relevance + Context Precision + Context Recall) / 4. CI/CD: DeepEval в GitHub Actions, fail on threshold <0.7.

Q: "Сравните RAGAS, DeepEval, TruLens."¶

RAGAS -- RAG-only, Apache 2.0, 8K+ stars, best RAG metrics (faithfulness, context precision/recall), no ground truth needed. Ограничение: no agents/chatbots, no tracing. DeepEval -- comprehensive (RAG + agents + chatbots), self-explaining metrics, pytest native, CI/CD ready. TruLens -- best tracing/debugging, MIT license, LangChain/LlamaIndex integration, visual dashboard. Стратегия: RAGAS для research, DeepEval для CI/CD testing, TruLens для production monitoring.

Q: "Как использовать LLM-as-Judge для оценки?"¶

LLM-as-Judge: use LLM (GPT-4) to evaluate other LLM outputs. Best for: no ground truth, subjective quality, fast iteration. Reliability: GPT-4 agrees with humans 85% on faithfulness, 88% on relevance, 80% on hallucination. Key issues: position bias, self-preference, inconsistency. Mitigation: multiple judges (ensemble), calibrate with human feedback, structured prompts. NOT for: high-stakes accuracy, regulatory compliance -- use human eval.

Ключевые числа¶

Факт	Значение
Faithfulness target	> 90%
Hallucination rate target	< 5%
Citation coverage target	> 85%
Precision@5 good	> 70%
Recall@10 good	> 75%
MRR good	> 0.6
Latency P50 SLA	< 500ms
Latency P99 SLA	< 2s
RAGAS GitHub stars	8,000+
DeepEval GitHub stars	3,000+
Teams using automated eval	65%
LLM-as-judge adoption	75%
GPT-4 judge faithfulness agreement	85%
Faithfulness industry median	0.78
Context Recall industry median	0.70

Interview Questions¶

Conceptual:

"Чем Precision@K отличается от Recall@K в контексте RAG?" -- Precision@K: доля релевантных среди top-K (из 5 чанков сколько полезных). Recall@K: доля найденных релевантных из всех существующих. При K=5 и 3 релевантных в коллекции: Precision@5=0.4 (⅖), Recall@5=0.67 (⅔). Precision важнее для quality, Recall для completeness.
"Когда nDCG лучше MRR?" -- MRR учитывает только позицию первого релевантного результата. nDCG учитывает позиции всех релевантных + степень релевантности (graded). Для RAG где нужно несколько чанков -- nDCG предпочтительнее.
"Как работает LLM-as-Judge и когда ему нельзя доверять?" -- LLM оценивает пары (вопрос, ответ) по rubric. Проблемы: position bias (предпочитает первый ответ), verbosity bias (длиннее = лучше), self-enhancement bias (модель оценивает себя выше). Для hallucination detection agreement с людьми только 80%.

System Design:

"Спроектируйте pipeline для оценки RAG в CI/CD." -- DeepEval + pytest, curated test dataset (50-100 пар), thresholds: faithfulness >0.7, context recall >0.6. Synthetic augmentation через RAGAS для расширения dataset. Fail pipeline при regression >5%.

Частые ошибки

"Precision@K и Recall@K -- одинаково важны" -- Нет. В RAG низкий Precision отравляет контекст мусорными чанками (LLM галлюцинирует). Низкий Recall -- ответ неполный, но хотя бы корректный. Precision критичнее.

"nDCG@10 = 0.8 -- значит retrieval хороший" -- nDCG не учитывает, ЧТО именно найдено. Можно иметь nDCG 0.8, но все найденные чанки из одного документа (diversity problem). Всегда проверяй source diversity.

"Faithfulness = Answer Correctness" -- Faithfulness: ответ основан на контексте. Answer Correctness: ответ фактически верен. Можно быть faithful к неправильному контексту (retrieval ошибся, но generation честно его процитировала).

Источники¶

LabelYourData -- "RAG Evaluation: 2026 Metrics and Benchmarks for Enterprise AI"
DeepEval -- "The LLM Evaluation Framework" + "DeepEval vs Ragas"
Maxim AI -- "The 5 Best RAG Evaluation Tools 2026"
Comet ML -- "LLM Evaluation Frameworks: Head-to-Head Comparison"
MLflow -- "Introducing DeepEval, RAGAS, and Phoenix Judges"
Deepchecks -- "Best 9 RAG Evaluation Tools"
arXiv -- "LLM-as-a-judge methods" (2412.05579)
arXiv -- "RAG performance evaluation" (2411.03538)