Перейти к содержанию

Метрики оценки RAG-систем

~9 минут чтения

URL: RAGAS docs, DeepEval, TruLens, arXiv Тип: RAG evaluation metrics / retrieval / generation / benchmarks Дата: 2025-2026 Сбор: Ralph Research ФАЗА 5


Предварительно: RAG-техники и векторные БД, Стратегии чанкинга

Зачем это нужно

RAG pipeline без метрик -- гадание на кофейной гуще. Precision@5 показывает долю релевантных чанков в top-5 (70% -- значит 1.5 чанка из 5 -- мусор). Faithfulness измеряет, какая доля утверждений в ответе подтверждается контекстом (ниже 0.7 -- hallucination risk). nDCG штрафует за неправильный порядок результатов. RAGAS Score = среднее четырёх метрик, даёт single number для сравнения pipeline'ов. Без этих чисел невозможно понять, что именно ломается: retrieval, generation или оба.

Ключевые концепции

RAG evaluation = Retrieval quality + Generation quality + End-to-end metrics.

graph LR
    subgraph Retrieval
        CR["Context Recall"]
        CP["Context Precision"]
        MRR["MRR"]
        NDCG["nDCG"]
    end
    subgraph Generation
        F["Faithfulness"]
        AR["Answer Relevance"]
        AC["Answer Correctness"]
        AS["Answer Similarity"]
    end
    subgraph System
        LAT["Latency"]
        COST["Cost"]
        TPT["Throughput"]
    end
    subgraph User
        SAT["Satisfaction"]
        HELP["Helpfulness"]
        TRUST["Trust"]
    end

    style Retrieval fill:#e8eaf6,stroke:#3f51b5
    style Generation fill:#e8f5e9,stroke:#4caf50
    style System fill:#fff3e0,stroke:#ff9800
    style User fill:#fce4ec,stroke:#e91e63

Высокая faithfulness != правильный ответ

RAG может быть 100% faithful к retrieved context, но при этом давать неправильный ответ -- если retrieved docs нерелевантны. Всегда оценивай retrieval quality И generation quality отдельно. RAGAS formula покрывает оба: Context Precision/Recall + Faithfulness/Relevance.

Enterprise Priority

For enterprises, errors in retrieval or generation can mean compliance failures, reputational damage, or legal exposure. Not "does it work in tests?" but "will it hold up reliably at scale?"


1. Retrieval Metrics

Metric Description
Precision@K Are the top-K documents relevant?
Recall@K How much of the relevant info was retrieved?
MRR (Mean Reciprocal Rank) Are correct docs ranked early?
nDCG (Normalized DCG) Graded relevance with position weighting
Diversity metrics Avoid repeatedly surfacing narrow content

Target Ranges

Metric Good Excellent
Precision@5 > 70% > 85%
Recall@10 > 75% > 90%
MRR > 0.6 > 0.8
nDCG@10 > 0.7 > 0.85

2. Generation Metrics

Metric Description
Faithfulness Output grounded in retrieved docs?
Answer relevance Does it address the query?
Citation coverage Claims backed with sources?
Hallucination rate Unsupported or fabricated text
Logical coherence Does the answer make sense?
Completeness Is the answer thorough?

Targets

Metric Target
Faithfulness > 90%
Hallucination rate < 5%
Citation coverage > 85%

3. End-to-End Metrics

Metric Description
Correctness Factually correct?
Factuality Grounded in source material?
Latency Response time under load
Cost Compute spend per query
Safety/Compliance Refusal rates, policy violations

Production SLAs

Metric Typical SLA
Latency P50 < 500ms
Latency P99 < 2s
Availability > 99.5%

4. Инструменты

Landscape 2026

Tool Focus License Best For
RAGAS RAG pipelines Apache 2.0 Research, RAG-only
DeepEval LLM, RAG, agents Apache 2.0 Comprehensive testing
TruLens RAG + tracing MIT Production monitoring
LangSmith LangChain native Proprietary LangChain ecosystem
Phoenix (Arize) Open-source observability ELv2 Full observability

Feature Comparison

Feature RAGAS DeepEval TruLens LangSmith
RAG evaluation Best Yes Yes Yes
Agent evaluation No Best Yes Yes
Tracing No Limited Best Yes
Open source Yes Yes Yes No
Pytest integration Custom Native No No
Dashboard No Yes Yes Best
Cost Free Free Free $39/seat

Metric Coverage

Metric RAGAS DeepEval TruLens
Faithfulness Yes Yes Yes (groundedness)
Answer Relevance Yes Yes Yes
Context Precision Yes Yes Yes
Context Recall Yes Yes Yes
Hallucination No Yes Yes
Bias No Yes Limited
Toxicity No Yes Yes
Tool Correctness No Yes No

Tool Adoption (2026)

Tool GitHub Stars Downloads/mo
RAGAS 8,000+ 200,000+
DeepEval 3,000+ 100,000+
TruLens 2,000+ 80,000+

5. RAGAS

\[RAGAS = \frac{1}{4} \times (Faithfulness + Answer\ Relevance + Context\ Precision + Context\ Recall)\]
Metric What It Measures Range
Faithfulness Answer grounded in context 0-1
Answer Relevance Answer addresses question 0-1
Context Precision Relevant chunks retrieved 0-1
Context Recall All needed info retrieved 0-1
Context Relevancy Signal vs noise in context 0-1

Strengths: purpose-built for RAG, no ground truth needed (LLM-as-judge), synthetic data generation. Limitations: RAG only (no agent/chatbot eval), metric opacity, no tracing.


6. DeepEval

Category Metrics
RAG Faithfulness, Answer Relevance, Contextual Recall, Contextual Precision
Generation Hallucination, Bias, Toxicity
Agents Tool Call Correctness, Task Completion
Conversation Conversation Relevancy, Role Adherence

Self-Explaining Metrics

from deepeval import evaluate
from deepeval.metrics import FaithfulnessMetric

metric = FaithfulnessMetric(
    threshold=0.7,
    model="gpt-4",
    include_reason=True  # Self-explaining
)

# Output includes:
# Score: 0.85
# Reason: "The answer claims X but context only supports Y..."

Strengths: comprehensive (RAG + agents + chatbots), self-explaining, pytest native, CI/CD ready.


7. TruLens

Component Purpose
TruChain LangChain integration
TruLlama LlamaIndex integration
Feedback Functions Evaluation metrics
Dashboard Visual debugging

Tracing

graph LR
    Q["Query: 'What is RAG?'"] --> R["Retrieval<br/>5 chunks, 0.12s"]
    R --> G["Generation<br/>gpt-4, 1245 tokens, 2.22s"]
    G --> FB["Feedback<br/>Relevance: 0.85<br/>Groundedness: 0.72<br/>Completeness: 0.90"]

    style Q fill:#e8eaf6,stroke:#3f51b5
    style R fill:#fff3e0,stroke:#ff9800
    style G fill:#e8f5e9,stroke:#4caf50
    style FB fill:#fce4ec,stroke:#e91e63

Strengths: best tracing/debugging, production monitoring focus.


8. Benchmarks

Benchmark Focus
RAGBench General-purpose retrieval + generation
CRAG Contextual relevance and grounding
LegalBench-RAG Legal QA (compliance impact)
WixQA Web-scale QA, factual grounding
T2-RAGBench Multi-turn and task-oriented RAG

9. Test Sets

Type Purpose Quality
Golden datasets Foundation; auditable, reproducible Highest
Synthetic (LLM-generated) Scale coverage Good
Production sampling Real user queries High
Adversarial Edge cases, stress testing Variable
Human-in-the-loop Safety-critical, compliance Non-negotiable

Best Practices

  • Cover full scope of the system
  • Balance easy and hard queries
  • Freeze versions for comparability
  • Include governance rules for updates

10. LLM-as-Judge

When to Use

Scenario Use LLM-as-Judge?
No ground truth Yes
Subjective quality Yes
Fast iteration Yes
High-stakes accuracy No -- use human eval
Regulatory compliance No -- use human eval

Judge Model Selection

Model Quality Speed Cost
GPT-4o Excellent Medium High
Claude 4 Excellent Medium High
GPT-3.5-turbo Good Fast Low
Local LLM Variable Fast Free

Judge Reliability

Metric GPT-4 Judge Human Agreement
Faithfulness 85% 90%
Relevance 88% 92%
Hallucination 80% 95%

11. CI/CD Integration

# .github/workflows/rag-eval.yml
name: RAG Evaluation
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install dependencies
        run: pip install deepeval ragas
      - name: Run evaluation
        run: pytest tests/eval/ --tb=short
      - name: Check thresholds
        run: deepeval test run --fail-on 0.7

Testing Strategy

Stage What to Test Tool
Development Unit tests for components DeepEval
Integration Full pipeline RAGAS
Production Live monitoring TruLens / LangSmith
Regression Prevent degradation All (CI/CD)

12. Production

Trade-offs at Scale

Decision Trade-off
Raising K Better recall vs slower response, higher cost
Adding re-rankers Higher precision vs multiplied latency
Multilingual pipelines English perf vs other language degradation

Evaluation Methods

Method Pros Cons
Computation-based (string match, embeddings) Reproducible Limited scope
LLM-as-a-judge Captures nuance Cost + variability
Human evaluation Gold standard Expensive, slow

Best Practice: Blend all three. Computation for CI/CD, LLM-as-judge for coverage, human for calibration.


Для интервью

Q: "Как оценить RAG-систему?"

3 уровня: (1) Retrieval -- Precision@K, Recall@K, MRR, nDCG. Цель: relevant docs в top-K. (2) Generation -- faithfulness (>90%), hallucination rate (<5%), answer relevance, citation coverage. (3) End-to-end -- correctness, latency (P50 <500ms), cost. Tools: RAGAS для RAG-specific metrics, DeepEval для comprehensive testing с pytest, TruLens для production tracing. RAGAS score = (Faithfulness + Answer Relevance + Context Precision + Context Recall) / 4. CI/CD: DeepEval в GitHub Actions, fail on threshold <0.7.

Q: "Сравните RAGAS, DeepEval, TruLens."

RAGAS -- RAG-only, Apache 2.0, 8K+ stars, best RAG metrics (faithfulness, context precision/recall), no ground truth needed. Ограничение: no agents/chatbots, no tracing. DeepEval -- comprehensive (RAG + agents + chatbots), self-explaining metrics, pytest native, CI/CD ready. TruLens -- best tracing/debugging, MIT license, LangChain/LlamaIndex integration, visual dashboard. Стратегия: RAGAS для research, DeepEval для CI/CD testing, TruLens для production monitoring.

Q: "Как использовать LLM-as-Judge для оценки?"

LLM-as-Judge: use LLM (GPT-4) to evaluate other LLM outputs. Best for: no ground truth, subjective quality, fast iteration. Reliability: GPT-4 agrees with humans 85% on faithfulness, 88% on relevance, 80% on hallucination. Key issues: position bias, self-preference, inconsistency. Mitigation: multiple judges (ensemble), calibrate with human feedback, structured prompts. NOT for: high-stakes accuracy, regulatory compliance -- use human eval.


Ключевые числа

Факт Значение
Faithfulness target > 90%
Hallucination rate target < 5%
Citation coverage target > 85%
Precision@5 good > 70%
Recall@10 good > 75%
MRR good > 0.6
Latency P50 SLA < 500ms
Latency P99 SLA < 2s
RAGAS GitHub stars 8,000+
DeepEval GitHub stars 3,000+
Teams using automated eval 65%
LLM-as-judge adoption 75%
GPT-4 judge faithfulness agreement 85%
Faithfulness industry median 0.78
Context Recall industry median 0.70

Interview Questions

Conceptual:

  1. "Чем Precision@K отличается от Recall@K в контексте RAG?" -- Precision@K: доля релевантных среди top-K (из 5 чанков сколько полезных). Recall@K: доля найденных релевантных из всех существующих. При K=5 и 3 релевантных в коллекции: Precision@5=0.4 (⅖), Recall@5=0.67 (⅔). Precision важнее для quality, Recall для completeness.
  2. "Когда nDCG лучше MRR?" -- MRR учитывает только позицию первого релевантного результата. nDCG учитывает позиции всех релевантных + степень релевантности (graded). Для RAG где нужно несколько чанков -- nDCG предпочтительнее.
  3. "Как работает LLM-as-Judge и когда ему нельзя доверять?" -- LLM оценивает пары (вопрос, ответ) по rubric. Проблемы: position bias (предпочитает первый ответ), verbosity bias (длиннее = лучше), self-enhancement bias (модель оценивает себя выше). Для hallucination detection agreement с людьми только 80%.

System Design:

  1. "Спроектируйте pipeline для оценки RAG в CI/CD." -- DeepEval + pytest, curated test dataset (50-100 пар), thresholds: faithfulness >0.7, context recall >0.6. Synthetic augmentation через RAGAS для расширения dataset. Fail pipeline при regression >5%.

Частые ошибки

"Precision@K и Recall@K -- одинаково важны" -- Нет. В RAG низкий Precision отравляет контекст мусорными чанками (LLM галлюцинирует). Низкий Recall -- ответ неполный, но хотя бы корректный. Precision критичнее.

"nDCG@10 = 0.8 -- значит retrieval хороший" -- nDCG не учитывает, ЧТО именно найдено. Можно иметь nDCG 0.8, но все найденные чанки из одного документа (diversity problem). Всегда проверяй source diversity.

"Faithfulness = Answer Correctness" -- Faithfulness: ответ основан на контексте. Answer Correctness: ответ фактически верен. Можно быть faithful к неправильному контексту (retrieval ошибся, но generation честно его процитировала).


See Also


Источники

  1. LabelYourData -- "RAG Evaluation: 2026 Metrics and Benchmarks for Enterprise AI"
  2. DeepEval -- "The LLM Evaluation Framework" + "DeepEval vs Ragas"
  3. Maxim AI -- "The 5 Best RAG Evaluation Tools 2026"
  4. Comet ML -- "LLM Evaluation Frameworks: Head-to-Head Comparison"
  5. MLflow -- "Introducing DeepEval, RAGAS, and Phoenix Judges"
  6. Deepchecks -- "Best 9 RAG Evaluation Tools"
  7. arXiv -- "LLM-as-a-judge methods" (2412.05579)
  8. arXiv -- "RAG performance evaluation" (2411.03538)