Фреймворки оценки LLM¶

~2 минуты чтения

Предварительно: Метрики оценки LLM | Наблюдаемость LLM

DeepEval (14+ метрик, pytest, CI/CD) и Ragas (6 RAG-метрик, synthetic data) -- два основных open-source фреймворка для автоматической оценки LLM в 2026. DeepEval покрывает RAG + agents + red teaming за $0.50-1.50/1K оценок (LLM calls). Ragas проще, но ограничен RAG. В production оба интегрируются в CI/CD: при каждом PR автоматически прогоняются Faithfulness (>0.8), Answer Relevancy (>0.7), Hallucination (<0.1). Без evaluation pipeline 40%+ LLM-приложений деградируют за 2-4 недели из-за prompt drift и data distribution shift.

Part 1: Overview¶

Why LLM Evaluation Matters in 2025-2026¶

Challenges:

Non-deterministic outputs - Same input, different outputs
Subjective quality - What's "good" varies by use case
Emergent behaviors - Models do unexpected things
Cost of evaluation - Human labeling is expensive

Evaluation Types: | Type | Purpose | Examples | |------|---------|----------| | Deterministic | Exact matches, regex | Accuracy, F1 | | LLM-based | Model-as-judge | G-Eval, DeepEval metrics | | Human | Ground truth comparison | A/B testing, surveys | | Behavioral | System-level performance | Task completion rate |

Part 2: DeepEval¶

2.1 Overview¶

Developer: Confident AI Current Version: 2.x (2025) License: Apache 2.0

Key Features: - 14+ built-in metrics - Pytest integration - LLM-as-judge with explainability - Confident AI platform integration - RAG evaluation (RAGAS-inspired) - Red teaming support - CI/CD ready

2.2 Installation & Setup¶

pip install deepeval

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase

# Set your LLM
import os
os.environ["OPENAI_API_KEY"] = "your-key"

2.3 Core Metrics¶

Metric	What It Measures	Score Range
Answer Relevancy	Does answer address the question?	0-1
Faithfulness	Is answer grounded in context?	0-1
Contextual Recall	Is context relevant to expected output?	0-1
Contextual Precision	Is retrieved context noise-free?	0-1
Hallucination	Does output contain invented facts?	0-1
Bias	Is output biased or unfair?	0-1
Toxicity	Is output harmful or offensive?	0-1
Conversational	Multi-turn coherence	0-1
Tool Correctness	Tool calling accuracy	0-1
SQL Correctness	SQL query validity	0-1

2.4 Usage Examples¶

Basic Evaluation:

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="Paris is the capital of France.",
    expected_output="Paris"
)

metric = AnswerRelevancyMetric(threshold=0.7)
evaluate(test_cases=[test_case], metrics=[metric])

RAG Pipeline Evaluation:

from deepeval.metrics import FaithfulnessMetric, ContextualRecallMetric

test_case = LLMTestCase(
    input="What is ML?",
    actual_output="Machine Learning is a subset of AI...",
    retrieval_context=[
        "Machine Learning is a field of AI that enables systems to learn from data.",
        "ML algorithms build models based on training data."
    ]
)

metrics = [
    FaithfulnessMetric(threshold=0.8),
    ContextualRecallMetric(threshold=0.7)
]

evaluate(test_cases=[test_case], metrics=metrics)

Pytest Integration:

# test_rag.py
import pytest
from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric

@pytest.mark.parametrize("query,expected", [
    ("What is Python?", "A programming language"),
    ("What is ML?", "A subset of AI"),
])
def test_rag_quality(query, expected):
    actual_output = rag_pipeline(query)
    context = retriever.get_context(query)

    test_case = LLMTestCase(
        input=query,
        actual_output=actual_output,
        retrieval_context=context
    )

    metric = FaithfulnessMetric(threshold=0.7)
    assert_test(test_case, [metric])

2.5 Custom Metrics¶

from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase

class CustomToneMetric(BaseMetric):
    def __init__(self, threshold: float = 0.5):
        self.threshold = threshold
        self.name = "Tone"

    def measure(self, test_case: LLMTestCase) -> float:
        # Custom logic here
        output = test_case.actual_output

        # Example: Check for professional tone
        professional_words = ["please", "thank you", "would"]
        count = sum(1 for word in professional_words if word in output.lower())

        self.score = min(count / 2, 1.0)
        self.success = self.score >= self.threshold
        self.reason = f"Found {count} professional markers"

        return self.score

    async def a_measure(self, test_case: LLMTestCase) -> float:
        return self.measure(test_case)

    def is_successful(self) -> bool:
        return self.success

2.6 CI/CD Integration¶

# .github/workflows/llm_eval.yml
name: LLM Evaluation

on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install dependencies
        run: pip install deepeval pytest

      - name: Run evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          pytest tests/llm/ --tb=short

      - name: Upload results to Confident AI
        if: always()
        run: |
          deepeval login --api-key ${{ secrets.CONFIDENT_API_KEY }}
          deepeval push

Part 3: Ragas¶

3.1 Overview¶

Developer: Exploding Gradients Current Version: 0.2.x (2025) License: Apache 2.0

Key Features: - RAG-focused metrics - Reference-free evaluation - Test data generation - Simple API - Research-oriented

3.2 Installation¶

pip install ragas

3.3 Core Metrics¶

Metric	What It Measures	Notes
Faithfulness	Is answer grounded in context?	Similar to DeepEval
Answer Relevancy	Does answer address question?	Uses LLM-as-judge
Context Precision	Is retrieved context relevant?	Signal-to-noise ratio
Context Recall	Is context comprehensive?	Coverage of ground truth
Answer Similarity	Semantic similarity to expected	Uses embeddings
Answer Correctness	Overall answer quality	Weighted combination

3.4 Usage Example¶

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from datasets import Dataset

data = {
    "question": ["What is AI?", "What is ML?"],
    "answer": [
        "AI is artificial intelligence...",
        "ML is machine learning..."
    ],
    "contexts": [
        ["AI refers to computer systems that can perform tasks requiring intelligence."],
        ["ML is a subset of AI that uses data to learn patterns."]
    ],
    "ground_truth": ["AI is intelligence demonstrated by machines.", "ML is learning from data."]
}

dataset = Dataset.from_dict(data)

result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy]
)

print(result)

Part 4: DeepEval vs Ragas Comparison¶

4.1 Feature Comparison¶

Feature	DeepEval	Ragas
RAG Metrics	✅ 14+	✅ 6
Agent Evaluation	✅	❌
Custom Metrics	✅ Easy	⚠️ Limited
Pytest Integration	✅ Native	❌
CI/CD Ready	✅	⚠️ Manual
Red Teaming	✅ Built-in	❌
Synthetic Data	⚠️ Via platform	✅ Built-in
Dashboard	✅ Confident AI	❌
Self-Hosted	✅	✅
Model Support	✅ Multi-provider	✅ OpenAI-focused

4.2 When to Choose Which¶

Choose DeepEval when: - Need pytest integration - Building production CI/CD pipelines - Evaluating AI agents, not just RAG - Need custom metrics - Want comprehensive evaluation platform - Red teaming is important

Choose Ragas when: - Focused purely on RAG evaluation - Need synthetic test data generation - Prefer simpler API - Research/experimentation focus - Don't need pytest integration

4.3 Performance Comparison¶

Metric	DeepEval	Ragas
Setup complexity	Medium	Low
Execution speed	Medium (parallel support)	Medium
LLM calls per eval	2-5 per metric	2-3 per metric
Memory usage	Medium	Low

Part 5: Other Top Platforms 2026¶

5.1 Platform Comparison¶

Platform	Focus	Pricing	Best For
DeepEval	Comprehensive eval	Free + Confident AI	Production apps
Ragas	RAG evaluation	Free	RAG pipelines
Deepchecks	ML monitoring	Free + Enterprise	ML ops
MLflow	ML lifecycle	Free	Experiment tracking
TruLens	Neural app eval	Free	RAG + agents
LangSmith	LLM observability	Paid	Production monitoring

5.2 TruLens¶

Key Features: - RAG triad: Context relevance, Groundedness, Answer relevance - Feedback functions - Chain/agent tracing - Dashboard visualization

# TruLens rebranded to `trulens` package in late 2024.
# Old: from trulens_eval import ... (deprecated)
from trulens.core import Feedback
from trulens.apps.langchain import TruChain
from trulens.providers.openai import OpenAI as fOpenAI

provider = fOpenAI()
f_groundedness = Feedback(provider.groundedness_measure_with_cot_reasons).on(
    input="query", output="response", context="context"
)

tru_recorder = TruChain(
    chain,
    feedbacks=[f_groundedness]
)

5.3 LangSmith¶

Key Features: - Trace visualization - Dataset management - Evaluation automation - Production monitoring - Prompt management

from langsmith import Client

client = Client()

# Create dataset
dataset = client.create_dataset("my-eval-dataset")

# Run evaluation
results = client.evaluate(
    "my-chain",
    data=dataset,
    evaluators=[accuracy_evaluator, hallucination_evaluator]
)

Part 6: Evaluation Best Practices¶

6.1 Test Data Strategy¶

Strategy	When to Use	Pros	Cons
Golden dataset	Production	High quality	Expensive to create
Synthetic data	Early development	Fast, cheap	May not reflect reality
Production sampling	Ongoing	Real data	Privacy concerns
Adversarial examples	Red teaming	Tests robustness	Labor intensive

6.2 Metric Selection Guide¶

Use Case	Primary Metrics
RAG chatbot	Faithfulness, Answer Relevancy, Context Precision
Code generation	Correctness, Syntax Validity
Summarization	ROUGE, BERTScore, Hallucination
Translation	BLEU, COMET, Answer Similarity
Agent	Tool Correctness, Task Completion
Content moderation	Toxicity, Bias

6.3 Evaluation Pipeline¶

graph TD
    A["Test Cases<br/>(Golden / Synthetic)"] --> B["LLM System"]
    B --> C["Actual Output"]
    C --> D["Evaluation Framework"]
    D --> E["Metrics<br/>(Multiple)"]
    D --> F["Results & Reports"]
    E --> G["Threshold Checks"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#f3e5f5,stroke:#9c27b0
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#fff3e0,stroke:#ef6c00
    style E fill:#e8f5e9,stroke:#4caf50
    style F fill:#e8f5e9,stroke:#4caf50
    style G fill:#fce4ec,stroke:#c62828

Part 7: Interview-Relevant Numbers¶

Framework Statistics¶

Metric	DeepEval	Ragas
GitHub stars	4,000+	7,000+
Built-in metrics	14+	6
LLM providers	5+	2-3
Pytest support	✅	❌
CI/CD templates	3+	0

Evaluation Costs¶

Metric Type	LLM Calls	Est. Cost (per 1k evals)
Answer Relevancy	2-3	$0.10-$0.30
Faithfulness	2-4	$0.15-$0.40
Hallucination	1-2	$0.05-$0.20
Full RAG suite	8-12	$0.50-$1.50

Industry Adoption¶

Framework	Enterprise Adoption
DeepEval	Growing (2025)
Ragas	High (RAG-focused)
LangSmith	High (LangChain users)
TruLens	Medium

Заблуждение: Faithfulness > 0.8 означает отсутствие галлюцинаций

Faithfulness метрика (DeepEval/Ragas) проверяет, что ответ grounded в context, но не ловит: (1) пропущенные факты из context (это Context Recall), (2) логические ошибки при корректных фактах, (3) subtle hallucinations внутри корректных предложений. Production: используйте минимум 3 метрики одновременно -- Faithfulness + Hallucination + Answer Relevancy. Одной метрики недостаточно для 95%+ detection rate.

Заблуждение: synthetic test data заменяет golden dataset

Ragas умеет генерировать synthetic test data, но она отражает distribution модели-генератора, а не реальных пользователей. В production 30-40% edge cases приходят из запросов, которые synthetic data не покрывает: опечатки, смешение языков, adversarial inputs. Golden dataset из 200-500 real queries с human labels обязателен для калибровки. Synthetic data -- для расширения покрытия, не для замены.

Заблуждение: DeepEval и Ragas дают одинаковые Faithfulness scores

Обе библиотеки измеряют Faithfulness, но реализации отличаются: DeepEval использует claim extraction + verification chain (2-4 LLM calls), Ragas -- statement extraction + NLI (2-3 LLM calls). На одних и тех же данных scores могут расходиться на 10-20%. Сравнивать абсолютные значения между фреймворками некорректно -- фиксируйте один фреймворк и отслеживайте тренд.

Interview Questions¶

Q: Как построить CI/CD pipeline для оценки LLM-приложения?

Red flag: "Запускаем pytest с несколькими примерами перед деплоем"

Strong answer: "Три уровня: (1) Unit -- DeepEval pytest integration, 50-100 test cases из golden dataset, метрики Faithfulness >0.8, Answer Relevancy >0.7, Hallucination <0.1. Прогон на каждый PR. (2) Integration -- full RAG pipeline eval на staging, 500+ test cases, Context Precision + Context Recall. Weekly. (3) Production monitoring -- 1-5% sampling через LangSmith/Langfuse, drift detection, alerting при degradации >10%. Costs: ~$0.50-1.50/1K evals для LLM-based метрик. GitHub Actions + deepeval push для dashboard."

Q: DeepEval vs Ragas -- когда что выбрать?

Red flag: "Ragas популярнее (7K+ stars), поэтому лучше"

Strong answer: "DeepEval: 14+ метрик (RAG + agents + tool use + SQL), native pytest, CI/CD templates, red teaming, custom metrics, Confident AI dashboard. Ragas: 6 RAG-метрик, synthetic test data generation, проще API, research-oriented. Выбор: DeepEval для production (pytest CI/CD + multi-use-case + custom metrics). Ragas для RAG-only проектов на этапе experimentation с нуждой в synthetic data. Можно комбинировать: Ragas для data generation, DeepEval для evaluation pipeline."

Q: Какие threshold'ы ставить для RAG evaluation метрик?

Red flag: "Faithfulness и Relevancy должны быть > 0.9"

Strong answer: "Зависит от use case: (1) Medical/legal RAG -- Faithfulness >0.95, Hallucination <0.02 (цена ошибки высокая). (2) Customer support -- Faithfulness >0.8, Answer Relevancy >0.7 (допустимы general ответы). (3) Internal search -- Context Precision >0.6, Recall >0.7 (важнее покрытие). Threshold'ы калибруются на human-labeled golden set: находите точку, где метрика коррелирует с human accept/reject (обычно threshold = F1-optimal point на golden set). Пересматривайте quarterly по мере drift."

Sources¶

DeepEval Official Documentation — https://deepeval.com (formerly docs.confident-ai.com)
Ragas Documentation — https://docs.ragas.io
Prompts.ai — "Top 5 LLM Evaluation Platforms 2026"
Confident AI Blog — DeepEval vs Ragas Comparison
TruLens Documentation — https://trulens.org
LangSmith Documentation — https://docs.langchain.com/langsmith (formerly docs.smith.langchain.com)

Metric Type	LLM Calls	Est. Cost (per 1k evals)
Answer Relevancy	2-3	\(0.10-\)0.30
Faithfulness	2-4	\(0.15-\)0.40
Hallucination	1-2	\(0.05-\)0.20
Full RAG suite	8-12	\(0.50-\)1.50