Фреймворки оценки LLM¶
~2 минуты чтения
Предварительно: Метрики оценки LLM | Наблюдаемость LLM
DeepEval (14+ метрик, pytest, CI/CD) и Ragas (6 RAG-метрик, synthetic data) -- два основных open-source фреймворка для автоматической оценки LLM в 2026. DeepEval покрывает RAG + agents + red teaming за $0.50-1.50/1K оценок (LLM calls). Ragas проще, но ограничен RAG. В production оба интегрируются в CI/CD: при каждом PR автоматически прогоняются Faithfulness (>0.8), Answer Relevancy (>0.7), Hallucination (<0.1). Без evaluation pipeline 40%+ LLM-приложений деградируют за 2-4 недели из-за prompt drift и data distribution shift.
Part 1: Overview¶
Why LLM Evaluation Matters in 2025-2026¶
Challenges:
- Non-deterministic outputs - Same input, different outputs
- Subjective quality - What's "good" varies by use case
- Emergent behaviors - Models do unexpected things
- Cost of evaluation - Human labeling is expensive
Evaluation Types: | Type | Purpose | Examples | |------|---------|----------| | Deterministic | Exact matches, regex | Accuracy, F1 | | LLM-based | Model-as-judge | G-Eval, DeepEval metrics | | Human | Ground truth comparison | A/B testing, surveys | | Behavioral | System-level performance | Task completion rate |
Part 2: DeepEval¶
2.1 Overview¶
Developer: Confident AI Current Version: 2.x (2025) License: Apache 2.0
Key Features: - 14+ built-in metrics - Pytest integration - LLM-as-judge with explainability - Confident AI platform integration - RAG evaluation (RAGAS-inspired) - Red teaming support - CI/CD ready
2.2 Installation & Setup¶
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
# Set your LLM
import os
os.environ["OPENAI_API_KEY"] = "your-key"
2.3 Core Metrics¶
| Metric | What It Measures | Score Range |
|---|---|---|
| Answer Relevancy | Does answer address the question? | 0-1 |
| Faithfulness | Is answer grounded in context? | 0-1 |
| Contextual Recall | Is context relevant to expected output? | 0-1 |
| Contextual Precision | Is retrieved context noise-free? | 0-1 |
| Hallucination | Does output contain invented facts? | 0-1 |
| Bias | Is output biased or unfair? | 0-1 |
| Toxicity | Is output harmful or offensive? | 0-1 |
| Conversational | Multi-turn coherence | 0-1 |
| Tool Correctness | Tool calling accuracy | 0-1 |
| SQL Correctness | SQL query validity | 0-1 |
2.4 Usage Examples¶
Basic Evaluation:
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="Paris is the capital of France.",
expected_output="Paris"
)
metric = AnswerRelevancyMetric(threshold=0.7)
evaluate(test_cases=[test_case], metrics=[metric])
RAG Pipeline Evaluation:
from deepeval.metrics import FaithfulnessMetric, ContextualRecallMetric
test_case = LLMTestCase(
input="What is ML?",
actual_output="Machine Learning is a subset of AI...",
retrieval_context=[
"Machine Learning is a field of AI that enables systems to learn from data.",
"ML algorithms build models based on training data."
]
)
metrics = [
FaithfulnessMetric(threshold=0.8),
ContextualRecallMetric(threshold=0.7)
]
evaluate(test_cases=[test_case], metrics=metrics)
Pytest Integration:
# test_rag.py
import pytest
from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric
@pytest.mark.parametrize("query,expected", [
("What is Python?", "A programming language"),
("What is ML?", "A subset of AI"),
])
def test_rag_quality(query, expected):
actual_output = rag_pipeline(query)
context = retriever.get_context(query)
test_case = LLMTestCase(
input=query,
actual_output=actual_output,
retrieval_context=context
)
metric = FaithfulnessMetric(threshold=0.7)
assert_test(test_case, [metric])
2.5 Custom Metrics¶
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase
class CustomToneMetric(BaseMetric):
def __init__(self, threshold: float = 0.5):
self.threshold = threshold
self.name = "Tone"
def measure(self, test_case: LLMTestCase) -> float:
# Custom logic here
output = test_case.actual_output
# Example: Check for professional tone
professional_words = ["please", "thank you", "would"]
count = sum(1 for word in professional_words if word in output.lower())
self.score = min(count / 2, 1.0)
self.success = self.score >= self.threshold
self.reason = f"Found {count} professional markers"
return self.score
async def a_measure(self, test_case: LLMTestCase) -> float:
return self.measure(test_case)
def is_successful(self) -> bool:
return self.success
2.6 CI/CD Integration¶
# .github/workflows/llm_eval.yml
name: LLM Evaluation
on: [push, pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install dependencies
run: pip install deepeval pytest
- name: Run evaluation
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
pytest tests/llm/ --tb=short
- name: Upload results to Confident AI
if: always()
run: |
deepeval login --api-key ${{ secrets.CONFIDENT_API_KEY }}
deepeval push
Part 3: Ragas¶
3.1 Overview¶
Developer: Exploding Gradients Current Version: 0.2.x (2025) License: Apache 2.0
Key Features: - RAG-focused metrics - Reference-free evaluation - Test data generation - Simple API - Research-oriented
3.2 Installation¶
3.3 Core Metrics¶
| Metric | What It Measures | Notes |
|---|---|---|
| Faithfulness | Is answer grounded in context? | Similar to DeepEval |
| Answer Relevancy | Does answer address question? | Uses LLM-as-judge |
| Context Precision | Is retrieved context relevant? | Signal-to-noise ratio |
| Context Recall | Is context comprehensive? | Coverage of ground truth |
| Answer Similarity | Semantic similarity to expected | Uses embeddings |
| Answer Correctness | Overall answer quality | Weighted combination |
3.4 Usage Example¶
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from datasets import Dataset
data = {
"question": ["What is AI?", "What is ML?"],
"answer": [
"AI is artificial intelligence...",
"ML is machine learning..."
],
"contexts": [
["AI refers to computer systems that can perform tasks requiring intelligence."],
["ML is a subset of AI that uses data to learn patterns."]
],
"ground_truth": ["AI is intelligence demonstrated by machines.", "ML is learning from data."]
}
dataset = Dataset.from_dict(data)
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy]
)
print(result)
Part 4: DeepEval vs Ragas Comparison¶
4.1 Feature Comparison¶
| Feature | DeepEval | Ragas |
|---|---|---|
| RAG Metrics | ✅ 14+ | ✅ 6 |
| Agent Evaluation | ✅ | ❌ |
| Custom Metrics | ✅ Easy | ⚠️ Limited |
| Pytest Integration | ✅ Native | ❌ |
| CI/CD Ready | ✅ | ⚠️ Manual |
| Red Teaming | ✅ Built-in | ❌ |
| Synthetic Data | ⚠️ Via platform | ✅ Built-in |
| Dashboard | ✅ Confident AI | ❌ |
| Self-Hosted | ✅ | ✅ |
| Model Support | ✅ Multi-provider | ✅ OpenAI-focused |
4.2 When to Choose Which¶
Choose DeepEval when: - Need pytest integration - Building production CI/CD pipelines - Evaluating AI agents, not just RAG - Need custom metrics - Want comprehensive evaluation platform - Red teaming is important
Choose Ragas when: - Focused purely on RAG evaluation - Need synthetic test data generation - Prefer simpler API - Research/experimentation focus - Don't need pytest integration
4.3 Performance Comparison¶
| Metric | DeepEval | Ragas |
|---|---|---|
| Setup complexity | Medium | Low |
| Execution speed | Medium (parallel support) | Medium |
| LLM calls per eval | 2-5 per metric | 2-3 per metric |
| Memory usage | Medium | Low |
Part 5: Other Top Platforms 2026¶
5.1 Platform Comparison¶
| Platform | Focus | Pricing | Best For |
|---|---|---|---|
| DeepEval | Comprehensive eval | Free + Confident AI | Production apps |
| Ragas | RAG evaluation | Free | RAG pipelines |
| Deepchecks | ML monitoring | Free + Enterprise | ML ops |
| MLflow | ML lifecycle | Free | Experiment tracking |
| TruLens | Neural app eval | Free | RAG + agents |
| LangSmith | LLM observability | Paid | Production monitoring |
5.2 TruLens¶
Key Features: - RAG triad: Context relevance, Groundedness, Answer relevance - Feedback functions - Chain/agent tracing - Dashboard visualization
# TruLens rebranded to `trulens` package in late 2024.
# Old: from trulens_eval import ... (deprecated)
from trulens.core import Feedback
from trulens.apps.langchain import TruChain
from trulens.providers.openai import OpenAI as fOpenAI
provider = fOpenAI()
f_groundedness = Feedback(provider.groundedness_measure_with_cot_reasons).on(
input="query", output="response", context="context"
)
tru_recorder = TruChain(
chain,
feedbacks=[f_groundedness]
)
5.3 LangSmith¶
Key Features: - Trace visualization - Dataset management - Evaluation automation - Production monitoring - Prompt management
from langsmith import Client
client = Client()
# Create dataset
dataset = client.create_dataset("my-eval-dataset")
# Run evaluation
results = client.evaluate(
"my-chain",
data=dataset,
evaluators=[accuracy_evaluator, hallucination_evaluator]
)
Part 6: Evaluation Best Practices¶
6.1 Test Data Strategy¶
| Strategy | When to Use | Pros | Cons |
|---|---|---|---|
| Golden dataset | Production | High quality | Expensive to create |
| Synthetic data | Early development | Fast, cheap | May not reflect reality |
| Production sampling | Ongoing | Real data | Privacy concerns |
| Adversarial examples | Red teaming | Tests robustness | Labor intensive |
6.2 Metric Selection Guide¶
| Use Case | Primary Metrics |
|---|---|
| RAG chatbot | Faithfulness, Answer Relevancy, Context Precision |
| Code generation | Correctness, Syntax Validity |
| Summarization | ROUGE, BERTScore, Hallucination |
| Translation | BLEU, COMET, Answer Similarity |
| Agent | Tool Correctness, Task Completion |
| Content moderation | Toxicity, Bias |
6.3 Evaluation Pipeline¶
graph TD
A["Test Cases<br/>(Golden / Synthetic)"] --> B["LLM System"]
B --> C["Actual Output"]
C --> D["Evaluation Framework"]
D --> E["Metrics<br/>(Multiple)"]
D --> F["Results & Reports"]
E --> G["Threshold Checks"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#f3e5f5,stroke:#9c27b0
style C fill:#fff3e0,stroke:#ef6c00
style D fill:#fff3e0,stroke:#ef6c00
style E fill:#e8f5e9,stroke:#4caf50
style F fill:#e8f5e9,stroke:#4caf50
style G fill:#fce4ec,stroke:#c62828
Part 7: Interview-Relevant Numbers¶
Framework Statistics¶
| Metric | DeepEval | Ragas |
|---|---|---|
| GitHub stars | 4,000+ | 7,000+ |
| Built-in metrics | 14+ | 6 |
| LLM providers | 5+ | 2-3 |
| Pytest support | ✅ | ❌ |
| CI/CD templates | 3+ | 0 |
Evaluation Costs¶
| Metric Type | LLM Calls | Est. Cost (per 1k evals) |
|---|---|---|
| Answer Relevancy | 2-3 | \(0.10-\)0.30 |
| Faithfulness | 2-4 | \(0.15-\)0.40 |
| Hallucination | 1-2 | \(0.05-\)0.20 |
| Full RAG suite | 8-12 | \(0.50-\)1.50 |
Industry Adoption¶
| Framework | Enterprise Adoption |
|---|---|
| DeepEval | Growing (2025) |
| Ragas | High (RAG-focused) |
| LangSmith | High (LangChain users) |
| TruLens | Medium |
Заблуждение: Faithfulness > 0.8 означает отсутствие галлюцинаций
Faithfulness метрика (DeepEval/Ragas) проверяет, что ответ grounded в context, но не ловит: (1) пропущенные факты из context (это Context Recall), (2) логические ошибки при корректных фактах, (3) subtle hallucinations внутри корректных предложений. Production: используйте минимум 3 метрики одновременно -- Faithfulness + Hallucination + Answer Relevancy. Одной метрики недостаточно для 95%+ detection rate.
Заблуждение: synthetic test data заменяет golden dataset
Ragas умеет генерировать synthetic test data, но она отражает distribution модели-генератора, а не реальных пользователей. В production 30-40% edge cases приходят из запросов, которые synthetic data не покрывает: опечатки, смешение языков, adversarial inputs. Golden dataset из 200-500 real queries с human labels обязателен для калибровки. Synthetic data -- для расширения покрытия, не для замены.
Заблуждение: DeepEval и Ragas дают одинаковые Faithfulness scores
Обе библиотеки измеряют Faithfulness, но реализации отличаются: DeepEval использует claim extraction + verification chain (2-4 LLM calls), Ragas -- statement extraction + NLI (2-3 LLM calls). На одних и тех же данных scores могут расходиться на 10-20%. Сравнивать абсолютные значения между фреймворками некорректно -- фиксируйте один фреймворк и отслеживайте тренд.
Interview Questions¶
Q: Как построить CI/CD pipeline для оценки LLM-приложения?
Red flag: "Запускаем pytest с несколькими примерами перед деплоем"
Strong answer: "Три уровня: (1) Unit -- DeepEval pytest integration, 50-100 test cases из golden dataset, метрики Faithfulness >0.8, Answer Relevancy >0.7, Hallucination <0.1. Прогон на каждый PR. (2) Integration -- full RAG pipeline eval на staging, 500+ test cases, Context Precision + Context Recall. Weekly. (3) Production monitoring -- 1-5% sampling через LangSmith/Langfuse, drift detection, alerting при degradации >10%. Costs: ~$0.50-1.50/1K evals для LLM-based метрик. GitHub Actions + deepeval push для dashboard."
Q: DeepEval vs Ragas -- когда что выбрать?
Red flag: "Ragas популярнее (7K+ stars), поэтому лучше"
Strong answer: "DeepEval: 14+ метрик (RAG + agents + tool use + SQL), native pytest, CI/CD templates, red teaming, custom metrics, Confident AI dashboard. Ragas: 6 RAG-метрик, synthetic test data generation, проще API, research-oriented. Выбор: DeepEval для production (pytest CI/CD + multi-use-case + custom metrics). Ragas для RAG-only проектов на этапе experimentation с нуждой в synthetic data. Можно комбинировать: Ragas для data generation, DeepEval для evaluation pipeline."
Q: Какие threshold'ы ставить для RAG evaluation метрик?
Red flag: "Faithfulness и Relevancy должны быть > 0.9"
Strong answer: "Зависит от use case: (1) Medical/legal RAG -- Faithfulness >0.95, Hallucination <0.02 (цена ошибки высокая). (2) Customer support -- Faithfulness >0.8, Answer Relevancy >0.7 (допустимы general ответы). (3) Internal search -- Context Precision >0.6, Recall >0.7 (важнее покрытие). Threshold'ы калибруются на human-labeled golden set: находите точку, где метрика коррелирует с human accept/reject (обычно threshold = F1-optimal point на golden set). Пересматривайте quarterly по мере drift."
Sources¶
- DeepEval Official Documentation — https://deepeval.com (formerly docs.confident-ai.com)
- Ragas Documentation — https://docs.ragas.io
- Prompts.ai — "Top 5 LLM Evaluation Platforms 2026"
- Confident AI Blog — DeepEval vs Ragas Comparison
- TruLens Documentation — https://trulens.org
- LangSmith Documentation — https://docs.langchain.com/langsmith (formerly docs.smith.langchain.com)
See Also¶
- Метрики оценки LLM -- BLEU, ROUGE, BERTScore, LLM-as-a-Judge -- метрики которые эти фреймворки реализуют
- Бенчмарки оценки LLM -- MMLU, GSM8K, Chatbot Arena -- стандартные бенчмарки для сравнения моделей
- Наблюдаемость LLM -- Langfuse, LangSmith, Arize Phoenix -- observability платформы, которые интегрируются с evaluation frameworks
- Гардрейлы оценки LLM -- safety evaluation tools (NeMo Guardrails, LlamaGuard) -- дополняют quality evaluation
- RAG архитектуры -- Ragas и DeepEval Faithfulness метрики напрямую измеряют качество RAG pipeline