Перейти к содержанию

Фреймворки оценки LLM

~2 минуты чтения

Предварительно: Метрики оценки LLM | Наблюдаемость LLM

DeepEval (14+ метрик, pytest, CI/CD) и Ragas (6 RAG-метрик, synthetic data) -- два основных open-source фреймворка для автоматической оценки LLM в 2026. DeepEval покрывает RAG + agents + red teaming за $0.50-1.50/1K оценок (LLM calls). Ragas проще, но ограничен RAG. В production оба интегрируются в CI/CD: при каждом PR автоматически прогоняются Faithfulness (>0.8), Answer Relevancy (>0.7), Hallucination (<0.1). Без evaluation pipeline 40%+ LLM-приложений деградируют за 2-4 недели из-за prompt drift и data distribution shift.


Part 1: Overview

Why LLM Evaluation Matters in 2025-2026

Challenges:

  • Non-deterministic outputs - Same input, different outputs
  • Subjective quality - What's "good" varies by use case
  • Emergent behaviors - Models do unexpected things
  • Cost of evaluation - Human labeling is expensive

Evaluation Types: | Type | Purpose | Examples | |------|---------|----------| | Deterministic | Exact matches, regex | Accuracy, F1 | | LLM-based | Model-as-judge | G-Eval, DeepEval metrics | | Human | Ground truth comparison | A/B testing, surveys | | Behavioral | System-level performance | Task completion rate |


Part 2: DeepEval

2.1 Overview

Developer: Confident AI Current Version: 2.x (2025) License: Apache 2.0

Key Features: - 14+ built-in metrics - Pytest integration - LLM-as-judge with explainability - Confident AI platform integration - RAG evaluation (RAGAS-inspired) - Red teaming support - CI/CD ready

2.2 Installation & Setup

pip install deepeval
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase

# Set your LLM
import os
os.environ["OPENAI_API_KEY"] = "your-key"

2.3 Core Metrics

Metric What It Measures Score Range
Answer Relevancy Does answer address the question? 0-1
Faithfulness Is answer grounded in context? 0-1
Contextual Recall Is context relevant to expected output? 0-1
Contextual Precision Is retrieved context noise-free? 0-1
Hallucination Does output contain invented facts? 0-1
Bias Is output biased or unfair? 0-1
Toxicity Is output harmful or offensive? 0-1
Conversational Multi-turn coherence 0-1
Tool Correctness Tool calling accuracy 0-1
SQL Correctness SQL query validity 0-1

2.4 Usage Examples

Basic Evaluation:

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="Paris is the capital of France.",
    expected_output="Paris"
)

metric = AnswerRelevancyMetric(threshold=0.7)
evaluate(test_cases=[test_case], metrics=[metric])

RAG Pipeline Evaluation:

from deepeval.metrics import FaithfulnessMetric, ContextualRecallMetric

test_case = LLMTestCase(
    input="What is ML?",
    actual_output="Machine Learning is a subset of AI...",
    retrieval_context=[
        "Machine Learning is a field of AI that enables systems to learn from data.",
        "ML algorithms build models based on training data."
    ]
)

metrics = [
    FaithfulnessMetric(threshold=0.8),
    ContextualRecallMetric(threshold=0.7)
]

evaluate(test_cases=[test_case], metrics=metrics)

Pytest Integration:

# test_rag.py
import pytest
from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric

@pytest.mark.parametrize("query,expected", [
    ("What is Python?", "A programming language"),
    ("What is ML?", "A subset of AI"),
])
def test_rag_quality(query, expected):
    actual_output = rag_pipeline(query)
    context = retriever.get_context(query)

    test_case = LLMTestCase(
        input=query,
        actual_output=actual_output,
        retrieval_context=context
    )

    metric = FaithfulnessMetric(threshold=0.7)
    assert_test(test_case, [metric])

2.5 Custom Metrics

from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase

class CustomToneMetric(BaseMetric):
    def __init__(self, threshold: float = 0.5):
        self.threshold = threshold
        self.name = "Tone"

    def measure(self, test_case: LLMTestCase) -> float:
        # Custom logic here
        output = test_case.actual_output

        # Example: Check for professional tone
        professional_words = ["please", "thank you", "would"]
        count = sum(1 for word in professional_words if word in output.lower())

        self.score = min(count / 2, 1.0)
        self.success = self.score >= self.threshold
        self.reason = f"Found {count} professional markers"

        return self.score

    async def a_measure(self, test_case: LLMTestCase) -> float:
        return self.measure(test_case)

    def is_successful(self) -> bool:
        return self.success

2.6 CI/CD Integration

# .github/workflows/llm_eval.yml
name: LLM Evaluation

on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install dependencies
        run: pip install deepeval pytest

      - name: Run evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          pytest tests/llm/ --tb=short

      - name: Upload results to Confident AI
        if: always()
        run: |
          deepeval login --api-key ${{ secrets.CONFIDENT_API_KEY }}
          deepeval push

Part 3: Ragas

3.1 Overview

Developer: Exploding Gradients Current Version: 0.2.x (2025) License: Apache 2.0

Key Features: - RAG-focused metrics - Reference-free evaluation - Test data generation - Simple API - Research-oriented

3.2 Installation

pip install ragas

3.3 Core Metrics

Metric What It Measures Notes
Faithfulness Is answer grounded in context? Similar to DeepEval
Answer Relevancy Does answer address question? Uses LLM-as-judge
Context Precision Is retrieved context relevant? Signal-to-noise ratio
Context Recall Is context comprehensive? Coverage of ground truth
Answer Similarity Semantic similarity to expected Uses embeddings
Answer Correctness Overall answer quality Weighted combination

3.4 Usage Example

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from datasets import Dataset

data = {
    "question": ["What is AI?", "What is ML?"],
    "answer": [
        "AI is artificial intelligence...",
        "ML is machine learning..."
    ],
    "contexts": [
        ["AI refers to computer systems that can perform tasks requiring intelligence."],
        ["ML is a subset of AI that uses data to learn patterns."]
    ],
    "ground_truth": ["AI is intelligence demonstrated by machines.", "ML is learning from data."]
}

dataset = Dataset.from_dict(data)

result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy]
)

print(result)

Part 4: DeepEval vs Ragas Comparison

4.1 Feature Comparison

Feature DeepEval Ragas
RAG Metrics ✅ 14+ ✅ 6
Agent Evaluation
Custom Metrics ✅ Easy ⚠️ Limited
Pytest Integration ✅ Native
CI/CD Ready ⚠️ Manual
Red Teaming ✅ Built-in
Synthetic Data ⚠️ Via platform ✅ Built-in
Dashboard ✅ Confident AI
Self-Hosted
Model Support ✅ Multi-provider ✅ OpenAI-focused

4.2 When to Choose Which

Choose DeepEval when: - Need pytest integration - Building production CI/CD pipelines - Evaluating AI agents, not just RAG - Need custom metrics - Want comprehensive evaluation platform - Red teaming is important

Choose Ragas when: - Focused purely on RAG evaluation - Need synthetic test data generation - Prefer simpler API - Research/experimentation focus - Don't need pytest integration

4.3 Performance Comparison

Metric DeepEval Ragas
Setup complexity Medium Low
Execution speed Medium (parallel support) Medium
LLM calls per eval 2-5 per metric 2-3 per metric
Memory usage Medium Low

Part 5: Other Top Platforms 2026

5.1 Platform Comparison

Platform Focus Pricing Best For
DeepEval Comprehensive eval Free + Confident AI Production apps
Ragas RAG evaluation Free RAG pipelines
Deepchecks ML monitoring Free + Enterprise ML ops
MLflow ML lifecycle Free Experiment tracking
TruLens Neural app eval Free RAG + agents
LangSmith LLM observability Paid Production monitoring

5.2 TruLens

Key Features: - RAG triad: Context relevance, Groundedness, Answer relevance - Feedback functions - Chain/agent tracing - Dashboard visualization

# TruLens rebranded to `trulens` package in late 2024.
# Old: from trulens_eval import ... (deprecated)
from trulens.core import Feedback
from trulens.apps.langchain import TruChain
from trulens.providers.openai import OpenAI as fOpenAI

provider = fOpenAI()
f_groundedness = Feedback(provider.groundedness_measure_with_cot_reasons).on(
    input="query", output="response", context="context"
)

tru_recorder = TruChain(
    chain,
    feedbacks=[f_groundedness]
)

5.3 LangSmith

Key Features: - Trace visualization - Dataset management - Evaluation automation - Production monitoring - Prompt management

from langsmith import Client

client = Client()

# Create dataset
dataset = client.create_dataset("my-eval-dataset")

# Run evaluation
results = client.evaluate(
    "my-chain",
    data=dataset,
    evaluators=[accuracy_evaluator, hallucination_evaluator]
)

Part 6: Evaluation Best Practices

6.1 Test Data Strategy

Strategy When to Use Pros Cons
Golden dataset Production High quality Expensive to create
Synthetic data Early development Fast, cheap May not reflect reality
Production sampling Ongoing Real data Privacy concerns
Adversarial examples Red teaming Tests robustness Labor intensive

6.2 Metric Selection Guide

Use Case Primary Metrics
RAG chatbot Faithfulness, Answer Relevancy, Context Precision
Code generation Correctness, Syntax Validity
Summarization ROUGE, BERTScore, Hallucination
Translation BLEU, COMET, Answer Similarity
Agent Tool Correctness, Task Completion
Content moderation Toxicity, Bias

6.3 Evaluation Pipeline

graph TD
    A["Test Cases<br/>(Golden / Synthetic)"] --> B["LLM System"]
    B --> C["Actual Output"]
    C --> D["Evaluation Framework"]
    D --> E["Metrics<br/>(Multiple)"]
    D --> F["Results & Reports"]
    E --> G["Threshold Checks"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#f3e5f5,stroke:#9c27b0
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#fff3e0,stroke:#ef6c00
    style E fill:#e8f5e9,stroke:#4caf50
    style F fill:#e8f5e9,stroke:#4caf50
    style G fill:#fce4ec,stroke:#c62828

Part 7: Interview-Relevant Numbers

Framework Statistics

Metric DeepEval Ragas
GitHub stars 4,000+ 7,000+
Built-in metrics 14+ 6
LLM providers 5+ 2-3
Pytest support
CI/CD templates 3+ 0

Evaluation Costs

Metric Type LLM Calls Est. Cost (per 1k evals)
Answer Relevancy 2-3 \(0.10-\)0.30
Faithfulness 2-4 \(0.15-\)0.40
Hallucination 1-2 \(0.05-\)0.20
Full RAG suite 8-12 \(0.50-\)1.50

Industry Adoption

Framework Enterprise Adoption
DeepEval Growing (2025)
Ragas High (RAG-focused)
LangSmith High (LangChain users)
TruLens Medium


Заблуждение: Faithfulness > 0.8 означает отсутствие галлюцинаций

Faithfulness метрика (DeepEval/Ragas) проверяет, что ответ grounded в context, но не ловит: (1) пропущенные факты из context (это Context Recall), (2) логические ошибки при корректных фактах, (3) subtle hallucinations внутри корректных предложений. Production: используйте минимум 3 метрики одновременно -- Faithfulness + Hallucination + Answer Relevancy. Одной метрики недостаточно для 95%+ detection rate.

Заблуждение: synthetic test data заменяет golden dataset

Ragas умеет генерировать synthetic test data, но она отражает distribution модели-генератора, а не реальных пользователей. В production 30-40% edge cases приходят из запросов, которые synthetic data не покрывает: опечатки, смешение языков, adversarial inputs. Golden dataset из 200-500 real queries с human labels обязателен для калибровки. Synthetic data -- для расширения покрытия, не для замены.

Заблуждение: DeepEval и Ragas дают одинаковые Faithfulness scores

Обе библиотеки измеряют Faithfulness, но реализации отличаются: DeepEval использует claim extraction + verification chain (2-4 LLM calls), Ragas -- statement extraction + NLI (2-3 LLM calls). На одних и тех же данных scores могут расходиться на 10-20%. Сравнивать абсолютные значения между фреймворками некорректно -- фиксируйте один фреймворк и отслеживайте тренд.


Interview Questions

Q: Как построить CI/CD pipeline для оценки LLM-приложения?

❌ Red flag: "Запускаем pytest с несколькими примерами перед деплоем"

✅ Strong answer: "Три уровня: (1) Unit -- DeepEval pytest integration, 50-100 test cases из golden dataset, метрики Faithfulness >0.8, Answer Relevancy >0.7, Hallucination <0.1. Прогон на каждый PR. (2) Integration -- full RAG pipeline eval на staging, 500+ test cases, Context Precision + Context Recall. Weekly. (3) Production monitoring -- 1-5% sampling через LangSmith/Langfuse, drift detection, alerting при degradации >10%. Costs: ~$0.50-1.50/1K evals для LLM-based метрик. GitHub Actions + deepeval push для dashboard."

Q: DeepEval vs Ragas -- когда что выбрать?

❌ Red flag: "Ragas популярнее (7K+ stars), поэтому лучше"

✅ Strong answer: "DeepEval: 14+ метрик (RAG + agents + tool use + SQL), native pytest, CI/CD templates, red teaming, custom metrics, Confident AI dashboard. Ragas: 6 RAG-метрик, synthetic test data generation, проще API, research-oriented. Выбор: DeepEval для production (pytest CI/CD + multi-use-case + custom metrics). Ragas для RAG-only проектов на этапе experimentation с нуждой в synthetic data. Можно комбинировать: Ragas для data generation, DeepEval для evaluation pipeline."

Q: Какие threshold'ы ставить для RAG evaluation метрик?

❌ Red flag: "Faithfulness и Relevancy должны быть > 0.9"

✅ Strong answer: "Зависит от use case: (1) Medical/legal RAG -- Faithfulness >0.95, Hallucination <0.02 (цена ошибки высокая). (2) Customer support -- Faithfulness >0.8, Answer Relevancy >0.7 (допустимы general ответы). (3) Internal search -- Context Precision >0.6, Recall >0.7 (важнее покрытие). Threshold'ы калибруются на human-labeled golden set: находите точку, где метрика коррелирует с human accept/reject (обычно threshold = F1-optimal point на golden set). Пересматривайте quarterly по мере drift."


Sources

  1. DeepEval Official Documentation — https://deepeval.com (formerly docs.confident-ai.com)
  2. Ragas Documentation — https://docs.ragas.io
  3. Prompts.ai — "Top 5 LLM Evaluation Platforms 2026"
  4. Confident AI Blog — DeepEval vs Ragas Comparison
  5. TruLens Documentation — https://trulens.org
  6. LangSmith Documentation — https://docs.langchain.com/langsmith (formerly docs.smith.langchain.com)

See Also