Метрики оценки LLM¶

~2 минуты чтения

Предварительно: Бенчмарки оценки LLM | RAG архитектуры

BLEU и ROUGE корректно ранжируют перевод и суммаризацию, но на open-ended generation их корреляция с человеческой оценкой падает до 0.3-0.4. LLM-as-a-Judge (GPT-4, Claude) показывает корреляцию 0.6-0.8 и не требует reference text, но без калибровки может полностью инвертировать рейтинг моделей -- arXiv 2512.11150 показал, что некалиброванный proxy перевернул ranking у 3 из 5 пар. Цена: BLEU -- $0.001/1K оценок, LLM-Judge -- $3-5/1K, человек -- $50-200/1K. Тренд 2026 -- каскадные pipeline: дешевые метрики фильтруют 90% случаев, LLM-Judge обрабатывает сложные, человек калибрует.

Part 1: Overview¶

Executive Summary¶

Key Insight:

"Traditional metrics like BLEU, ROUGE fall short when evaluating modern LLMs. They assume a clean ground truth that most real-world LLM outputs don't have." LLM-as-a-Judge has emerged as the leading alternative, but requires calibration to avoid bias and rank inversions. The 2026 trend is toward causal judge evaluation with calibrated surrogate metrics.

2026 LLM Evaluation Metrics Landscape:

Metric Type	Examples	Best For	Limitation
Lexical	BLEU, ROUGE	Translation, Summarization	No semantic understanding
Embedding	BERTScore, MoverScore	Semantic similarity	Needs reference
LLM-as-Judge	GPT-4, Claude, Prometheus	Open-ended generation	Bias, cost
Task-specific	Exact Match, F1	QA, Classification	Task-limited
Human-aligned	Calibration, Preference	Production	Expensive

Part 2: Traditional Metrics Limitations¶

BLEU and ROUGE Problems¶

BLEU (Bilingual Evaluation Understudy) -- n-gram precision с reference. Проблема: "The cat sat on mat" vs "A feline rested" дает BLEU = 0 (нет совпадений n-gram), хотя смысл идентичен.

ROUGE (Recall-Oriented Understudy for Gisting) -- recall n-gram. Проблема: награждает длину, не оценивает reasoning и creativity.

Фундаментальные ограничения:

Требуют reference text (у LLM outputs его часто нет)
Surface-level matching (нет семантического понимания)
Не оценивают: reasoning, creativity, safety
Созданы для MT/summarization, не для open generation

BLEU Formula¶

\[\text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)\]

Where: - BP = Brevity Penalty - $p_n$ = n-gram precision - $w_n$ = weights (typically $1/N$)

ROUGE Variants¶

Variant	Description	Formula
ROUGE-N	N-gram overlap	$\frac{\sum \text{match}}{\sum \text{reference n-grams}}$
ROUGE-L	Longest Common Subsequence	LCS-based F1
ROUGE-S	Skip-bigram	Non-contiguous pairs

Part 3: LLM-as-a-Judge¶

Core Concept¶

graph TD
    A["Input: Generated text + Criteria + Optional Reference"] --> B["Judge LLM<br/>(GPT-4, Claude, Prometheus)"]
    B --> C["Score (1-5 / pass-fail) + Reasoning"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#e8f5e9,stroke:#4caf50

Преимущества: не требует reference, семантическое понимание, оценивает reasoning/safety/creativity, масштабируется на любой тип output.

Judge Evaluation Methods¶

Method	Description	When to Use
Pointwise	Single output score	Quality assessment
Pairwise	Compare two outputs	A/B testing
Listwise	Rank multiple outputs	Ranking systems

LLM-as-Judge может ИНВЕРТИРОВАТЬ рейтинги моделей

GPT-4 как judge предпочитает GPT-4 outputs, Claude -- Claude outputs (self-preference bias). В pairwise evaluation порядок A/B влияет на результат (order bias). arXiv 2512.11150 показал: некалиброванные proxy-метрики могут полностью перевернуть ranking моделей. Решение: (1) используй несколько разных judges и агрегируй, (2) обязательно проверяй на human-labeled calibration set, (3) в pairwise -- рандомизируй порядок и усредняй.

2026 Best Practices (Reddit r/LangChain)¶

Avoid 1-10 scales — Use pass/fail or 1-5
Start with human labels — Calibrate before scaling
Use rubrics — Clear evaluation criteria
Add anchor examples — Include good/bad samples
Debias — Counter self-preference, vendor favoritism

Part 4: Prometheus and M-Prometheus¶

Prometheus¶

Aspect	Details
Developer	KAIST AI
Type	Open-source LLM judge
Training	Feedback collection
License	Open source

M-Prometheus (OpenReview 2025)¶

Aspect	Details
Languages	20+
Focus	Multilingual LLM judge
Performance	Outperforms SOTA on multilingual reward benchmarks

Part 5: Calibration and Bias¶

Bias Types¶

Bias	Описание	Влияние
Self-Preference	GPT-4 предпочитает GPT-4, Claude -- Claude	Инвертирует rankings
Length Bias	Длинные ответы рейтингуются выше	"More words = better" fallacy
Vendor Favoritism	Модели из одного vendor'а получают бонус	Может полностью изменить порядок
Order Bias (Pairwise)	Первая/последняя позиция предпочитается	A vs B != B vs A
Hinting Effects	Раскрытие identity модели влияет на оценку	"Tier preferences" emerge

Causal Judge Evaluation (arXiv 2512.11150)¶

\[\hat{Y} = \alpha + \beta \cdot \text{LLM\_Score} + \gamma \cdot \text{Controls} + \epsilon\]

Key insight: Uncalibrated proxies can invert rankings entirely.

Calibration Techniques¶

Technique	Description
Human alignment	Train on human preference data
Bias correction	Subtract known bias terms
Cross-validation	Use multiple judges, aggregate
Anchor examples	Include calibrated references
Rubric design	Clear, specific criteria

Part 6: Embedding-Based Metrics¶

BERTScore¶

\[\text{BERTScore} = \frac{1}{|x|} \sum_{x_i \in x} \max_{y_j \in y} \cos(x_i, y_j)\]

Uses contextual embeddings from BERT for semantic matching.

Metric Comparison¶

Metric	Reference Needed	Semantic	Cost
BLEU	Yes	No	Low
ROUGE	Yes	No	Low
BERTScore	Yes	Yes	Medium
LLM-as-Judge	No	Yes	High
Human	No	Yes	Very High

Part 7: Evaluation Dimensions¶

7 Critical Dimensions (TechRxiv 2025)¶

Dimension	Metrics	Description
Accuracy	F1, EM, LLM-judge	Correctness
Efficiency	Latency, Tokens	Resource usage
Safety	Toxicity, Bias check	Harmful outputs
Faithfulness	Hallucination rate	Grounding
Coherence	Perplexity, LLM-judge	Logical flow
Relevance	Retrieval metrics	Context fit
Fluency	Grammar check	Language quality

Part 8: Production Evaluation Pipeline¶

graph TD
    subgraph L1["L1: Fast Checks < 100ms"]
        A1["Exact match / Regex"]
        A2["Length checks"]
        A3["Keyword presence"]
    end

    subgraph L2["L2: LLM-as-Judge 100-500ms"]
        B1["Quality scoring (1-5)"]
        B2["Safety check (pass/fail)"]
        B3["Format validation"]
    end

    subgraph L3["L3: Human Review (sampling)"]
        C1["Edge cases"]
        C2["Calibration set"]
        C3["Feedback collection"]
    end

    L1 --> L2 --> L3

    style L1 fill:#e8f5e9,stroke:#4caf50
    style L2 fill:#fff3e0,stroke:#ef6c00
    style L3 fill:#fce4ec,stroke:#c62828

Part 9: Comparison Matrix¶

Evaluation Method Comparison¶

Feature	BLEU/ROUGE	BERTScore	LLM-Judge	Human
Semantic	No	Yes	Yes	Yes
No reference	No	No	Yes	Yes
Scales well	Yes	Yes	Medium	No
Cost	Low	Medium	High	Very High
Bias-free	Yes	Yes	No	Mostly

Cost Comparison¶

Method	Per 1K Evaluations	Time
BLEU	$0.001	1s
BERTScore	$0.01	5s
LLM-Judge (GPT-4)	$3-5	30s
Human	$50-200	Hours

Part 10: Interview-Relevant Numbers¶

BLEU/ROUGE Benchmarks¶

Task	Good BLEU	Good ROUGE-L
Translation	30-40	-
Summarization	-	40-50
Chat (not applicable)	N/A	N/A

LLM-as-Judge Agreement¶

Judge Type	Human Agreement
GPT-4	70-80%
Claude 3	72-82%
Prometheus	65-75%
Human-Human	80-90%

Correlation with Human Judgment¶

Metric	Correlation
BLEU	0.3-0.4
ROUGE	0.35-0.45
BERTScore	0.5-0.6
LLM-as-Judge	0.6-0.8

Evaluation Frequency Recommendations¶

Stage	LLM-Judge	Human
Development	100%	10%
Staging	50%	5%
Production	1-5%	0.1%

Заблуждение: BLEU >= 30 означает хорошее качество перевода

BLEU 30 считался стандартом для MT в 2010-х, но метрика не учитывает семантику: "A feline rested on the mat" vs "The cat sat on the mat" даёт BLEU ~0.3 при идентичном смысле. Корреляция BLEU с человеческой оценкой -- всего 0.3-0.4 на open-ended генерации. Для LLM-output BLEU практически бесполезен.

Заблуждение: LLM-as-Judge объективнее человека

GPT-4 как judge предпочитает собственные ответы на 10-15% чаще (self-preference bias). Порядок предъявления A/B меняет результат на 5-20% (order bias). В исследовании arXiv 2512.11150 некалиброванный LLM-judge полностью инвертировал рейтинг 3 из 5 пар моделей. Решение: несколько разных judges + рандомизация порядка + human calibration set.

Заблуждение: BERTScore -- замена человеческой оценки

BERTScore (корреляция 0.5-0.6 с человеком) лучше BLEU, но всё ещё требует reference text. На задачах creative writing и reasoning BERTScore не оценивает логическую корректность -- два семантически похожих, но логически противоположных утверждения могут получить BERTScore > 0.9.

Interview Questions¶

Q: Когда использовать BLEU/ROUGE, а когда LLM-as-Judge?

Red flag: "BLEU и ROUGE устарели, всегда используем LLM-Judge"

Strong answer: "BLEU/ROUGE дёшевы ($0.001/1K) и хороши для MT/summarization, где есть reference text и surface-level matching достаточен. LLM-Judge ($3-5/1K) нужен для open-ended generation, reasoning, safety -- там корреляция с человеком 0.6-0.8 vs 0.3-0.4 у BLEU. В production используют каскад: L1 (regex/exact match, <100ms) -> L2 (LLM-Judge, 100-500ms) -> L3 (human sampling). BLEU уместен как fast check в L1."

Q: Как бороться с bias в LLM-as-Judge?

Red flag: "Используем GPT-4 как judge, он достаточно объективен"

Strong answer: "Три основных bias: self-preference (GPT-4 предпочитает GPT-4 output на 10-15%), order bias (позиция A/B влияет на 5-20%), length bias (длинные ответы ранжируются выше). Решения: (1) несколько разных judges и агрегация, (2) рандомизация порядка в pairwise + усреднение, (3) калибровка на human-labeled set минимум 200 примеров, (4) rubric с конкретными критериями вместо 1-10 шкалы -- лучше pass/fail или 1-5 с anchor examples."

Q: Как построить evaluation pipeline в production?

Red flag: "Оцениваем 100% трафика через LLM-Judge"

Strong answer: "Трёхуровневый каскад: L1 -- fast checks (<100ms): exact match, regex, длина, keyword presence -- покрывает 90% простых случаев. L2 -- LLM-Judge (100-500ms): quality scoring 1-5, safety pass/fail -- для 10% неочевидных случаев. L3 -- human review (sampling): edge cases, calibration set, feedback collection -- 0.1-1% трафика. Development: 100% LLM-Judge + 10% human. Production: 1-5% LLM-Judge + 0.1% human. Costs: L2 на 100K запросов/день при 5% sampling = $15-25/день."

Sources¶

arXiv — "How to Correctly Report LLM-as-a-Judge Evaluations" (2511.21140)
arXiv — "Causal Judge Evaluation: Calibrated Surrogate Metrics" (2512.11150)
OpenReview — "M-Prometheus: Open Multilingual LLM Judges" (Atyk8lnIQQ)
Analytics Vidhya — "Top 15 LLM Evaluation Metrics to Explore in 2026"
Weights & Biases — "LLM Evaluation Benchmarking: Beyond BLEU and ROUGE"
Confident AI — "LLM Evaluation Metrics: Ultimate Guide" (Jan 2026)
Comet — "Introduction to LLM-as-a-Judge For Evals"
Reddit r/LangChain — "Best LLM-as-a-Judge Practices from 2025"
Medium — "LLMs as Judges: Measuring Bias, Hinting Effects"
Lazy Programmer — "LLM-as-a-Judge: Goodbye BLEU Scores and ROUGE Metrics"

Variant	Description	Formula
ROUGE-N	N-gram overlap	\(\frac{\sum \text{match}}{\sum \text{reference n-grams}}\)
ROUGE-L	Longest Common Subsequence	LCS-based F1
ROUGE-S	Skip-bigram	Non-contiguous pairs