Бенчмарки оценки LLM¶

~8 минут чтения

Предварительно: Гайд по бенчмаркам LLM, Метрики оценки LLM

MMLU (57 предметов, 16K вопросов) в 2024 давал 17.5% gap между top-моделями -- к 2026 это 1-2% (GPT-5.2: 92.3%, Claude Opus 4.6: 91.8%), бенчмарк насыщен. GSM8K и HumanEval показывали 31.6% gap в 2024, теперь тоже >95%. Реальная дифференциация сместилась к Chatbot Arena (Elo-рейтинги на миллионах реальных голосов), GPQA Diamond (PhD-уровень, Google-proof), SWE-bench Verified (реальные GitHub issues). Gap между closed и open-source моделями сократился с 8.04% в январе 2024 до почти нуля к 2025 -- open-source модели (DeepSeek R1, LLaMA) стали production-viable.

Ключевые идеи¶

Overview таблица¶

Benchmark	Что тестирует	Формат	Skills	Ограничения
MMLU	Broad knowledge, 57 subjects	Multiple-choice	General intelligence, multi-task	Не отражает real-world generation
GSM8K	Math reasoning	Word problems (grade school)	Multi-step logic, analytical thinking	Только математика, narrow domain
HumanEval	Code generation	Programming problems	Algorithmic thinking, syntax	Только программирование, 164 задачи
MT-Bench	Instruction following	Various tasks	Accuracy, efficiency	Practical capabilities

Подробности по каждому benchmark¶

MMLU (Massive Multitask Language Understanding)¶

57 subjects — STEM, humanities, social sciences
5-shot prompting — модель видит 5 примеров перед каждым вопросом (standard MMLU protocol)
Score: процент правильных ответов
Проблема: multiple-choice формат ≠ реальные ответы
Gap 2023: 17.5% points между top моделями

GSM8K (Grade School Math 8K)¶

~8,500 problems — математические задачи уровня 8 класса
Multi-step reasoning — требует пошагового решения
Chain-of-Thought — наиболее эффективно для этого benchmark'а
2024 gap: 31.6% points между top моделями

HumanEval¶

164 problems — программирование на Python
Pass@k — решение прошло с k попыток или меньше
Оценка: функциональная правильность кода
2024 gap: 31.6% points (подобно GSM8K)

MT-Bench¶

Instruction following — практические задачи
Accuracy — основная метрика
Latency — время генерации
Более реалистично чем MMLU/GSM8K/HumanEval

Формулы и математика¶

Normalized score¶

\[ \text{Score} = \frac{\text{correct} - \text{random}}{\text{total} - \text{random}} \]

Performance gap (2024)¶

\[ \text{Gap}_{\text{2024}} = \text{Best Model} - \text{Median Open Model} \approx 20\text{-}30\% \]

Применения для AI/LLM Engineer¶

Model evaluation¶

Comprehensive assessment — использование всех 4 benchmarks для полной картины
Reasoning vs coding — разные модели excel на разных задачах
Trade-off analysis — понимание strong/weak points каждой модели

System design¶

Benchmark selection — какие тесты запускать для оценки
Progress tracking — отслеживание улучшения на benchmark'ах
A/B testing — сравнение моделей на production workload'ах

Связанные работы¶

EleutherAI benchmark framework
"Instruction Tuning for Large Language Models: A Comprehensive Analysis" (paper)
LLM-as-judge methodology
OpenAI evaluation methodology

Цитаты¶

"Each benchmark tests different aspects of LLM capabilities, which is why they're commonly used together"

Мои заметки¶

Почему это важно: - LLM evaluation — core для AI/LLM Engineer position - Understanding model strengths/weaknesses — critical для system design - Benchmark gaps — opportunity для улучшения (31% gap в 2024!)

Interview questions: - "What are the key differences between MMLU, GSM8K, and HumanEval?" - "Why is there a 31% performance gap between top models in 2024?" - "How would you evaluate an LLM for production deployment?" - "What is MT-Bench and why is it more realistic than other benchmarks?"

Further research: - EleutherAI detailed benchmark documentation - LLM-as-judge papers и implementation - Specific model evaluation methodologies - MT-Bench technical report

Заблуждение: MMLU -- лучший способ сравнить модели

MMLU (multiple-choice, 57 предметов) в 2026 насыщен: top-5 моделей в диапазоне 87.5-92.3%, разница 1-2% статистически незначима на 16K вопросах. Multiple-choice формат не отражает реальную генерацию текста. Data contamination -- модели могли видеть вопросы в training data. MMLU-Pro (12K более сложных вопросов) и GPQA Diamond (PhD-level, Google-proof) -- более информативные замены. Правило: если разница на MMLU <3%, она не значима.

Заблуждение: Chatbot Arena = объективная оценка

Chatbot Arena -- crowdsourced Elo на миллионах голосов, но: (1) selection bias -- пользователи arena не репрезентируют всех use cases, (2) prompt distribution -- случайные запросы пользователей, не ваши production задачи, (3) Elo нестабилен при малом количестве battle для новых моделей. Chatbot Arena лучше static benchmarks, но для production выбора нужно тестировать на 50-100 примерах вашей конкретной задачи.

Заблуждение: GSM8K -- хороший тест математических способностей

GSM8K -- задачи уровня 8 класса, frontier-модели набирают 95-97%. Бенчмарк насыщен и не дифференцирует. Но главная проблема -- модели могут "запомнить" паттерны решений вместо настоящего рассуждения. MATH-500 (competition math) и AIME 2024 (o3-mini: 83.3%) значительно лучше тестируют reasoning. FrontierMath (research-level math) -- ещё не решается даже frontier-моделями.

Interview Questions¶

Q: Как правильно выбрать бенчмарк для оценки LLM под конкретную задачу?

Red flag: "Берём MMLU -- он самый известный"

Strong answer: "Выбор зависит от use case: для general assistant -- MMLU + IFEval + Chatbot Arena, для кодинга -- SWE-bench Verified + LiveCodeBench, для research/math -- GPQA Diamond + AIME, для multimodal -- MMMU + MathVista. Плюс обязательно тестировать на собственном evaluation set из 50-100 примеров. Ни один бенчмарк не покрывает все аспекты: нужно минимум 3 бенчмарка из разных категорий + domain-specific eval."

Q: В чём значимость сокращения gap между open-source и closed-source моделями?

Red flag: "Open-source догнали, теперь нет разницы"

Strong answer: "Gap сократился с 8.04% в январе 2024 до почти нуля на MMLU к 2025. Это означает, что production deployment может использовать open-source (DeepSeek R1, LLaMA) без значительной потери качества для general tasks. Но нюанс: на harder benchmarks (SWE-bench, GPQA Diamond) gap остаётся ~10-18 пунктов. Trade-off: open-source даёт контроль над данными, low latency, cost savings в 10-50x, но frontier capabilities (complex reasoning, agent tasks) пока у closed-source."

Q: Почему static benchmarks теряют актуальность и что приходит на замену?

Red flag: "Бенчмарки работают хорошо, просто нужно больше задач"

Strong answer: "Четыре проблемы static benchmarks: (1) contamination -- модели тренируются на benchmark data, (2) saturation -- top-модели в пределах 1-2%, (3) narrow scope -- multiple-choice не отражает real-world usage, (4) gaming -- модели оптимизируются под конкретные форматы. Тренды 2025-2026: human preference > automated metrics (Chatbot Arena), real-world tasks > synthetic (SWE-bench), multimodal > text-only (MMMU), agentic > static Q&A (MCPMark). Лучшая практика: Chatbot Arena для general quality + domain-specific benchmark + custom evaluation set."

Sources¶

EleutherAI
Papers With Code - LLM Evaluation
ArXiv - MT-Bench (Zheng et al., 2023)
HuggingFace - Evaluate LLMs
LMSYS Chatbot Arena (formerly lmarena.ai, rebranded 2026)

2025-2026 Evaluation Advances (Updated Feb 2026)¶

New Key Benchmarks¶

Benchmark	Domain	Format	Significance
MMMU	Multimodal reasoning	Image+Text questions	Tests cross-modal understanding
GPQA	Graduate-level QA	Expert-level questions	Beyond internet knowledge
Chatbot Arena	Human preference	Elo ratings	Real user preferences
SWE-bench	Software engineering	GitHub issues	Real-world coding tasks
AGI-Eval	General intelligence	Multi-domain	AGI capability measurement

Chatbot Arena (LMSYS)¶

Methodology: - Crowdsourced battles between models - Randomized, blind comparisons - Elo rating system (like chess) - Millions of human preference votes

2025 Trend: $$ \text{Gap}_{\text{closed-open}} = 8.04\% \text{ (Jan 2024)} \rightarrow \text{Closing (Feb 2025)} $$

Top Models (Feb 2026): | Rank | Model | Elo | |------|-------|-----| | 1 | GPT-4.5/o3 | ~1400+ | | 2 | Claude (Opus 4 / Sonnet 4.5) | ~1380+ | | 3 | Gemini 2.5 Pro | ~1370+ | | 4 | DeepSeek R1 | ~1360+ |

MMMU (Massive Multitask Multimodal Understanding)¶

What it tests: - Image + text comprehension - Multi-domain knowledge with visual context - College-level reasoning

Key insight: Tests ability to combine visual and textual information for complex reasoning.

GPQA (Graduate-Level Google-Proof Q&A)¶

Characteristics: - Expert-level questions requiring PhD-level knowledge - Resistant to web search (not easily Google-able) - Tests deep understanding, not retrieval

Why it matters: - Better proxy for true intelligence than MMLU - Harder to game through training data contamination

SWE-bench (Software Engineering Benchmark)¶

What it tests: - Real GitHub issues from popular repos - Code generation + debugging - Multi-file understanding

Format: $$ \text{Pass Rate} = \frac{\text{Issues Resolved}}{\text{Total Issues}} $$

Evaluation Framework Evolution¶

2025-2026 Trends: 1. Human preference > automated metrics 2. Real-world tasks > synthetic benchmarks 3. Multimodal > text-only 4. Agentic capabilities > static Q&A

Best Practice for Model Selection:

1. Check Chatbot Arena for human preference
2. Verify domain-specific benchmarks (SWE-bench for code)
3. Test on your actual use case (few-shot evaluation)
4. Consider cost-speed-quality tradeoffs

Interview Questions (2025-2026)¶

Q: "Why is Chatbot Arena considered more reliable than static benchmarks?"

Answer: Static benchmarks can be contaminated by training data and don't capture real-world preferences. Chatbot Arena uses blind, randomized human comparisons at scale, providing Elo ratings that better reflect how users actually experience model quality.

Q: "How would you evaluate an LLM for a customer support application?"

Answer: 1. Start with Chatbot Arena for general quality 2. Test domain-specific accuracy on support tickets 3. Measure hallucination rate with RAG grounding 4. A/B test with real users 5. Monitor production metrics (CSAT, resolution rate)

Q: "What is the significance of the closed-open model gap narrowing?"

Answer: It indicates open-source models are catching up to proprietary ones. The gap dropped from 8% to near-zero in 2025, meaning production deployments can use open models (DeepSeek, LLaMA) without major quality sacrifices.