Перейти к содержанию

Гайд по бенчмаркам LLM

~10 минут чтения

Предварительно: Метрики оценки LLM, Фреймворки оценки LLM

30+ бенчмарков LLM в 2026 разделены на 6 категорий: knowledge (MMLU, GPQA), reasoning (GSM8K, AIME, ARC-AGI), coding (HumanEval, SWE-bench), multimodal (MMMU, MathVista), safety (TruthfulQA), long context (Needle in Haystack). "Лёгкие" бенчмарки насыщены: MMLU 92.3%, GSM8K 97%, HumanEval 93.7%. Дифференцируют модели "сложные": GPQA Diamond (лучший 77%), AIME 2024 (83.3%), ARC-AGI (87.5%), FrontierMath и Humanity's Last Exam (ещё не решаются frontier-моделями). Критический факт: recall при 1M токенов контекста падает до 26% (vs 95%+ при 4K) -- длинный контекст не означает полезный контекст.


Part 1: Overview

Executive Summary

Key Insight:

LLM evaluation in 2026 involves 30+ benchmarks across reasoning, coding, knowledge, safety, and multimodal capabilities. MMLU shows saturation (88%+ top models), making newer benchmarks like GPQA Diamond, SWE-bench Verified, and Humanity's Last Exam more discriminative for frontier models.

2026 Benchmark Categories:

Category Key Benchmarks What They Test
Knowledge MMLU, MMLU-Pro, GPQA Factual knowledge
Reasoning GSM8K, MATH, AIME Math and logic
Coding HumanEval, SWE-bench, MBPP Code generation
Multimodal MMMU, MathVista, VQA Vision-language
Safety TruthfulQA, RealToxicityPrompts Harmful outputs
Long Context Needle in Haystack, LongBench Context utilization

Part 2: Knowledge Benchmarks

MMLU (Massive Multitask Language Understanding)

Aspect Details
Subjects 57 academic subjects
Questions ~16,000 multiple choice
Difficulty Elementary to professional
Subjects include History, law, medicine, CS, physics, economics

MMLU Leaderboard (2026):

Model MMLU Score
GPT-5.2 92.3%
Claude Opus 4.6 91.8%
Gemini 3 Pro 91.2%
DeepSeek V3 90.8%
Llama 4 70B 87.5%

MMLU-Pro (Harder Version)

Aspect Details
Focus More challenging reasoning
Questions 12,000+ complex questions
Difficulty Professional/expert level
Status Replacing original MMLU

GPQA (Graduate-Level Google-Proof Q&A)

Aspect Details
Focus PhD-level biology, physics, chemistry
Difficulty Expert-level reasoning
Google-proof Cannot be solved by search

GPQA Diamond Leaderboard:

Model GPQA Diamond
o3-mini 77.0%
DeepSeek R1 71.5%
Claude 3.7 Sonnet ~75%

Part 3: Reasoning Benchmarks

GSM8K (Grade School Math)

Aspect Details
Questions 8,500+ math word problems
Level Grade school math
Evaluation Exact answer match

GSM8K Leaderboard:

Model GSM8K Score
GPT-5.2 96.5%
Claude Opus 4.6 95.8%
Gemini 3 Pro 95.2%
o3-mini 97%+

MATH / MATH-500

Aspect Details
Focus Competition mathematics
Difficulty High school competition level
Subjects Algebra, geometry, number theory

MATH-500 Leaderboard:

Model MATH-500
DeepSeek R1 97.3%
o3-mini 96.8%
Claude 3.7 ~95%

AIME (American Invitational Mathematics Examination)

Aspect Details
Focus Advanced competition math
Difficulty Very high
Year tracked AIME 2024

AIME 2024 Leaderboard:

Model AIME 2024
o3-mini 83.3%
DeepSeek R1 79.8%
Claude 3.7 Thinking ~80%

ARC-AGI (Abstraction and Reasoning Corpus)

Aspect Details
Focus Abstract reasoning, pattern recognition
Novelty Tests generalization to unseen patterns
Human baseline ~85%

ARC-AGI Leaderboard:

Model ARC-AGI
o3-mini 87.5% (record)
o1 ~76%
GPT-4o ~50%

Part 4: Coding Benchmarks

HumanEval

Aspect Details
Tasks 164 Python coding problems
Format Function + docstring → implementation
Evaluation Unit test pass rate

HumanEval Leaderboard:

Model HumanEval
Claude Opus 4.6 93.7%
GPT-5.2 92.1%
DeepSeek V3 89.2%
Gemini 3 Pro 88.5%

SWE-bench (Software Engineering Benchmark)

Aspect Details
Tasks Real GitHub issues + PRs
Evaluation Can model fix real bugs?
Difficulty Production-level code

SWE-bench Verified Leaderboard:

Model SWE-bench Verified
Claude Opus 4.6 80.9%
o3-mini 71.7%
GPT-5.2 ~70%
DeepSeek V3 ~65%

MBPP (Mostly Basic Python Problems)

Aspect Details
Tasks 974 Python problems
Difficulty Basic to intermediate
Format Text description → code

MultiPL-E (Multilingual)

Aspect Details
Languages 18 programming languages
Based on HumanEval translations
Use case Cross-language comparison

Part 5: Multimodal Benchmarks

MMMU (Multimodal Multi-discipline Understanding)

Aspect Details
Focus College-level multimodal reasoning
Modalities Text, images, diagrams
Subjects 30+ academic disciplines

MMMU Leaderboard:

Model MMMU
GPT-5.2 85.4%
Gemini 3 Pro 83.5%
Claude Opus 4.6 82.1%
Qwen2.5-VL 75.2%

MathVista

Aspect Details
Focus Mathematical visual reasoning
Tasks Charts, diagrams, plots
Skills Math + visual understanding

MathVista Leaderboard:

Model MathVista
Gemini 3 Pro 73.1%
GPT-5.2 72.3%
Claude 4.6 71.8%

VQA (Visual Question Answering)

Variant Description
VQA v2 Natural images
TextVQA Text in images (OCR)
DocVQA Document understanding
ChartQA Chart interpretation

Part 6: Safety & Alignment Benchmarks

TruthfulQA

Aspect Details
Focus Truthfulness vs. common misconceptions
Questions 817 questions
Goal Avoid confident falsehoods

TruthfulQA Patterns:

Behavior Description
Imitate Model mimics common (false) beliefs
Truthful Model provides correct answer

RealToxicityPrompts

Aspect Details
Focus Toxicity in completions
Evaluation Perspective API scores
Goal Minimize harmful outputs

BBQ (Bias Benchmark for Question Answering)

Aspect Details
Focus Social biases
Categories Age, gender, race, religion, etc.
Evaluation Disambiguated vs. ambiguous contexts

Part 7: Long Context Benchmarks

Needle In A Haystack

Aspect Details
Task Find specific fact in long context
Context lengths 1K to 1M+ tokens
Positions Various depths

Context Recall Degradation:

Context Length Recall Rate
4K tokens 95%+
32K tokens 90%+
128K tokens 77%
512K tokens 45%
1M tokens 26%

LongBench

Aspect Details
Tasks Multi-document QA, summarization
Context Up to 32K tokens
Languages Bilingual (EN/CN)

RULER

Aspect Details
Focus Retrieval and reasoning in long context
Tasks Variable complexity
Max context 128K+

Part 8: Emerging Benchmarks (2026)

Humanity's Last Exam (HLE)

Aspect Details
Focus Hardest questions humans can answer
Difficulty Expert-level across domains
Status New frontier benchmark

FrontierMath

Aspect Details
Focus Research-level mathematics
Difficulty Unsolved/open problems
Goal Test mathematical creativity

IFEval (Instruction Following)

Aspect Details
Focus Strict instruction adherence
Constraints Format, length, keywords
Use case Reliability in production

Part 9: Benchmark Selection Guide

By Use Case

Use Case Recommended Benchmarks
General assistant MMLU, GSM8K, IFEval
Coding tool HumanEval, SWE-bench, MBPP
Research/math MATH, AIME, GPQA
Document analysis DocVQA, LongBench, Needle
Multimodal app MMMU, MathVista, VQA
Safety-critical TruthfulQA, RealToxicityPrompts

By Model Type

Model Type Key Benchmarks
Reasoning models (o1/R1) AIME, ARC-AGI, GPQA
Code models HumanEval, SWE-bench, MultiPL-E
Chat models MT-Bench, AlpacaEval, Chatbot Arena
Embedding models MTEB, BEIR

Part 10: Benchmark Limitations

Высокий MMLU не значит что модель хороша для ВАШЕЙ задачи

MMLU 92% выглядит впечатляюще, но: (1) top models разнятся на 1-2% -- статистически незначимо при 16K вопросах, (2) contamination -- модели могли тренироваться на benchmark data, (3) MMLU тестирует factual knowledge (multiple choice), а не reasoning, coding или следование инструкциям. Модель с 88% MMLU может быть лучше для вашего use case чем модель с 92%. Правило: всегда тестируйте на собственном evaluation set из 50-100 примеров вашей задачи. 3+ бенчмарка для fair comparison.

Known Issues

Issue Description
Contamination Models trained on test data
Saturation Top models all score 90%+
Narrow scope Doesn't reflect real-world usage
Gaming Models optimized for specific benchmarks
English bias Most benchmarks English-only

Best Practices

Practice Description
Multiple benchmarks Use 3+ for fair comparison
Domain-specific Include relevant task benchmarks
Human evaluation Benchmark + human review
Production testing Test on real use cases

Part 11: Interview-Relevant Numbers

Top Model Comparison (2026)

Benchmark GPT-5.2 Claude 4.6 Gemini 3 Pro o3-mini
MMLU 92.3% 91.8% 91.2%
GPQA Diamond ~75% 77.0%
HumanEval 92.1% 93.7% 88.5%
SWE-bench ~70% 80.9% ~65% 71.7%
AIME 2024 ~80% 83.3%
ARC-AGI 87.5%
MMMU 85.4% 82.1% 83.5%

Benchmark Difficulty Ranking

Difficulty Benchmarks
Easy (90%+ saturation) GSM8K, HumanEval, MMLU
Medium (70-90%) SWE-bench, MATH, MMMU
Hard (50-70%) GPQA Diamond, ARC-AGI
Very Hard (<50%) AIME, FrontierMath, HLE

Заблуждение: модель с 1M контекстом может использовать весь контекст

Needle in Haystack показывает: recall при 4K токенов = 95%+, при 128K = 77%, при 512K = 45%, при 1M = 26%. Модель с 1M context window теряет 74% информации на максимальной длине. Для production: используйте RAG + chunking вместо длинного контекста. "Длинный контекст" -- это marketing, не capability.

Заблуждение: reasoning-модели (o-series, R1) лучше для всех задач

O3-mini = 87.5% на ARC-AGI (vs GPT-4o: 50%), 83.3% на AIME (vs o1: ~76%). Но reasoning-модели стоят дороже по compute и latency, и на "лёгких" бенчмарках (MMLU, GSM8K, HumanEval) не дают преимущества -- стандартные модели уже набирают 90%+. Reasoning premium (+15-20%) проявляется только на hard tasks: SWE-bench, GPQA Diamond, AIME, ARC-AGI. Для simple Q&A и autocomplete standard модель выгоднее.


Interview Questions

Q: Как систематически оценить LLM для production deployment?

❌ Red flag: "Посмотрим MMLU и HumanEval, выберем лучшую"

✅ Strong answer: "Пятиступенчатый процесс: (1) Chatbot Arena для общего quality (Elo-рейтинг на миллионах голосов), (2) domain-specific бенчмарки -- SWE-bench для кодинга, GPQA для research, MMMU для multimodal, (3) custom evaluation set -- 50-100 примеров вашей конкретной задачи, (4) cost-speed-quality tradeoff -- DeepSeek V3 в 55x дешевле Claude Opus, (5) A/B test с реальными пользователями. Ни один бенчмарк в изоляции не предсказывает production performance."

Q: Какие бенчмарки остались дискриминирующими для frontier-моделей в 2026?

❌ Red flag: "MMLU, HumanEval, GSM8K -- основные бенчмарки"

✅ Strong answer: "Насыщенные (>90%, бесполезны для сравнения): MMLU, GSM8K, HumanEval, MBPP. Дискриминирующие: GPQA Diamond (77% лучший, PhD-level), ARC-AGI (87.5%, abstract reasoning), SWE-bench Verified (80.9%, real GitHub issues), AIME 2024 (83.3%, competition math). Ещё не решённые: FrontierMath (research-level math), Humanity's Last Exam (hardest human questions). Тренд: каждый год frontier-модели насыщают текущие 'hard' бенчмарки, и community создаёт ещё более сложные."

Q: В чём проблема data contamination в бенчмарках и как с ней борются?

❌ Red flag: "Contamination -- это когда модель видела ответы, просто нужно держать данные в секрете"

✅ Strong answer: "Contamination -- модель тренируется на data, содержащей benchmark задачи (web scraping captures leaderboards, solutions, discussions). Последствия: inflated scores, ложное чувство прогресса. Решения: (1) LiveCodeBench -- использует только задачи published после training cutoff модели, (2) Humanity's Last Exam -- новые задачи от экспертов, (3) SWE-bench Verified -- реальные GitHub issues с verified test cases, (4) temporal evaluation -- регулярная ротация задач. Нет идеального решения -- любой публичный бенчмарк со временем утекает в training data."

Q: Как интерпретировать результаты Needle in Haystack теста?

❌ Red flag: "Если модель поддерживает 1M контекст, она может использовать весь миллион"

✅ Strong answer: "Needle in Haystack прячет конкретный факт на разной глубине длинного контекста и проверяет recall. Ключевые числа: 4K токенов = 95%+ recall, 32K = 90%+, 128K = 77%, 512K = 45%, 1M = 26%. Деградация нелинейная -- после 128K падение резкое. Практический вывод: advertised context window и usable context window -- разные вещи. Для длинных документов: RAG с chunking + retrieval эффективнее, чем запихивание всего в контекст."


Sources

  1. DataCamp -- "LLM Benchmarks Explained: A Guide to Comparing the Best AI Models"
  2. Analytics Vidhya — "Guide to AI Benchmarks: MMLU, HumanEval, and More Explained"
  3. Evidently AI — "30 LLM evaluation benchmarks and how they work"
  4. LLMIndex — "LLM Benchmarks Index"
  5. LMCouncil — "AI Model Benchmarks Feb 2026"
  6. BRACAI — "SWE-bench benchmark leaderboard in 2026"
  7. Inference.net — "Top 22 LLM Performance Benchmarks"
  8. Zylos AI — "LLM Evaluation and Benchmarking 2026"

See Also