Гайд по бенчмаркам LLM¶

~10 минут чтения

Предварительно: Метрики оценки LLM, Фреймворки оценки LLM

30+ бенчмарков LLM в 2026 разделены на 6 категорий: knowledge (MMLU, GPQA), reasoning (GSM8K, AIME, ARC-AGI), coding (HumanEval, SWE-bench), multimodal (MMMU, MathVista), safety (TruthfulQA), long context (Needle in Haystack). "Лёгкие" бенчмарки насыщены: MMLU 92.3%, GSM8K 97%, HumanEval 93.7%. Дифференцируют модели "сложные": GPQA Diamond (лучший 77%), AIME 2024 (83.3%), ARC-AGI (87.5%), FrontierMath и Humanity's Last Exam (ещё не решаются frontier-моделями). Критический факт: recall при 1M токенов контекста падает до 26% (vs 95%+ при 4K) -- длинный контекст не означает полезный контекст.

Part 1: Overview¶

Executive Summary¶

Key Insight:

LLM evaluation in 2026 involves 30+ benchmarks across reasoning, coding, knowledge, safety, and multimodal capabilities. MMLU shows saturation (88%+ top models), making newer benchmarks like GPQA Diamond, SWE-bench Verified, and Humanity's Last Exam more discriminative for frontier models.

2026 Benchmark Categories:

Category	Key Benchmarks	What They Test
Knowledge	MMLU, MMLU-Pro, GPQA	Factual knowledge
Reasoning	GSM8K, MATH, AIME	Math and logic
Coding	HumanEval, SWE-bench, MBPP	Code generation
Multimodal	MMMU, MathVista, VQA	Vision-language
Safety	TruthfulQA, RealToxicityPrompts	Harmful outputs
Long Context	Needle in Haystack, LongBench	Context utilization

Part 2: Knowledge Benchmarks¶

MMLU (Massive Multitask Language Understanding)¶

Aspect	Details
Subjects	57 academic subjects
Questions	~16,000 multiple choice
Difficulty	Elementary to professional
Subjects include	History, law, medicine, CS, physics, economics

MMLU Leaderboard (2026):

Model	MMLU Score
GPT-5.2	92.3%
Claude Opus 4.6	91.8%
Gemini 3 Pro	91.2%
DeepSeek V3	90.8%
Llama 4 70B	87.5%

MMLU-Pro (Harder Version)¶

Aspect	Details
Focus	More challenging reasoning
Questions	12,000+ complex questions
Difficulty	Professional/expert level
Status	Replacing original MMLU

GPQA (Graduate-Level Google-Proof Q&A)¶

Aspect	Details
Focus	PhD-level biology, physics, chemistry
Difficulty	Expert-level reasoning
Google-proof	Cannot be solved by search

GPQA Diamond Leaderboard:

Model	GPQA Diamond
o3-mini	77.0%
DeepSeek R1	71.5%
Claude 3.7 Sonnet	~75%

Part 3: Reasoning Benchmarks¶

GSM8K (Grade School Math)¶

Aspect	Details
Questions	8,500+ math word problems
Level	Grade school math
Evaluation	Exact answer match

GSM8K Leaderboard:

Model	GSM8K Score
GPT-5.2	96.5%
Claude Opus 4.6	95.8%
Gemini 3 Pro	95.2%
o3-mini	97%+

MATH / MATH-500¶

Aspect	Details
Focus	Competition mathematics
Difficulty	High school competition level
Subjects	Algebra, geometry, number theory

MATH-500 Leaderboard:

Model	MATH-500
DeepSeek R1	97.3%
o3-mini	96.8%
Claude 3.7	~95%

AIME (American Invitational Mathematics Examination)¶

Aspect	Details
Focus	Advanced competition math
Difficulty	Very high
Year tracked	AIME 2024

AIME 2024 Leaderboard:

Model	AIME 2024
o3-mini	83.3%
DeepSeek R1	79.8%
Claude 3.7 Thinking	~80%

ARC-AGI (Abstraction and Reasoning Corpus)¶

Aspect	Details
Focus	Abstract reasoning, pattern recognition
Novelty	Tests generalization to unseen patterns
Human baseline	~85%

ARC-AGI Leaderboard:

Model	ARC-AGI
o3-mini	87.5% (record)
o1	~76%
GPT-4o	~50%

Part 4: Coding Benchmarks¶

HumanEval¶

Aspect	Details
Tasks	164 Python coding problems
Format	Function + docstring → implementation
Evaluation	Unit test pass rate

HumanEval Leaderboard:

Model	HumanEval
Claude Opus 4.6	93.7%
GPT-5.2	92.1%
DeepSeek V3	89.2%
Gemini 3 Pro	88.5%

SWE-bench (Software Engineering Benchmark)¶

Aspect	Details
Tasks	Real GitHub issues + PRs
Evaluation	Can model fix real bugs?
Difficulty	Production-level code

SWE-bench Verified Leaderboard:

Model	SWE-bench Verified
Claude Opus 4.6	80.9%
o3-mini	71.7%
GPT-5.2	~70%
DeepSeek V3	~65%

MBPP (Mostly Basic Python Problems)¶

Aspect	Details
Tasks	974 Python problems
Difficulty	Basic to intermediate
Format	Text description → code

MultiPL-E (Multilingual)¶

Aspect	Details
Languages	18 programming languages
Based on	HumanEval translations
Use case	Cross-language comparison

Part 5: Multimodal Benchmarks¶

MMMU (Multimodal Multi-discipline Understanding)¶

Aspect	Details
Focus	College-level multimodal reasoning
Modalities	Text, images, diagrams
Subjects	30+ academic disciplines

MMMU Leaderboard:

Model	MMMU
GPT-5.2	85.4%
Gemini 3 Pro	83.5%
Claude Opus 4.6	82.1%
Qwen2.5-VL	75.2%

MathVista¶

Aspect	Details
Focus	Mathematical visual reasoning
Tasks	Charts, diagrams, plots
Skills	Math + visual understanding

MathVista Leaderboard:

Model	MathVista
Gemini 3 Pro	73.1%
GPT-5.2	72.3%
Claude 4.6	71.8%

VQA (Visual Question Answering)¶

Variant	Description
VQA v2	Natural images
TextVQA	Text in images (OCR)
DocVQA	Document understanding
ChartQA	Chart interpretation

Part 6: Safety & Alignment Benchmarks¶

TruthfulQA¶

Aspect	Details
Focus	Truthfulness vs. common misconceptions
Questions	817 questions
Goal	Avoid confident falsehoods

TruthfulQA Patterns:

Behavior	Description
Imitate	Model mimics common (false) beliefs
Truthful	Model provides correct answer

RealToxicityPrompts¶

Aspect	Details
Focus	Toxicity in completions
Evaluation	Perspective API scores
Goal	Minimize harmful outputs

BBQ (Bias Benchmark for Question Answering)¶

Aspect	Details
Focus	Social biases
Categories	Age, gender, race, religion, etc.
Evaluation	Disambiguated vs. ambiguous contexts

Part 7: Long Context Benchmarks¶

Needle In A Haystack¶

Aspect	Details
Task	Find specific fact in long context
Context lengths	1K to 1M+ tokens
Positions	Various depths

Context Recall Degradation:

Context Length	Recall Rate
4K tokens	95%+
32K tokens	90%+
128K tokens	77%
512K tokens	45%
1M tokens	26%

LongBench¶

Aspect	Details
Tasks	Multi-document QA, summarization
Context	Up to 32K tokens
Languages	Bilingual (EN/CN)

RULER¶

Aspect	Details
Focus	Retrieval and reasoning in long context
Tasks	Variable complexity
Max context	128K+

Part 8: Emerging Benchmarks (2026)¶

Humanity's Last Exam (HLE)¶

Aspect	Details
Focus	Hardest questions humans can answer
Difficulty	Expert-level across domains
Status	New frontier benchmark

FrontierMath¶

Aspect	Details
Focus	Research-level mathematics
Difficulty	Unsolved/open problems
Goal	Test mathematical creativity

IFEval (Instruction Following)¶

Aspect	Details
Focus	Strict instruction adherence
Constraints	Format, length, keywords
Use case	Reliability in production

Part 9: Benchmark Selection Guide¶

By Use Case¶

Use Case	Recommended Benchmarks
General assistant	MMLU, GSM8K, IFEval
Coding tool	HumanEval, SWE-bench, MBPP
Research/math	MATH, AIME, GPQA
Document analysis	DocVQA, LongBench, Needle
Multimodal app	MMMU, MathVista, VQA
Safety-critical	TruthfulQA, RealToxicityPrompts

By Model Type¶

Model Type	Key Benchmarks
Reasoning models (o1/R1)	AIME, ARC-AGI, GPQA
Code models	HumanEval, SWE-bench, MultiPL-E
Chat models	MT-Bench, AlpacaEval, Chatbot Arena
Embedding models	MTEB, BEIR

Part 10: Benchmark Limitations¶

Высокий MMLU не значит что модель хороша для ВАШЕЙ задачи

MMLU 92% выглядит впечатляюще, но: (1) top models разнятся на 1-2% -- статистически незначимо при 16K вопросах, (2) contamination -- модели могли тренироваться на benchmark data, (3) MMLU тестирует factual knowledge (multiple choice), а не reasoning, coding или следование инструкциям. Модель с 88% MMLU может быть лучше для вашего use case чем модель с 92%. Правило: всегда тестируйте на собственном evaluation set из 50-100 примеров вашей задачи. 3+ бенчмарка для fair comparison.

Known Issues¶

Issue	Description
Contamination	Models trained on test data
Saturation	Top models all score 90%+
Narrow scope	Doesn't reflect real-world usage
Gaming	Models optimized for specific benchmarks
English bias	Most benchmarks English-only

Best Practices¶

Practice	Description
Multiple benchmarks	Use 3+ for fair comparison
Domain-specific	Include relevant task benchmarks
Human evaluation	Benchmark + human review
Production testing	Test on real use cases

Part 11: Interview-Relevant Numbers¶

Top Model Comparison (2026)¶

Benchmark	GPT-5.2	Claude 4.6	Gemini 3 Pro	o3-mini
MMLU	92.3%	91.8%	91.2%	—
GPQA Diamond	—	~75%	—	77.0%
HumanEval	92.1%	93.7%	88.5%	—
SWE-bench	~70%	80.9%	~65%	71.7%
AIME 2024	—	~80%	—	83.3%
ARC-AGI	—	—	—	87.5%
MMMU	85.4%	82.1%	83.5%	—

Benchmark Difficulty Ranking¶

Difficulty	Benchmarks
Easy (90%+ saturation)	GSM8K, HumanEval, MMLU
Medium (70-90%)	SWE-bench, MATH, MMMU
Hard (50-70%)	GPQA Diamond, ARC-AGI
Very Hard (<50%)	AIME, FrontierMath, HLE

Заблуждение: модель с 1M контекстом может использовать весь контекст

Needle in Haystack показывает: recall при 4K токенов = 95%+, при 128K = 77%, при 512K = 45%, при 1M = 26%. Модель с 1M context window теряет 74% информации на максимальной длине. Для production: используйте RAG + chunking вместо длинного контекста. "Длинный контекст" -- это marketing, не capability.

Заблуждение: reasoning-модели (o-series, R1) лучше для всех задач

O3-mini = 87.5% на ARC-AGI (vs GPT-4o: 50%), 83.3% на AIME (vs o1: ~76%). Но reasoning-модели стоят дороже по compute и latency, и на "лёгких" бенчмарках (MMLU, GSM8K, HumanEval) не дают преимущества -- стандартные модели уже набирают 90%+. Reasoning premium (+15-20%) проявляется только на hard tasks: SWE-bench, GPQA Diamond, AIME, ARC-AGI. Для simple Q&A и autocomplete standard модель выгоднее.

Interview Questions¶

Q: Как систематически оценить LLM для production deployment?

Red flag: "Посмотрим MMLU и HumanEval, выберем лучшую"

Strong answer: "Пятиступенчатый процесс: (1) Chatbot Arena для общего quality (Elo-рейтинг на миллионах голосов), (2) domain-specific бенчмарки -- SWE-bench для кодинга, GPQA для research, MMMU для multimodal, (3) custom evaluation set -- 50-100 примеров вашей конкретной задачи, (4) cost-speed-quality tradeoff -- DeepSeek V3 в 55x дешевле Claude Opus, (5) A/B test с реальными пользователями. Ни один бенчмарк в изоляции не предсказывает production performance."

Q: Какие бенчмарки остались дискриминирующими для frontier-моделей в 2026?

Red flag: "MMLU, HumanEval, GSM8K -- основные бенчмарки"

Strong answer: "Насыщенные (>90%, бесполезны для сравнения): MMLU, GSM8K, HumanEval, MBPP. Дискриминирующие: GPQA Diamond (77% лучший, PhD-level), ARC-AGI (87.5%, abstract reasoning), SWE-bench Verified (80.9%, real GitHub issues), AIME 2024 (83.3%, competition math). Ещё не решённые: FrontierMath (research-level math), Humanity's Last Exam (hardest human questions). Тренд: каждый год frontier-модели насыщают текущие 'hard' бенчмарки, и community создаёт ещё более сложные."

Q: В чём проблема data contamination в бенчмарках и как с ней борются?

Red flag: "Contamination -- это когда модель видела ответы, просто нужно держать данные в секрете"

Strong answer: "Contamination -- модель тренируется на data, содержащей benchmark задачи (web scraping captures leaderboards, solutions, discussions). Последствия: inflated scores, ложное чувство прогресса. Решения: (1) LiveCodeBench -- использует только задачи published после training cutoff модели, (2) Humanity's Last Exam -- новые задачи от экспертов, (3) SWE-bench Verified -- реальные GitHub issues с verified test cases, (4) temporal evaluation -- регулярная ротация задач. Нет идеального решения -- любой публичный бенчмарк со временем утекает в training data."

Q: Как интерпретировать результаты Needle in Haystack теста?

Red flag: "Если модель поддерживает 1M контекст, она может использовать весь миллион"

Strong answer: "Needle in Haystack прячет конкретный факт на разной глубине длинного контекста и проверяет recall. Ключевые числа: 4K токенов = 95%+ recall, 32K = 90%+, 128K = 77%, 512K = 45%, 1M = 26%. Деградация нелинейная -- после 128K падение резкое. Практический вывод: advertised context window и usable context window -- разные вещи. Для длинных документов: RAG с chunking + retrieval эффективнее, чем запихивание всего в контекст."

Sources¶

DataCamp -- "LLM Benchmarks Explained: A Guide to Comparing the Best AI Models"
Analytics Vidhya — "Guide to AI Benchmarks: MMLU, HumanEval, and More Explained"
Evidently AI — "30 LLM evaluation benchmarks and how they work"
LLMIndex — "LLM Benchmarks Index"
LMCouncil — "AI Model Benchmarks Feb 2026"
BRACAI — "SWE-bench benchmark leaderboard in 2026"
Inference.net — "Top 22 LLM Performance Benchmarks"
Zylos AI — "LLM Evaluation and Benchmarking 2026"