Последние статьи и бенчмарки LLM¶
~6 минут чтения
Предварительно: Гайд по бенчмаркам LLM | Эффективные трансформеры
К февралю 2026 года SWE-bench заменил HumanEval как главный бенчмарк для кодинга (HumanEval насыщен на 95%+), а Claude Opus 4.5 показал 80.9% -- разрыв между open-source и closed моделями сократился до 19 процентных пунктов (62% у Kimi K2.5). Параллельно test-time compute стал стандартным приёмом: +5-10% качества за счёт "thinking" режима. В этой статье -- ключевые arXiv papers февраля 2026 и актуальные бенчмарк-лидерборды.
URL: LinkedIn LLM Papers, ToLearn Blog Тип: research / benchmarks / papers Дата: Февраль 2026 Сбор: Ralph Research ФАЗА 5
Part 1: Recent arXiv Papers (February 2026)¶
1. Learning to Discover at Test Time¶
Paper: arXiv:2601.16175
Key Innovation: Models that discover new patterns during inference, not just during training.
Core Concept: - Traditional models: learn patterns during training, apply during inference - Test-Time Discovery: actively search for patterns during inference - Enables handling of novel distributions without retraining
Implications: - Better OOD (out-of-distribution) performance - Adaptive to new domains - Reduces need for continuous retraining
2. End-to-End Test-Time Training for Long Context¶
Paper: arXiv:2512.23675
Key Innovation: Updating Transformer weights at inference time to handle extremely long contexts.
Method:
graph LR
subgraph Standard
A1["Forward pass"] --> A2["Output"]
end
subgraph T3["Test-Time Training"]
B1["Forward pass"] --> B2["Update weights"]
B2 --> B3["Forward pass"] --> B4["Output"]
end
style A1 fill:#e8eaf6,stroke:#3f51b5
style A2 fill:#e8f5e9,stroke:#4caf50
style B1 fill:#e8eaf6,stroke:#3f51b5
style B2 fill:#fff3e0,stroke:#ef6c00
style B3 fill:#e8eaf6,stroke:#3f51b5
style B4 fill:#e8f5e9,stroke:#4caf50
Results: - Handles 10x longer sequences than training - Maintains quality without architecture changes - Minimal compute overhead (5-10%)
Trade-offs: - Slightly slower inference - Requires careful gradient computation - Memory overhead for weight updates
3. Recursive Language Models¶
Paper: arXiv:2512.24601
Key Innovation: Handling arbitrarily long inputs via recursive model calls.
Architecture:
graph LR
A["Input"] --> C1["Chunk 1"]
C1 --> M1["Model Call"]
M1 --> H1["Hidden State"]
H1 --> C2["Chunk 2"]
C2 --> M2["Model Call<br/>+ prev hidden state"]
M2 --> H2["Hidden State"]
H2 --> D["... --> Output"]
style A fill:#e8eaf6,stroke:#3f51b5
style C1 fill:#fff3e0,stroke:#ef6c00
style C2 fill:#fff3e0,stroke:#ef6c00
style M1 fill:#f3e5f5,stroke:#9c27b0
style M2 fill:#f3e5f5,stroke:#9c27b0
style H1 fill:#e8f5e9,stroke:#4caf50
style H2 fill:#e8f5e9,stroke:#4caf50
style D fill:#e8eaf6,stroke:#3f51b5
Advantages: - No theoretical context limit - Linear scaling with input length - Composable with any base model
Applications: - Document processing (1000+ pages) - Code analysis (entire codebases) - Multi-turn conversations with full history
4. STEM: Scaling Transformers with Embedding Modules¶
Paper: arXiv:2601.10639
Key Innovation: Fine-grained sparse transformer with learnable embedding modules.
Architecture:
graph LR
subgraph Traditional["Traditional Transformer"]
T1["All tokens"] --> T2["All attention heads"] --> T3["Dense computation"]
end
subgraph STEM_arch["STEM"]
S1["Tokens"] --> S2["Router"] --> S3["Selected Embedding<br/>Modules"] --> S4["Sparse computation"]
end
style T1 fill:#e8eaf6,stroke:#3f51b5
style T2 fill:#e8eaf6,stroke:#3f51b5
style T3 fill:#fce4ec,stroke:#c62828
style S1 fill:#e8eaf6,stroke:#3f51b5
style S2 fill:#fff3e0,stroke:#ef6c00
style S3 fill:#f3e5f5,stroke:#9c27b0
style S4 fill:#e8f5e9,stroke:#4caf50
Key Metrics: | Metric | Traditional | STEM | |--------|-------------|------| | Compute for 1M tokens | 100% | 15-25% | | Quality (MMLU) | Baseline | -2% to +1% | | Memory | 100% | 40-60% |
Embedding Module Types: 1. Domain-specific: Code, Math, Reasoning, Creative 2. Task-specific: Classification, Generation, Summarization 3. Language-specific: Multilingual routing
5. Mixture-of-Agents (MoA) Papers¶
Multiple papers February 2026
Core Concept: Multiple LLM agents collaborating vs single model.
Architectures: 1. Sequential MoA: Agent 1 → Agent 2 → Agent 3 → Output 2. Parallel MoA: Agent 1, Agent 2, Agent 3 → Aggregator → Output 3. Hierarchical MoA: Manager → Workers → Aggregator → Output
Findings: - 3-5 agents optimal (diminishing returns after) - Different models for different roles works best - Aggregator model quality critical
Production Results (from papers): | Setup | Cost | Quality Gain | |-------|------|--------------| | 3x GPT-5.2-mini | 1.5x | +8% on complex tasks | | 1x Claude + 1x GPT | 2x | +12% on reasoning | | 5x Llama-70B | 0.5x | +5% on code |
6. Reinforcement Learning from AI Feedback (RLAIF) Advances¶
Papers: Multiple Feb 2026
Key Innovations: 1. Constitutional AI 2.0: Anthropic's improved approach 2. Self-RLAIF: Model generates its own feedback 3. Multi-objective RLAIF: Balance safety, helpfulness, accuracy
Constitutional AI 2.0 Principles: - Self-critique before response - Principle-based reasoning - Automatic principle generation
Part 2: LLM Coding Benchmarks (February 2026)¶
SWE-bench Verified Results¶
What it measures: Real GitHub issues + PR patches
| Rank | Model | Score | Notes |
|---|---|---|---|
| 1 | Claude Opus 4.5 | 80.9% | SOTA |
| 2 | GPT-5.2 | 80.0% | Close second |
| 3 | Claude Sonnet 4.5 | 72.3% | Best value |
| 4 | Gemini 2.5 Pro | 63.8% | - |
| 5 | DeepSeek-V3 | 58.2% | Open-weight leader |
Key Findings: - Claude Opus 4.5 leads by 0.9% over GPT-5.2 - Open-source models catching up (DeepSeek at 58%) - Test-time compute helps significantly (+5-10%)
LiveCodeBench Results (February 2026)¶
What it measures: Real-time coding challenges
| Model | Score | Notes |
|---|---|---|
| GPT-5.2 | ~89% | Top performer |
| GLM-4.7 Thinking | 89% | Chinese model |
| Claude Opus 4.5 | 85-88% | - |
| Qwen3-Coder-32B | 78% | Open-weight |
| Kimi K2.5 | 76% | Open-weight |
Key Findings: - "Thinking" models outperform standard models by 5-10% - Test-time search (MCTS, beam search) adds 3-5% - Open-source gap narrowing
Terminal-Bench Results¶
What it measures: Terminal/command-line tasks
| Model | Score | Notes |
|---|---|---|
| Claude Opus 4.5 | 92% | Best for sysadmin tasks |
| GPT-5.2 | 89% | - |
| Open-source best | 75% | Qwen3-Coder |
Unique Challenges: - Multi-step command sequences - Error recovery - Environment awareness - Security considerations
HumanEval and MBPP (Classic Benchmarks)¶
Status: Mostly saturated (90%+ for top models)
| Model | HumanEval | MBPP |
|---|---|---|
| GPT-5.2 | 96% | 94% |
| Claude Opus 4.5 | 95% | 93% |
| DeepSeek-V3 | 89% | 87% |
| Qwen3-Coder | 88% | 85% |
Key Finding: These benchmarks no longer differentiate top models. Focus shifted to SWE-bench and LiveCodeBench.
Part 3: Open-Source Coding Models (February 2026)¶
Leaderboard¶
| Model | Params | License | Key Benchmark |
|---|---|---|---|
| Kimi K2.5 | 1.2T | Custom | SWE-bench: 62% |
| Qwen3-Coder-32B | 32B | Apache 2.0 | HumanEval: 88% |
| DeepSeek-V3 | 650B | Custom | SWE-bench: 58% |
| CodeLlama-70B | 70B | Llama License | HumanEval: 78% |
Qwen3-Coder Family¶
| Model | Size | Price | Best For |
|---|---|---|---|
| Qwen3-Coder-7B | 7B | Free | Edge deployment |
| Qwen3-Coder-14B | 14B | Free | Consumer hardware |
| Qwen3-Coder-32B | 32B | Free | Production use |
Notable: Apache 2.0 license allows commercial use without restrictions.
Part 4: Benchmark Analysis for Interview Prep¶
What to Know for Interviews¶
Key Numbers (February 2026):
| Metric | Value |
|---|---|
| SWE-bench SOTA | 80.9% (Claude Opus 4.5) |
| LiveCodeBench SOTA | ~89% (GPT-5.2, GLM-4.7) |
| Open-source SWE-bench | 62% (Kimi K2.5) |
| HumanEval saturation | 95%+ for top models |
Trends to Discuss:
- SWE-bench is the new standard: HumanEval/MBPP are saturated
- Test-time compute matters: +5-10% from thinking/reasoning
- Open-source catching up: 62% SWE-bench from open models
- MoA approaches: 3-5 agents can beat single frontier model
Architecture Patterns:
- STEM (Sparse Transformer with Embedding Modules): 75% compute reduction
- Test-Time Training: Handle 10x longer contexts
- Recursive LMs: Unlimited context via recursion
- MoA (Mixture of Agents): Multiple models collaborating
Part 5: Implications for Practice¶
Model Selection (February 2026)¶
| Use Case | Best Model | Alternative |
|---|---|---|
| Code generation (SOTA) | Claude Opus 4.5 | GPT-5.2 |
| Code generation (value) | Claude Sonnet 4.5 | Qwen3-Coder-32B |
| Long context | Llama 4 Scout (10M) | Claude Opus 4.5 (1M) |
| Open source | Qwen3-Coder-32B | DeepSeek-V3 |
| Complex reasoning | Claude Opus 4.5 | GPT-5.2 |
Cost-Performance Trade-offs¶
| Tier | Cost | Quality | Use Case |
|---|---|---|---|
| Frontier | $15-60/M | 80-81% SWE-bench | Critical tasks |
| Mid-tier | $3-10/M | 70-75% SWE-bench | Production code |
| Open-source | $0.5-2/M | 58-62% SWE-bench | High volume |
Заблуждение: HumanEval -- надёжный индикатор способностей модели к кодингу
HumanEval насыщен: топ-модели показывают 95-96%, разница в пределах шума. Реальная дифференциация -- на SWE-bench (реальные GitHub issues, multi-file patches) и LiveCodeBench (свежие задачи, нет в training data). HumanEval измеряет single-function completion, а не engineering-способности.
Заблуждение: Mixture-of-Agents всегда лучше одной модели
MoA даёт +8-12% на complex tasks при 1.5-2x стоимости, но на простых задачах overhead не окупается. Оптимум -- 3-5 агентов (diminishing returns после). Ключевой фактор -- качество aggregator model, а не количество агентов: слабый aggregator ухудшает результат.
Заблуждение: open-source модели на порядок хуже closed
Разрыв сократился до 19 п.п. на SWE-bench (62% Kimi K2.5 vs 80.9% Claude Opus 4.5). На HumanEval -- менее 7%. DeepSeek V3 с MIT-лицензией конкурирует на MMLU/MATH. При self-hosting (500K+ queries/mo) open-source дешевле на 25% ROI.
Interview Questions¶
Q: Почему SWE-bench стал главным бенчмарком для кодинга вместо HumanEval?
Red flag: "SWE-bench -- просто более новый бенчмарк с большими задачами."
Strong answer: "HumanEval насыщен (95%+ у топ-моделей) и измеряет только single-function completion. SWE-bench использует реальные GitHub issues с multi-file patches -- это ближе к инженерной работе. Ключевая разница: SWE-bench требует понимания контекста проекта, навигации по codebase, reasoning о зависимостях. Дополнительно LiveCodeBench даёт fresh задачи, не попавшие в training data."
Q: Что такое test-time compute и зачем оно нужно?
Red flag: "Это когда модель думает дольше и отвечает лучше."
Strong answer: "Test-time compute -- дополнительные вычисления при инференсе: thinking tokens (CoT), MCTS/beam search для reasoning, test-time training (обновление весов на контексте). Даёт +5-10% на SWE-bench и +3-5% от search. Примеры: T3 (test-time training) позволяет обрабатывать 10x длиннее контексты без изменения архитектуры при 5-10% overhead. Trade-off: задержка и стоимость растут, поэтому применяют selective compute -- больше для сложных задач."
Q: Какую модель вы выберете для production code generation и почему?
Red flag: "Самую новую -- Claude Opus, потому что у неё лучший score."
Strong answer: "Зависит от use case. Critical tasks: Claude Opus 4.5 (80.9% SWE-bench) или GPT-5.2 (80.0%), $15-60/M tokens. Production bulk: Claude Sonnet 4.5 (72.3%), оптимальный cost/quality. High volume: Qwen3-Coder-32B (Apache 2.0, HumanEval 88%, self-host $0.35/M). Для long context -- Llama 4 Scout (10M tokens). Ключевая метрика -- cost per resolved issue, а не benchmark score."
Sources¶
- LinkedIn — "LLM Papers Reading Notes - February Week 2, 2026"
- ToLearn Blog — "The LLM Coding Benchmark Showdown 2026"
- SWE-bench Official Leaderboard
- LiveCodeBench Official Results
- arXiv papers (February 2026 batch)
See Also¶
- Бенчмарки кода LLM -- подробный разбор SWE-bench, LiveCodeBench, HumanEval с pass@k формулой
- Гайд по бенчмаркам LLM -- полная таксономия бенчмарков: knowledge, reasoning, coding, multimodal, safety
- Эффективные трансформеры -- STEM (sparse transformer), test-time training -- архитектуры из papers этого раздела
- Масштабирование рассуждений -- test-time compute scaling, thinking models -- связано с reasoning advances
- Open-Source LLM модели -- Qwen3-Coder, DeepSeek-V3, Kimi K2.5 -- open-source модели из leaderboards