Перейти к содержанию

Последние статьи и бенчмарки LLM

~6 минут чтения

Предварительно: Гайд по бенчмаркам LLM | Эффективные трансформеры

К февралю 2026 года SWE-bench заменил HumanEval как главный бенчмарк для кодинга (HumanEval насыщен на 95%+), а Claude Opus 4.5 показал 80.9% -- разрыв между open-source и closed моделями сократился до 19 процентных пунктов (62% у Kimi K2.5). Параллельно test-time compute стал стандартным приёмом: +5-10% качества за счёт "thinking" режима. В этой статье -- ключевые arXiv papers февраля 2026 и актуальные бенчмарк-лидерборды.

URL: LinkedIn LLM Papers, ToLearn Blog Тип: research / benchmarks / papers Дата: Февраль 2026 Сбор: Ralph Research ФАЗА 5


Part 1: Recent arXiv Papers (February 2026)

1. Learning to Discover at Test Time

Paper: arXiv:2601.16175

Key Innovation: Models that discover new patterns during inference, not just during training.

Core Concept: - Traditional models: learn patterns during training, apply during inference - Test-Time Discovery: actively search for patterns during inference - Enables handling of novel distributions without retraining

Implications: - Better OOD (out-of-distribution) performance - Adaptive to new domains - Reduces need for continuous retraining


2. End-to-End Test-Time Training for Long Context

Paper: arXiv:2512.23675

Key Innovation: Updating Transformer weights at inference time to handle extremely long contexts.

Method:

graph LR
    subgraph Standard
        A1["Forward pass"] --> A2["Output"]
    end
    subgraph T3["Test-Time Training"]
        B1["Forward pass"] --> B2["Update weights"]
        B2 --> B3["Forward pass"] --> B4["Output"]
    end

    style A1 fill:#e8eaf6,stroke:#3f51b5
    style A2 fill:#e8f5e9,stroke:#4caf50
    style B1 fill:#e8eaf6,stroke:#3f51b5
    style B2 fill:#fff3e0,stroke:#ef6c00
    style B3 fill:#e8eaf6,stroke:#3f51b5
    style B4 fill:#e8f5e9,stroke:#4caf50

Results: - Handles 10x longer sequences than training - Maintains quality without architecture changes - Minimal compute overhead (5-10%)

Trade-offs: - Slightly slower inference - Requires careful gradient computation - Memory overhead for weight updates


3. Recursive Language Models

Paper: arXiv:2512.24601

Key Innovation: Handling arbitrarily long inputs via recursive model calls.

Architecture:

graph LR
    A["Input"] --> C1["Chunk 1"]
    C1 --> M1["Model Call"]
    M1 --> H1["Hidden State"]
    H1 --> C2["Chunk 2"]
    C2 --> M2["Model Call<br/>+ prev hidden state"]
    M2 --> H2["Hidden State"]
    H2 --> D["... --> Output"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style C1 fill:#fff3e0,stroke:#ef6c00
    style C2 fill:#fff3e0,stroke:#ef6c00
    style M1 fill:#f3e5f5,stroke:#9c27b0
    style M2 fill:#f3e5f5,stroke:#9c27b0
    style H1 fill:#e8f5e9,stroke:#4caf50
    style H2 fill:#e8f5e9,stroke:#4caf50
    style D fill:#e8eaf6,stroke:#3f51b5

Advantages: - No theoretical context limit - Linear scaling with input length - Composable with any base model

Applications: - Document processing (1000+ pages) - Code analysis (entire codebases) - Multi-turn conversations with full history


4. STEM: Scaling Transformers with Embedding Modules

Paper: arXiv:2601.10639

Key Innovation: Fine-grained sparse transformer with learnable embedding modules.

Architecture:

graph LR
    subgraph Traditional["Traditional Transformer"]
        T1["All tokens"] --> T2["All attention heads"] --> T3["Dense computation"]
    end
    subgraph STEM_arch["STEM"]
        S1["Tokens"] --> S2["Router"] --> S3["Selected Embedding<br/>Modules"] --> S4["Sparse computation"]
    end

    style T1 fill:#e8eaf6,stroke:#3f51b5
    style T2 fill:#e8eaf6,stroke:#3f51b5
    style T3 fill:#fce4ec,stroke:#c62828
    style S1 fill:#e8eaf6,stroke:#3f51b5
    style S2 fill:#fff3e0,stroke:#ef6c00
    style S3 fill:#f3e5f5,stroke:#9c27b0
    style S4 fill:#e8f5e9,stroke:#4caf50

Key Metrics: | Metric | Traditional | STEM | |--------|-------------|------| | Compute for 1M tokens | 100% | 15-25% | | Quality (MMLU) | Baseline | -2% to +1% | | Memory | 100% | 40-60% |

Embedding Module Types: 1. Domain-specific: Code, Math, Reasoning, Creative 2. Task-specific: Classification, Generation, Summarization 3. Language-specific: Multilingual routing


5. Mixture-of-Agents (MoA) Papers

Multiple papers February 2026

Core Concept: Multiple LLM agents collaborating vs single model.

Architectures: 1. Sequential MoA: Agent 1 → Agent 2 → Agent 3 → Output 2. Parallel MoA: Agent 1, Agent 2, Agent 3 → Aggregator → Output 3. Hierarchical MoA: Manager → Workers → Aggregator → Output

Findings: - 3-5 agents optimal (diminishing returns after) - Different models for different roles works best - Aggregator model quality critical

Production Results (from papers): | Setup | Cost | Quality Gain | |-------|------|--------------| | 3x GPT-5.2-mini | 1.5x | +8% on complex tasks | | 1x Claude + 1x GPT | 2x | +12% on reasoning | | 5x Llama-70B | 0.5x | +5% on code |


6. Reinforcement Learning from AI Feedback (RLAIF) Advances

Papers: Multiple Feb 2026

Key Innovations: 1. Constitutional AI 2.0: Anthropic's improved approach 2. Self-RLAIF: Model generates its own feedback 3. Multi-objective RLAIF: Balance safety, helpfulness, accuracy

Constitutional AI 2.0 Principles: - Self-critique before response - Principle-based reasoning - Automatic principle generation


Part 2: LLM Coding Benchmarks (February 2026)

SWE-bench Verified Results

What it measures: Real GitHub issues + PR patches

Rank Model Score Notes
1 Claude Opus 4.5 80.9% SOTA
2 GPT-5.2 80.0% Close second
3 Claude Sonnet 4.5 72.3% Best value
4 Gemini 2.5 Pro 63.8% -
5 DeepSeek-V3 58.2% Open-weight leader

Key Findings: - Claude Opus 4.5 leads by 0.9% over GPT-5.2 - Open-source models catching up (DeepSeek at 58%) - Test-time compute helps significantly (+5-10%)


LiveCodeBench Results (February 2026)

What it measures: Real-time coding challenges

Model Score Notes
GPT-5.2 ~89% Top performer
GLM-4.7 Thinking 89% Chinese model
Claude Opus 4.5 85-88% -
Qwen3-Coder-32B 78% Open-weight
Kimi K2.5 76% Open-weight

Key Findings: - "Thinking" models outperform standard models by 5-10% - Test-time search (MCTS, beam search) adds 3-5% - Open-source gap narrowing


Terminal-Bench Results

What it measures: Terminal/command-line tasks

Model Score Notes
Claude Opus 4.5 92% Best for sysadmin tasks
GPT-5.2 89% -
Open-source best 75% Qwen3-Coder

Unique Challenges: - Multi-step command sequences - Error recovery - Environment awareness - Security considerations


HumanEval and MBPP (Classic Benchmarks)

Status: Mostly saturated (90%+ for top models)

Model HumanEval MBPP
GPT-5.2 96% 94%
Claude Opus 4.5 95% 93%
DeepSeek-V3 89% 87%
Qwen3-Coder 88% 85%

Key Finding: These benchmarks no longer differentiate top models. Focus shifted to SWE-bench and LiveCodeBench.


Part 3: Open-Source Coding Models (February 2026)

Leaderboard

Model Params License Key Benchmark
Kimi K2.5 1.2T Custom SWE-bench: 62%
Qwen3-Coder-32B 32B Apache 2.0 HumanEval: 88%
DeepSeek-V3 650B Custom SWE-bench: 58%
CodeLlama-70B 70B Llama License HumanEval: 78%

Qwen3-Coder Family

Model Size Price Best For
Qwen3-Coder-7B 7B Free Edge deployment
Qwen3-Coder-14B 14B Free Consumer hardware
Qwen3-Coder-32B 32B Free Production use

Notable: Apache 2.0 license allows commercial use without restrictions.


Part 4: Benchmark Analysis for Interview Prep

What to Know for Interviews

Key Numbers (February 2026):

Metric Value
SWE-bench SOTA 80.9% (Claude Opus 4.5)
LiveCodeBench SOTA ~89% (GPT-5.2, GLM-4.7)
Open-source SWE-bench 62% (Kimi K2.5)
HumanEval saturation 95%+ for top models

Trends to Discuss:

  1. SWE-bench is the new standard: HumanEval/MBPP are saturated
  2. Test-time compute matters: +5-10% from thinking/reasoning
  3. Open-source catching up: 62% SWE-bench from open models
  4. MoA approaches: 3-5 agents can beat single frontier model

Architecture Patterns:

  1. STEM (Sparse Transformer with Embedding Modules): 75% compute reduction
  2. Test-Time Training: Handle 10x longer contexts
  3. Recursive LMs: Unlimited context via recursion
  4. MoA (Mixture of Agents): Multiple models collaborating

Part 5: Implications for Practice

Model Selection (February 2026)

Use Case Best Model Alternative
Code generation (SOTA) Claude Opus 4.5 GPT-5.2
Code generation (value) Claude Sonnet 4.5 Qwen3-Coder-32B
Long context Llama 4 Scout (10M) Claude Opus 4.5 (1M)
Open source Qwen3-Coder-32B DeepSeek-V3
Complex reasoning Claude Opus 4.5 GPT-5.2

Cost-Performance Trade-offs

Tier Cost Quality Use Case
Frontier $15-60/M 80-81% SWE-bench Critical tasks
Mid-tier $3-10/M 70-75% SWE-bench Production code
Open-source $0.5-2/M 58-62% SWE-bench High volume

Заблуждение: HumanEval -- надёжный индикатор способностей модели к кодингу

HumanEval насыщен: топ-модели показывают 95-96%, разница в пределах шума. Реальная дифференциация -- на SWE-bench (реальные GitHub issues, multi-file patches) и LiveCodeBench (свежие задачи, нет в training data). HumanEval измеряет single-function completion, а не engineering-способности.

Заблуждение: Mixture-of-Agents всегда лучше одной модели

MoA даёт +8-12% на complex tasks при 1.5-2x стоимости, но на простых задачах overhead не окупается. Оптимум -- 3-5 агентов (diminishing returns после). Ключевой фактор -- качество aggregator model, а не количество агентов: слабый aggregator ухудшает результат.

Заблуждение: open-source модели на порядок хуже closed

Разрыв сократился до 19 п.п. на SWE-bench (62% Kimi K2.5 vs 80.9% Claude Opus 4.5). На HumanEval -- менее 7%. DeepSeek V3 с MIT-лицензией конкурирует на MMLU/MATH. При self-hosting (500K+ queries/mo) open-source дешевле на 25% ROI.


Interview Questions

Q: Почему SWE-bench стал главным бенчмарком для кодинга вместо HumanEval?

❌ Red flag: "SWE-bench -- просто более новый бенчмарк с большими задачами."

✅ Strong answer: "HumanEval насыщен (95%+ у топ-моделей) и измеряет только single-function completion. SWE-bench использует реальные GitHub issues с multi-file patches -- это ближе к инженерной работе. Ключевая разница: SWE-bench требует понимания контекста проекта, навигации по codebase, reasoning о зависимостях. Дополнительно LiveCodeBench даёт fresh задачи, не попавшие в training data."

Q: Что такое test-time compute и зачем оно нужно?

❌ Red flag: "Это когда модель думает дольше и отвечает лучше."

✅ Strong answer: "Test-time compute -- дополнительные вычисления при инференсе: thinking tokens (CoT), MCTS/beam search для reasoning, test-time training (обновление весов на контексте). Даёт +5-10% на SWE-bench и +3-5% от search. Примеры: T3 (test-time training) позволяет обрабатывать 10x длиннее контексты без изменения архитектуры при 5-10% overhead. Trade-off: задержка и стоимость растут, поэтому применяют selective compute -- больше для сложных задач."

Q: Какую модель вы выберете для production code generation и почему?

❌ Red flag: "Самую новую -- Claude Opus, потому что у неё лучший score."

✅ Strong answer: "Зависит от use case. Critical tasks: Claude Opus 4.5 (80.9% SWE-bench) или GPT-5.2 (80.0%), $15-60/M tokens. Production bulk: Claude Sonnet 4.5 (72.3%), оптимальный cost/quality. High volume: Qwen3-Coder-32B (Apache 2.0, HumanEval 88%, self-host $0.35/M). Для long context -- Llama 4 Scout (10M tokens). Ключевая метрика -- cost per resolved issue, а не benchmark score."


Sources

  1. LinkedIn — "LLM Papers Reading Notes - February Week 2, 2026"
  2. ToLearn Blog — "The LLM Coding Benchmark Showdown 2026"
  3. SWE-bench Official Leaderboard
  4. LiveCodeBench Official Results
  5. arXiv papers (February 2026 batch)

See Also