Последние статьи и бенчмарки LLM¶

~6 минут чтения

Предварительно: Гайд по бенчмаркам LLM | Эффективные трансформеры

К февралю 2026 года SWE-bench заменил HumanEval как главный бенчмарк для кодинга (HumanEval насыщен на 95%+), а Claude Opus 4.5 показал 80.9% -- разрыв между open-source и closed моделями сократился до 19 процентных пунктов (62% у Kimi K2.5). Параллельно test-time compute стал стандартным приёмом: +5-10% качества за счёт "thinking" режима. В этой статье -- ключевые arXiv papers февраля 2026 и актуальные бенчмарк-лидерборды.

URL: LinkedIn LLM Papers, ToLearn Blog Тип: research / benchmarks / papers Дата: Февраль 2026 Сбор: Ralph Research ФАЗА 5

Part 1: Recent arXiv Papers (February 2026)¶

1. Learning to Discover at Test Time¶

Paper: arXiv:2601.16175

Key Innovation: Models that discover new patterns during inference, not just during training.

Core Concept: - Traditional models: learn patterns during training, apply during inference - Test-Time Discovery: actively search for patterns during inference - Enables handling of novel distributions without retraining

Implications: - Better OOD (out-of-distribution) performance - Adaptive to new domains - Reduces need for continuous retraining

2. End-to-End Test-Time Training for Long Context¶

Paper: arXiv:2512.23675

Key Innovation: Updating Transformer weights at inference time to handle extremely long contexts.

Method:

graph LR
    subgraph Standard
        A1["Forward pass"] --> A2["Output"]
    end
    subgraph T3["Test-Time Training"]
        B1["Forward pass"] --> B2["Update weights"]
        B2 --> B3["Forward pass"] --> B4["Output"]
    end

    style A1 fill:#e8eaf6,stroke:#3f51b5
    style A2 fill:#e8f5e9,stroke:#4caf50
    style B1 fill:#e8eaf6,stroke:#3f51b5
    style B2 fill:#fff3e0,stroke:#ef6c00
    style B3 fill:#e8eaf6,stroke:#3f51b5
    style B4 fill:#e8f5e9,stroke:#4caf50

Results: - Handles 10x longer sequences than training - Maintains quality without architecture changes - Minimal compute overhead (5-10%)

Trade-offs: - Slightly slower inference - Requires careful gradient computation - Memory overhead for weight updates

3. Recursive Language Models¶

Paper: arXiv:2512.24601

Key Innovation: Handling arbitrarily long inputs via recursive model calls.

Architecture:

graph LR
    A["Input"] --> C1["Chunk 1"]
    C1 --> M1["Model Call"]
    M1 --> H1["Hidden State"]
    H1 --> C2["Chunk 2"]
    C2 --> M2["Model Call<br/>+ prev hidden state"]
    M2 --> H2["Hidden State"]
    H2 --> D["... --> Output"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style C1 fill:#fff3e0,stroke:#ef6c00
    style C2 fill:#fff3e0,stroke:#ef6c00
    style M1 fill:#f3e5f5,stroke:#9c27b0
    style M2 fill:#f3e5f5,stroke:#9c27b0
    style H1 fill:#e8f5e9,stroke:#4caf50
    style H2 fill:#e8f5e9,stroke:#4caf50
    style D fill:#e8eaf6,stroke:#3f51b5

Advantages: - No theoretical context limit - Linear scaling with input length - Composable with any base model

Applications: - Document processing (1000+ pages) - Code analysis (entire codebases) - Multi-turn conversations with full history

4. STEM: Scaling Transformers with Embedding Modules¶

Paper: arXiv:2601.10639

Key Innovation: Fine-grained sparse transformer with learnable embedding modules.

Architecture:

graph LR
    subgraph Traditional["Traditional Transformer"]
        T1["All tokens"] --> T2["All attention heads"] --> T3["Dense computation"]
    end
    subgraph STEM_arch["STEM"]
        S1["Tokens"] --> S2["Router"] --> S3["Selected Embedding<br/>Modules"] --> S4["Sparse computation"]
    end

    style T1 fill:#e8eaf6,stroke:#3f51b5
    style T2 fill:#e8eaf6,stroke:#3f51b5
    style T3 fill:#fce4ec,stroke:#c62828
    style S1 fill:#e8eaf6,stroke:#3f51b5
    style S2 fill:#fff3e0,stroke:#ef6c00
    style S3 fill:#f3e5f5,stroke:#9c27b0
    style S4 fill:#e8f5e9,stroke:#4caf50

Key Metrics: | Metric | Traditional | STEM | |--------|-------------|------| | Compute for 1M tokens | 100% | 15-25% | | Quality (MMLU) | Baseline | -2% to +1% | | Memory | 100% | 40-60% |

Embedding Module Types: 1. Domain-specific: Code, Math, Reasoning, Creative 2. Task-specific: Classification, Generation, Summarization 3. Language-specific: Multilingual routing

5. Mixture-of-Agents (MoA) Papers¶

Multiple papers February 2026

Core Concept: Multiple LLM agents collaborating vs single model.

Architectures: 1. Sequential MoA: Agent 1 → Agent 2 → Agent 3 → Output 2. Parallel MoA: Agent 1, Agent 2, Agent 3 → Aggregator → Output 3. Hierarchical MoA: Manager → Workers → Aggregator → Output

Findings: - 3-5 agents optimal (diminishing returns after) - Different models for different roles works best - Aggregator model quality critical

Production Results (from papers): | Setup | Cost | Quality Gain | |-------|------|--------------| | 3x GPT-5.2-mini | 1.5x | +8% on complex tasks | | 1x Claude + 1x GPT | 2x | +12% on reasoning | | 5x Llama-70B | 0.5x | +5% on code |

6. Reinforcement Learning from AI Feedback (RLAIF) Advances¶

Papers: Multiple Feb 2026

Key Innovations: 1. Constitutional AI 2.0: Anthropic's improved approach 2. Self-RLAIF: Model generates its own feedback 3. Multi-objective RLAIF: Balance safety, helpfulness, accuracy

Constitutional AI 2.0 Principles: - Self-critique before response - Principle-based reasoning - Automatic principle generation

Part 2: LLM Coding Benchmarks (February 2026)¶

SWE-bench Verified Results¶

What it measures: Real GitHub issues + PR patches

Rank	Model	Score	Notes
1	Claude Opus 4.5	80.9%	SOTA
2	GPT-5.2	80.0%	Close second
3	Claude Sonnet 4.5	72.3%	Best value
4	Gemini 2.5 Pro	63.8%	-
5	DeepSeek-V3	58.2%	Open-weight leader

Key Findings: - Claude Opus 4.5 leads by 0.9% over GPT-5.2 - Open-source models catching up (DeepSeek at 58%) - Test-time compute helps significantly (+5-10%)

LiveCodeBench Results (February 2026)¶

What it measures: Real-time coding challenges

Model	Score	Notes
GPT-5.2	~89%	Top performer
GLM-4.7 Thinking	89%	Chinese model
Claude Opus 4.5	85-88%	-
Qwen3-Coder-32B	78%	Open-weight
Kimi K2.5	76%	Open-weight

Key Findings: - "Thinking" models outperform standard models by 5-10% - Test-time search (MCTS, beam search) adds 3-5% - Open-source gap narrowing

Terminal-Bench Results¶

What it measures: Terminal/command-line tasks

Model	Score	Notes
Claude Opus 4.5	92%	Best for sysadmin tasks
GPT-5.2	89%	-
Open-source best	75%	Qwen3-Coder

Unique Challenges: - Multi-step command sequences - Error recovery - Environment awareness - Security considerations

HumanEval and MBPP (Classic Benchmarks)¶

Status: Mostly saturated (90%+ for top models)

Model	HumanEval	MBPP
GPT-5.2	96%	94%
Claude Opus 4.5	95%	93%
DeepSeek-V3	89%	87%
Qwen3-Coder	88%	85%

Key Finding: These benchmarks no longer differentiate top models. Focus shifted to SWE-bench and LiveCodeBench.

Part 3: Open-Source Coding Models (February 2026)¶

Leaderboard¶

Model	Params	License	Key Benchmark
Kimi K2.5	1.2T	Custom	SWE-bench: 62%
Qwen3-Coder-32B	32B	Apache 2.0	HumanEval: 88%
DeepSeek-V3	650B	Custom	SWE-bench: 58%
CodeLlama-70B	70B	Llama License	HumanEval: 78%

Qwen3-Coder Family¶

Model	Size	Price	Best For
Qwen3-Coder-7B	7B	Free	Edge deployment
Qwen3-Coder-14B	14B	Free	Consumer hardware
Qwen3-Coder-32B	32B	Free	Production use

Notable: Apache 2.0 license allows commercial use without restrictions.

Part 4: Benchmark Analysis for Interview Prep¶

What to Know for Interviews¶

Key Numbers (February 2026):

Metric	Value
SWE-bench SOTA	80.9% (Claude Opus 4.5)
LiveCodeBench SOTA	~89% (GPT-5.2, GLM-4.7)
Open-source SWE-bench	62% (Kimi K2.5)
HumanEval saturation	95%+ for top models

Trends to Discuss:

SWE-bench is the new standard: HumanEval/MBPP are saturated
Test-time compute matters: +5-10% from thinking/reasoning
Open-source catching up: 62% SWE-bench from open models
MoA approaches: 3-5 agents can beat single frontier model

Architecture Patterns:

STEM (Sparse Transformer with Embedding Modules): 75% compute reduction
Test-Time Training: Handle 10x longer contexts
Recursive LMs: Unlimited context via recursion
MoA (Mixture of Agents): Multiple models collaborating

Part 5: Implications for Practice¶

Model Selection (February 2026)¶

Use Case	Best Model	Alternative
Code generation (SOTA)	Claude Opus 4.5	GPT-5.2
Code generation (value)	Claude Sonnet 4.5	Qwen3-Coder-32B
Long context	Llama 4 Scout (10M)	Claude Opus 4.5 (1M)
Open source	Qwen3-Coder-32B	DeepSeek-V3
Complex reasoning	Claude Opus 4.5	GPT-5.2

Cost-Performance Trade-offs¶

Tier	Cost	Quality	Use Case
Frontier	$15-60/M	80-81% SWE-bench	Critical tasks
Mid-tier	$3-10/M	70-75% SWE-bench	Production code
Open-source	$0.5-2/M	58-62% SWE-bench	High volume

Заблуждение: HumanEval -- надёжный индикатор способностей модели к кодингу

HumanEval насыщен: топ-модели показывают 95-96%, разница в пределах шума. Реальная дифференциация -- на SWE-bench (реальные GitHub issues, multi-file patches) и LiveCodeBench (свежие задачи, нет в training data). HumanEval измеряет single-function completion, а не engineering-способности.

Заблуждение: Mixture-of-Agents всегда лучше одной модели

MoA даёт +8-12% на complex tasks при 1.5-2x стоимости, но на простых задачах overhead не окупается. Оптимум -- 3-5 агентов (diminishing returns после). Ключевой фактор -- качество aggregator model, а не количество агентов: слабый aggregator ухудшает результат.

Заблуждение: open-source модели на порядок хуже closed

Разрыв сократился до 19 п.п. на SWE-bench (62% Kimi K2.5 vs 80.9% Claude Opus 4.5). На HumanEval -- менее 7%. DeepSeek V3 с MIT-лицензией конкурирует на MMLU/MATH. При self-hosting (500K+ queries/mo) open-source дешевле на 25% ROI.

Interview Questions¶

Q: Почему SWE-bench стал главным бенчмарком для кодинга вместо HumanEval?

Red flag: "SWE-bench -- просто более новый бенчмарк с большими задачами."

Strong answer: "HumanEval насыщен (95%+ у топ-моделей) и измеряет только single-function completion. SWE-bench использует реальные GitHub issues с multi-file patches -- это ближе к инженерной работе. Ключевая разница: SWE-bench требует понимания контекста проекта, навигации по codebase, reasoning о зависимостях. Дополнительно LiveCodeBench даёт fresh задачи, не попавшие в training data."

Q: Что такое test-time compute и зачем оно нужно?

Red flag: "Это когда модель думает дольше и отвечает лучше."

Strong answer: "Test-time compute -- дополнительные вычисления при инференсе: thinking tokens (CoT), MCTS/beam search для reasoning, test-time training (обновление весов на контексте). Даёт +5-10% на SWE-bench и +3-5% от search. Примеры: T3 (test-time training) позволяет обрабатывать 10x длиннее контексты без изменения архитектуры при 5-10% overhead. Trade-off: задержка и стоимость растут, поэтому применяют selective compute -- больше для сложных задач."

Q: Какую модель вы выберете для production code generation и почему?

Red flag: "Самую новую -- Claude Opus, потому что у неё лучший score."

Strong answer: "Зависит от use case. Critical tasks: Claude Opus 4.5 (80.9% SWE-bench) или GPT-5.2 (80.0%), $15-60/M tokens. Production bulk: Claude Sonnet 4.5 (72.3%), оптимальный cost/quality. High volume: Qwen3-Coder-32B (Apache 2.0, HumanEval 88%, self-host $0.35/M). Для long context -- Llama 4 Scout (10M tokens). Ключевая метрика -- cost per resolved issue, а не benchmark score."

Sources¶

LinkedIn — "LLM Papers Reading Notes - February Week 2, 2026"
ToLearn Blog — "The LLM Coding Benchmark Showdown 2026"
SWE-bench Official Leaderboard
LiveCodeBench Official Results
arXiv papers (February 2026 batch)