LLM Reasoning & Chain-of-Thought 2025-2026¶
~6 минут чтения
Research: Advances and limitations in LLM reasoning capabilities Date: 2026-02-11 Priority: P0
Предварительно: Техники рассуждений, Промпт-инженеринг
Overview¶
Chain-of-Thought (CoT) prompting has become fundamental to LLM reasoning. Recent research reveals both advances and critical limitations.
Key Findings¶
1. Long CoT vs Short CoT (Mar 2025)¶
Paper: "Towards Reasoning Era: A Survey of Long Chain-of-Thought"
Key Distinctions:
| Aspect | Short CoT | Long CoT |
|---|---|---|
| Depth | Shallow reasoning | Deep reasoning |
| Exploration | Single path | Extensive exploration |
| Reflection | None | Feasible reflection |
| Examples | Standard CoT | OpenAI O1, DeepSeek-R1 |
Three Characteristics of Long CoT: 1. Deep Reasoning — Multi-step logical deduction 2. Extensive Exploration — Multiple solution paths 3. Feasible Reflection — Self-correction capabilities
Phenomena: - Overthinking — Excessive computation on simple tasks - Inference-time Scaling — Performance improves with more compute
2. CoT is a Mirage (Aug 2025)¶
Paper: "Is Chain-of-Thought Reasoning of LLMs a Mirage?"
Hypothesis: CoT reasoning reflects structured inductive bias from training data, not genuine reasoning.
Key Insight: $$ P(\text{CoT success}) \propto \text{Similarity}(\text{test}, \text{train distribution}) $$
Findings: - CoT is brittle when pushed beyond training distributions - Fails systematically on out-of-distribution tasks - Success depends on data distribution alignment
Implication: CoT may not represent true generalizable reasoning.
3. Unfaithful CoT in the Wild (Mar 2025)¶
Paper: "Chain-of-Thought Reasoning In The Wild Is Not Always Faithful"
Discovery: Models produce coherent arguments that don't reflect actual reasoning.
Implicit Post-Hoc Rationalization: - Models justify contradictory answers with "coherent" explanations - "Is X > Y?" and "Is Y > X?" can both get "Yes" with different justifications
Unfaithfulness Rates: | Model | Rate | |-------|------| | GPT-4o-mini | 13% | | Haiku 3.5 | 7% | | Gemini 2.5 Flash | 2.17% | | ChatGPT-4o | 0.49% | | DeepSeek R1 | 0.37% | | Sonnet 3.7 (thinking) | 0.04% |
Unfaithful Illogical Shortcuts: - Models use subtly illogical reasoning to make speculative answers seem rigorous
4. Markov Chain of Thought (Oct 2024)¶
Paper: "Markov Chain of Thought for Efficient Mathematical Reasoning"
Concept: "Derive, then reduce"
MCoT Formula: $$ S_{t+1} = f(S_t, A_t) \rightarrow Q_{simplified} $$
Where: - \(S_t\) = current state (text + code) - \(A_t\) = action (code execution) - \(Q_{simplified}\) = compressed question for next step
Benefits: - Avoids lengthy KV cache - Enables longer reasoning paths - Self-correction via code interpreter - Compresses previous steps into simplified questions
5. Entropy-Guided Adaptive CoT (Jan 2026)¶
Paper: "LLMs for Game Theory: Entropy-Guided In-Context Learning"
Framework: Adaptive reasoning based on token-level uncertainty
Algorithm:
if entropy(tokens) < threshold:
→ Concise reasoning, minimal context
else:
→ Expanded multi-path CoT exploration
Results (Tic-Tac-Toe vs algorithmic opponent): - Baseline LLM: -11.6% average outcome - Entropy-guided: +9.5% average outcome - Correlation: Negative between entropy and move optimality
Key Insight: Uncertainty-guided adaptation improves sequential decision-making.
6. Reasoning Model as Discriminator (May 2025)¶
Paper: "When Reasoning Beats Scale: A 1.5B Reasoning Model Outranks 13B LLMs"
Finding: Reasoning models excel as discriminators, not generators.
Results: - DeepSeek-R1-1.5B vs CodeLlama-7B: 87% higher F1, 3.7% better accuracy - DeepSeek-R1-1.5B vs CodeLlama-13B: 3.7% higher execution accuracy
Soft Score Extraction from CoT: $$ \text{Score}© = \frac{\text{CoT endorsement tokens for } c}{\text{Total endorsement tokens}} $$
Implication: Use reasoning models to evaluate, not generate.
7. Interactive Learning for LLM Reasoning (Sep 2025)¶
Paper: "Interactive Learning for LLM Reasoning"
Framework (ILR): 1. Dynamic Interaction — Cooperative vs competitive based on difficulty 2. Perception Calibration — GRPO with cross-model reward integration
Idea3 Interaction Paradigm: - Idea Sharing — Exchange approaches - Idea Analysis — Critique each other - Idea Fusion — Combine best elements
Results: +5% over strongest baseline on math/coding benchmarks
Comparison Table¶
| Approach | Key Innovation | When to Use |
|---|---|---|
| Long CoT | Deep + reflection | Complex problems |
| MCoT | State compression | Long reasoning chains |
| Entropy CoT | Uncertainty adaptation | Sequential decisions |
| Reasoning Discriminator | Evaluation focus | Generator-discriminator setup |
| Interactive Learning | Multi-agent | Collaborative problem solving |
Key Formulas¶
CoT Faithfulness Measure¶
Entropy-Based Adaptation¶
$$ H(X) = -\sum_{x \in X} p(x) \log p(x) $$ If \(H(X) > \tau\): expand reasoning paths
MCoT State Transition¶
Soft Score from CoT¶
Interview Questions¶
Q1: CoT Limitations¶
"Explain why Chain-of-Thought reasoning might not reflect genuine reasoning capabilities."
Answer: - CoT reflects inductive bias from training data - Fails on out-of-distribution tasks - Can produce post-hoc rationalizations - Models may justify contradictory answers
Q2: Long vs Short CoT¶
"When should you use Long CoT over Short CoT?"
Answer: - Long CoT: Complex multi-step problems, need reflection/self-correction - Short CoT: Simple tasks, limited compute budget - Trade-off: Compute vs accuracy
Q3: Discriminator vs Generator¶
"Why might a smaller reasoning model outperform a larger model as a discriminator?"
Answer: - Reasoning models trained for evaluation - Better at analyzing logical consistency - CoT provides explainable scoring - Size doesn't correlate with discrimination ability
Q4: System Design¶
"Design a system that uses LLM reasoning for code review."
Architecture:
Code Diff → Reasoning Discriminator → Issue Detection
↓
Entropy-Guided CoT
↓
Interactive Multi-Agent (if uncertain)
↓
Final Review Report
CoT может быть post-hoc рационализацией, а не реальным рассуждением
Модели генерируют убедительные цепочки рассуждений, которые НЕ отражают внутренний процесс. GPT-4o-mini: 13% unfaithful CoT. Модель может обосновать противоположные ответы ("X > Y" и "Y > X") одинаково убедительно. Вывод: CoT -- это structured output, не debugger мышления. Верифицируй результат независимо от "объяснения".
Больше thinking tokens != лучший ответ
Overthinking -- реальная проблема Long CoT. На простых задачах модель тратит избыточный compute, иногда ухудшая результат (самокоррекция правильного ответа на неправильный). Entropy-guided подход: если uncertainty низкая -- короткий CoT; высокая -- развёрнутый. Не ставь max_tokens=32000 "на всякий случай".
Practical Implications¶
For Deployment¶
- Don't trust CoT explanations blindly — verify independently
- Use reasoning models as evaluators — not generators
- Monitor entropy — adapt compute based on uncertainty
- Compress reasoning chains — MCoT for efficiency
For Development¶
- Test on out-of-distribution data
- Measure faithfulness explicitly
- Consider generator-discriminator architecture
- Implement adaptive reasoning based on task complexity
Remaining Challenges¶
- Generalization — CoT beyond training distribution
- Faithfulness — Ensuring explanations reflect actual reasoning
- Efficiency — Long CoT is expensive
- Evaluation — How to measure true reasoning capability
Sources¶
- arXiv:2503.09567 — Long CoT Survey (Mar 2025)
- arXiv:2508.01191 — CoT is a Mirage (Aug 2025)
- arXiv:2503.08679 — Unfaithful CoT (Mar 2025)
- arXiv:2410.17635 — Markov CoT (Oct 2024)
- arXiv:2601.10775 — Entropy-Guided CoT (Jan 2026)
- arXiv:2505.03786 — Reasoning as Discriminator (May 2025)
- arXiv:2509.26306 — Interactive Learning (Sep 2025)
- arXiv:2410.03595 — Hopfieldian View of CoT (Oct 2024)
See Also¶
- Prompt Engineering -- CoT как техника промптинга, few-shot vs zero-shot CoT
- Reasoning Techniques -- Tree-of-Thought, self-consistency, и другие методы рассуждений
- Scaling Reasoning -- test-time compute scaling, inference-time reasoning
- Alignment Methods -- GRPO (DeepSeek R1) дал emergence of reasoning через RL
- LLM Benchmarks -- GSM8K, MATH, HumanEval для оценки reasoning capabilities