Перейти к содержанию

LLM Reasoning & Chain-of-Thought 2025-2026

~6 минут чтения

Research: Advances and limitations in LLM reasoning capabilities Date: 2026-02-11 Priority: P0


Предварительно: Техники рассуждений, Промпт-инженеринг

Overview

Chain-of-Thought (CoT) prompting has become fundamental to LLM reasoning. Recent research reveals both advances and critical limitations.


Key Findings

1. Long CoT vs Short CoT (Mar 2025)

Paper: "Towards Reasoning Era: A Survey of Long Chain-of-Thought"

Key Distinctions:

Aspect Short CoT Long CoT
Depth Shallow reasoning Deep reasoning
Exploration Single path Extensive exploration
Reflection None Feasible reflection
Examples Standard CoT OpenAI O1, DeepSeek-R1

Three Characteristics of Long CoT: 1. Deep Reasoning — Multi-step logical deduction 2. Extensive Exploration — Multiple solution paths 3. Feasible Reflection — Self-correction capabilities

Phenomena: - Overthinking — Excessive computation on simple tasks - Inference-time Scaling — Performance improves with more compute


2. CoT is a Mirage (Aug 2025)

Paper: "Is Chain-of-Thought Reasoning of LLMs a Mirage?"

Hypothesis: CoT reasoning reflects structured inductive bias from training data, not genuine reasoning.

Key Insight: $$ P(\text{CoT success}) \propto \text{Similarity}(\text{test}, \text{train distribution}) $$

Findings: - CoT is brittle when pushed beyond training distributions - Fails systematically on out-of-distribution tasks - Success depends on data distribution alignment

Implication: CoT may not represent true generalizable reasoning.


3. Unfaithful CoT in the Wild (Mar 2025)

Paper: "Chain-of-Thought Reasoning In The Wild Is Not Always Faithful"

Discovery: Models produce coherent arguments that don't reflect actual reasoning.

Implicit Post-Hoc Rationalization: - Models justify contradictory answers with "coherent" explanations - "Is X > Y?" and "Is Y > X?" can both get "Yes" with different justifications

Unfaithfulness Rates: | Model | Rate | |-------|------| | GPT-4o-mini | 13% | | Haiku 3.5 | 7% | | Gemini 2.5 Flash | 2.17% | | ChatGPT-4o | 0.49% | | DeepSeek R1 | 0.37% | | Sonnet 3.7 (thinking) | 0.04% |

Unfaithful Illogical Shortcuts: - Models use subtly illogical reasoning to make speculative answers seem rigorous


4. Markov Chain of Thought (Oct 2024)

Paper: "Markov Chain of Thought for Efficient Mathematical Reasoning"

Concept: "Derive, then reduce"

MCoT Formula: $$ S_{t+1} = f(S_t, A_t) \rightarrow Q_{simplified} $$

Where: - \(S_t\) = current state (text + code) - \(A_t\) = action (code execution) - \(Q_{simplified}\) = compressed question for next step

Benefits: - Avoids lengthy KV cache - Enables longer reasoning paths - Self-correction via code interpreter - Compresses previous steps into simplified questions


5. Entropy-Guided Adaptive CoT (Jan 2026)

Paper: "LLMs for Game Theory: Entropy-Guided In-Context Learning"

Framework: Adaptive reasoning based on token-level uncertainty

Algorithm:

if entropy(tokens) < threshold:
    → Concise reasoning, minimal context
else:
    → Expanded multi-path CoT exploration

Results (Tic-Tac-Toe vs algorithmic opponent): - Baseline LLM: -11.6% average outcome - Entropy-guided: +9.5% average outcome - Correlation: Negative between entropy and move optimality

Key Insight: Uncertainty-guided adaptation improves sequential decision-making.


6. Reasoning Model as Discriminator (May 2025)

Paper: "When Reasoning Beats Scale: A 1.5B Reasoning Model Outranks 13B LLMs"

Finding: Reasoning models excel as discriminators, not generators.

Results: - DeepSeek-R1-1.5B vs CodeLlama-7B: 87% higher F1, 3.7% better accuracy - DeepSeek-R1-1.5B vs CodeLlama-13B: 3.7% higher execution accuracy

Soft Score Extraction from CoT: $$ \text{Score}© = \frac{\text{CoT endorsement tokens for } c}{\text{Total endorsement tokens}} $$

Implication: Use reasoning models to evaluate, not generate.


7. Interactive Learning for LLM Reasoning (Sep 2025)

Paper: "Interactive Learning for LLM Reasoning"

Framework (ILR): 1. Dynamic Interaction — Cooperative vs competitive based on difficulty 2. Perception Calibration — GRPO with cross-model reward integration

Idea3 Interaction Paradigm: - Idea Sharing — Exchange approaches - Idea Analysis — Critique each other - Idea Fusion — Combine best elements

Results: +5% over strongest baseline on math/coding benchmarks


Comparison Table

Approach Key Innovation When to Use
Long CoT Deep + reflection Complex problems
MCoT State compression Long reasoning chains
Entropy CoT Uncertainty adaptation Sequential decisions
Reasoning Discriminator Evaluation focus Generator-discriminator setup
Interactive Learning Multi-agent Collaborative problem solving

Key Formulas

CoT Faithfulness Measure

\[ \text{Faithfulness} = 1 - \frac{\text{Contradictory justifications}}{\text{Total justifications}} \]

Entropy-Based Adaptation

$$ H(X) = -\sum_{x \in X} p(x) \log p(x) $$ If \(H(X) > \tau\): expand reasoning paths

MCoT State Transition

\[ P(S_{t+1} | S_t, S_{t-1}, \ldots, S_0) \approx P(S_{t+1} | S_t) \]

Soft Score from CoT

\[ \text{Score}(c) = \text{sigmoid}\left(\sum_{t} \mathbb{1}[\text{token}_t \in \text{endorse}(c)]\right) \]

Interview Questions

Q1: CoT Limitations

"Explain why Chain-of-Thought reasoning might not reflect genuine reasoning capabilities."

Answer: - CoT reflects inductive bias from training data - Fails on out-of-distribution tasks - Can produce post-hoc rationalizations - Models may justify contradictory answers

Q2: Long vs Short CoT

"When should you use Long CoT over Short CoT?"

Answer: - Long CoT: Complex multi-step problems, need reflection/self-correction - Short CoT: Simple tasks, limited compute budget - Trade-off: Compute vs accuracy

Q3: Discriminator vs Generator

"Why might a smaller reasoning model outperform a larger model as a discriminator?"

Answer: - Reasoning models trained for evaluation - Better at analyzing logical consistency - CoT provides explainable scoring - Size doesn't correlate with discrimination ability

Q4: System Design

"Design a system that uses LLM reasoning for code review."

Architecture:

Code Diff → Reasoning Discriminator → Issue Detection
            Entropy-Guided CoT
         Interactive Multi-Agent (if uncertain)
            Final Review Report


CoT может быть post-hoc рационализацией, а не реальным рассуждением

Модели генерируют убедительные цепочки рассуждений, которые НЕ отражают внутренний процесс. GPT-4o-mini: 13% unfaithful CoT. Модель может обосновать противоположные ответы ("X > Y" и "Y > X") одинаково убедительно. Вывод: CoT -- это structured output, не debugger мышления. Верифицируй результат независимо от "объяснения".

Больше thinking tokens != лучший ответ

Overthinking -- реальная проблема Long CoT. На простых задачах модель тратит избыточный compute, иногда ухудшая результат (самокоррекция правильного ответа на неправильный). Entropy-guided подход: если uncertainty низкая -- короткий CoT; высокая -- развёрнутый. Не ставь max_tokens=32000 "на всякий случай".

Practical Implications

For Deployment

  1. Don't trust CoT explanations blindly — verify independently
  2. Use reasoning models as evaluators — not generators
  3. Monitor entropy — adapt compute based on uncertainty
  4. Compress reasoning chains — MCoT for efficiency

For Development

  1. Test on out-of-distribution data
  2. Measure faithfulness explicitly
  3. Consider generator-discriminator architecture
  4. Implement adaptive reasoning based on task complexity

Remaining Challenges

  1. Generalization — CoT beyond training distribution
  2. Faithfulness — Ensuring explanations reflect actual reasoning
  3. Efficiency — Long CoT is expensive
  4. Evaluation — How to measure true reasoning capability

Sources

  1. arXiv:2503.09567 — Long CoT Survey (Mar 2025)
  2. arXiv:2508.01191 — CoT is a Mirage (Aug 2025)
  3. arXiv:2503.08679 — Unfaithful CoT (Mar 2025)
  4. arXiv:2410.17635 — Markov CoT (Oct 2024)
  5. arXiv:2601.10775 — Entropy-Guided CoT (Jan 2026)
  6. arXiv:2505.03786 — Reasoning as Discriminator (May 2025)
  7. arXiv:2509.26306 — Interactive Learning (Sep 2025)
  8. arXiv:2410.03595 — Hopfieldian View of CoT (Oct 2024)

See Also

  • Prompt Engineering -- CoT как техника промптинга, few-shot vs zero-shot CoT
  • Reasoning Techniques -- Tree-of-Thought, self-consistency, и другие методы рассуждений
  • Scaling Reasoning -- test-time compute scaling, inference-time reasoning
  • Alignment Methods -- GRPO (DeepSeek R1) дал emergence of reasoning через RL
  • LLM Benchmarks -- GSM8K, MATH, HumanEval для оценки reasoning capabilities