LLM Reasoning & Chain-of-Thought 2025-2026¶

~6 минут чтения

Research: Advances and limitations in LLM reasoning capabilities Date: 2026-02-11 Priority: P0

Предварительно: Техники рассуждений, Промпт-инженеринг

Overview¶

Chain-of-Thought (CoT) prompting has become fundamental to LLM reasoning. Recent research reveals both advances and critical limitations.

Key Findings¶

1. Long CoT vs Short CoT (Mar 2025)¶

Paper: "Towards Reasoning Era: A Survey of Long Chain-of-Thought"

Key Distinctions:

Aspect	Short CoT	Long CoT
Depth	Shallow reasoning	Deep reasoning
Exploration	Single path	Extensive exploration
Reflection	None	Feasible reflection
Examples	Standard CoT	OpenAI O1, DeepSeek-R1

Three Characteristics of Long CoT: 1. Deep Reasoning — Multi-step logical deduction 2. Extensive Exploration — Multiple solution paths 3. Feasible Reflection — Self-correction capabilities

Phenomena: - Overthinking — Excessive computation on simple tasks - Inference-time Scaling — Performance improves with more compute

2. CoT is a Mirage (Aug 2025)¶

Paper: "Is Chain-of-Thought Reasoning of LLMs a Mirage?"

Hypothesis: CoT reasoning reflects structured inductive bias from training data, not genuine reasoning.

Key Insight: $$ P(\text{CoT success}) \propto \text{Similarity}(\text{test}, \text{train distribution}) $$

Findings: - CoT is brittle when pushed beyond training distributions - Fails systematically on out-of-distribution tasks - Success depends on data distribution alignment

Implication: CoT may not represent true generalizable reasoning.

3. Unfaithful CoT in the Wild (Mar 2025)¶

Paper: "Chain-of-Thought Reasoning In The Wild Is Not Always Faithful"

Discovery: Models produce coherent arguments that don't reflect actual reasoning.

Implicit Post-Hoc Rationalization: - Models justify contradictory answers with "coherent" explanations - "Is X > Y?" and "Is Y > X?" can both get "Yes" with different justifications

Unfaithfulness Rates: | Model | Rate | |-------|------| | GPT-4o-mini | 13% | | Haiku 3.5 | 7% | | Gemini 2.5 Flash | 2.17% | | ChatGPT-4o | 0.49% | | DeepSeek R1 | 0.37% | | Sonnet 3.7 (thinking) | 0.04% |

Unfaithful Illogical Shortcuts: - Models use subtly illogical reasoning to make speculative answers seem rigorous

4. Markov Chain of Thought (Oct 2024)¶

Paper: "Markov Chain of Thought for Efficient Mathematical Reasoning"

Concept: "Derive, then reduce"

MCoT Formula: $$ S_{t+1} = f(S_t, A_t) \rightarrow Q_{simplified} $$

Where: - $S_t$ = current state (text + code) - $A_t$ = action (code execution) - $Q_{simplified}$ = compressed question for next step

Benefits: - Avoids lengthy KV cache - Enables longer reasoning paths - Self-correction via code interpreter - Compresses previous steps into simplified questions

5. Entropy-Guided Adaptive CoT (Jan 2026)¶

Paper: "LLMs for Game Theory: Entropy-Guided In-Context Learning"

Framework: Adaptive reasoning based on token-level uncertainty

Algorithm:

if entropy(tokens) < threshold:
    → Concise reasoning, minimal context
else:
    → Expanded multi-path CoT exploration

Results (Tic-Tac-Toe vs algorithmic opponent): - Baseline LLM: -11.6% average outcome - Entropy-guided: +9.5% average outcome - Correlation: Negative between entropy and move optimality

Key Insight: Uncertainty-guided adaptation improves sequential decision-making.

6. Reasoning Model as Discriminator (May 2025)¶

Paper: "When Reasoning Beats Scale: A 1.5B Reasoning Model Outranks 13B LLMs"

Finding: Reasoning models excel as discriminators, not generators.

Results: - DeepSeek-R1-1.5B vs CodeLlama-7B: 87% higher F1, 3.7% better accuracy - DeepSeek-R1-1.5B vs CodeLlama-13B: 3.7% higher execution accuracy

Soft Score Extraction from CoT: $$ \text{Score}© = \frac{\text{CoT endorsement tokens for } c}{\text{Total endorsement tokens}} $$

Implication: Use reasoning models to evaluate, not generate.

7. Interactive Learning for LLM Reasoning (Sep 2025)¶

Paper: "Interactive Learning for LLM Reasoning"

Framework (ILR): 1. Dynamic Interaction — Cooperative vs competitive based on difficulty 2. Perception Calibration — GRPO with cross-model reward integration

Idea3 Interaction Paradigm: - Idea Sharing — Exchange approaches - Idea Analysis — Critique each other - Idea Fusion — Combine best elements

Results: +5% over strongest baseline on math/coding benchmarks

Comparison Table¶

Approach	Key Innovation	When to Use
Long CoT	Deep + reflection	Complex problems
MCoT	State compression	Long reasoning chains
Entropy CoT	Uncertainty adaptation	Sequential decisions
Reasoning Discriminator	Evaluation focus	Generator-discriminator setup
Interactive Learning	Multi-agent	Collaborative problem solving

Key Formulas¶

CoT Faithfulness Measure¶

\[ \text{Faithfulness} = 1 - \frac{\text{Contradictory justifications}}{\text{Total justifications}} \]

Entropy-Based Adaptation¶

$$ H(X) = -\sum_{x \in X} p(x) \log p(x) $$ If $H(X) > \tau$: expand reasoning paths

MCoT State Transition¶

\[ P(S_{t+1} | S_t, S_{t-1}, \ldots, S_0) \approx P(S_{t+1} | S_t) \]

Soft Score from CoT¶

\[ \text{Score}(c) = \text{sigmoid}\left(\sum_{t} \mathbb{1}[\text{token}_t \in \text{endorse}(c)]\right) \]

Interview Questions¶

Q1: CoT Limitations¶

"Explain why Chain-of-Thought reasoning might not reflect genuine reasoning capabilities."

Answer: - CoT reflects inductive bias from training data - Fails on out-of-distribution tasks - Can produce post-hoc rationalizations - Models may justify contradictory answers

Q2: Long vs Short CoT¶

"When should you use Long CoT over Short CoT?"

Answer: - Long CoT: Complex multi-step problems, need reflection/self-correction - Short CoT: Simple tasks, limited compute budget - Trade-off: Compute vs accuracy

Q3: Discriminator vs Generator¶

"Why might a smaller reasoning model outperform a larger model as a discriminator?"

Answer: - Reasoning models trained for evaluation - Better at analyzing logical consistency - CoT provides explainable scoring - Size doesn't correlate with discrimination ability

Q4: System Design¶

"Design a system that uses LLM reasoning for code review."

Architecture:

Code Diff → Reasoning Discriminator → Issue Detection
                    ↓
            Entropy-Guided CoT
                    ↓
         Interactive Multi-Agent (if uncertain)
                    ↓
            Final Review Report

CoT может быть post-hoc рационализацией, а не реальным рассуждением

Модели генерируют убедительные цепочки рассуждений, которые НЕ отражают внутренний процесс. GPT-4o-mini: 13% unfaithful CoT. Модель может обосновать противоположные ответы ("X > Y" и "Y > X") одинаково убедительно. Вывод: CoT -- это structured output, не debugger мышления. Верифицируй результат независимо от "объяснения".

Больше thinking tokens != лучший ответ

Overthinking -- реальная проблема Long CoT. На простых задачах модель тратит избыточный compute, иногда ухудшая результат (самокоррекция правильного ответа на неправильный). Entropy-guided подход: если uncertainty низкая -- короткий CoT; высокая -- развёрнутый. Не ставь max_tokens=32000 "на всякий случай".

Practical Implications¶

For Deployment¶

Don't trust CoT explanations blindly — verify independently
Use reasoning models as evaluators — not generators
Monitor entropy — adapt compute based on uncertainty
Compress reasoning chains — MCoT for efficiency

For Development¶

Test on out-of-distribution data
Measure faithfulness explicitly
Consider generator-discriminator architecture
Implement adaptive reasoning based on task complexity

Remaining Challenges¶

Generalization — CoT beyond training distribution
Faithfulness — Ensuring explanations reflect actual reasoning
Efficiency — Long CoT is expensive
Evaluation — How to measure true reasoning capability

Sources¶

arXiv:2503.09567 — Long CoT Survey (Mar 2025)
arXiv:2508.01191 — CoT is a Mirage (Aug 2025)
arXiv:2503.08679 — Unfaithful CoT (Mar 2025)
arXiv:2410.17635 — Markov CoT (Oct 2024)
arXiv:2601.10775 — Entropy-Guided CoT (Jan 2026)
arXiv:2505.03786 — Reasoning as Discriminator (May 2025)
arXiv:2509.26306 — Interactive Learning (Sep 2025)
arXiv:2410.03595 — Hopfieldian View of CoT (Oct 2024)