Перейти к содержанию

Анализ противоречий между источниками

~9 минут чтения

Предварительно: Кросс-референс карта тем | Мастер-гайд для подготовки

Тип: synthesis / cross-reference / contradictions Дата: Февраль 2026

При подготовке к интервью по LLM-инженерии кандидаты сталкиваются с 10+ прямыми противоречиями между статьями, бенчмарками и документациями фреймворков. Например, ALiBi paper утверждает превосходство над RoPE в экстраполяции, но 90%+ production LLM (LLaMA, Mistral, Qwen, Gemma) используют RoPE. Подобные расхождения -- не ошибки, а следствие разных контекстов, метрик и дат публикации. Этот документ разрешает 10 ключевых противоречий с конкретными рекомендациями на февраль 2026.


Overview

During the research synthesis, several contradictions emerged between different sources. This document identifies these contradictions, analyzes the context, and provides resolution guidance.


1. Positional Encoding: RoPE vs ALiBi

Contradiction

Source Claim
ALiBi paper (2021) ALiBi extrapolates better than RoPE
Modern LLM adoption RoPE used in LLaMA, Mistral, Qwen, Gemma
Some benchmarks ALiBi shows better length extrapolation

Resolution

Both claims are TRUE, but context matters:

  1. ALiBi's advantage: Extrapolation to longer sequences without modification
  2. Trained on 1024 → works on 8192+ out of box
  3. Simpler implementation

  4. RoPE's advantage:

  5. Higher overall quality
  6. Better suited for learned position patterns
  7. Extensions (RoPE-Scaling, YaRN) close the extrapolation gap

  8. Industry choice: RoPE won because:

  9. Quality > pure extrapolation in most use cases
  10. Scaling methods solve extrapolation
  11. Better compatibility with rotary attention optimizations

Conclusion: Use RoPE for production. Consider ALiBi only for specific extrapolation needs without modification.


2. SSM vs Transformer: Quality Claims

Contradiction

Source Claim
Mamba paper "SSMs match Transformer quality"
Some benchmarks Mamba-3 at 95-98% of Transformer quality
Industry adoption Transformers still dominate

Resolution

Context-dependent quality:

Scenario Winner Reason
Short sequences (<1K) Transformer Full attention captures all dependencies
Medium sequences (1K-16K) Hybrid (Jamba) Balance quality + efficiency
Very long sequences (>16K) SSM or Hybrid SSM's O(T) advantage dominates
Generation quality Transformer Subtle patterns better captured

Key nuance: "Match quality" in papers means comparable on benchmarks, not identical. Production systems often need that extra 2-5% quality.

Conclusion: Pure SSM for efficiency-critical applications. Hybrid (1:7 Transformer:SSM) for production with quality requirements.


3. Quantization Quality: AWQ vs GPTQ vs GGUF

Contradiction

Source Claim
AWQ paper AWQ achieves ~98% of original quality
GPTQ paper GPTQ achieves 97-99% of original quality
llama.cpp docs GGUF Q4_K_M achieves ~92% quality

Resolution

Different quality metrics and use cases:

Method Quality (4-bit) Speed Best For
GPTQ 97-99% Fast on NVIDIA GPU serving
AWQ ~98% Fast on all GPU Production API
GGUF Q4_K_M ~92% Variable CPU/Mobile

Why different numbers: 1. Different benchmarks: Some use perplexity, others use downstream tasks 2. Different models: Larger models quantize better 3. Different calibration data: Affects per-channel scaling

Conclusion: For production GPU: AWQ or GPTQ (comparable). For CPU/edge: GGUF with appropriate quantization level.


4. Inference Engine: vLLM vs SGLang Speed Claims

Contradiction

Source Claim
vLLM docs "State-of-the-art throughput"
SGLang paper "Up to 3.7× faster than vLLM"
Some benchmarks vLLM faster on single requests

Resolution

Performance is workload-dependent:

Workload Winner Why
Single request vLLM Lower overhead, simpler path
Batch inference SGLang Better scheduling, RadixAttention
Agent workflows SGLang 3× faster due to prefix caching
Multi-turn chat Similar Both have prefix caching

Key insight: SGLang's advantage comes from: 1. RadixAttention (token-level sharing) 2. Structured output optimization 3. Disaggregated inference (for multi-node)

Conclusion: Use SGLang for high-throughput/agent workloads. vLLM for simpler serving needs.


5. Speculative Decoding: Speedup Claims

Contradiction

Source Claim
EAGLE paper 2-4× speedup
Some production reports 1.3-1.8× speedup
vLLM blog (Dec 2025) 1.82× on Qwen3-32B

Resolution

Speedup depends on acceptance rate:

Acceptance Rate Speedup
33% (Qwen3-32B) 1.82×
50% ~2.2×
65% ~2.5×
80% ~3×+

Why variance: 1. Draft model quality: Better draft = higher acceptance 2. Task type: Repetitive tasks = higher acceptance 3. Temperature: Lower temperature = more predictable 4. Model size: Smaller targets accept more

Conclusion: EAGLE-3 achieves 2-4× in ideal conditions. Real-world: expect 1.5-2.5× for most workloads.


6. Long Context: RAG vs Long Context Models

Contradiction

Source Claim
Some papers "Long context will replace RAG"
RAG research "RAG remains essential"
Production experience Both are needed

Resolution

Cost-quality tradeoff:

Aspect Long Context RAG
Cost ratio 100× more expensive Baseline
Fresh data Requires retraining Instant update
Accuracy Full context Retrieval quality dependent
"Lost in middle" Yes Can be mitigated

Key finding (2025-2026): $\(\text{Cost(Long Context)} : \text{Cost(RAG)} \approx 100 : 1\)$

Best practice: - Use RAG for retrieval (get relevant docs) - Use long context for reasoning (process retrieved docs) - Hybrid approach for production

Conclusion: Long context and RAG are complementary, not competing. Use both.


7. MoE Load Balancing: Auxiliary Loss vs Loss-Free

Contradiction

Source Claim
Standard MoE (Mixtral) Auxiliary loss is necessary
DeepSeek V3 Loss-free balancing works better
SIMBAL paper Similarity-preserving is 36% faster

Resolution

Evolution of techniques:

Method Description Pros Cons
Auxiliary Loss Penalty for imbalance Simple Interferes with main loss
Loss-Free (DeepSeek V3) Dynamic bias per expert No interference Complex implementation
SIMBAL Similarity-preserving 36% faster convergence New (less proven)

Status (Feb 2026): - Auxiliary loss: Standard, well-tested - Loss-free: Gaining adoption (DeepSeek V3 proves it works) - SIMBAL: Promising but needs more validation

Conclusion: For new projects, consider loss-free balancing. For stability, auxiliary loss is safe choice.


8. Normalization: LayerNorm vs RMSNorm

Contradiction

Source Claim
Original Transformer LayerNorm is standard
LLaMA paper RMSNorm is better
Some studies "No significant difference"

Resolution

Both are correct in different contexts:

Aspect LayerNorm RMSNorm
Computation \(O(d)\) with mean subtraction \(O(d)\) simpler
Quality Established Identical or better
Speed Baseline 15-25% faster
Adoption (2026) Declining Standard in all LLMs

Why RMSNorm won: 1. Simpler computation (no mean subtraction) 2. Same or better quality 3. Easier to implement efficiently on GPU

Conclusion: Use RMSNorm for all new Transformer-based models. LayerNorm is legacy.


9. CoT Faithfulness: Do Models Really Reason?

Contradiction

Source Claim
Original CoT paper CoT improves reasoning
"CoT is a Mirage" paper CoT reflects training bias
DeepSeek R1 docs CoT enables complex reasoning

Resolution

Nuanced understanding:

Finding Implication
CoT helps solve problems True for many tasks
CoT explanations are unfaithful Models justify answers post-hoc
CoT improves with size Larger models more faithful

Unfaithfulness rates (Jan 2026): - GPT-4o-mini: 13% - Claude Haiku: 7% - DeepSeek R1: 0.37% - Claude Sonnet (thinking): 0.04%

Conclusion: CoT improves performance but explanations may not reflect actual reasoning. Use for task solving, be skeptical of explanations.


10. FlashAttention Version Claims

Contradiction

Source Claim
FlashAttention-1 paper 2-4× speedup
FlashAttention-2 paper 2× faster than FA-1
FlashAttention-3 paper 1.5-2× faster than FA-2

Resolution

Cumulative improvements:

Version Key Innovation Total Speedup
FA-1 Memory-efficient tiling 2-4× vs baseline
FA-2 Better parallelization 4-8× vs baseline
FA-3 Hopper optimization, FP8 6-16× vs baseline

Important caveat: FA-3 only works on H100/H200. For A100, FA-2 is still optimal.

Conclusion: Speedups are cumulative but hardware-dependent. Use latest version compatible with your hardware.


Типичные заблуждения

Заблуждение: 'ALiBi лучше RoPE для длинных контекстов'

ALiBi экстраполирует без модификации (1024 -> 8192+), но RoPE с scaling-методами (YaRN, NTK-aware) закрывает этот gap. При этом RoPE дает на 2-5% выше качество на стандартных бенчмарках. Индустрия выбрала RoPE: LLaMA, Mistral, Qwen, Gemma -- все используют RoPE. ALiBi остался нишевым решением.

Заблуждение: 'SSM уже достигли качества Transformer'

Mamba paper заявляет 'match quality', но это означает сопоставимые результаты на бенчмарках, а не идентичные. В production разница в 2-5% критична. Поэтому лидеры индустрии используют гибриды (Jamba 1:7 Transformer:Mamba), а не чистые SSM. Чистые SSM оправданы только при жестких ограничениях по latency.

Заблуждение: 'Speculative decoding дает 2-4x ускорение'

Это верхняя граница при acceptance rate 80%+. В реальном production с Qwen3-32B vLLM показывает acceptance rate 33% и speedup 1.82x. Ожидайте 1.5-2.5x для большинства рабочих нагрузок. Speedup сильно зависит от качества draft-модели, типа задачи и temperature.


Интервью

Вопрос: Как вы разрешаете противоречия между разными источниками при выборе архитектурных решений?

❌ "Я просто беру самый свежий paper и следую его рекомендациям"

❌ "Нужно использовать то, что показало лучшие бенчмарки"

✅ "Проверяю 4 фактора: (1) дату публикации -- newer supersedes older, (2) контекст использования -- разные задачи = разные winners, (3) industry adoption -- production выбор отражает реальность, (4) метрики -- бенчмарки могут быть cherry-picked. Например, ALiBi vs RoPE: ALiBi выигрывает в чистой экстраполяции, но RoPE победил в индустрии благодаря качеству + scaling-методам. Противоречие = trade-off, не ошибка."

Вопрос: RoPE или ALiBi для нового LLM проекта в 2026?

❌ "ALiBi, потому что проще реализовать и лучше экстраполирует"

✅ "RoPE с YaRN scaling. Причины: (1) выше quality на стандартных задачах, (2) 90%+ LLM используют RoPE, значит лучшая совместимость с экосистемой (FlashAttention, CUDA kernels), (3) scaling-методы закрыли gap в экстраполяции. ALiBi рассмотрю только для edge-case с нулевым бюджетом на доработку."

Вопрос: Когда стоит использовать чистый SSM вместо Transformer?

❌ "Mamba показал что SSM не хуже Transformer, можно всегда использовать"

✅ "Чистый SSM оправдан при (1) очень длинных последовательностях >16K, (2) жестких ограничениях по latency и memory, (3) задачах где 2-5% потери качества приемлемы. Для production с требованиями к качеству -- гибрид Jamba (1:7 Transformer:SSM) дает 99%+ качества при 3x ускорении."


Summary: Contradiction Resolution Framework

When encountering contradictions:

  1. Check publication date — Newer sources often supersede older
  2. Consider use case — Different contexts favor different solutions
  3. Look at adoption — Industry choices reflect production reality
  4. Verify with benchmarks — Numbers can be cherry-picked
  5. Understand trade-offs — Most "contradictions" are trade-offs

Quick Resolution Guide

Topic 2026 Recommendation
Positional Encoding RoPE (with scaling for long context)
SSM vs Transformer Hybrid (1:7) for production
Quantization AWQ/GPTQ for GPU, GGUF for edge
Inference Engine SGLang for agents, vLLM for simple
Speculative Decoding EAGLE-3 (expect 1.5-2.5× real-world)
Long Context vs RAG Both (RAG + long context reasoning)
MoE Load Balancing Loss-free if possible, auxiliary loss otherwise
Normalization RMSNorm
CoT Use for solving, verify explanations
FlashAttention Latest version for your hardware