Анализ противоречий между источниками¶
~9 минут чтения
Предварительно: Кросс-референс карта тем | Мастер-гайд для подготовки
Тип: synthesis / cross-reference / contradictions Дата: Февраль 2026
При подготовке к интервью по LLM-инженерии кандидаты сталкиваются с 10+ прямыми противоречиями между статьями, бенчмарками и документациями фреймворков. Например, ALiBi paper утверждает превосходство над RoPE в экстраполяции, но 90%+ production LLM (LLaMA, Mistral, Qwen, Gemma) используют RoPE. Подобные расхождения -- не ошибки, а следствие разных контекстов, метрик и дат публикации. Этот документ разрешает 10 ключевых противоречий с конкретными рекомендациями на февраль 2026.
Overview¶
During the research synthesis, several contradictions emerged between different sources. This document identifies these contradictions, analyzes the context, and provides resolution guidance.
1. Positional Encoding: RoPE vs ALiBi¶
Contradiction¶
| Source | Claim |
|---|---|
| ALiBi paper (2021) | ALiBi extrapolates better than RoPE |
| Modern LLM adoption | RoPE used in LLaMA, Mistral, Qwen, Gemma |
| Some benchmarks | ALiBi shows better length extrapolation |
Resolution¶
Both claims are TRUE, but context matters:
- ALiBi's advantage: Extrapolation to longer sequences without modification
- Trained on 1024 → works on 8192+ out of box
-
Simpler implementation
-
RoPE's advantage:
- Higher overall quality
- Better suited for learned position patterns
-
Extensions (RoPE-Scaling, YaRN) close the extrapolation gap
-
Industry choice: RoPE won because:
- Quality > pure extrapolation in most use cases
- Scaling methods solve extrapolation
- Better compatibility with rotary attention optimizations
Conclusion: Use RoPE for production. Consider ALiBi only for specific extrapolation needs without modification.
2. SSM vs Transformer: Quality Claims¶
Contradiction¶
| Source | Claim |
|---|---|
| Mamba paper | "SSMs match Transformer quality" |
| Some benchmarks | Mamba-3 at 95-98% of Transformer quality |
| Industry adoption | Transformers still dominate |
Resolution¶
Context-dependent quality:
| Scenario | Winner | Reason |
|---|---|---|
| Short sequences (<1K) | Transformer | Full attention captures all dependencies |
| Medium sequences (1K-16K) | Hybrid (Jamba) | Balance quality + efficiency |
| Very long sequences (>16K) | SSM or Hybrid | SSM's O(T) advantage dominates |
| Generation quality | Transformer | Subtle patterns better captured |
Key nuance: "Match quality" in papers means comparable on benchmarks, not identical. Production systems often need that extra 2-5% quality.
Conclusion: Pure SSM for efficiency-critical applications. Hybrid (1:7 Transformer:SSM) for production with quality requirements.
3. Quantization Quality: AWQ vs GPTQ vs GGUF¶
Contradiction¶
| Source | Claim |
|---|---|
| AWQ paper | AWQ achieves ~98% of original quality |
| GPTQ paper | GPTQ achieves 97-99% of original quality |
| llama.cpp docs | GGUF Q4_K_M achieves ~92% quality |
Resolution¶
Different quality metrics and use cases:
| Method | Quality (4-bit) | Speed | Best For |
|---|---|---|---|
| GPTQ | 97-99% | Fast on NVIDIA | GPU serving |
| AWQ | ~98% | Fast on all GPU | Production API |
| GGUF Q4_K_M | ~92% | Variable | CPU/Mobile |
Why different numbers: 1. Different benchmarks: Some use perplexity, others use downstream tasks 2. Different models: Larger models quantize better 3. Different calibration data: Affects per-channel scaling
Conclusion: For production GPU: AWQ or GPTQ (comparable). For CPU/edge: GGUF with appropriate quantization level.
4. Inference Engine: vLLM vs SGLang Speed Claims¶
Contradiction¶
| Source | Claim |
|---|---|
| vLLM docs | "State-of-the-art throughput" |
| SGLang paper | "Up to 3.7× faster than vLLM" |
| Some benchmarks | vLLM faster on single requests |
Resolution¶
Performance is workload-dependent:
| Workload | Winner | Why |
|---|---|---|
| Single request | vLLM | Lower overhead, simpler path |
| Batch inference | SGLang | Better scheduling, RadixAttention |
| Agent workflows | SGLang | 3× faster due to prefix caching |
| Multi-turn chat | Similar | Both have prefix caching |
Key insight: SGLang's advantage comes from: 1. RadixAttention (token-level sharing) 2. Structured output optimization 3. Disaggregated inference (for multi-node)
Conclusion: Use SGLang for high-throughput/agent workloads. vLLM for simpler serving needs.
5. Speculative Decoding: Speedup Claims¶
Contradiction¶
| Source | Claim |
|---|---|
| EAGLE paper | 2-4× speedup |
| Some production reports | 1.3-1.8× speedup |
| vLLM blog (Dec 2025) | 1.82× on Qwen3-32B |
Resolution¶
Speedup depends on acceptance rate:
| Acceptance Rate | Speedup |
|---|---|
| 33% (Qwen3-32B) | 1.82× |
| 50% | ~2.2× |
| 65% | ~2.5× |
| 80% | ~3×+ |
Why variance: 1. Draft model quality: Better draft = higher acceptance 2. Task type: Repetitive tasks = higher acceptance 3. Temperature: Lower temperature = more predictable 4. Model size: Smaller targets accept more
Conclusion: EAGLE-3 achieves 2-4× in ideal conditions. Real-world: expect 1.5-2.5× for most workloads.
6. Long Context: RAG vs Long Context Models¶
Contradiction¶
| Source | Claim |
|---|---|
| Some papers | "Long context will replace RAG" |
| RAG research | "RAG remains essential" |
| Production experience | Both are needed |
Resolution¶
Cost-quality tradeoff:
| Aspect | Long Context | RAG |
|---|---|---|
| Cost ratio | 100× more expensive | Baseline |
| Fresh data | Requires retraining | Instant update |
| Accuracy | Full context | Retrieval quality dependent |
| "Lost in middle" | Yes | Can be mitigated |
Key finding (2025-2026): $\(\text{Cost(Long Context)} : \text{Cost(RAG)} \approx 100 : 1\)$
Best practice: - Use RAG for retrieval (get relevant docs) - Use long context for reasoning (process retrieved docs) - Hybrid approach for production
Conclusion: Long context and RAG are complementary, not competing. Use both.
7. MoE Load Balancing: Auxiliary Loss vs Loss-Free¶
Contradiction¶
| Source | Claim |
|---|---|
| Standard MoE (Mixtral) | Auxiliary loss is necessary |
| DeepSeek V3 | Loss-free balancing works better |
| SIMBAL paper | Similarity-preserving is 36% faster |
Resolution¶
Evolution of techniques:
| Method | Description | Pros | Cons |
|---|---|---|---|
| Auxiliary Loss | Penalty for imbalance | Simple | Interferes with main loss |
| Loss-Free (DeepSeek V3) | Dynamic bias per expert | No interference | Complex implementation |
| SIMBAL | Similarity-preserving | 36% faster convergence | New (less proven) |
Status (Feb 2026): - Auxiliary loss: Standard, well-tested - Loss-free: Gaining adoption (DeepSeek V3 proves it works) - SIMBAL: Promising but needs more validation
Conclusion: For new projects, consider loss-free balancing. For stability, auxiliary loss is safe choice.
8. Normalization: LayerNorm vs RMSNorm¶
Contradiction¶
| Source | Claim |
|---|---|
| Original Transformer | LayerNorm is standard |
| LLaMA paper | RMSNorm is better |
| Some studies | "No significant difference" |
Resolution¶
Both are correct in different contexts:
| Aspect | LayerNorm | RMSNorm |
|---|---|---|
| Computation | \(O(d)\) with mean subtraction | \(O(d)\) simpler |
| Quality | Established | Identical or better |
| Speed | Baseline | 15-25% faster |
| Adoption (2026) | Declining | Standard in all LLMs |
Why RMSNorm won: 1. Simpler computation (no mean subtraction) 2. Same or better quality 3. Easier to implement efficiently on GPU
Conclusion: Use RMSNorm for all new Transformer-based models. LayerNorm is legacy.
9. CoT Faithfulness: Do Models Really Reason?¶
Contradiction¶
| Source | Claim |
|---|---|
| Original CoT paper | CoT improves reasoning |
| "CoT is a Mirage" paper | CoT reflects training bias |
| DeepSeek R1 docs | CoT enables complex reasoning |
Resolution¶
Nuanced understanding:
| Finding | Implication |
|---|---|
| CoT helps solve problems | True for many tasks |
| CoT explanations are unfaithful | Models justify answers post-hoc |
| CoT improves with size | Larger models more faithful |
Unfaithfulness rates (Jan 2026): - GPT-4o-mini: 13% - Claude Haiku: 7% - DeepSeek R1: 0.37% - Claude Sonnet (thinking): 0.04%
Conclusion: CoT improves performance but explanations may not reflect actual reasoning. Use for task solving, be skeptical of explanations.
10. FlashAttention Version Claims¶
Contradiction¶
| Source | Claim |
|---|---|
| FlashAttention-1 paper | 2-4× speedup |
| FlashAttention-2 paper | 2× faster than FA-1 |
| FlashAttention-3 paper | 1.5-2× faster than FA-2 |
Resolution¶
Cumulative improvements:
| Version | Key Innovation | Total Speedup |
|---|---|---|
| FA-1 | Memory-efficient tiling | 2-4× vs baseline |
| FA-2 | Better parallelization | 4-8× vs baseline |
| FA-3 | Hopper optimization, FP8 | 6-16× vs baseline |
Important caveat: FA-3 only works on H100/H200. For A100, FA-2 is still optimal.
Conclusion: Speedups are cumulative but hardware-dependent. Use latest version compatible with your hardware.
Типичные заблуждения¶
Заблуждение: 'ALiBi лучше RoPE для длинных контекстов'
ALiBi экстраполирует без модификации (1024 -> 8192+), но RoPE с scaling-методами (YaRN, NTK-aware) закрывает этот gap. При этом RoPE дает на 2-5% выше качество на стандартных бенчмарках. Индустрия выбрала RoPE: LLaMA, Mistral, Qwen, Gemma -- все используют RoPE. ALiBi остался нишевым решением.
Заблуждение: 'SSM уже достигли качества Transformer'
Mamba paper заявляет 'match quality', но это означает сопоставимые результаты на бенчмарках, а не идентичные. В production разница в 2-5% критична. Поэтому лидеры индустрии используют гибриды (Jamba 1:7 Transformer:Mamba), а не чистые SSM. Чистые SSM оправданы только при жестких ограничениях по latency.
Заблуждение: 'Speculative decoding дает 2-4x ускорение'
Это верхняя граница при acceptance rate 80%+. В реальном production с Qwen3-32B vLLM показывает acceptance rate 33% и speedup 1.82x. Ожидайте 1.5-2.5x для большинства рабочих нагрузок. Speedup сильно зависит от качества draft-модели, типа задачи и temperature.
Интервью¶
Вопрос: Как вы разрешаете противоречия между разными источниками при выборе архитектурных решений?¶
"Я просто беру самый свежий paper и следую его рекомендациям"
"Нужно использовать то, что показало лучшие бенчмарки"
"Проверяю 4 фактора: (1) дату публикации -- newer supersedes older, (2) контекст использования -- разные задачи = разные winners, (3) industry adoption -- production выбор отражает реальность, (4) метрики -- бенчмарки могут быть cherry-picked. Например, ALiBi vs RoPE: ALiBi выигрывает в чистой экстраполяции, но RoPE победил в индустрии благодаря качеству + scaling-методам. Противоречие = trade-off, не ошибка."
Вопрос: RoPE или ALiBi для нового LLM проекта в 2026?¶
"ALiBi, потому что проще реализовать и лучше экстраполирует"
"RoPE с YaRN scaling. Причины: (1) выше quality на стандартных задачах, (2) 90%+ LLM используют RoPE, значит лучшая совместимость с экосистемой (FlashAttention, CUDA kernels), (3) scaling-методы закрыли gap в экстраполяции. ALiBi рассмотрю только для edge-case с нулевым бюджетом на доработку."
Вопрос: Когда стоит использовать чистый SSM вместо Transformer?¶
"Mamba показал что SSM не хуже Transformer, можно всегда использовать"
"Чистый SSM оправдан при (1) очень длинных последовательностях >16K, (2) жестких ограничениях по latency и memory, (3) задачах где 2-5% потери качества приемлемы. Для production с требованиями к качеству -- гибрид Jamba (1:7 Transformer:SSM) дает 99%+ качества при 3x ускорении."
Summary: Contradiction Resolution Framework¶
When encountering contradictions:
- Check publication date — Newer sources often supersede older
- Consider use case — Different contexts favor different solutions
- Look at adoption — Industry choices reflect production reality
- Verify with benchmarks — Numbers can be cherry-picked
- Understand trade-offs — Most "contradictions" are trade-offs
Quick Resolution Guide¶
| Topic | 2026 Recommendation |
|---|---|
| Positional Encoding | RoPE (with scaling for long context) |
| SSM vs Transformer | Hybrid (1:7) for production |
| Quantization | AWQ/GPTQ for GPU, GGUF for edge |
| Inference Engine | SGLang for agents, vLLM for simple |
| Speculative Decoding | EAGLE-3 (expect 1.5-2.5× real-world) |
| Long Context vs RAG | Both (RAG + long context reasoning) |
| MoE Load Balancing | Loss-free if possible, auxiliary loss otherwise |
| Normalization | RMSNorm |
| CoT | Use for solving, verify explanations |
| FlashAttention | Latest version for your hardware |