Кросс-референс карта тем LLM¶
~5 минут чтения
Предварительно: Мастер-гайд для подготовки | Анализ противоречий
Тип: synthesis / cross-reference Дата: Февраль 2026
Подготовка к LLM-интервью охватывает 10+ крупных тем (Attention, MoE, KV Cache, Quantization, RAG, Training...), между которыми существует 30+ критических связей. Например, выбор GQA напрямую влияет на размер KV cache (8x разница с MQA), что определяет конфигурацию inference engine (vLLM vs SGLang), что в свою очередь влияет на стратегию speculative decoding. Эта карта визуализирует все зависимости и помогает строить ответы, которые демонстрируют системное понимание.
Topic Dependency Graph¶
graph TD
A["Transformer<br/>Architecture"] --> B["Attention<br/>Optimization"]
A --> C["MoE<br/>Architecture"]
A --> D["Long Context<br/>Methods"]
B --> E["FlashAttention<br/>MQA/GQA"]
C --> F["Load Balancing<br/>Expert Routing"]
D --> G["Ring Attention<br/>Infini-Attn"]
E --> H["KV Cache<br/>Management"]
E --> I["Inference<br/>Engines"]
I --> H
H --> J["PagedAttention<br/>Prefix Caching<br/>RadixAttention"]
J --> K["Speculative Decoding<br/>EAGLE, MTP, NGRAM"]
I --> K
K --> L["Quantization<br/>AWQ, GPTQ, GGUF, FP8"]
style A fill:#f3e5f5,stroke:#9c27b0
style B fill:#e8eaf6,stroke:#3f51b5
style C fill:#e8eaf6,stroke:#3f51b5
style D fill:#e8eaf6,stroke:#3f51b5
style E fill:#e8f5e9,stroke:#4caf50
style F fill:#e8f5e9,stroke:#4caf50
style G fill:#e8f5e9,stroke:#4caf50
style H fill:#fff3e0,stroke:#ef6c00
style I fill:#fff3e0,stroke:#ef6c00
style J fill:#fff3e0,stroke:#ef6c00
style K fill:#fce4ec,stroke:#c62828
style L fill:#fce4ec,stroke:#c62828
1. Attention → KV Cache → Inference¶
Connection Flow¶
graph TD
A["Attention: requires K,V vectors"] --> B["KV Cache: stores K,V<br/>to avoid recomputation"]
B --> C["Memory problem:<br/>KV cache grows with seq length"]
C --> D["PagedAttention<br/>block-based allocation"]
C --> E["Prefix caching<br/>hash shared prefixes"]
C --> F["RadixAttention<br/>token-level radix tree"]
D --> G["vLLM:<br/>PagedAttention + prefix cache"]
E --> G
F --> H["SGLang:<br/>RadixAttention, best for agents"]
D --> I["TensorRT-LLM:<br/>CUDA-optimized attention"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#fff3e0,stroke:#ef6c00
style C fill:#fce4ec,stroke:#c62828
style D fill:#e8f5e9,stroke:#4caf50
style E fill:#e8f5e9,stroke:#4caf50
style F fill:#e8f5e9,stroke:#4caf50
style G fill:#f3e5f5,stroke:#9c27b0
style H fill:#f3e5f5,stroke:#9c27b0
style I fill:#f3e5f5,stroke:#9c27b0
Key Formulas Connection¶
Attention: $\(\text{Attn}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\)$
KV Cache Memory: $\(M = 2 \times L \times H \times d \times T \times 2 \text{ bytes}\)$
Connection: The \(K\) and \(V\) matrices in attention are what get cached. Memory grows with \(T\) (sequence length).
2. MoE → Load Balancing → Inference¶
Connection Flow¶
graph TD
A["MoE: routes tokens<br/>to subset of experts"] --> B["Problem: uneven expert<br/>utilization, rich-get-richer"]
B --> C["Auxiliary loss<br/>standard, interferes with main loss"]
B --> D["Loss-Free, DeepSeek V3<br/>dynamic bias"]
B --> E["SIMBAL<br/>similarity-preserving, 36% faster"]
C --> F["Token dropping<br/>if expert overloaded"]
D --> F
E --> F
F --> G["Latency variance per token"]
F --> H["Throughput depends<br/>on balance quality"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#fce4ec,stroke:#c62828
style C fill:#e8f5e9,stroke:#4caf50
style D fill:#e8f5e9,stroke:#4caf50
style E fill:#e8f5e9,stroke:#4caf50
style F fill:#fff3e0,stroke:#ef6c00
style G fill:#fff3e0,stroke:#ef6c00
style H fill:#fff3e0,stroke:#ef6c00
Expert Utilization Impact¶
| Balance Quality | Throughput | Latency |
|---|---|---|
| Poor (some experts 0%) | -30% | High variance |
| Good (all ~50%) | Baseline | Stable |
| Excellent (all ~50% with low variance) | +10% | Predictable |
3. Long Context → Ring Attention → Distributed¶
Connection Flow¶
graph TD
A["Standard attention:<br/>O(T^2) memory and compute"] --> B["Problem: can't fit 100K+<br/>context on single GPU"]
B --> C["FlashAttention<br/>memory efficient, still limited"]
B --> D["Ring Attention<br/>distributed across GPUs"]
B --> E["Infini-Attention<br/>compressive memory"]
B --> F["Star Attention<br/>block-sparse, 11x faster"]
D --> G["Multi-GPU setup<br/>ring topology"]
D --> H["Inter-GPU communication"]
D --> I["Distributed framework<br/>FSDP, Megatron"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#fce4ec,stroke:#c62828
style C fill:#e8f5e9,stroke:#4caf50
style D fill:#e8f5e9,stroke:#4caf50
style E fill:#e8f5e9,stroke:#4caf50
style F fill:#e8f5e9,stroke:#4caf50
style G fill:#fff3e0,stroke:#ef6c00
style H fill:#fff3e0,stroke:#ef6c00
style I fill:#fff3e0,stroke:#ef6c00
Context Length Scaling¶
| Method | Max Context | Hardware | Notes |
|---|---|---|---|
| FlashAttention-3 | 128K | H100 | Single GPU |
| Ring Attention | 1M+ | 8+ GPUs | Linear scaling |
| Infini-Attention | Unlimited | Any | Quality loss |
| Star Attention | 128K | Single | 11× faster |
4. Speculative Decoding → Quantization → Inference¶
Connection Flow¶
graph TD
A["Speculative decoding<br/>adds draft model"] --> B["Draft model needs<br/>extra memory"]
B --> C["Solution: quantize draft<br/>model more aggressively"]
C --> D["EAGLE-3 draft model"]
D --> E["10K vocabulary<br/>vs 128K full"]
D --> F["Single transformer layer"]
D --> G["Can be further quantized"]
E --> H["Combined speedup:<br/>2-4x speed + 75% memory reduction"]
F --> H
G --> H
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#fce4ec,stroke:#c62828
style C fill:#e8f5e9,stroke:#4caf50
style D fill:#fff3e0,stroke:#ef6c00
style E fill:#fff3e0,stroke:#ef6c00
style F fill:#fff3e0,stroke:#ef6c00
style G fill:#fff3e0,stroke:#ef6c00
style H fill:#f3e5f5,stroke:#9c27b0
Quantization + Speculative Speedup¶
| Configuration | Memory | Speedup |
|---|---|---|
| BF16, no spec | 100% | Baseline |
| BF16 + EAGLE-3 | 105% | 2.5× |
| AWQ-4 + EAGLE-3 | 30% | 2.2× |
| GGUF Q4 + NGRAM | 25% | 1.3× |
5. SSM → Hybrid Architecture → Inference¶
Connection Flow¶
graph TD
A["State Space Models, Mamba<br/>O(T) complexity"] --> B["Pros: fast inference,<br/>constant memory"]
A --> C["Cons: lower quality<br/>than Transformers"]
B --> D["Jamba<br/>1:7 Transformer:Mamba"]
C --> D
B --> E["Bamba<br/>similar ratio"]
B --> F["Nemotron-H<br/>NVIDIA"]
D --> G["Need both attention<br/>and SSM kernels"]
D --> H["Memory = KV cache<br/>+ SSM state"]
D --> I["Different optimal<br/>batch sizes"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#e8f5e9,stroke:#4caf50
style C fill:#fce4ec,stroke:#c62828
style D fill:#f3e5f5,stroke:#9c27b0
style E fill:#f3e5f5,stroke:#9c27b0
style F fill:#f3e5f5,stroke:#9c27b0
style G fill:#fff3e0,stroke:#ef6c00
style H fill:#fff3e0,stroke:#ef6c00
style I fill:#fff3e0,stroke:#ef6c00
Hybrid vs Pure Comparison¶
| Architecture | Quality | Inference Speed | Memory |
|---|---|---|---|
| Pure Transformer | 100% | Baseline | Grows with T |
| Pure Mamba | 95-98% | 5× faster | Fixed |
| Hybrid (1:7) | 99%+ | 3× faster | Reduced |
6. RAG → Embeddings → Vector DB → LLM¶
Connection Flow¶
graph TD
A["Query needs context"] --> B["Embed query<br/>same model as indexed docs"]
B --> C["Vector similarity search<br/>in Vector DB"]
C --> D["Top-K chunks retrieved"]
D --> E["Optional: rerank<br/>with cross-encoder"]
E --> F["LLM generates<br/>with context"]
F --> G["Retrieved context adds<br/>to prompt length"]
F --> H["Longer context =<br/>more KV cache"]
F --> I["Prefix caching helps<br/>with repeated docs"]
F --> J["Semantic caching<br/>reduces LLM calls"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#e8f5e9,stroke:#4caf50
style C fill:#e8f5e9,stroke:#4caf50
style D fill:#fff3e0,stroke:#ef6c00
style E fill:#fff3e0,stroke:#ef6c00
style F fill:#f3e5f5,stroke:#9c27b0
style G fill:#fce4ec,stroke:#c62828
style H fill:#fce4ec,stroke:#c62828
style I fill:#e8f5e9,stroke:#4caf50
style J fill:#e8f5e9,stroke:#4caf50
End-to-End Latency¶
| Stage | Typical Time | Optimization |
|---|---|---|
| Embedding | 5-10ms | Batch, smaller model |
| Vector search | 5-20ms | HNSW, smaller dimension |
| Reranking | 20-50ms | Skip for simple queries |
| LLM generation | 50-500ms | Speculative, quantization |
7. Training → Fine-tuning → Inference¶
Connection Flow¶
graph TD
A["Pre-training:<br/>full model on large corpus"] --> B["Full fine-tuning<br/>all params"]
A --> C["LoRA<br/>low-rank adapters"]
A --> D["QLoRA<br/>quantized + LoRA"]
B --> E["LoRA adds small overhead<br/>adapter merge"]
C --> E
D --> E
C --> F["Multi-LoRA serving<br/>adapter switching"]
D --> G["Quantized base + LoRA<br/>= efficient serving"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#fff3e0,stroke:#ef6c00
style C fill:#e8f5e9,stroke:#4caf50
style D fill:#e8f5e9,stroke:#4caf50
style E fill:#f3e5f5,stroke:#9c27b0
style F fill:#f3e5f5,stroke:#9c27b0
style G fill:#f3e5f5,stroke:#9c27b0
LoRA Serving Patterns¶
Single LoRA: Merge into base weights (no overhead)
Multi-LoRA: Keep adapters separate, switch per request
Batched Multi: vLLM/SGLang support dynamic adapter loading
8. Cross-Cutting Concerns¶
Memory Optimization Techniques¶
| Technique | Where Applied | Reduction |
|---|---|---|
| Quantization | Model weights | 50-75% |
| KV Cache optimization | Attention | 60-80% waste → <5% |
| Activation checkpointing | Training | 50-70% |
| Gradient accumulation | Training | Simulates larger batch |
| Prefix caching | Inference | 50-80% for repeated prefixes |
Speed Optimization Techniques¶
| Technique | Where Applied | Speedup |
|---|---|---|
| FlashAttention | Attention | 2-4× |
| Speculative decoding | Inference | 1.8-4× |
| CUDA graphs | Inference | 1.2-1.5× |
| torch.compile | Training/Inference | 1.1-1.3× |
| Continuous batching | Inference | 2-3× |
9. Decision Trees¶
"I need to deploy an LLM for inference"¶
Start
│
├── Memory constrained?
│ ├── Yes → Quantization (AWQ/GGUF)
│ └── No → BF16 or FP16
│
├── High throughput needed?
│ ├── Yes → SGLang + EAGLE-3 + RadixAttention
│ └── No → vLLM is simpler
│
├── Long context (>32K)?
│ ├── Yes → Ring Attention or FlashAttention-3
│ └── No → Standard attention
│
├── Multi-turn conversations?
│ ├── Yes → Prefix caching + KV cache optimization
│ └── No → Standard serving
│
└── Agent workflows?
├── Yes → SGLang with RadixAttention
└── No → vLLM or TensorRT-LLM
"I need to fine-tune an LLM"¶
Start
│
├── Full GPU cluster available?
│ ├── Yes → Full fine-tuning or FSDP2
│ └── No → LoRA or QLoRA
│
├── Quality critical?
│ ├── Yes → Full fine-tuning or high-rank LoRA (r=64)
│ └── No → LoRA (r=16) sufficient
│
├── Many tasks/domains?
│ ├── Yes → Multiple LoRA adapters
│ └── No → Single fine-tune
│
└── Budget constrained?
├── Yes → QLoRA (4-bit base + LoRA)
└── No → BF16 LoRA or full fine-tune
10. Topic Quick Links¶
| Topic | Key Concepts | Related Files |
|---|---|---|
| Attention | FlashAttention, MQA/GQA | flash-attention-v2-v3.md, mqa-gqa-attention.md |
| KV Cache | PagedAttention, RadixAttention | kv-cache-optimization-2025-2026.md, vllm-paged-attention.md |
| MoE | Expert routing, load balancing | moe-advances-2025-2026.md, moe-load-balancing-2025-2026.md |
| Quantization | AWQ, GPTQ, GGUF, FP8 | llm-quantization-2025-2026.md, gptq-awq-gguf-quantization.md |
| Speculative | EAGLE-3, MTP, NGRAM | speculative-decoding-2025-2026.md |
| Long Context | Ring Attention, Infini-Attention | long-context-2025-2026.md, rope-long-context.md |
| SSM | Mamba, RWKV, hybrids | state-space-models-2025-2026.md, xlstm-architecture-2025-2026.md |
| Inference | vLLM, SGLang, TensorRT-LLM | inference-engines-comparison-2025-2026.md |
| RAG | Embeddings, Vector DB, Retrieval | rag-system-design-2025-2026.md, advanced-rag-patterns-2025.md |
| Training | FSDP2, ZeRO-3, LoRA | distributed-training-comparison.md, lora-qlora-implementation-2025.md |
Типичные заблуждения¶
Заблуждение: 'KV cache -- это просто кэш, не влияет на архитектуру'
KV cache -- центральный bottleneck всей inference-цепочки. Для Llama 70B @ 4K context один запрос требует ~10GB только на KV cache. Выбор MQA vs GQA определяет размер cache (8x разница), что влияет на batch size, throughput и выбор inference engine. PagedAttention снижает waste с 60-80% до <4%.
Заблуждение: 'Quantization и speculative decoding -- независимые оптимизации'
Они тесно связаны. Draft-модель EAGLE-3 можно квантизировать агрессивнее (10K vocabulary vs 128K full). Комбинация AWQ-4 + EAGLE-3 дает 30% memory от baseline при 2.2x speedup. Оптимальная стратегия всегда учитывает обе техники вместе.
Заблуждение: 'RAG добавляет только latency на retrieval'
RAG влияет на всю inference-цепочку: retrieved context увеличивает prompt length -> больше KV cache -> больше memory -> меньше batch size -> меньше throughput. Оптимизации (prefix caching для повторяющихся docs, semantic caching для повторяющихся queries) сокращают end-to-end latency на 50-80%.
Интервью¶
Вопрос: Как связаны Attention optimization и Inference engine?¶
"FlashAttention ускоряет attention, inference engine обслуживает модель -- они независимы"
"Прямая связь через KV cache. FlashAttention/MQA/GQA определяют размер KV cache per request. Inference engine (vLLM vs SGLang) реализует управление этим cache: PagedAttention снижает waste с 60-80% до <4%, RadixAttention добавляет token-level sharing для agent workloads. Выбор GQA + SGLang с RadixAttention -- оптимальная комбинация для multi-turn chat."
Вопрос: Опишите связь Quantization -> Speculative Decoding -> Inference¶
"Quantization сжимает модель, speculative decoding ускоряет генерацию, inference engine все обслуживает"
"Три техники образуют pipeline: (1) Quantization (AWQ 4-bit) уменьшает основную модель до 25% memory, (2) EAGLE-3 draft model с 10K vocabulary добавляет только 5% overhead, (3) SGLang координирует verification batching. Результат: AWQ-4 + EAGLE-3 = 30% memory, 2.2x speedup. Без понимания связей вы потеряете 40%+ от оптимального результата."
Вопрос: Почему для RAG важно понимать KV cache management?¶
"RAG -- это про retrieval, KV cache -- про inference, они не связаны"
"Retrieved context напрямую увеличивает prompt length, что растит KV cache. Для 10 chunks по 512 токенов это +5K токенов = дополнительные гигабайты KV cache для больших моделей. Prefix caching критичен: повторяющиеся system prompts и base documents кэшируются, экономя 50-80% compute. Semantic caching на уровне запросов дополнительно снижает LLM calls на 50-80%."