Кросс-референс карта тем LLM¶

~5 минут чтения

Предварительно: Мастер-гайд для подготовки | Анализ противоречий

Тип: synthesis / cross-reference Дата: Февраль 2026

Подготовка к LLM-интервью охватывает 10+ крупных тем (Attention, MoE, KV Cache, Quantization, RAG, Training...), между которыми существует 30+ критических связей. Например, выбор GQA напрямую влияет на размер KV cache (8x разница с MQA), что определяет конфигурацию inference engine (vLLM vs SGLang), что в свою очередь влияет на стратегию speculative decoding. Эта карта визуализирует все зависимости и помогает строить ответы, которые демонстрируют системное понимание.

Topic Dependency Graph¶

graph TD
    A["Transformer<br/>Architecture"] --> B["Attention<br/>Optimization"]
    A --> C["MoE<br/>Architecture"]
    A --> D["Long Context<br/>Methods"]

    B --> E["FlashAttention<br/>MQA/GQA"]
    C --> F["Load Balancing<br/>Expert Routing"]
    D --> G["Ring Attention<br/>Infini-Attn"]

    E --> H["KV Cache<br/>Management"]
    E --> I["Inference<br/>Engines"]
    I --> H

    H --> J["PagedAttention<br/>Prefix Caching<br/>RadixAttention"]

    J --> K["Speculative Decoding<br/>EAGLE, MTP, NGRAM"]
    I --> K

    K --> L["Quantization<br/>AWQ, GPTQ, GGUF, FP8"]

    style A fill:#f3e5f5,stroke:#9c27b0
    style B fill:#e8eaf6,stroke:#3f51b5
    style C fill:#e8eaf6,stroke:#3f51b5
    style D fill:#e8eaf6,stroke:#3f51b5
    style E fill:#e8f5e9,stroke:#4caf50
    style F fill:#e8f5e9,stroke:#4caf50
    style G fill:#e8f5e9,stroke:#4caf50
    style H fill:#fff3e0,stroke:#ef6c00
    style I fill:#fff3e0,stroke:#ef6c00
    style J fill:#fff3e0,stroke:#ef6c00
    style K fill:#fce4ec,stroke:#c62828
    style L fill:#fce4ec,stroke:#c62828

1. Attention → KV Cache → Inference¶

Connection Flow¶

graph TD
    A["Attention: requires K,V vectors"] --> B["KV Cache: stores K,V<br/>to avoid recomputation"]
    B --> C["Memory problem:<br/>KV cache grows with seq length"]
    C --> D["PagedAttention<br/>block-based allocation"]
    C --> E["Prefix caching<br/>hash shared prefixes"]
    C --> F["RadixAttention<br/>token-level radix tree"]
    D --> G["vLLM:<br/>PagedAttention + prefix cache"]
    E --> G
    F --> H["SGLang:<br/>RadixAttention, best for agents"]
    D --> I["TensorRT-LLM:<br/>CUDA-optimized attention"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#fce4ec,stroke:#c62828
    style D fill:#e8f5e9,stroke:#4caf50
    style E fill:#e8f5e9,stroke:#4caf50
    style F fill:#e8f5e9,stroke:#4caf50
    style G fill:#f3e5f5,stroke:#9c27b0
    style H fill:#f3e5f5,stroke:#9c27b0
    style I fill:#f3e5f5,stroke:#9c27b0

Key Formulas Connection¶

Attention: $$\text{Attn}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

KV Cache Memory: $$M = 2 \times L \times H \times d \times T \times 2 \text{ bytes}$$

Connection: The $K$ and $V$ matrices in attention are what get cached. Memory grows with $T$ (sequence length).

2. MoE → Load Balancing → Inference¶

Connection Flow¶

graph TD
    A["MoE: routes tokens<br/>to subset of experts"] --> B["Problem: uneven expert<br/>utilization, rich-get-richer"]
    B --> C["Auxiliary loss<br/>standard, interferes with main loss"]
    B --> D["Loss-Free, DeepSeek V3<br/>dynamic bias"]
    B --> E["SIMBAL<br/>similarity-preserving, 36% faster"]
    C --> F["Token dropping<br/>if expert overloaded"]
    D --> F
    E --> F
    F --> G["Latency variance per token"]
    F --> H["Throughput depends<br/>on balance quality"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fce4ec,stroke:#c62828
    style C fill:#e8f5e9,stroke:#4caf50
    style D fill:#e8f5e9,stroke:#4caf50
    style E fill:#e8f5e9,stroke:#4caf50
    style F fill:#fff3e0,stroke:#ef6c00
    style G fill:#fff3e0,stroke:#ef6c00
    style H fill:#fff3e0,stroke:#ef6c00

Expert Utilization Impact¶

Balance Quality	Throughput	Latency
Poor (some experts 0%)	-30%	High variance
Good (all ~50%)	Baseline	Stable
Excellent (all ~50% with low variance)	+10%	Predictable

3. Long Context → Ring Attention → Distributed¶

Connection Flow¶

graph TD
    A["Standard attention:<br/>O&#40;T^2&#41; memory and compute"] --> B["Problem: can't fit 100K+<br/>context on single GPU"]
    B --> C["FlashAttention<br/>memory efficient, still limited"]
    B --> D["Ring Attention<br/>distributed across GPUs"]
    B --> E["Infini-Attention<br/>compressive memory"]
    B --> F["Star Attention<br/>block-sparse, 11x faster"]
    D --> G["Multi-GPU setup<br/>ring topology"]
    D --> H["Inter-GPU communication"]
    D --> I["Distributed framework<br/>FSDP, Megatron"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fce4ec,stroke:#c62828
    style C fill:#e8f5e9,stroke:#4caf50
    style D fill:#e8f5e9,stroke:#4caf50
    style E fill:#e8f5e9,stroke:#4caf50
    style F fill:#e8f5e9,stroke:#4caf50
    style G fill:#fff3e0,stroke:#ef6c00
    style H fill:#fff3e0,stroke:#ef6c00
    style I fill:#fff3e0,stroke:#ef6c00

Context Length Scaling¶

Method	Max Context	Hardware	Notes
FlashAttention-3	128K	H100	Single GPU
Ring Attention	1M+	8+ GPUs	Linear scaling
Infini-Attention	Unlimited	Any	Quality loss
Star Attention	128K	Single	11× faster

4. Speculative Decoding → Quantization → Inference¶

Connection Flow¶

graph TD
    A["Speculative decoding<br/>adds draft model"] --> B["Draft model needs<br/>extra memory"]
    B --> C["Solution: quantize draft<br/>model more aggressively"]
    C --> D["EAGLE-3 draft model"]
    D --> E["10K vocabulary<br/>vs 128K full"]
    D --> F["Single transformer layer"]
    D --> G["Can be further quantized"]
    E --> H["Combined speedup:<br/>2-4x speed + 75% memory reduction"]
    F --> H
    G --> H

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fce4ec,stroke:#c62828
    style C fill:#e8f5e9,stroke:#4caf50
    style D fill:#fff3e0,stroke:#ef6c00
    style E fill:#fff3e0,stroke:#ef6c00
    style F fill:#fff3e0,stroke:#ef6c00
    style G fill:#fff3e0,stroke:#ef6c00
    style H fill:#f3e5f5,stroke:#9c27b0

Quantization + Speculative Speedup¶

Configuration	Memory	Speedup
BF16, no spec	100%	Baseline
BF16 + EAGLE-3	105%	2.5×
AWQ-4 + EAGLE-3	30%	2.2×
GGUF Q4 + NGRAM	25%	1.3×

5. SSM → Hybrid Architecture → Inference¶

Connection Flow¶

graph TD
    A["State Space Models, Mamba<br/>O&#40;T&#41; complexity"] --> B["Pros: fast inference,<br/>constant memory"]
    A --> C["Cons: lower quality<br/>than Transformers"]
    B --> D["Jamba<br/>1:7 Transformer:Mamba"]
    C --> D
    B --> E["Bamba<br/>similar ratio"]
    B --> F["Nemotron-H<br/>NVIDIA"]
    D --> G["Need both attention<br/>and SSM kernels"]
    D --> H["Memory = KV cache<br/>+ SSM state"]
    D --> I["Different optimal<br/>batch sizes"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8f5e9,stroke:#4caf50
    style C fill:#fce4ec,stroke:#c62828
    style D fill:#f3e5f5,stroke:#9c27b0
    style E fill:#f3e5f5,stroke:#9c27b0
    style F fill:#f3e5f5,stroke:#9c27b0
    style G fill:#fff3e0,stroke:#ef6c00
    style H fill:#fff3e0,stroke:#ef6c00
    style I fill:#fff3e0,stroke:#ef6c00

Hybrid vs Pure Comparison¶

Architecture	Quality	Inference Speed	Memory
Pure Transformer	100%	Baseline	Grows with T
Pure Mamba	95-98%	5× faster	Fixed
Hybrid (1:7)	99%+	3× faster	Reduced

6. RAG → Embeddings → Vector DB → LLM¶

Connection Flow¶

graph TD
    A["Query needs context"] --> B["Embed query<br/>same model as indexed docs"]
    B --> C["Vector similarity search<br/>in Vector DB"]
    C --> D["Top-K chunks retrieved"]
    D --> E["Optional: rerank<br/>with cross-encoder"]
    E --> F["LLM generates<br/>with context"]
    F --> G["Retrieved context adds<br/>to prompt length"]
    F --> H["Longer context =<br/>more KV cache"]
    F --> I["Prefix caching helps<br/>with repeated docs"]
    F --> J["Semantic caching<br/>reduces LLM calls"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8f5e9,stroke:#4caf50
    style C fill:#e8f5e9,stroke:#4caf50
    style D fill:#fff3e0,stroke:#ef6c00
    style E fill:#fff3e0,stroke:#ef6c00
    style F fill:#f3e5f5,stroke:#9c27b0
    style G fill:#fce4ec,stroke:#c62828
    style H fill:#fce4ec,stroke:#c62828
    style I fill:#e8f5e9,stroke:#4caf50
    style J fill:#e8f5e9,stroke:#4caf50

End-to-End Latency¶

Stage	Typical Time	Optimization
Embedding	5-10ms	Batch, smaller model
Vector search	5-20ms	HNSW, smaller dimension
Reranking	20-50ms	Skip for simple queries
LLM generation	50-500ms	Speculative, quantization

7. Training → Fine-tuning → Inference¶

Connection Flow¶

graph TD
    A["Pre-training:<br/>full model on large corpus"] --> B["Full fine-tuning<br/>all params"]
    A --> C["LoRA<br/>low-rank adapters"]
    A --> D["QLoRA<br/>quantized + LoRA"]
    B --> E["LoRA adds small overhead<br/>adapter merge"]
    C --> E
    D --> E
    C --> F["Multi-LoRA serving<br/>adapter switching"]
    D --> G["Quantized base + LoRA<br/>= efficient serving"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#e8f5e9,stroke:#4caf50
    style D fill:#e8f5e9,stroke:#4caf50
    style E fill:#f3e5f5,stroke:#9c27b0
    style F fill:#f3e5f5,stroke:#9c27b0
    style G fill:#f3e5f5,stroke:#9c27b0

LoRA Serving Patterns¶

Single LoRA:      Merge into base weights (no overhead)
Multi-LoRA:       Keep adapters separate, switch per request
Batched Multi:    vLLM/SGLang support dynamic adapter loading

8. Cross-Cutting Concerns¶

Memory Optimization Techniques¶

Technique	Where Applied	Reduction
Quantization	Model weights	50-75%
KV Cache optimization	Attention	60-80% waste → <5%
Activation checkpointing	Training	50-70%
Gradient accumulation	Training	Simulates larger batch
Prefix caching	Inference	50-80% for repeated prefixes

Speed Optimization Techniques¶

Technique	Where Applied	Speedup
FlashAttention	Attention	2-4×
Speculative decoding	Inference	1.8-4×
CUDA graphs	Inference	1.2-1.5×
torch.compile	Training/Inference	1.1-1.3×
Continuous batching	Inference	2-3×

9. Decision Trees¶

"I need to deploy an LLM for inference"¶

Start
  │
  ├── Memory constrained?
  │     ├── Yes → Quantization (AWQ/GGUF)
  │     └── No → BF16 or FP16
  │
  ├── High throughput needed?
  │     ├── Yes → SGLang + EAGLE-3 + RadixAttention
  │     └── No → vLLM is simpler
  │
  ├── Long context (>32K)?
  │     ├── Yes → Ring Attention or FlashAttention-3
  │     └── No → Standard attention
  │
  ├── Multi-turn conversations?
  │     ├── Yes → Prefix caching + KV cache optimization
  │     └── No → Standard serving
  │
  └── Agent workflows?
        ├── Yes → SGLang with RadixAttention
        └── No → vLLM or TensorRT-LLM

"I need to fine-tune an LLM"¶

Start
  │
  ├── Full GPU cluster available?
  │     ├── Yes → Full fine-tuning or FSDP2
  │     └── No → LoRA or QLoRA
  │
  ├── Quality critical?
  │     ├── Yes → Full fine-tuning or high-rank LoRA (r=64)
  │     └── No → LoRA (r=16) sufficient
  │
  ├── Many tasks/domains?
  │     ├── Yes → Multiple LoRA adapters
  │     └── No → Single fine-tune
  │
  └── Budget constrained?
        ├── Yes → QLoRA (4-bit base + LoRA)
        └── No → BF16 LoRA or full fine-tune

10. Topic Quick Links¶

Topic	Key Concepts	Related Files
Attention	FlashAttention, MQA/GQA	`flash-attention-v2-v3.md`, `mqa-gqa-attention.md`
KV Cache	PagedAttention, RadixAttention	`kv-cache-optimization-2025-2026.md`, `vllm-paged-attention.md`
MoE	Expert routing, load balancing	`moe-advances-2025-2026.md`, `moe-load-balancing-2025-2026.md`
Quantization	AWQ, GPTQ, GGUF, FP8	`llm-quantization-2025-2026.md`, `gptq-awq-gguf-quantization.md`
Speculative	EAGLE-3, MTP, NGRAM	`speculative-decoding-2025-2026.md`
Long Context	Ring Attention, Infini-Attention	`long-context-2025-2026.md`, `rope-long-context.md`
SSM	Mamba, RWKV, hybrids	`state-space-models-2025-2026.md`, `xlstm-architecture-2025-2026.md`
Inference	vLLM, SGLang, TensorRT-LLM	`inference-engines-comparison-2025-2026.md`
RAG	Embeddings, Vector DB, Retrieval	`rag-system-design-2025-2026.md`, `advanced-rag-patterns-2025.md`
Training	FSDP2, ZeRO-3, LoRA	`distributed-training-comparison.md`, `lora-qlora-implementation-2025.md`

Типичные заблуждения¶

Заблуждение: 'KV cache -- это просто кэш, не влияет на архитектуру'

KV cache -- центральный bottleneck всей inference-цепочки. Для Llama 70B @ 4K context один запрос требует ~10GB только на KV cache. Выбор MQA vs GQA определяет размер cache (8x разница), что влияет на batch size, throughput и выбор inference engine. PagedAttention снижает waste с 60-80% до <4%.

Заблуждение: 'Quantization и speculative decoding -- независимые оптимизации'

Они тесно связаны. Draft-модель EAGLE-3 можно квантизировать агрессивнее (10K vocabulary vs 128K full). Комбинация AWQ-4 + EAGLE-3 дает 30% memory от baseline при 2.2x speedup. Оптимальная стратегия всегда учитывает обе техники вместе.

Заблуждение: 'RAG добавляет только latency на retrieval'

RAG влияет на всю inference-цепочку: retrieved context увеличивает prompt length -> больше KV cache -> больше memory -> меньше batch size -> меньше throughput. Оптимизации (prefix caching для повторяющихся docs, semantic caching для повторяющихся queries) сокращают end-to-end latency на 50-80%.

Интервью¶

Вопрос: Как связаны Attention optimization и Inference engine?¶

"FlashAttention ускоряет attention, inference engine обслуживает модель -- они независимы"

"Прямая связь через KV cache. FlashAttention/MQA/GQA определяют размер KV cache per request. Inference engine (vLLM vs SGLang) реализует управление этим cache: PagedAttention снижает waste с 60-80% до <4%, RadixAttention добавляет token-level sharing для agent workloads. Выбор GQA + SGLang с RadixAttention -- оптимальная комбинация для multi-turn chat."

Вопрос: Опишите связь Quantization -> Speculative Decoding -> Inference¶

"Quantization сжимает модель, speculative decoding ускоряет генерацию, inference engine все обслуживает"

"Три техники образуют pipeline: (1) Quantization (AWQ 4-bit) уменьшает основную модель до 25% memory, (2) EAGLE-3 draft model с 10K vocabulary добавляет только 5% overhead, (3) SGLang координирует verification batching. Результат: AWQ-4 + EAGLE-3 = 30% memory, 2.2x speedup. Без понимания связей вы потеряете 40%+ от оптимального результата."

Вопрос: Почему для RAG важно понимать KV cache management?¶

"RAG -- это про retrieval, KV cache -- про inference, они не связаны"

"Retrieved context напрямую увеличивает prompt length, что растит KV cache. Для 10 chunks по 512 токенов это +5K токенов = дополнительные гигабайты KV cache для больших моделей. Prefix caching критичен: повторяющиеся system prompts и base documents кэшируются, экономя 50-80% compute. Semantic caching на уровне запросов дополнительно снижает LLM calls на 50-80%."