Перейти к содержанию

Кросс-референс карта тем LLM

~5 минут чтения

Предварительно: Мастер-гайд для подготовки | Анализ противоречий

Тип: synthesis / cross-reference Дата: Февраль 2026

Подготовка к LLM-интервью охватывает 10+ крупных тем (Attention, MoE, KV Cache, Quantization, RAG, Training...), между которыми существует 30+ критических связей. Например, выбор GQA напрямую влияет на размер KV cache (8x разница с MQA), что определяет конфигурацию inference engine (vLLM vs SGLang), что в свою очередь влияет на стратегию speculative decoding. Эта карта визуализирует все зависимости и помогает строить ответы, которые демонстрируют системное понимание.


Topic Dependency Graph

graph TD
    A["Transformer<br/>Architecture"] --> B["Attention<br/>Optimization"]
    A --> C["MoE<br/>Architecture"]
    A --> D["Long Context<br/>Methods"]

    B --> E["FlashAttention<br/>MQA/GQA"]
    C --> F["Load Balancing<br/>Expert Routing"]
    D --> G["Ring Attention<br/>Infini-Attn"]

    E --> H["KV Cache<br/>Management"]
    E --> I["Inference<br/>Engines"]
    I --> H

    H --> J["PagedAttention<br/>Prefix Caching<br/>RadixAttention"]

    J --> K["Speculative Decoding<br/>EAGLE, MTP, NGRAM"]
    I --> K

    K --> L["Quantization<br/>AWQ, GPTQ, GGUF, FP8"]

    style A fill:#f3e5f5,stroke:#9c27b0
    style B fill:#e8eaf6,stroke:#3f51b5
    style C fill:#e8eaf6,stroke:#3f51b5
    style D fill:#e8eaf6,stroke:#3f51b5
    style E fill:#e8f5e9,stroke:#4caf50
    style F fill:#e8f5e9,stroke:#4caf50
    style G fill:#e8f5e9,stroke:#4caf50
    style H fill:#fff3e0,stroke:#ef6c00
    style I fill:#fff3e0,stroke:#ef6c00
    style J fill:#fff3e0,stroke:#ef6c00
    style K fill:#fce4ec,stroke:#c62828
    style L fill:#fce4ec,stroke:#c62828

1. Attention → KV Cache → Inference

Connection Flow

graph TD
    A["Attention: requires K,V vectors"] --> B["KV Cache: stores K,V<br/>to avoid recomputation"]
    B --> C["Memory problem:<br/>KV cache grows with seq length"]
    C --> D["PagedAttention<br/>block-based allocation"]
    C --> E["Prefix caching<br/>hash shared prefixes"]
    C --> F["RadixAttention<br/>token-level radix tree"]
    D --> G["vLLM:<br/>PagedAttention + prefix cache"]
    E --> G
    F --> H["SGLang:<br/>RadixAttention, best for agents"]
    D --> I["TensorRT-LLM:<br/>CUDA-optimized attention"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#fce4ec,stroke:#c62828
    style D fill:#e8f5e9,stroke:#4caf50
    style E fill:#e8f5e9,stroke:#4caf50
    style F fill:#e8f5e9,stroke:#4caf50
    style G fill:#f3e5f5,stroke:#9c27b0
    style H fill:#f3e5f5,stroke:#9c27b0
    style I fill:#f3e5f5,stroke:#9c27b0

Key Formulas Connection

Attention: $\(\text{Attn}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\)$

KV Cache Memory: $\(M = 2 \times L \times H \times d \times T \times 2 \text{ bytes}\)$

Connection: The \(K\) and \(V\) matrices in attention are what get cached. Memory grows with \(T\) (sequence length).


2. MoE → Load Balancing → Inference

Connection Flow

graph TD
    A["MoE: routes tokens<br/>to subset of experts"] --> B["Problem: uneven expert<br/>utilization, rich-get-richer"]
    B --> C["Auxiliary loss<br/>standard, interferes with main loss"]
    B --> D["Loss-Free, DeepSeek V3<br/>dynamic bias"]
    B --> E["SIMBAL<br/>similarity-preserving, 36% faster"]
    C --> F["Token dropping<br/>if expert overloaded"]
    D --> F
    E --> F
    F --> G["Latency variance per token"]
    F --> H["Throughput depends<br/>on balance quality"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fce4ec,stroke:#c62828
    style C fill:#e8f5e9,stroke:#4caf50
    style D fill:#e8f5e9,stroke:#4caf50
    style E fill:#e8f5e9,stroke:#4caf50
    style F fill:#fff3e0,stroke:#ef6c00
    style G fill:#fff3e0,stroke:#ef6c00
    style H fill:#fff3e0,stroke:#ef6c00

Expert Utilization Impact

Balance Quality Throughput Latency
Poor (some experts 0%) -30% High variance
Good (all ~50%) Baseline Stable
Excellent (all ~50% with low variance) +10% Predictable

3. Long Context → Ring Attention → Distributed

Connection Flow

graph TD
    A["Standard attention:<br/>O&#40;T^2&#41; memory and compute"] --> B["Problem: can't fit 100K+<br/>context on single GPU"]
    B --> C["FlashAttention<br/>memory efficient, still limited"]
    B --> D["Ring Attention<br/>distributed across GPUs"]
    B --> E["Infini-Attention<br/>compressive memory"]
    B --> F["Star Attention<br/>block-sparse, 11x faster"]
    D --> G["Multi-GPU setup<br/>ring topology"]
    D --> H["Inter-GPU communication"]
    D --> I["Distributed framework<br/>FSDP, Megatron"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fce4ec,stroke:#c62828
    style C fill:#e8f5e9,stroke:#4caf50
    style D fill:#e8f5e9,stroke:#4caf50
    style E fill:#e8f5e9,stroke:#4caf50
    style F fill:#e8f5e9,stroke:#4caf50
    style G fill:#fff3e0,stroke:#ef6c00
    style H fill:#fff3e0,stroke:#ef6c00
    style I fill:#fff3e0,stroke:#ef6c00

Context Length Scaling

Method Max Context Hardware Notes
FlashAttention-3 128K H100 Single GPU
Ring Attention 1M+ 8+ GPUs Linear scaling
Infini-Attention Unlimited Any Quality loss
Star Attention 128K Single 11× faster

4. Speculative Decoding → Quantization → Inference

Connection Flow

graph TD
    A["Speculative decoding<br/>adds draft model"] --> B["Draft model needs<br/>extra memory"]
    B --> C["Solution: quantize draft<br/>model more aggressively"]
    C --> D["EAGLE-3 draft model"]
    D --> E["10K vocabulary<br/>vs 128K full"]
    D --> F["Single transformer layer"]
    D --> G["Can be further quantized"]
    E --> H["Combined speedup:<br/>2-4x speed + 75% memory reduction"]
    F --> H
    G --> H

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fce4ec,stroke:#c62828
    style C fill:#e8f5e9,stroke:#4caf50
    style D fill:#fff3e0,stroke:#ef6c00
    style E fill:#fff3e0,stroke:#ef6c00
    style F fill:#fff3e0,stroke:#ef6c00
    style G fill:#fff3e0,stroke:#ef6c00
    style H fill:#f3e5f5,stroke:#9c27b0

Quantization + Speculative Speedup

Configuration Memory Speedup
BF16, no spec 100% Baseline
BF16 + EAGLE-3 105% 2.5×
AWQ-4 + EAGLE-3 30% 2.2×
GGUF Q4 + NGRAM 25% 1.3×

5. SSM → Hybrid Architecture → Inference

Connection Flow

graph TD
    A["State Space Models, Mamba<br/>O&#40;T&#41; complexity"] --> B["Pros: fast inference,<br/>constant memory"]
    A --> C["Cons: lower quality<br/>than Transformers"]
    B --> D["Jamba<br/>1:7 Transformer:Mamba"]
    C --> D
    B --> E["Bamba<br/>similar ratio"]
    B --> F["Nemotron-H<br/>NVIDIA"]
    D --> G["Need both attention<br/>and SSM kernels"]
    D --> H["Memory = KV cache<br/>+ SSM state"]
    D --> I["Different optimal<br/>batch sizes"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8f5e9,stroke:#4caf50
    style C fill:#fce4ec,stroke:#c62828
    style D fill:#f3e5f5,stroke:#9c27b0
    style E fill:#f3e5f5,stroke:#9c27b0
    style F fill:#f3e5f5,stroke:#9c27b0
    style G fill:#fff3e0,stroke:#ef6c00
    style H fill:#fff3e0,stroke:#ef6c00
    style I fill:#fff3e0,stroke:#ef6c00

Hybrid vs Pure Comparison

Architecture Quality Inference Speed Memory
Pure Transformer 100% Baseline Grows with T
Pure Mamba 95-98% 5× faster Fixed
Hybrid (1:7) 99%+ 3× faster Reduced

6. RAG → Embeddings → Vector DB → LLM

Connection Flow

graph TD
    A["Query needs context"] --> B["Embed query<br/>same model as indexed docs"]
    B --> C["Vector similarity search<br/>in Vector DB"]
    C --> D["Top-K chunks retrieved"]
    D --> E["Optional: rerank<br/>with cross-encoder"]
    E --> F["LLM generates<br/>with context"]
    F --> G["Retrieved context adds<br/>to prompt length"]
    F --> H["Longer context =<br/>more KV cache"]
    F --> I["Prefix caching helps<br/>with repeated docs"]
    F --> J["Semantic caching<br/>reduces LLM calls"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8f5e9,stroke:#4caf50
    style C fill:#e8f5e9,stroke:#4caf50
    style D fill:#fff3e0,stroke:#ef6c00
    style E fill:#fff3e0,stroke:#ef6c00
    style F fill:#f3e5f5,stroke:#9c27b0
    style G fill:#fce4ec,stroke:#c62828
    style H fill:#fce4ec,stroke:#c62828
    style I fill:#e8f5e9,stroke:#4caf50
    style J fill:#e8f5e9,stroke:#4caf50

End-to-End Latency

Stage Typical Time Optimization
Embedding 5-10ms Batch, smaller model
Vector search 5-20ms HNSW, smaller dimension
Reranking 20-50ms Skip for simple queries
LLM generation 50-500ms Speculative, quantization

7. Training → Fine-tuning → Inference

Connection Flow

graph TD
    A["Pre-training:<br/>full model on large corpus"] --> B["Full fine-tuning<br/>all params"]
    A --> C["LoRA<br/>low-rank adapters"]
    A --> D["QLoRA<br/>quantized + LoRA"]
    B --> E["LoRA adds small overhead<br/>adapter merge"]
    C --> E
    D --> E
    C --> F["Multi-LoRA serving<br/>adapter switching"]
    D --> G["Quantized base + LoRA<br/>= efficient serving"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#e8f5e9,stroke:#4caf50
    style D fill:#e8f5e9,stroke:#4caf50
    style E fill:#f3e5f5,stroke:#9c27b0
    style F fill:#f3e5f5,stroke:#9c27b0
    style G fill:#f3e5f5,stroke:#9c27b0

LoRA Serving Patterns

Single LoRA:      Merge into base weights (no overhead)
Multi-LoRA:       Keep adapters separate, switch per request
Batched Multi:    vLLM/SGLang support dynamic adapter loading

8. Cross-Cutting Concerns

Memory Optimization Techniques

Technique Where Applied Reduction
Quantization Model weights 50-75%
KV Cache optimization Attention 60-80% waste → <5%
Activation checkpointing Training 50-70%
Gradient accumulation Training Simulates larger batch
Prefix caching Inference 50-80% for repeated prefixes

Speed Optimization Techniques

Technique Where Applied Speedup
FlashAttention Attention 2-4×
Speculative decoding Inference 1.8-4×
CUDA graphs Inference 1.2-1.5×
torch.compile Training/Inference 1.1-1.3×
Continuous batching Inference 2-3×

9. Decision Trees

"I need to deploy an LLM for inference"

Start
  ├── Memory constrained?
  │     ├── Yes → Quantization (AWQ/GGUF)
  │     └── No → BF16 or FP16
  ├── High throughput needed?
  │     ├── Yes → SGLang + EAGLE-3 + RadixAttention
  │     └── No → vLLM is simpler
  ├── Long context (>32K)?
  │     ├── Yes → Ring Attention or FlashAttention-3
  │     └── No → Standard attention
  ├── Multi-turn conversations?
  │     ├── Yes → Prefix caching + KV cache optimization
  │     └── No → Standard serving
  └── Agent workflows?
        ├── Yes → SGLang with RadixAttention
        └── No → vLLM or TensorRT-LLM

"I need to fine-tune an LLM"

Start
  ├── Full GPU cluster available?
  │     ├── Yes → Full fine-tuning or FSDP2
  │     └── No → LoRA or QLoRA
  ├── Quality critical?
  │     ├── Yes → Full fine-tuning or high-rank LoRA (r=64)
  │     └── No → LoRA (r=16) sufficient
  ├── Many tasks/domains?
  │     ├── Yes → Multiple LoRA adapters
  │     └── No → Single fine-tune
  └── Budget constrained?
        ├── Yes → QLoRA (4-bit base + LoRA)
        └── No → BF16 LoRA or full fine-tune

Topic Key Concepts Related Files
Attention FlashAttention, MQA/GQA flash-attention-v2-v3.md, mqa-gqa-attention.md
KV Cache PagedAttention, RadixAttention kv-cache-optimization-2025-2026.md, vllm-paged-attention.md
MoE Expert routing, load balancing moe-advances-2025-2026.md, moe-load-balancing-2025-2026.md
Quantization AWQ, GPTQ, GGUF, FP8 llm-quantization-2025-2026.md, gptq-awq-gguf-quantization.md
Speculative EAGLE-3, MTP, NGRAM speculative-decoding-2025-2026.md
Long Context Ring Attention, Infini-Attention long-context-2025-2026.md, rope-long-context.md
SSM Mamba, RWKV, hybrids state-space-models-2025-2026.md, xlstm-architecture-2025-2026.md
Inference vLLM, SGLang, TensorRT-LLM inference-engines-comparison-2025-2026.md
RAG Embeddings, Vector DB, Retrieval rag-system-design-2025-2026.md, advanced-rag-patterns-2025.md
Training FSDP2, ZeRO-3, LoRA distributed-training-comparison.md, lora-qlora-implementation-2025.md

Типичные заблуждения

Заблуждение: 'KV cache -- это просто кэш, не влияет на архитектуру'

KV cache -- центральный bottleneck всей inference-цепочки. Для Llama 70B @ 4K context один запрос требует ~10GB только на KV cache. Выбор MQA vs GQA определяет размер cache (8x разница), что влияет на batch size, throughput и выбор inference engine. PagedAttention снижает waste с 60-80% до <4%.

Заблуждение: 'Quantization и speculative decoding -- независимые оптимизации'

Они тесно связаны. Draft-модель EAGLE-3 можно квантизировать агрессивнее (10K vocabulary vs 128K full). Комбинация AWQ-4 + EAGLE-3 дает 30% memory от baseline при 2.2x speedup. Оптимальная стратегия всегда учитывает обе техники вместе.

Заблуждение: 'RAG добавляет только latency на retrieval'

RAG влияет на всю inference-цепочку: retrieved context увеличивает prompt length -> больше KV cache -> больше memory -> меньше batch size -> меньше throughput. Оптимизации (prefix caching для повторяющихся docs, semantic caching для повторяющихся queries) сокращают end-to-end latency на 50-80%.


Интервью

Вопрос: Как связаны Attention optimization и Inference engine?

❌ "FlashAttention ускоряет attention, inference engine обслуживает модель -- они независимы"

✅ "Прямая связь через KV cache. FlashAttention/MQA/GQA определяют размер KV cache per request. Inference engine (vLLM vs SGLang) реализует управление этим cache: PagedAttention снижает waste с 60-80% до <4%, RadixAttention добавляет token-level sharing для agent workloads. Выбор GQA + SGLang с RadixAttention -- оптимальная комбинация для multi-turn chat."

Вопрос: Опишите связь Quantization -> Speculative Decoding -> Inference

❌ "Quantization сжимает модель, speculative decoding ускоряет генерацию, inference engine все обслуживает"

✅ "Три техники образуют pipeline: (1) Quantization (AWQ 4-bit) уменьшает основную модель до 25% memory, (2) EAGLE-3 draft model с 10K vocabulary добавляет только 5% overhead, (3) SGLang координирует verification batching. Результат: AWQ-4 + EAGLE-3 = 30% memory, 2.2x speedup. Без понимания связей вы потеряете 40%+ от оптимального результата."

Вопрос: Почему для RAG важно понимать KV cache management?

❌ "RAG -- это про retrieval, KV cache -- про inference, они не связаны"

✅ "Retrieved context напрямую увеличивает prompt length, что растит KV cache. Для 10 chunks по 512 токенов это +5K токенов = дополнительные гигабайты KV cache для больших моделей. Prefix caching критичен: повторяющиеся system prompts и base documents кэшируются, экономя 50-80% compute. Semantic caching на уровне запросов дополнительно снижает LLM calls на 50-80%."