Мастер-гайд подготовки к LLM-интервью¶
~9 минут чтения
Предварительно: Кросс-референс карта тем | Шпаргалка Alignment & RLHF | Шпаргалка ML System Design
Тип: synthesis / master guide Дата: Февраль 2026
Типичное LLM-интервью покрывает 7 крупных областей: inference (KV cache, speculative decoding, quantization), architecture (attention, MoE, SSM), training (distributed, LoRA, alignment), RAG, system design, deep learning и safety. На каждую область приходится 3-5 вопросов с ожидаемым уровнем глубины: формулы наизусть, конкретные числа (FlashAttention 2-4x, MQA 8x memory reduction, speculative decoding 1.5-2.5x real-world), и trade-off анализ. Этот мастер-гайд -- единая точка входа с формулами, decision guides и ТОП-20 вопросами из 89 синтезированных источников.
Быстрый доступ к темам¶
| Область | Ключевые темы | Cheat Sheet |
|---|---|---|
| LLM Inference | KV Cache, Speculative Decoding, Quantization | inference |
| RAG & Vector DB | Embeddings, Retrieval, Reranking | rag |
| ML Training | Distributed, Optimization, Regularization | training |
| Deep Learning | Normalization, Activations, SSM, Diffusion | deep-learning |
| System Design | Capacity, Patterns, Observability | system-design |
| Attention | RoPE, FlashAttention, MQA/GQA | attention |
| Alignment | RLHF, DPO, GRPO, RLAIF | alignment |
1. Ключевые числа для интервью¶
LLM Architecture¶
| Параметр | Значение | Контекст |
|---|---|---|
| Attention complexity | \(O(T^2)\) | Full self-attention |
| FlashAttention speedup | 2-4× | FA-3: 6-16× на H100 |
| MQA memory reduction | ~8× | 1 K/V head vs все heads |
| GQA quality loss | <1% | 8 groups vs full |
| RoPE extrapolation | 2-4× | Beyond training length |
Inference¶
| Метрика | Значение | Контекст |
|---|---|---|
| KV cache per request | ~10GB | Llama 70B @ 4K context |
| Speculative decoding | 1.5-2.5× | Real-world speedup |
| 4-bit quantization | 75% memory | ~97-98% quality |
| PagedAttention waste | <4% | vs 60-80% static |
Training¶
| Метрика | Значение | Контекст |
|---|---|---|
| Mixed precision speedup | 2-3× | FP16/BF16 |
| Gradient accumulation | Virtual batch | Memory-efficient |
| LoRA parameters | <1% | Full quality |
| EMA smoothing | 0.999 | Typical decay |
System Design¶
| Метрика | Формула | Пример |
|---|---|---|
| QPS | DAU × RPS / 86400 | 10M DAU × 10 = 1157 QPS |
| Cache hit rate | 50-80% | Semantic caching |
| Availability 99.9% | 8.76h downtime/year | SLA target |
2. Формулы на память¶
Attention¶
RoPE Rotation¶
LayerNorm vs RMSNorm¶
Softmax Temperature¶
Cross-Entropy Loss¶
KL Divergence¶
Perplexity¶
LoRA¶
RLHF (Bradley-Terry)¶
DPO Loss¶
QPS Formula¶
3. Что выбрать? (Decision Guide 2026)¶
Positional Encoding¶
| Нужна | Выбор |
|---|---|
| Production LLM | RoPE (LLaMA, Mistral, Qwen) |
| Max extrapolation | ALiBi (1024→8192+) |
| Simple implementation | ALiBi |
Attention Optimization¶
| Сценарий | Выбор |
|---|---|
| Memory-constrained | MQA (8× less KV) |
| Balance quality/speed | GQA (<1% quality loss) |
| Long context (1M+) | Ring Attention or Star Attention |
| Maximum speed | FlashAttention-3 (H100) |
Inference Engine¶
| Workload | Выбор |
|---|---|
| Simple serving | vLLM |
| Agent workflows | SGLang (3× faster) |
| NVIDIA-only | TensorRT-LLM |
| CPU/Mobile | llama.cpp (GGUF) |
Quantization¶
| Hardware | Метод |
|---|---|
| NVIDIA GPU | AWQ or GPTQ (~98% quality) |
| CPU/Edge | GGUF Q4_K_M (~92% quality) |
| H100+ | FP8 (native support) |
Alignment Method¶
| Scenario | Метод |
|---|---|
| Research/Quick | DPO (2-3× faster) |
| Production | RLHF (PPO or GRPO) |
| Reasoning models | GRPO (DeepSeek-R1) |
| Cost-constrained | RLAIF (70% cheaper) |
SSM vs Transformer¶
| Sequence Length | Выбор |
|---|---|
| Short (<1K) | Transformer |
| Medium (1K-16K) | Hybrid (Jamba 1:7) |
| Very long (>16K) | SSM or Hybrid |
Long Context vs RAG¶
| Factor | Long Context | RAG |
|---|---|---|
| Cost | 100× more | Baseline |
| Fresh data | Retrain needed | Instant |
| Use case | Reasoning over docs | Retrieval |
Best practice: Use BOTH — RAG for retrieval, long context for reasoning.
4. ТОП-20 Interview Questions¶
Architecture¶
- Explain self-attention mechanism
- \(QK^T\) dot product, softmax, multiply by \(V\)
-
\(O(T^2)\) complexity, captures all pairwise interactions
-
Why RoPE over learned positions?
- Relative position encoding through rotation
- Extrapolates 2-4× beyond training length
-
Used in all modern LLMs (LLaMA, Mistral)
-
Compare MQA, GQA, MHA
- MHA: full heads, max quality
- MQA: 1 K/V head, 8× less memory, ~3% quality loss
- GQA: grouped heads, <1% loss, best balance
Training¶
- What is gradient accumulation?
- Simulate larger batch by accumulating gradients
-
effective_batch = batch_size × accumulation_steps -
Explain LoRA
- Low-rank adaptation: \(W = W_0 + BA\)
-
<1% parameters, same quality as full fine-tuning
-
Why RMSNorm over LayerNorm?
- 15-25% faster (no mean subtraction)
- Same or better quality
- Standard in all 2025-2026 LLMs
Inference¶
- What is KV cache?
- Store key/value tensors from previous tokens
- Avoids recomputation in autoregressive generation
-
Memory: ~10GB for Llama 70B @ 4K context
-
Explain speculative decoding
- Draft model generates tokens
- Target model verifies in parallel
-
1.5-2.5× speedup (depends on acceptance rate)
-
Compare quantization methods
- AWQ/GPTQ: ~98% quality, GPU
- GGUF: ~92% quality, CPU
- FP8: Native on H100+, best quality
Alignment¶
-
Explain RLHF pipeline
- SFT → Reward Model training → PPO fine-tuning
- Human preferences → Bradley-Terry → RM loss
-
DPO vs RLHF?
- DPO skips reward model
- 2-3× faster, simpler
- Slightly lower quality on complex tasks
-
What is GRPO?
- Group Relative Policy Optimization
- DeepSeek-R1 method
- No value function, 50% less memory
System Design¶
-
Design ChatGPT at scale
- Load balancer → vLLM/SGLang → Semantic cache
- Model routing for cost optimization
- Streaming responses, rate limiting
-
QPS calculation
- \(\text{QPS} = \text{DAU} \times \text{Requests} / 86400\)
- Example: 10M DAU × 10 req = 1157 QPS, peak 3500
-
Semantic caching
- Embed query, check similarity before LLM call
- 50-80% hit rate, 70-90% cost savings
Advanced¶
-
FlashAttention optimization
- Block-wise computation, never materialize \(T×T\) matrix
- \(O(T)\) memory instead of \(O(T^2)\)
- FA-3 on H100: 6-16× speedup
-
MoE load balancing
- Auxiliary loss: penalty for imbalance
- Loss-free (DeepSeek V3): dynamic bias
- SIMBAL: 36% faster convergence
-
Test-time compute scaling
- o1 approach: more inference = better reasoning
- Methods: parallel sampling, tree search, MCTS
- Key: 1B + scaling > 405B without scaling
-
RAG vs Long Context
- RAG: cheaper, fresher data
- Long Context: simpler, more expensive
- Use both: RAG retrieve → LC reason
-
How to handle prompt injection?
- Multi-layer defense: input filter → intent classifier → output filter
- Constitutional AI training for robustness
- Continuous monitoring and red teaming
5. Красные флаги в коде (Code Review)¶
Распространённые ошибки¶
# ❌ WRONG: Hardcoded secrets
API_KEY = "sk-12345..."
# ✅ RIGHT: Environment variables
API_KEY = os.environ.get("API_KEY")
# ❌ WRONG: No error handling
response = model.generate(prompt)
# ✅ RIGHT: Graceful degradation
try:
response = model.generate(prompt)
except Exception as e:
logger.error(f"Generation failed: {e}")
response = fallback_response()
# ❌ WRONG: SQL injection
query = f"SELECT * FROM users WHERE id = {user_id}"
# ✅ RIGHT: Parameterized query
query = "SELECT * FROM users WHERE id = %s"
cursor.execute(query, (user_id,))
ML-Specific Anti-Patterns¶
# ❌ Data leakage
X_train, X_test = train_test_split(data, test_size=0.2)
scaler.fit(data) # Fitting on ALL data!
# ✅ Correct order
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# ❌ Not handling imbalanced data
model.fit(X, y) # 99% class 0
# ✅ Handle imbalance
model.fit(X, y, class_weight='balanced')
6. Behavioral Questions (STAR)¶
AI Safety Story¶
S: Working on content moderation system, noticed model producing false positives on legitimate content.
T: Reduce false positive rate while maintaining safety.
A: Analyzed failure cases, found pattern in edge cases, implemented multi-tier system (regex → ML → human review), added feedback loop.
R: False positive rate dropped from 15% to 3%, user complaints reduced by 80%.
Failure Story¶
S: Deployed model that performed well in testing but failed in production.
T: Understand root cause and fix deployment.
A: Investigated logs, found distribution shift (production data different from test), implemented monitoring, retrained with production data, added A/B testing.
R: Model now performs consistently, monitoring catches drift early.
Conflict Resolution¶
S: Disagreement with PM about model complexity vs latency budget.
T: Find solution that meets both quality and latency requirements.
A: Set up experiments with different model sizes, measured latency vs quality tradeoff, presented data-driven options, proposed model routing (simple queries → small model, complex → large).
R: Implemented tiered approach, reduced p95 latency by 60%, maintained quality.
7. Ресурсы для подготовки¶
Блоги (регулярно читать)¶
- Lilian Weng — lilianweng.github.io (attention, RLHF, transformers)
- Sebastian Raschka — magazine.sebastianraschka.com (training, fine-tuning)
- Chip Huyen — chiphuyen.com (MLOps, engineering)
- Eugene Yan — eugeneyan.com (applied ML patterns)
Papers (2025-2026 key reads)¶
- DeepSeek-R1 (GRPO)
- FlashAttention-3
- Mamba-2 (SSM)
- Ring Attention (10M+ context)
- EAGLE-3 (Speculative Decoding)
Практика¶
- Implement attention from scratch
- Implement BPE tokenizer
- Implement LoRA fine-tuning
- Implement KV cache with PagedAttention
- Build RAG pipeline with reranking
8. Checklist перед интервью¶
Technical¶
- Attention formula без подсказки
- Объяснить RoPE 3 разными способами
- Нарисовать RLHF pipeline
- Посчитать QPS для данного DAU
- Назвать 3 метода quantization и их trade-offs
- Объяснить speculative decoding за 30 секунд
System Design¶
- Framework: RESHADED
- Know latency budgets
- Caching strategies
- Database choices by use case
- Monitoring metrics
Behavioral¶
- 3 STAR stories готовы
- AI safety perspective
- Failure story с lessons learned
- Why this company / role
Типичные заблуждения¶
Заблуждение: 'LoRA дает <1% параметров -- значит качество сильно хуже'
При правильном ранге (r=16-64) LoRA достигает 95-100% качества full fine-tuning на большинстве задач. Ключевой фактор -- не количество параметров, а то, какие слои адаптируются (Q, K, V, O проекции). QLoRA (4-bit base + LoRA) добавляет дополнительную экономию при минимальной потере.
Заблуждение: 'DPO всегда лучше RLHF потому что проще'
DPO быстрее в 2-3x и проще в реализации, но уступает RLHF/GRPO на complex reasoning задачах. DeepSeek-R1 использует GRPO (не DPO) именно потому, что для reasoning-моделей нужна exploration через sampling. DPO -- для быстрых экспериментов, GRPO/PPO -- для production reasoning.
Заблуждение: 'Semantic caching экономит 70-90% -- значит всегда стоит внедрять'
Hit rate 50-80% достигается только при предсказуемых паттернах запросов (customer support, FAQ). Для creative/reasoning задач hit rate падает до 10-20%, а overhead на embedding + vector search (15-30ms) может увеличить latency без реальной экономии. Анализируйте распределение запросов перед внедрением.
Интервью: Экспресс-проверка готовности¶
Вопрос: Объясните self-attention за 30 секунд¶
"Self-attention умножает query на key, потом на value"
"Attn(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V. Каждый токен вычисляет dot product со всеми другими -- O(T^2) complexity. Это позволяет моделировать зависимости на любом расстоянии. FlashAttention решает memory bottleneck через tiling: O(T) memory вместо O(T^2), ускорение 2-4x."
Вопрос: Какой alignment-метод выбрать для reasoning-модели?¶
"DPO -- он проще и быстрее"
"GRPO (Group Relative Policy Optimization) как в DeepSeek-R1. Причины: (1) не требует value function -- 50% экономия memory, (2) group-relative advantages через sampling 4-16 ответов на prompt -- это дает exploration, критичную для reasoning, (3) стабильнее PPO на длинных chain-of-thought. DPO хорош для quick experiments, но для reasoning нужна exploration."
Вопрос: Посчитайте QPS для системы с 10M DAU¶
"Много, нужен кластер GPU"
"QPS = DAU x Requests / 86400 = 10M x 10 / 86400 = 1157 QPS average, peak ~3500 QPS (3x factor). С semantic caching (50-80% hit rate) реальная нагрузка на LLM = 700-1750 QPS. При latency budget 200ms и throughput ~100 tokens/s per GPU нужно ~35-175 GPU instances за load balancer."
Sources¶
All 10 synthesis cheat sheets + 89 source files (including ФАЗА 5 additions).