Перейти к содержанию

Мастер-гайд подготовки к LLM-интервью

~9 минут чтения

Предварительно: Кросс-референс карта тем | Шпаргалка Alignment & RLHF | Шпаргалка ML System Design

Тип: synthesis / master guide Дата: Февраль 2026

Типичное LLM-интервью покрывает 7 крупных областей: inference (KV cache, speculative decoding, quantization), architecture (attention, MoE, SSM), training (distributed, LoRA, alignment), RAG, system design, deep learning и safety. На каждую область приходится 3-5 вопросов с ожидаемым уровнем глубины: формулы наизусть, конкретные числа (FlashAttention 2-4x, MQA 8x memory reduction, speculative decoding 1.5-2.5x real-world), и trade-off анализ. Этот мастер-гайд -- единая точка входа с формулами, decision guides и ТОП-20 вопросами из 89 синтезированных источников.


Быстрый доступ к темам

Область Ключевые темы Cheat Sheet
LLM Inference KV Cache, Speculative Decoding, Quantization inference
RAG & Vector DB Embeddings, Retrieval, Reranking rag
ML Training Distributed, Optimization, Regularization training
Deep Learning Normalization, Activations, SSM, Diffusion deep-learning
System Design Capacity, Patterns, Observability system-design
Attention RoPE, FlashAttention, MQA/GQA attention
Alignment RLHF, DPO, GRPO, RLAIF alignment

1. Ключевые числа для интервью

LLM Architecture

Параметр Значение Контекст
Attention complexity \(O(T^2)\) Full self-attention
FlashAttention speedup 2-4× FA-3: 6-16× на H100
MQA memory reduction ~8× 1 K/V head vs все heads
GQA quality loss <1% 8 groups vs full
RoPE extrapolation 2-4× Beyond training length

Inference

Метрика Значение Контекст
KV cache per request ~10GB Llama 70B @ 4K context
Speculative decoding 1.5-2.5× Real-world speedup
4-bit quantization 75% memory ~97-98% quality
PagedAttention waste <4% vs 60-80% static

Training

Метрика Значение Контекст
Mixed precision speedup 2-3× FP16/BF16
Gradient accumulation Virtual batch Memory-efficient
LoRA parameters <1% Full quality
EMA smoothing 0.999 Typical decay

System Design

Метрика Формула Пример
QPS DAU × RPS / 86400 10M DAU × 10 = 1157 QPS
Cache hit rate 50-80% Semantic caching
Availability 99.9% 8.76h downtime/year SLA target

2. Формулы на память

Attention

\[\text{Attn}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

RoPE Rotation

\[\text{RoPE}(x_m, m) = R_m x_m, \quad \theta_i = 10000^{-2i/d}\]

LayerNorm vs RMSNorm

\[\text{LayerNorm}(x) = \frac{x - \mu}{\sigma} \cdot \gamma + \beta\]
\[\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum x_i^2}} \cdot \gamma\]

Softmax Temperature

\[\text{softmax}(z/T) = \frac{e^{z_i/T}}{\sum_j e^{z_j/T}}\]

Cross-Entropy Loss

\[\mathcal{L}_{CE} = -\sum_i y_i \log(\hat{y}_i)\]

KL Divergence

\[D_{KL}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}\]

Perplexity

\[\text{PPL} = \exp\left(-\frac{1}{N}\sum_{i=1}^N \log p(x_i)\right)\]

LoRA

\[W = W_0 + BA, \quad B \in \mathbb{R}^{d \times r}, A \in \mathbb{R}^{r \times k}\]

RLHF (Bradley-Terry)

\[P(y_w > y_l) = \sigma(r(x, y_w) - r(x, y_l))\]

DPO Loss

\[\mathcal{L}_{DPO} = -\mathbb{E}\left[\log \sigma\left(\beta \log \frac{\pi(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right]\]

QPS Formula

\[\text{QPS} = \frac{\text{DAU} \times \text{Requests per User}}{86400}\]

3. Что выбрать? (Decision Guide 2026)

Positional Encoding

Нужна Выбор
Production LLM RoPE (LLaMA, Mistral, Qwen)
Max extrapolation ALiBi (1024→8192+)
Simple implementation ALiBi

Attention Optimization

Сценарий Выбор
Memory-constrained MQA (8× less KV)
Balance quality/speed GQA (<1% quality loss)
Long context (1M+) Ring Attention or Star Attention
Maximum speed FlashAttention-3 (H100)

Inference Engine

Workload Выбор
Simple serving vLLM
Agent workflows SGLang (3× faster)
NVIDIA-only TensorRT-LLM
CPU/Mobile llama.cpp (GGUF)

Quantization

Hardware Метод
NVIDIA GPU AWQ or GPTQ (~98% quality)
CPU/Edge GGUF Q4_K_M (~92% quality)
H100+ FP8 (native support)

Alignment Method

Scenario Метод
Research/Quick DPO (2-3× faster)
Production RLHF (PPO or GRPO)
Reasoning models GRPO (DeepSeek-R1)
Cost-constrained RLAIF (70% cheaper)

SSM vs Transformer

Sequence Length Выбор
Short (<1K) Transformer
Medium (1K-16K) Hybrid (Jamba 1:7)
Very long (>16K) SSM or Hybrid

Long Context vs RAG

Factor Long Context RAG
Cost 100× more Baseline
Fresh data Retrain needed Instant
Use case Reasoning over docs Retrieval

Best practice: Use BOTH — RAG for retrieval, long context for reasoning.


4. ТОП-20 Interview Questions

Architecture

  1. Explain self-attention mechanism
  2. \(QK^T\) dot product, softmax, multiply by \(V\)
  3. \(O(T^2)\) complexity, captures all pairwise interactions

  4. Why RoPE over learned positions?

  5. Relative position encoding through rotation
  6. Extrapolates 2-4× beyond training length
  7. Used in all modern LLMs (LLaMA, Mistral)

  8. Compare MQA, GQA, MHA

  9. MHA: full heads, max quality
  10. MQA: 1 K/V head, 8× less memory, ~3% quality loss
  11. GQA: grouped heads, <1% loss, best balance

Training

  1. What is gradient accumulation?
  2. Simulate larger batch by accumulating gradients
  3. effective_batch = batch_size × accumulation_steps

  4. Explain LoRA

  5. Low-rank adaptation: \(W = W_0 + BA\)
  6. <1% parameters, same quality as full fine-tuning

  7. Why RMSNorm over LayerNorm?

  8. 15-25% faster (no mean subtraction)
  9. Same or better quality
  10. Standard in all 2025-2026 LLMs

Inference

  1. What is KV cache?
  2. Store key/value tensors from previous tokens
  3. Avoids recomputation in autoregressive generation
  4. Memory: ~10GB for Llama 70B @ 4K context

  5. Explain speculative decoding

  6. Draft model generates tokens
  7. Target model verifies in parallel
  8. 1.5-2.5× speedup (depends on acceptance rate)

  9. Compare quantization methods

  10. AWQ/GPTQ: ~98% quality, GPU
  11. GGUF: ~92% quality, CPU
  12. FP8: Native on H100+, best quality

Alignment

  1. Explain RLHF pipeline

    • SFT → Reward Model training → PPO fine-tuning
    • Human preferences → Bradley-Terry → RM loss
  2. DPO vs RLHF?

    • DPO skips reward model
    • 2-3× faster, simpler
    • Slightly lower quality on complex tasks
  3. What is GRPO?

    • Group Relative Policy Optimization
    • DeepSeek-R1 method
    • No value function, 50% less memory

System Design

  1. Design ChatGPT at scale

    • Load balancer → vLLM/SGLang → Semantic cache
    • Model routing for cost optimization
    • Streaming responses, rate limiting
  2. QPS calculation

    • \(\text{QPS} = \text{DAU} \times \text{Requests} / 86400\)
    • Example: 10M DAU × 10 req = 1157 QPS, peak 3500
  3. Semantic caching

    • Embed query, check similarity before LLM call
    • 50-80% hit rate, 70-90% cost savings

Advanced

  1. FlashAttention optimization

    • Block-wise computation, never materialize \(T×T\) matrix
    • \(O(T)\) memory instead of \(O(T^2)\)
    • FA-3 on H100: 6-16× speedup
  2. MoE load balancing

    • Auxiliary loss: penalty for imbalance
    • Loss-free (DeepSeek V3): dynamic bias
    • SIMBAL: 36% faster convergence
  3. Test-time compute scaling

    • o1 approach: more inference = better reasoning
    • Methods: parallel sampling, tree search, MCTS
    • Key: 1B + scaling > 405B without scaling
  4. RAG vs Long Context

    • RAG: cheaper, fresher data
    • Long Context: simpler, more expensive
    • Use both: RAG retrieve → LC reason
  5. How to handle prompt injection?

    • Multi-layer defense: input filter → intent classifier → output filter
    • Constitutional AI training for robustness
    • Continuous monitoring and red teaming

5. Красные флаги в коде (Code Review)

Распространённые ошибки

# ❌ WRONG: Hardcoded secrets
API_KEY = "sk-12345..."

# ✅ RIGHT: Environment variables
API_KEY = os.environ.get("API_KEY")

# ❌ WRONG: No error handling
response = model.generate(prompt)

# ✅ RIGHT: Graceful degradation
try:
    response = model.generate(prompt)
except Exception as e:
    logger.error(f"Generation failed: {e}")
    response = fallback_response()

# ❌ WRONG: SQL injection
query = f"SELECT * FROM users WHERE id = {user_id}"

# ✅ RIGHT: Parameterized query
query = "SELECT * FROM users WHERE id = %s"
cursor.execute(query, (user_id,))

ML-Specific Anti-Patterns

# ❌ Data leakage
X_train, X_test = train_test_split(data, test_size=0.2)
scaler.fit(data)  # Fitting on ALL data!

# ✅ Correct order
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# ❌ Not handling imbalanced data
model.fit(X, y)  # 99% class 0

# ✅ Handle imbalance
model.fit(X, y, class_weight='balanced')

6. Behavioral Questions (STAR)

AI Safety Story

S: Working on content moderation system, noticed model producing false positives on legitimate content.

T: Reduce false positive rate while maintaining safety.

A: Analyzed failure cases, found pattern in edge cases, implemented multi-tier system (regex → ML → human review), added feedback loop.

R: False positive rate dropped from 15% to 3%, user complaints reduced by 80%.

Failure Story

S: Deployed model that performed well in testing but failed in production.

T: Understand root cause and fix deployment.

A: Investigated logs, found distribution shift (production data different from test), implemented monitoring, retrained with production data, added A/B testing.

R: Model now performs consistently, monitoring catches drift early.

Conflict Resolution

S: Disagreement with PM about model complexity vs latency budget.

T: Find solution that meets both quality and latency requirements.

A: Set up experiments with different model sizes, measured latency vs quality tradeoff, presented data-driven options, proposed model routing (simple queries → small model, complex → large).

R: Implemented tiered approach, reduced p95 latency by 60%, maintained quality.


7. Ресурсы для подготовки

Блоги (регулярно читать)

  • Lilian Weng — lilianweng.github.io (attention, RLHF, transformers)
  • Sebastian Raschka — magazine.sebastianraschka.com (training, fine-tuning)
  • Chip Huyen — chiphuyen.com (MLOps, engineering)
  • Eugene Yan — eugeneyan.com (applied ML patterns)

Papers (2025-2026 key reads)

  • DeepSeek-R1 (GRPO)
  • FlashAttention-3
  • Mamba-2 (SSM)
  • Ring Attention (10M+ context)
  • EAGLE-3 (Speculative Decoding)

Практика

  • Implement attention from scratch
  • Implement BPE tokenizer
  • Implement LoRA fine-tuning
  • Implement KV cache with PagedAttention
  • Build RAG pipeline with reranking

8. Checklist перед интервью

Technical

  • Attention formula без подсказки
  • Объяснить RoPE 3 разными способами
  • Нарисовать RLHF pipeline
  • Посчитать QPS для данного DAU
  • Назвать 3 метода quantization и их trade-offs
  • Объяснить speculative decoding за 30 секунд

System Design

  • Framework: RESHADED
  • Know latency budgets
  • Caching strategies
  • Database choices by use case
  • Monitoring metrics

Behavioral

  • 3 STAR stories готовы
  • AI safety perspective
  • Failure story с lessons learned
  • Why this company / role

Типичные заблуждения

Заблуждение: 'LoRA дает <1% параметров -- значит качество сильно хуже'

При правильном ранге (r=16-64) LoRA достигает 95-100% качества full fine-tuning на большинстве задач. Ключевой фактор -- не количество параметров, а то, какие слои адаптируются (Q, K, V, O проекции). QLoRA (4-bit base + LoRA) добавляет дополнительную экономию при минимальной потере.

Заблуждение: 'DPO всегда лучше RLHF потому что проще'

DPO быстрее в 2-3x и проще в реализации, но уступает RLHF/GRPO на complex reasoning задачах. DeepSeek-R1 использует GRPO (не DPO) именно потому, что для reasoning-моделей нужна exploration через sampling. DPO -- для быстрых экспериментов, GRPO/PPO -- для production reasoning.

Заблуждение: 'Semantic caching экономит 70-90% -- значит всегда стоит внедрять'

Hit rate 50-80% достигается только при предсказуемых паттернах запросов (customer support, FAQ). Для creative/reasoning задач hit rate падает до 10-20%, а overhead на embedding + vector search (15-30ms) может увеличить latency без реальной экономии. Анализируйте распределение запросов перед внедрением.


Интервью: Экспресс-проверка готовности

Вопрос: Объясните self-attention за 30 секунд

❌ "Self-attention умножает query на key, потом на value"

✅ "Attn(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V. Каждый токен вычисляет dot product со всеми другими -- O(T^2) complexity. Это позволяет моделировать зависимости на любом расстоянии. FlashAttention решает memory bottleneck через tiling: O(T) memory вместо O(T^2), ускорение 2-4x."

Вопрос: Какой alignment-метод выбрать для reasoning-модели?

❌ "DPO -- он проще и быстрее"

✅ "GRPO (Group Relative Policy Optimization) как в DeepSeek-R1. Причины: (1) не требует value function -- 50% экономия memory, (2) group-relative advantages через sampling 4-16 ответов на prompt -- это дает exploration, критичную для reasoning, (3) стабильнее PPO на длинных chain-of-thought. DPO хорош для quick experiments, но для reasoning нужна exploration."

Вопрос: Посчитайте QPS для системы с 10M DAU

❌ "Много, нужен кластер GPU"

✅ "QPS = DAU x Requests / 86400 = 10M x 10 / 86400 = 1157 QPS average, peak ~3500 QPS (3x factor). С semantic caching (50-80% hit rate) реальная нагрузка на LLM = 700-1750 QPS. При latency budget 200ms и throughput ~100 tokens/s per GPU нужно ~35-175 GPU instances за load balancer."


Sources

All 10 synthesis cheat sheets + 89 source files (including ФАЗА 5 additions).