Мастер-гайд подготовки к LLM-интервью¶

~9 минут чтения

Предварительно: Кросс-референс карта тем | Шпаргалка Alignment & RLHF | Шпаргалка ML System Design

Тип: synthesis / master guide Дата: Февраль 2026

Типичное LLM-интервью покрывает 7 крупных областей: inference (KV cache, speculative decoding, quantization), architecture (attention, MoE, SSM), training (distributed, LoRA, alignment), RAG, system design, deep learning и safety. На каждую область приходится 3-5 вопросов с ожидаемым уровнем глубины: формулы наизусть, конкретные числа (FlashAttention 2-4x, MQA 8x memory reduction, speculative decoding 1.5-2.5x real-world), и trade-off анализ. Этот мастер-гайд -- единая точка входа с формулами, decision guides и ТОП-20 вопросами из 89 синтезированных источников.

Быстрый доступ к темам¶

Область	Ключевые темы	Cheat Sheet
LLM Inference	KV Cache, Speculative Decoding, Quantization	inference
RAG & Vector DB	Embeddings, Retrieval, Reranking	rag
ML Training	Distributed, Optimization, Regularization	training
Deep Learning	Normalization, Activations, SSM, Diffusion	deep-learning
System Design	Capacity, Patterns, Observability	system-design
Attention	RoPE, FlashAttention, MQA/GQA	attention
Alignment	RLHF, DPO, GRPO, RLAIF	alignment

1. Ключевые числа для интервью¶

LLM Architecture¶

Параметр	Значение	Контекст
Attention complexity	\(O(T^2)\)	Full self-attention
FlashAttention speedup	2-4×	FA-3: 6-16× на H100
MQA memory reduction	~8×	1 K/V head vs все heads
GQA quality loss	<1%	8 groups vs full
RoPE extrapolation	2-4×	Beyond training length

Inference¶

Метрика	Значение	Контекст
KV cache per request	~10GB	Llama 70B @ 4K context
Speculative decoding	1.5-2.5×	Real-world speedup
4-bit quantization	75% memory	~97-98% quality
PagedAttention waste	<4%	vs 60-80% static

Training¶

Метрика	Значение	Контекст
Mixed precision speedup	2-3×	FP16/BF16
Gradient accumulation	Virtual batch	Memory-efficient
LoRA parameters	<1%	Full quality
EMA smoothing	0.999	Typical decay

System Design¶

Метрика	Формула	Пример
QPS	DAU × RPS / 86400	10M DAU × 10 = 1157 QPS
Cache hit rate	50-80%	Semantic caching
Availability 99.9%	8.76h downtime/year	SLA target

2. Формулы на память¶

Attention¶

\[\text{Attn}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

RoPE Rotation¶

\[\text{RoPE}(x_m, m) = R_m x_m, \quad \theta_i = 10000^{-2i/d}\]

LayerNorm vs RMSNorm¶

\[\text{LayerNorm}(x) = \frac{x - \mu}{\sigma} \cdot \gamma + \beta\]

\[\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum x_i^2}} \cdot \gamma\]

Softmax Temperature¶

\[\text{softmax}(z/T) = \frac{e^{z_i/T}}{\sum_j e^{z_j/T}}\]

Cross-Entropy Loss¶

\[\mathcal{L}_{CE} = -\sum_i y_i \log(\hat{y}_i)\]

KL Divergence¶

\[D_{KL}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}\]

Perplexity¶

\[\text{PPL} = \exp\left(-\frac{1}{N}\sum_{i=1}^N \log p(x_i)\right)\]

LoRA¶

\[W = W_0 + BA, \quad B \in \mathbb{R}^{d \times r}, A \in \mathbb{R}^{r \times k}\]

RLHF (Bradley-Terry)¶

\[P(y_w > y_l) = \sigma(r(x, y_w) - r(x, y_l))\]

DPO Loss¶

\[\mathcal{L}_{DPO} = -\mathbb{E}\left[\log \sigma\left(\beta \log \frac{\pi(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right]\]

QPS Formula¶

\[\text{QPS} = \frac{\text{DAU} \times \text{Requests per User}}{86400}\]

3. Что выбрать? (Decision Guide 2026)¶

Positional Encoding¶

Нужна	Выбор
Production LLM	RoPE (LLaMA, Mistral, Qwen)
Max extrapolation	ALiBi (1024→8192+)
Simple implementation	ALiBi

Attention Optimization¶

Сценарий	Выбор
Memory-constrained	MQA (8× less KV)
Balance quality/speed	GQA (<1% quality loss)
Long context (1M+)	Ring Attention or Star Attention
Maximum speed	FlashAttention-3 (H100)

Inference Engine¶

Workload	Выбор
Simple serving	vLLM
Agent workflows	SGLang (3× faster)
NVIDIA-only	TensorRT-LLM
CPU/Mobile	llama.cpp (GGUF)

Quantization¶

Hardware	Метод
NVIDIA GPU	AWQ or GPTQ (~98% quality)
CPU/Edge	GGUF Q4_K_M (~92% quality)
H100+	FP8 (native support)

Alignment Method¶

Scenario	Метод
Research/Quick	DPO (2-3× faster)
Production	RLHF (PPO or GRPO)
Reasoning models	GRPO (DeepSeek-R1)
Cost-constrained	RLAIF (70% cheaper)

SSM vs Transformer¶

Sequence Length	Выбор
Short (<1K)	Transformer
Medium (1K-16K)	Hybrid (Jamba 1:7)
Very long (>16K)	SSM or Hybrid

Long Context vs RAG¶

Factor	Long Context	RAG
Cost	100× more	Baseline
Fresh data	Retrain needed	Instant
Use case	Reasoning over docs	Retrieval

Best practice: Use BOTH — RAG for retrieval, long context for reasoning.

4. ТОП-20 Interview Questions¶

Architecture¶

Explain self-attention mechanism
\(QK^T\) dot product, softmax, multiply by \(V\)
\(O(T^2)\) complexity, captures all pairwise interactions
Why RoPE over learned positions?
Relative position encoding through rotation
Extrapolates 2-4× beyond training length
Used in all modern LLMs (LLaMA, Mistral)
Compare MQA, GQA, MHA
MHA: full heads, max quality
MQA: 1 K/V head, 8× less memory, ~3% quality loss
GQA: grouped heads, <1% loss, best balance

Training¶

What is gradient accumulation?
Simulate larger batch by accumulating gradients
effective_batch = batch_size × accumulation_steps
Explain LoRA
Low-rank adaptation: \(W = W_0 + BA\)
<1% parameters, same quality as full fine-tuning
Why RMSNorm over LayerNorm?
15-25% faster (no mean subtraction)
Same or better quality
Standard in all 2025-2026 LLMs

Inference¶

What is KV cache?
Store key/value tensors from previous tokens
Avoids recomputation in autoregressive generation
Memory: ~10GB for Llama 70B @ 4K context
Explain speculative decoding
Draft model generates tokens
Target model verifies in parallel
1.5-2.5× speedup (depends on acceptance rate)
Compare quantization methods
AWQ/GPTQ: ~98% quality, GPU
GGUF: ~92% quality, CPU
FP8: Native on H100+, best quality

Alignment¶

Explain RLHF pipeline
- SFT → Reward Model training → PPO fine-tuning
- Human preferences → Bradley-Terry → RM loss
DPO vs RLHF?
- DPO skips reward model
- 2-3× faster, simpler
- Slightly lower quality on complex tasks
What is GRPO?
- Group Relative Policy Optimization
- DeepSeek-R1 method
- No value function, 50% less memory

System Design¶

Design ChatGPT at scale
- Load balancer → vLLM/SGLang → Semantic cache
- Model routing for cost optimization
- Streaming responses, rate limiting
QPS calculation
- \(\text{QPS} = \text{DAU} \times \text{Requests} / 86400\)
- Example: 10M DAU × 10 req = 1157 QPS, peak 3500
Semantic caching
- Embed query, check similarity before LLM call
- 50-80% hit rate, 70-90% cost savings

Advanced¶

FlashAttention optimization
- Block-wise computation, never materialize \(T×T\) matrix
- \(O(T)\) memory instead of \(O(T^2)\)
- FA-3 on H100: 6-16× speedup
MoE load balancing
- Auxiliary loss: penalty for imbalance
- Loss-free (DeepSeek V3): dynamic bias
- SIMBAL: 36% faster convergence
Test-time compute scaling
- o1 approach: more inference = better reasoning
- Methods: parallel sampling, tree search, MCTS
- Key: 1B + scaling > 405B without scaling
RAG vs Long Context
- RAG: cheaper, fresher data
- Long Context: simpler, more expensive
- Use both: RAG retrieve → LC reason
How to handle prompt injection?
- Multi-layer defense: input filter → intent classifier → output filter
- Constitutional AI training for robustness
- Continuous monitoring and red teaming

5. Красные флаги в коде (Code Review)¶

Распространённые ошибки¶

# ❌ WRONG: Hardcoded secrets
API_KEY = "sk-12345..."

# ✅ RIGHT: Environment variables
API_KEY = os.environ.get("API_KEY")

# ❌ WRONG: No error handling
response = model.generate(prompt)

# ✅ RIGHT: Graceful degradation
try:
    response = model.generate(prompt)
except Exception as e:
    logger.error(f"Generation failed: {e}")
    response = fallback_response()

# ❌ WRONG: SQL injection
query = f"SELECT * FROM users WHERE id = {user_id}"

# ✅ RIGHT: Parameterized query
query = "SELECT * FROM users WHERE id = %s"
cursor.execute(query, (user_id,))

ML-Specific Anti-Patterns¶

# ❌ Data leakage
X_train, X_test = train_test_split(data, test_size=0.2)
scaler.fit(data)  # Fitting on ALL data!

# ✅ Correct order
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# ❌ Not handling imbalanced data
model.fit(X, y)  # 99% class 0

# ✅ Handle imbalance
model.fit(X, y, class_weight='balanced')

6. Behavioral Questions (STAR)¶

AI Safety Story¶

S: Working on content moderation system, noticed model producing false positives on legitimate content.

T: Reduce false positive rate while maintaining safety.

A: Analyzed failure cases, found pattern in edge cases, implemented multi-tier system (regex → ML → human review), added feedback loop.

R: False positive rate dropped from 15% to 3%, user complaints reduced by 80%.

Failure Story¶

S: Deployed model that performed well in testing but failed in production.

T: Understand root cause and fix deployment.

A: Investigated logs, found distribution shift (production data different from test), implemented monitoring, retrained with production data, added A/B testing.

R: Model now performs consistently, monitoring catches drift early.

Conflict Resolution¶

S: Disagreement with PM about model complexity vs latency budget.

T: Find solution that meets both quality and latency requirements.

A: Set up experiments with different model sizes, measured latency vs quality tradeoff, presented data-driven options, proposed model routing (simple queries → small model, complex → large).

R: Implemented tiered approach, reduced p95 latency by 60%, maintained quality.

7. Ресурсы для подготовки¶

Блоги (регулярно читать)¶

Lilian Weng — lilianweng.github.io (attention, RLHF, transformers)
Sebastian Raschka — magazine.sebastianraschka.com (training, fine-tuning)
Chip Huyen — chiphuyen.com (MLOps, engineering)
Eugene Yan — eugeneyan.com (applied ML patterns)

Papers (2025-2026 key reads)¶

DeepSeek-R1 (GRPO)
FlashAttention-3
Mamba-2 (SSM)
Ring Attention (10M+ context)
EAGLE-3 (Speculative Decoding)

Практика¶

Implement attention from scratch
Implement BPE tokenizer
Implement LoRA fine-tuning
Implement KV cache with PagedAttention
Build RAG pipeline with reranking

8. Checklist перед интервью¶

Technical¶

Attention formula без подсказки
Объяснить RoPE 3 разными способами
Нарисовать RLHF pipeline
Посчитать QPS для данного DAU
Назвать 3 метода quantization и их trade-offs
Объяснить speculative decoding за 30 секунд

System Design¶

Behavioral¶

3 STAR stories готовы
AI safety perspective
Failure story с lessons learned
Why this company / role

Типичные заблуждения¶

Заблуждение: 'LoRA дает <1% параметров -- значит качество сильно хуже'

При правильном ранге (r=16-64) LoRA достигает 95-100% качества full fine-tuning на большинстве задач. Ключевой фактор -- не количество параметров, а то, какие слои адаптируются (Q, K, V, O проекции). QLoRA (4-bit base + LoRA) добавляет дополнительную экономию при минимальной потере.

Заблуждение: 'DPO всегда лучше RLHF потому что проще'

DPO быстрее в 2-3x и проще в реализации, но уступает RLHF/GRPO на complex reasoning задачах. DeepSeek-R1 использует GRPO (не DPO) именно потому, что для reasoning-моделей нужна exploration через sampling. DPO -- для быстрых экспериментов, GRPO/PPO -- для production reasoning.

Заблуждение: 'Semantic caching экономит 70-90% -- значит всегда стоит внедрять'

Hit rate 50-80% достигается только при предсказуемых паттернах запросов (customer support, FAQ). Для creative/reasoning задач hit rate падает до 10-20%, а overhead на embedding + vector search (15-30ms) может увеличить latency без реальной экономии. Анализируйте распределение запросов перед внедрением.

Интервью: Экспресс-проверка готовности¶

Вопрос: Объясните self-attention за 30 секунд¶

"Self-attention умножает query на key, потом на value"

"Attn(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V. Каждый токен вычисляет dot product со всеми другими -- O(T^2) complexity. Это позволяет моделировать зависимости на любом расстоянии. FlashAttention решает memory bottleneck через tiling: O(T) memory вместо O(T^2), ускорение 2-4x."

Вопрос: Какой alignment-метод выбрать для reasoning-модели?¶

"DPO -- он проще и быстрее"

"GRPO (Group Relative Policy Optimization) как в DeepSeek-R1. Причины: (1) не требует value function -- 50% экономия memory, (2) group-relative advantages через sampling 4-16 ответов на prompt -- это дает exploration, критичную для reasoning, (3) стабильнее PPO на длинных chain-of-thought. DPO хорош для quick experiments, но для reasoning нужна exploration."

Вопрос: Посчитайте QPS для системы с 10M DAU¶

"Много, нужен кластер GPU"

"QPS = DAU x Requests / 86400 = 10M x 10 / 86400 = 1157 QPS average, peak ~3500 QPS (3x factor). С semantic caching (50-80% hit rate) реальная нагрузка на LLM = 700-1750 QPS. При latency budget 200ms и throughput ~100 tokens/s per GPU нужно ~35-175 GPU instances за load balancer."

Sources¶

All 10 synthesis cheat sheets + 89 source files (including ФАЗА 5 additions).