ML Practice: Speed Run Sheet (Layer 7)¶

~21 минута чтения

60-second refresher перед собесом Key formulas, common mistakes, one-liner answers Обновлено: 2026-02-11

ML Math Quick Reference¶

Calculus¶

Gradient Descent:  w = w - lr * dL/dw

Sigmoid derivative:  sigma'(x) = sigma(x) * (1 - sigma(x))
ReLU derivative:     ReLU'(x) = 1 if x > 0 else 0
Tanh derivative:     tanh'(x) = 1 - tanh^2(x)

Softmax:             softmax(x_i) = exp(x_i) / sum(exp(x_j))
Cross-entropy:       CE = -sum(y * log(p))

Chain rule:          dL/dw = dL/da * da/dz * dz/dw

One-liners: - Q: "Why ReLU over sigmoid?" → A: No vanishing gradient for positive values, sparse activation, faster computation. - Q: "Why softmax for classification?" → A: Outputs valid probability distribution (sums to 1, all positive).

Linear Algebra¶

L2 Norm:            ||x||_2 = sqrt(sum(x_i^2))
Cosine Similarity:  cos(a,b) = (a · b) / (||a|| * ||b||)

Softmax stable:     softmax(x) = softmax(x - max(x))

SVD:                X = U * S * V^T
PCA:                X' = X @ V[:,:k]  (top-k components)

One-liners: - Q: "When to use cosine vs L2?" → A: Cosine for similarity regardless of magnitude, L2 for distance. - Q: "Why subtract max in softmax?" → A: Numerical stability, prevents overflow.

Statistics¶

Mean:               mu = (1/n) * sum(x_i)
Variance:           var = (1/n) * sum((x_i - mu)^2)
Std Dev:            sigma = sqrt(var)

Correlation:        r = cov(X,Y) / (sigma_X * sigma_Y)

P-value:            P(obs | H0) — probability of result under null hypothesis
Confidence Interval: mu +/- z * (sigma / sqrt(n))

T-test:             t = (x_bar - mu) / (s / sqrt(n))
Chi-square:         chi2 = sum((O - E)^2 / E)

One-liners: - Q: "What is p-value?" → A: Probability of observing data this extreme if H0 is true. Low p-value = reject H0. - Q: "When to use t-test vs z-test?" → A: t-test when n < 30 or population variance unknown.

Information Theory¶

Entropy:            H(X) = -sum(p(x) * log(p(x)))
Cross-entropy:      H(P,Q) = -sum(p(x) * log(q(x)))
KL Divergence:      KL(P||Q) = sum(p(x) * log(p(x)/q(x)))
Information Gain:   IG = H(parent) - weighted_avg(H(children))
Gini:               Gini = 1 - sum(p_i^2)

One-liners: - Q: "Gini vs Entropy?" → A: Both work similarly. Gini faster (no log), Entropy more interpretable (bits). - Q: "What is KL divergence?" → A: How much information lost when Q approximates P. Not symmetric.

Classical ML Quick Reference¶

Loss Functions¶

MSE:                L = (1/n) * sum((y - y_hat)^2)
MAE:                L = (1/n) * sum(|y - y_hat|)
Binary CE:          L = -[y*log(p) + (1-y)*log(1-p)]
Hinge (SVM):        L = max(0, 1 - y * f(x))

Metrics¶

Precision:          TP / (TP + FP)
Recall:             TP / (TP + FN)
F1:                 2 * P * R / (P + R)
Accuracy:           (TP + TN) / Total

ROC-AUC:            Area under TPR vs FPR curve
PR-AUC:             Area under Precision vs Recall curve

One-liners: - Q: "Precision vs Recall?" → A: Precision = avoid false positives, Recall = catch all positives. - Q: "When ROC-AUC vs PR-AUC?" → A: PR-AUC for imbalanced data, ROC-AUC for balanced.

Regularization¶

L1 (Lasso):         L + lambda * sum(|w|)
L2 (Ridge):         L + lambda * sum(w^2)
Elastic Net:        L + lambda1 * L1 + lambda2 * L2
Dropout:            During training, randomly zero p% of activations

One-liners: - Q: "L1 vs L2?" → A: L1 produces sparse weights (feature selection), L2 shrinks all weights smoothly. - Q: "Why Dropout?" → A: Prevents co-adaptation, acts as ensemble averaging.

Tree Algorithms¶

Decision Tree split:  argmax IG(feature)
Random Forest:         Bagging + random feature subset per split
GBDT:                  F(x) = F_prev(x) + lr * h(x), where h fits residuals
XGBoost:               GBDT + second-order gradients + regularization

One-liners: - Q: "RF vs GBDT?" → A: RF parallel (independent trees), GBDT sequential (corrects errors). GBDT usually better. - Q: "Why random features in RF?" → A: De-correlates trees, reduces variance.

Deep Learning Quick Reference¶

Optimizers¶

SGD:                w = w - lr * grad
Momentum:           v = beta*v + grad; w = w - lr*v
RMSprop:            w = w - lr * grad / sqrt(v + eps)
Adam:               m = beta1*m + (1-beta1)*grad
                    v = beta2*v + (1-beta2)*grad^2
                    w = w - lr * m / (sqrt(v) + eps)

One-liners: - Q: "Adam vs SGD?" → A: Adam adaptive, faster convergence. SGD + momentum often better final accuracy with LR scheduling. - Q: "Why beta1=0.9, beta2=0.999?" → A: Exponential moving average of gradient (0.9) and squared gradient (0.999).

Weight Initialization¶

Xavier:             W ~ N(0, 2/(fan_in+fan_out))  — for tanh/sigmoid
He:                 W ~ N(0, 2/fan_in)           — for ReLU

One-liners: - Q: "Why not zero initialization?" → A: All neurons compute same output, no learning (symmetry). - Q: "He vs Xavier?" → A: He for ReLU (2x variance), Xavier for tanh/sigmoid.

Normalization¶

BatchNorm:          x_norm = (x - mu_batch) / sqrt(var_batch + eps)
                    y = gamma * x_norm + beta

LayerNorm:          Normalize across features (not batch)
RMSNorm:            x / sqrt(mean(x^2) + eps)  — no mean subtraction

One-liners: - Q: "BatchNorm vs LayerNorm?" → A: BatchNorm = per-feature across batch, LayerNorm = per-sample across features. LayerNorm for Transformers. - Q: "Why gamma, beta?" → A: Learnable scale and shift, restore representation power.

Attention¶

Attention(Q,K,V) = softmax(Q @ K^T / sqrt(d_k)) @ V

Multi-Head: concat(head_1, ..., head_h) @ W_o
where head_i = Attention(Q @ W_q, K @ W_k, V @ W_v)

One-liners: - Q: "Why sqrt(d_k)?" → A: Scales dot product to prevent softmax from having extremely small gradients. - Q: "Why multi-head?" → A: Different heads learn different relationships, richer representations.

LLM Engineering Quick Reference¶

Tokenization¶

BPE:                Merge most frequent pairs iteratively
WordPiece:          Merge pairs maximizing likelihood
SentencePiece:      Language-agnostic, handles any language

One-liners: - Q: "BPE vs WordPiece?" → A: BPE merges most frequent pairs, WordPiece maximizes likelihood. WordPiece uses ## prefix. - Q: "Why subword?" → A: No OOV, smaller vocabulary, handles morphology.

Decoding¶

Greedy:             argmax at each step
Beam Search:        Keep top-k sequences at each step
Top-k:              Sample from top-k tokens
Top-p (nucleus):    Sample from smallest set with cumulative prob >= p
Temperature:        logits = logits / T  (T<1 = sharper, T>1 = flatter)

One-liners: - Q: "Top-k vs Top-p?" → A: Top-k fixed number, Top-p adaptive to distribution shape. - Q: "When beam search?" → A: When you want most likely sequence, not diverse generation.

RAG¶

BM25:               TF-IDF with saturation + length normalization
Dense Retrieval:    similarity(query_emb, doc_emb)
Hybrid:             alpha * BM25 + (1-alpha) * Dense

Reranking:          Cross-encoder (slow, accurate) vs Bi-encoder (fast)

One-liners: - Q: "BM25 vs Dense?" → A: BM25 exact keyword match, Dense semantic similarity. Use hybrid. - Q: "When to rerank?" → A: Rerank top-100 from retrieval for better precision.

LoRA¶

LoRA:               W' = W + B @ A, where B: d x r, A: r x d
                    Only train A, B (r << d)

QLoRA:              4-bit quantized base + LoRA adapters

One-liners: - Q: "Why LoRA?" → A: Train 0.1% parameters, no catastrophic forgetting, easy to switch adapters. - Q: "LoRA rank choice?" → A: Start with r=8-16. Higher rank = more capacity but overfitting risk.

Quantization¶

PTQ (Post-Training):  Quantize after training
QAT (Quantization-Aware): Train with quantization simulation
GPTQ:                 One-shot quantization using Hessian
AWQ:                  Activation-aware weight quantization

One-liners: - Q: "INT8 vs FP16?" → A: INT8 2x smaller, faster inference, ~1% accuracy drop acceptable. - Q: "GPTQ vs AWQ?" → A: Both INT4, AWQ faster inference, better accuracy retention.

ML System Design Quick Reference¶

Model Serving¶

Latency targets:    P50 < 50ms, P99 < 200ms

Optimization:
1. Batching        — combine requests
2. Quantization    — INT8/INT4
3. Caching         — cache popular predictions
4. Async           — non-blocking inference

One-liners: - Q: "Reduce latency 2x?" → A: Quantization, batching, caching, model distillation.

A/B Testing¶

Sample size:        n = 16 * sigma^2 / delta^2  (95% CI, 80% power)

Significance:       p-value < 0.05 → reject H0
                    z = (p_A - p_B) / sqrt(pooled * (1-pooled) * (1/n_A + 1/n_B))

One-liners: - Q: "Sample size for A/B test?" → A: n = 16 * p(1-p) / delta^2 per variant. - Q: "When A/B test invalid?" → A: Network effects, temporal effects, sample ratio mismatch.

Drift Detection¶

PSI < 0.1:          No significant change
PSI 0.1-0.25:       Moderate change
PSI > 0.25:         Significant change (investigate!)

KS-test:            max |CDF_A(x) - CDF_B(x)|

One-liners: - Q: "PSI vs KS-test?" → A: PSI for binned distributions (interpretable), KS for continuous (statistical significance). - Q: "Drift detected. What next?" → A: Investigate cause, evaluate model on new data, consider retraining.

Calibration¶

Platt Scaling:      P(calibrated) = sigmoid(A * score + B)
Isotonic:           Piecewise constant function
Brier Score:        BS = mean((p - y)^2)  — lower is better

One-liners: - Q: "Platt vs Isotonic?" → A: Platt parametric (2 params), Isotonic non-parametric (needs more data). - Q: "Why calibrate?" → A: When you need accurate probabilities (medical, risk scoring).

RecSys¶

Two-Tower:          user_emb · item_emb = similarity
Cold Start:         Content-based, popularity, exploration (bandits)

Metrics:            NDCG@k, MRR, CTR Lift

One-liners: - Q: "Two-Tower architecture?" → A: Separate embeddings for user and item, similarity = dot product. - Q: "Cold start solution?" → A: Content features, popularity, bandit exploration, LLM preferences.

AI Agents Quick Reference¶

ReAct Pattern¶

Loop:
1. Thought:  Analyze situation
2. Action:   Call tool
3. Observe:  See result
4. Repeat until done

One-liners: - Q: "What is ReAct?" → A: Interleaves reasoning (Thought) with tool use (Action) in a loop. - Q: "Why ReAct over just prompting?" → A: Can use external tools, transparent reasoning, recover from errors.

Framework Comparison¶

LangGraph:          State machine, production, human-in-loop
AutoGen:            Multi-agent, Microsoft, conversational
CrewAI:             Role-based, simple API, task orchestration

One-liners: - Q: "LangGraph vs AutoGen?" → A: LangGraph = stateful workflows, AutoGen = multi-agent collaboration. - Q: "When CrewAI?" → A: Simple role-based tasks, quick prototyping, less control needed.

Common Mistakes to Avoid¶

Topic	Mistake	Correct
Preprocessing	fit_transform on test data	fit on train, transform test
Split	random split for time series	temporal split
Leakage	using future data	check temporal ordering
Eval	train accuracy only	use validation/test
Regularization	L2 on bias terms	only on weights
BatchNorm	using batch stats at inference	use running stats
Dropout	using at inference	only during training
Learning Rate	same LR throughout	use scheduling
Softmax	on single class	sigmoid for multi-label
Loss	MSE for classification	cross-entropy

Code Snippets to Memorize¶

# Stable softmax
def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

# Cross-entropy loss
def cross_entropy(y_true, y_pred):
    return -np.sum(y_true * np.log(y_pred + 1e-9))

# Accuracy
def accuracy(y_true, y_pred):
    return np.mean(y_true == y_pred)

# Mini-batch
for X_batch, y_batch in DataLoader(dataset, batch_size=32, shuffle=True):
    # training step

# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Learning rate warmup
for param_group in optimizer.param_groups:
    param_group['lr'] = min(base_lr, current_step / warmup_steps * base_lr)

Advanced Topics Quick Reference¶

Vision Transformers (ViT)¶

Pipeline: Image → Patches (16x16) → Linear Project → +Pos Embed → Transformer → [CLS] → MLP Head

Patch Embedding: Conv2d(3, 768, kernel=16, stride=16)
Positional: Learnable (196 patches + 1 CLS)
[CLS] Token: Learnable, aggregates global info for classification

Memory: O(N²) attention = 196² = 38K for 224x224
vs CNN: ViT needs more data (JFT-300M), CNN good with ImageNet
Swin Transformer: Window attention = O(N), hierarchical = CNN-like

One-liners: - Q: "ViT vs CNN?" → A: CNN has inductive bias (locality), ViT learns it from data. ViT needs more data but scales better. - Q: "Why [CLS] token?" → A: Learns to aggregate global representation. Alternative is global avg pooling.

Diffusion Models¶

Forward: x_t = sqrt(α_t) * x_{t-1} + sqrt(1-α_t) * ε  (add noise)
Reverse: Learn ε_θ(x_t, t) to predict noise added

Training Loss: E[||ε - ε_θ(x_t, t)||²]

DDPM: Stochastic, 1000 steps
DDIM: Deterministic, 10-50 steps, same seed = same output

Latent Diffusion (Stable Diffusion):
  Image (512x512) → VAE encode → Latent (64x64) → Diffuse → VAE decode
  16x faster, runs on consumer GPU

One-liners: - Q: "DDPM vs DDIM?" → A: DDPM stochastic/slow, DDIM deterministic/fast. DDIM 20x fewer steps. - Q: "Classifier-Free Guidance?" → A: Interpolate conditional and unconditional predictions. Scale s > 1 = more prompt adherence.

Long Context & KV Cache¶

KV-Cache: Store K,V for previous tokens, only compute for new token
  Memory: 2 * layers * hidden_size * seq_len * 2 bytes (FP16)
  70B @ 128K context ≈ 100GB KV-cache

RoPE Scaling:
  Linear: Multiply positions by factor (simple, loses detail)
  NTK-aware: Adaptive frequency scaling
  YaRN: NTK + linear + temperature = SOTA

GQA (Grouped-Query Attention): Groups of heads share KV
  Llama-3-70B: 64 query heads, 8 KV heads = 8x memory saving

One-liners: - Q: "Why RoPE over absolute PE?" → A: Extrapolates to longer sequences, relative position naturally, no learned params. - Q: "1M context on 8x A100?" → A: GQA-8 + FlashAttention-3 + KV-cache eviction + offloading to CPU.

Mixture of Experts (MoE)¶

Architecture: Dense FFN → N Experts (small FFNs) + Router
Routing: Router(x) → probs → Top-K selection → Weighted sum of expert outputs
  y = Σ p_i * E_i(x)  for i in Top-K

Key Trade-offs:
  Total params vs Active params (Mixtral: 46.7B total, ~13B active)
  Compute efficiency vs Memory (need all experts loaded)
  Specialization vs Load balance

Expert Collapse Problem:
  Router selects same experts always → others "die"
  Fix: Load balancing loss, capacity limits, noise injection

Load Balancing Loss:
  L_aux = n * Σ f_i * P_i  (f_i = token fraction, P_i = prob mass)
  Minimize when f_i ≈ P_i (uniform usage)

DeepSeek-V3 Innovations:
  - 256 fine-grained experts (vs 8 in Mixtral)
  - Top-8 routing (vs Top-2)
  - Shared experts (always active) + routed experts
  - Auxiliary-loss-free routing (dynamic bias instead)

One-liners: - Q: "MoE vs Dense inference?" → A: MoE: 3-10x less compute, same quality. But needs all experts in memory. - Q: "Expert collapse?" → A: Router over-selects same experts. Fix: load balancing loss, capacity limits. - Q: "When NOT to use MoE?" → A: Small scale (<7B), single-domain, latency-critical, memory-constrained edge.

AI Agent Memory¶

4 Memory Types:
  1. Internal Knowledge (model weights) - immutable
  2. Context Window (128K tokens) - current conversation
  3. Short-term Memory (Redis) - session, TTL 24h
  4. Long-term Memory (Vector DB) - persistent across sessions

Episodic: Events ("User asked about X at 3pm")
Semantic: Facts ("User prefers Python over JS")
Procedural: Workflows ("Successful resolution paths")

One-liners: - Q: "Agent vs LLM?" → A: Agent = LLM + Tools + Memory + Planning. - Q: "Memory architecture for customer service?" → A: Redis (session) + Vector DB (episodic) + Knowledge Graph (semantic).

Time Series Quick Reference¶

Stationarity: Mean, variance, autocovariance constant over time
ADF Test: H0 = non-stationary, reject if p < 0.05

ARIMA(p,d,q):
  p = AR order (PACF cutoff)
  d = Differencing order (ADF until stationary)
  q = MA order (ACF cutoff)

ACF: Direct + indirect correlation at lag k
PACF: Direct correlation only (partial out intermediate)

Seasonal decomposition:
  Additive: Y = Trend + Seasonal + Residual
  Multiplicative: Y = Trend × Seasonal × Residual
  Log transform: Multiplicative → Additive

Cointegration: Two non-stationary series with stationary linear combination
  Test: Engle-Granger (regress, then ADF on residuals)
  Application: Pairs trading

One-liners: - Q: "ARIMA parameter selection?" → A: ADF for d, PACF cutoff for p, ACF cutoff for q. - Q: "Prophet vs ARIMA?" → A: Prophet for multiple seasonalities + holidays, ARIMA for clean univariate.

Advanced Time Series (Deep Learning)¶

DeepAR (Amazon):
  Autoregressive RNN with probabilistic output
  Learns from multiple time series (global model)
  Output: Distribution (mean + std), not point estimate
  Good for: Many related series, cold start with covariates

TFT (Temporal Fusion Transformer):
  Variable Selection Network → which features matter
  Static covariate encoder → time-invariant features
  Gated Residual Network → skip connections + gating
  Multi-head attention → interpretability (which past steps matter)
  Quantile regression → prediction intervals

  Three types of inputs:
    - Static: product category, store location
    - Known future: holidays, promotions
    - Historical: past sales, weather

Prophet (Meta):
  y(t) = g(t) + s(t) + h(t) + ε_t
  g(t) = trend (piecewise linear or logistic)
  s(t) = seasonality (Fourier series)
  h(t) = holiday effects
  Good for: Business forecasting with seasonality + holidays

N-BEATS:
  Stack of FC blocks with forward/backward residuals
  Interpretable: trend + seasonality decomposition
  Pure deep learning, no hand-crafted features

Cross-validation (Time Series):
  Rolling origin: Train on [0:t], test on [t:t+h], expand t
  NOT random split! Temporal order must be preserved

One-liners: - Q: "DeepAR vs ARIMA?" → A: DeepAR for multiple related series with covariates, learns globally. ARIMA for single series. - Q: "TFT key innovation?" → A: Variable Selection + attention for interpretability. Knows which features and past steps matter. - Q: "Time series CV?" → A: Rolling origin. Never random split — temporal order matters. - Q: "Prophet components?" → A: Trend (piecewise) + Seasonality (Fourier) + Holidays + Error.

Causal Inference Quick Reference¶

Association ≠ Causation: Ice cream & crime correlated (confounder: heat)

Potential Outcomes:
  Y_i(1) = outcome if treated, Y_i(0) = outcome if not
  ITE = Y_i(1) - Y_i(0) [never observed!]
  ATE = E[Y(1) - Y(0)]

Confounder: Affects both treatment and outcome → spurious correlation

Propensity Score: P(T=1|X), probability of treatment given covariates
  Matching: Pair treated/controls by similar PS
  Assumptions: Unconfoundedness, Overlap (0 < PS < 1)

Methods:
  PSM: Propensity Score Matching
  RDD: Cutoff-based assignment (compare just above/below)
  IV: Instrument affects treatment but not outcome directly
  DiD: Difference-in-Differences (trend comparison)

Uplift Modeling: Individual treatment effect prediction
  Persuadables: Respond only if treated
  Sleeping dogs: Respond worse if treated
  Sure things: Respond regardless
  Lost causes: Never respond

One-liners: - Q: "RCT vs observational?" → A: RCT randomizes treatment, observational needs confounding control. - Q: "ATE vs ATT?" → A: ATE = effect on everyone, ATT = effect on treated population. - Q: "Valid instrument?" → A: Relevant (affects treatment), Exogenous (no direct effect), Excludable.

Bayesian ML & Uncertainty Quick Reference¶

Epistemic Uncertainty: Model ignorance (reducible with more data)
Aleatoric Uncertainty: Data noise (irreducible)

BNN: Weights as distributions, not point estimates
  P(y|x,D) = ∫ P(y|x,w) P(w|D) dw

Variational Inference: Approximate posterior with tractable q(w)
  ELBO = E_q[log P(D|w)] - KL(q(w) || P(w))

MC Dropout: Dropout at inference ≈ variational approximation
  Enable dropout: model.train() at inference!
  Predictive uncertainty: Var(y|x) ≈ (1/T) Σ ŷ_t² - ((1/T) Σ ŷ_t)²

Deep Ensembles: Train M models with different init, variance = uncertainty
  Better calibrated than MC Dropout in practice
  Total variance = average variance + variance of means

Calibration: Predicted probability ≈ actual accuracy
  ECE (Expected Calibration Error): Σ (n_i/N) |acc_i - conf_i|
  Temperature scaling: p' = softmax(z/T), learn T on validation set
  Reliability diagram: bin samples by confidence, plot accuracy

Conformal Prediction: Distribution-free coverage guarantee
  P(Y_{n+1} ∈ Ĉ(X_{n+1})) ≥ 1 - α (under exchangeability)
  Split conformal: use calibration set, find quantile q
  Prediction set: Ĉ(x) = {y: s(x,y) ≤ q} where s is score function

OOD Detection: Identify out-of-distribution samples
  MSP (Max Softmax Probability): max(softmax(f(x)))
  Energy score: E(x) = -T log Σ exp(f_i(x)/T)
  Mahalanobis: (x-μ)ᵀ Σ⁻¹ (x-μ) in feature space
  ODIN: Input perturbation + temperature scaling

Selective Prediction: Abstain when uncertain
  Trade-off: Coverage ↓ → Accuracy ↑
  SelectiveNet: Dedicated selection head, g(x) threshold
  Conformal selective: Guarantee (1-α) coverage on accepted

One-liners: - Q: "Epistemic vs aleatoric?" → A: Epistemic = model uncertainty (fix with data), aleatoric = noise (inherent). - Q: "MC Dropout how?" → A: Keep dropout ON at inference, sample N predictions, std = uncertainty. - Q: "MC Dropout vs Ensembles?" → A: Ensembles better calibrated, MC Dropout cheaper (one model). - Q: "What is calibration?" → A: Predicted prob ≈ actual accuracy. Fix with temperature scaling. - Q: "Conformal prediction guarantee?" → A: P(Y ∈ Ĉ(X)) ≥ 1-α under exchangeability, no assumptions about data distribution. - Q: "OOD detection baselines?" → A: MSP (max softmax), Energy (lower = OOD), Mahalanobis distance in feature space. - Q: "Selective prediction?" → A: Model can say "I don't know", trade coverage for accuracy.

Second-Order Optimization Quick Reference¶

Newton's Method: θ = θ - H⁻¹∇L
  Uses curvature (Hessian), not just gradient
  Quadratic convergence vs SGD linear
  Problem: H⁻¹ is O(D³), infeasible for deep learning

BFGS: Quasi-Newton, approximates H⁻¹ iteratively
  Memory: O(D²)

L-BFGS: Limited-memory BFGS
  Stores only last m gradient differences
  Memory: O(mD), practical for large problems

Natural Gradient: ∇̃L = F⁻¹∇L
  F = Fisher Information Matrix
  Invariant to reparameterization
  K-FAC: Kronecker approximation

Conjugate Gradient: Directions conjugate to previous
  d_t = -g_t + β_t·d_{t-1}
  Guaranteed convergence in ≤ D steps for quadratic
  No Hessian storage needed

One-liners: - Q: "Why not Newton in DL?" → A: Hessian is O(D²) memory, O(D³) inverse. Use L-BFGS or Adam. - Q: "L-BFGS vs BFGS?" → A: L-BFGS stores only last m updates, BFGS stores full inverse approximation. - Q: "Natural gradient intuition?" → A: Steepest descent in distribution space, not parameter space.

Explainable AI (XAI) Quick Reference¶

SHAP (SHapley Additive exPlanations):
  Theory: Game theory Shapley values
  Formula: φ_i = Σ [|S|!(M-|S|-1)!/M!] × [f(S∪{i}) - f(S)]

  Guarantees:
    Efficiency: Σ φ_i = f(x) - E[f(X)]
    Symmetry: Equal contribution = equal SHAP
    Dummy: No impact = SHAP = 0
    Additivity: SHAP_ensemble = sum of SHAPs

  Variants:
    TreeSHAP: Exact for trees, O(TLD²), fast (~65ms)
    DeepSHAP: Neural networks via DeepLIFT
    KernelSHAP: Model-agnostic, slow (~450ms)

LIME (Local Interpretable Model-agnostic Explanations):
  Theory: Local surrogate model
  Formula: ξ(x) = argmin L(f, g, π_x) + Ω(g)

  Process:
    1. Perturb input → synthetic samples
    2. Weight by proximity: π_x(z) = exp(-D(x,z)²/σ²)
    3. Fit interpretable model (linear)
    4. Coefficients = explanation

Performance Comparison:
  Method         Speed    Memory   Stability
  SHAP (Tree)    65ms     78MB     95%
  SHAP (Kernel)  450ms    680MB    96%
  LIME           85ms     92MB     82%

When to use:
  SHAP: Tree models, global explanations, regulated industries
  LIME: Novel architectures, quick single explanations, non-technical stakeholders

Failure modes:
  - Correlated features → underestimated importance
  - OOD samples → unreliable (40%+ error)
  - Feature interactions → linear models miss them

One-liners: - Q: "SHAP vs LIME?" → A: SHAP = theoretical rigor (game theory), LIME = flexible but unstable. Use SHAP for regulated, LIME for quick. - Q: "SHAP guarantees?" → A: Efficiency (sum = prediction-baseline), Symmetry, Dummy, Additivity. Only method with all four. - Q: "LIME unstability fix?" → A: Multiple runs + average, increase num_samples, or use SHAP instead. - Q: "XAI for deep learning?" → A: DeepSHAP (backprop-based) or GradientSHAP. TreeSHAP only for trees.

Neural Architecture Search (NAS) Quick Reference¶

3 Components:
  Search Space: What architectures are allowed (ops, connections)
  Search Strategy: How to explore (RL, EA, gradient, Bayesian)
  Performance Estimation: How to evaluate candidates fast

Search Strategies:
  RL:          RNN controller, reward = accuracy. 1800 GPU-days (NASNet)
  Evolution:   Population + mutation + crossover. Simple but expensive
  DARTS:       Continuous relaxation, gradient on α params. 1 GPU-day
  One-Shot:    Train supernet, sample subnets. 10000x faster

DARTS Formula:
  ō(x) = Σ_o softmax(α_o) × o(x)
  After training: argmax(α) → discrete architecture

Cell-Based Search:
  Search small reusable cell, not whole network
  Normal cell: same resolution
  Reduction cell: halve resolution
  Stack N cells → full network

Hardware-Aware NAS:
  Loss = Accuracy - λ × Latency
  Optimize for target device (mobile, edge)
  OFA: Train once, specialize for many devices

When NOT to use NAS:
  - Small scale (<7B params)
  - Single-domain task
  - Strong baseline exists (ResNet/EfficientNet)
  - Limited compute budget

One-liners: - Q: "NAS vs manual design?" → A: NAS finds novel architectures, but 100+ GPU-days. Use when unique constraints or novel task. - Q: "DARTS vs RL NAS?" → A: DARTS = gradient-based, 1 GPU-day. RL = 1800 GPU-days but more thorough. - Q: "Hardware-aware NAS?" → A: Add latency/memory to loss. Optimize for deployment target, not just accuracy. - Q: "Cell-based vs macro search?" → A: Cell = search small block, repeat. Macro = search whole network. Cell faster, transferable.

Generative Models (VAE/GAN) Quick Reference¶

Autoencoder: x → encoder → z → decoder → x̂
  Loss: ||x - x̂||² + regularization
  Use: Dimensionality reduction, denoising, anomaly detection

VAE: Probabilistic latent space
  z ~ q(z|x) = N(μ(x), σ²(x))
  Loss: Reconstruction + KL(q(z|x) || p(z))
  Reparameterization: z = μ + σ·ε, ε ~ N(0,I)
  Issue: Blurry outputs (averages modes)

GAN: Adversarial training
  min_G max_D E[log D(x)] + E[log(1-D(G(z)))]
  Mode collapse: G produces limited variety
  Solutions: WGAN, feature matching, spectral norm

WGAN: Wasserstein distance
  No sigmoid in D (critic)
  Gradient penalty or weight clipping
  Meaningful loss curve

DCGAN rules:
  - Strided conv instead of pooling
  - BatchNorm (except G output, D input)
  - ReLU in G, LeakyReLU in D
  - No FC layers

One-liners: - Q: "VAE vs AE?" → A: VAE learns distribution, generative. AE learns compression, not generative. - Q: "Why VAE blurry?" → A: Averages over modes when uncertain. GAN forces sharp outputs. - Q: "Mode collapse?" → A: Generator produces limited variety. Fix with WGAN, mini-batch discrimination. - Q: "WGAN vs GAN?" → A: WGAN uses Wasserstein distance, stable gradients, meaningful loss.

NLP & Word Embeddings Quick Reference¶

Word2Vec:
  CBOW: Context → Center word (faster, good for frequent words)
  Skip-gram: Center word → Context (better for rare words)

Negative Sampling:
  Replace softmax (expensive) with binary classification
  Loss = log(σ(v_context · v_center)) + Σ log(σ(-v_neg · v_center))
  K = 5-20 negative samples per positive
  Sampling: P(w)^0.75 (boost rare words)

GloVe:
  Count-based, global co-occurrence statistics
  Loss = Σ f(X_ij) (w_i · w̃_j + b_i + b̃_j - log X_ij)²
  Better for analogies, Word2Vec better for fine-tuning

Analogies:
  king - man + woman ≈ queen (vector arithmetic)
  Limitation: Polysemy (bank = river/financial → same vector)

One-liners: - Q: "CBOW vs Skip-gram?" → A: CBOW predicts center from context (fast), Skip-gram predicts context from center (rare words). - Q: "Why negative sampling?" → A: Softmax over 100K vocab is expensive. Binary classification on K negatives is O(K) vs O(V). - Q: "Word2Vec vs GloVe?" → A: Word2Vec = predictive (local), GloVe = count-based (global). Both learn similar embeddings.

NER & Sequence Labeling¶

BIO Tagging:
  B-PER: Begin person entity
  I-PER: Inside person entity
  O: Outside any entity

NER Evaluation:
  Token-level: P/R/F1 per class
  Entity-level: Exact match required (stricter)
  CoNLL-2003: Entity-level F1 standard

CRF for NER:
  Learns transition constraints (I-PER follows B-PER, not I-ORG)
  P(y|x) = (1/Z) exp(Σ θ · features)

BiLSTM-CRF:
  BiLSTM: Contextual representations
  CRF: Valid transition sequences

BERT for NER:
  Fine-tune + linear classifier
  Use first subword token for entity label
  SOTA: 93+ F1 on CoNLL-2003

One-liners: - Q: "CRF vs BiLSTM for NER?" → A: CRF learns tag transitions, BiLSTM learns context. BiLSTM-CRF = best of both. - Q: "NER entity vs token F1?" → A: Entity-level requires exact boundaries + type match. Stricter but more realistic. - Q: "BERT for NER subwords?" → A: Use first subword label, ignore rest. "Califor##nia" → B-LOC on "Califor".

POS Tagging¶

HMM Tagger:
  P(t|w) ∝ P(w|t) · P(t|t_prev)
  Emission: Word given tag
  Transition: Tag bigram
  Viterbi: Best path through tags

CRF Tagger:
  Global normalization over sequences
  Features: word, suffix, prefix, neighboring tags
  Better than HMM (no independence assumption)

Modern: BERT fine-tuning
  Token → BERT → Linear classifier
  97%+ accuracy on Penn Treebank

One-liners: - Q: "HMM vs CRF for POS?" → A: HMM generative (P(x,y)), CRF discriminative (P(y|x)). CRF more flexible features. - Q: "Why contextual embeddings for POS?" → A: "can" (verb vs noun) determined by context. BERT captures this.

Recommendation Systems Quick Reference¶

Collaborative Filtering:
  User-based: Find similar users → recommend their items
  Item-based: Find similar items → recommend (preferred for scale)
  Matrix: R ≈ U × V^T (factorization)

Matrix Factorization:
  SGD: u_i += η(e_ui × v_j - λ×u_i)
  ALS: Alternating least squares
  Libraries: Implicit, LightFM

Cold Start Solutions:
  New User: Content-based, ask preferences, popularity
  New Item: Content features, bandits, side info

Two-Tower Architecture:
  User Tower: Features → Embedding
  Item Tower: Features → Embedding
  Score: Dot product → ANN search (FAISS)

RecSys Pipeline:
  1. Retrieval: ANN → 1000 candidates
  2. Ranking: GBDT/Deep → Top 100
  3. Re-ranking: Diversity, business rules → Final 20

One-liners: - Q: "CF vs Content-Based?" → A: CF uses user-item interactions (discovery), CB uses item features (explainability). - Q: "Item-based vs User-based?" → A: Item-based more stable, pre-computable, better for production scale. - Q: "Two-Tower advantage?" → A: Decoupled inference, pre-compute item embeddings, ANN search for millions of items. - Q: "Cold start for new user?" → A: Content-based first, popularity baseline, ask preferences, smooth to CF as data accumulates.

Hyperparameter Optimization Quick Reference¶

Parameters vs Hyperparameters:
  Parameters: Learned from data (weights, biases)
  Hyperparameters: Set before training (lr, batch_size, layers)

Grid Search: Exhaustive over all combinations
  O(n^k) for k params with n values each
  Use for small search spaces

Random Search: Sample random combinations
  Often better than grid (explores more values for important params)
  Paper: Bergstra & Bengio 2012

Bayesian Optimization:
  Build surrogate model (Gaussian Process) of objective
  Acquisition function (EI, UCB) guides search
  Trade-off: exploration vs exploitation
  Use when evaluations are expensive

Optuna: Single-node, TPE sampler, pruning
Ray Tune: Distributed, PBT, ASHA, Hyperband

Priority for tuning:
  1. Learning rate (biggest impact)
  2. Batch size
  3. Optimizer (Adam vs SGD)
  4. Architecture (layers, units)
  5. Regularization (dropout, weight decay)

Early Stopping in HPO:
  Median pruning: Stop if worse than median at step k
  ASHA/Hyperband: Promote top performers, stop rest

One-liners: - Q: "Grid vs Random search?" → A: Random often better — explores more values for important params. Grid wastes trials on unimportant dimensions. - Q: "When Bayesian?" → A: Expensive evaluations + low-dimensional space + smooth objective. Otherwise random is fine. - Q: "Optuna vs Ray Tune?" → A: Optuna simpler, single-node. Ray Tune distributed, PBT, ASHA. - Q: "What to tune first?" → A: Learning rate → batch size → optimizer → architecture → regularization. - Q: "Multi-objective HPO?" → A: Pareto front — solutions where no objective can improve without worsening another.

Reinforcement Learning Quick Reference¶

Value-based vs Policy-based:
  Value: Learn Q(s,a), choose argmax (DQN)
  Policy: Learn π(a|s) directly (REINFORCE, PPO)

Q-Learning:
  Q(s,a) ← Q(s,a) + α[r + γ·max Q(s',a') - Q(s,a)]
  Bellman equation, tabular doesn't scale

DQN Innovations:
  1. Experience Replay: Break sample correlation
  2. Target Network: Stable learning targets

Double DQN: Decouple action selection from evaluation
Dueling DQN: Q = V(s) + A(s,a)

Policy Gradient:
  ∇J(θ) = E[∇log π(a|s) · G]
  G = discounted return

REINFORCE: Vanilla policy gradient, high variance
  Use advantage A = Q - V as baseline

PPO: Clip policy ratio to prevent large updates
  L = min(r·A, clip(r, 1-ε, 1+ε)·A)
  Most robust general-purpose algorithm

Actor-Critic: Actor (policy) + Critic (value)
  A2C: Synchronous, A3C: Async
  SAC: Entropy bonus, continuous actions, off-policy
  TD3: Twin critics, delayed updates

One-liners: - Q: "DQN vs PPO?" → A: DQN discrete only, off-policy, sample efficient. PPO both, on-policy, more stable. - Q: "Why experience replay?" → A: Breaks correlation between samples, enables off-policy learning from past experience. - Q: "Why target network?" → A: Prevents oscillations from chasing moving target. Update periodically, not every step. - Q: "PPO popularity?" → A: Clipped updates = stable. Works for discrete and continuous. Simple to implement. - Q: "SAC vs TD3?" → A: SAC has entropy bonus (better exploration), TD3 simpler but needs more tuning. Both continuous. - Q: "Which algorithm first?" → A: PPO. Most robust, works for most problems. Tune from there.

Object Detection Quick Reference¶

One-stage vs Two-stage:
  Two-stage (Faster R-CNN): RPN → RoI → Classify (accurate, slow)
  One-stage (YOLO, SSD): Direct prediction (fast, slightly less accurate)

Anchor Boxes: Pre-defined box templates
  Each location: K anchors (3 scales × 3 ratios = 9)
  Model predicts offsets, not absolute boxes
  Modern trend: Anchor-free (FCOS, CenterNet)

IoU (Intersection over Union):
  IoU = Area(A ∩ B) / Area(A ∪ B)
  Variants: GIoU (handles non-overlap), DIoU (+distance), CIoU (+aspect ratio)

NMS (Non-Maximum Suppression):
  Sort boxes by score, keep highest
  Remove overlapping boxes (IoU > threshold)
  Soft-NMS: Reduce score instead of remove

mAP Calculation:
  AP = Area under Precision-Recall curve (per class)
  mAP = Mean of AP across all classes
  COCO: mAP@0.5:0.95 (average over 10 thresholds)

One-liners: - Q: "YOLO vs Faster R-CNN?" → A: YOLO single-shot, real-time (~45 FPS). Faster R-CNN two-stage, more accurate (~5 FPS). - Q: "What is anchor box?" → A: Pre-defined template at each location. Model predicts offset from anchor, not absolute coordinates. - Q: "NMS purpose?" → A: Remove duplicate detections of same object. Keep highest score, suppress overlapping boxes. - Q: "mAP@0.5 vs mAP@0.5:0.95?" → A: mAP@0.5 uses IoU threshold 0.5. mAP@0.5:0.95 averages over thresholds 0.5-0.95 (COCO standard).

YOLO Evolution¶

YOLOv1: 7×7 grid, 2 boxes per cell, no anchors
YOLOv2: Anchors, batch norm, multi-scale training
YOLOv3: Feature pyramid (3 scales), Darknet-53
YOLOv5: PyTorch, auto-learning anchors, Mosaic augmentation
YOLOv8: Anchor-free, decoupled head
YOLOv10: NMS-free training, consistent dual assignments

Loss: λ_coord × L_loc + L_conf + λ_noobj × L_noobj + L_cls

Contrastive & Self-Supervised Learning Quick Reference¶

Contrastive Learning:
  Pull positive pairs together, push negative pairs apart
  Loss = -log(exp(sim(z_i,z_j)/τ) / Σ exp(sim(z_i,z_k)/τ))

SimCLR:
  Augment → Encoder → Projection head → NT-Xent loss
  Needs large batch (4096+) for enough negatives
  Key: Strong augmentation (crop + color + blur)

CLIP:
  Image encoder + Text encoder → Joint embedding
  Contrastive loss on 400M (image, text) pairs
  Zero-shot: "a photo of a {class}" → classify

MoCo:
  Memory bank (queue) instead of large batch
  Momentum encoder for stable targets
  Smaller batch, same performance

BYOL:
  No negative pairs!
  Online network + Target network (momentum)
  Predictor prevents collapse

One-liners: - Q: "Contrastive learning intuition?" → A: Pull augmented views of same image together, push different images apart in embedding space. - Q: "SimCLR vs MoCo?" → A: SimCLR needs large batch (4096+) for negatives. MoCo uses memory queue, works with smaller batches. - Q: "BYOL without negatives?" → A: Momentum target network + predictor. Stop-gradient prevents collapse. - Q: "CLIP zero-shot?" → A: Encode class names as text ("a photo of a dog"), compute similarity with image embedding. - Q: "Linear probe evaluation?" → A: Freeze encoder, train only linear classifier. Measures feature quality.

Self-Supervised Pretext Tasks¶

Contrastive: SimCLR, MoCo, BYOL, SimSiam
Masked: MAE (mask 75%, reconstruct), BEiT (predict tokens)
Pretext (older): Rotation, Jigsaw, Colorization, Inpainting

2025-2026 Trend: MAE + contrastive hybrid

Model Compression & Quantization Quick Reference¶

Compression Techniques:
  Quantization: FP32 → INT8/INT4 (4x-8x smaller)
  Pruning: Remove weights/neurons (2x smaller)
  Distillation: Train small from large (10-100x smaller)

PTQ vs QAT:
  PTQ: Post-training, fast (minutes), may lose accuracy
  QAT: During training, slow (full retrain), preserves accuracy

Quantization Types:
  Symmetric: Zero-point = 0, simpler (weights)
  Asymmetric: Zero-point ≠ 0, more accurate (activations)

Pruning:
  Unstructured: Remove individual weights (needs sparse hardware)
  Structured: Remove channels/filters (speedup on any hardware)

Knowledge Distillation:
  L = α * L_hard + (1-α) * T² * L_soft
  Temperature T: Higher = softer distribution
  "Dark knowledge": Which classes teacher thinks are similar

LLM Quantization:
  GPTQ: Post-training, layer-by-layer, INT4
  AWQ: Activation-aware, protects salient weights
  GGUF: llama.cpp format, CPU-optimized

One-liners: - Q: "PTQ vs QAT?" → A: PTQ is fast post-training, QAT is during training but preserves accuracy better. - Q: "Structured vs unstructured pruning?" → A: Structured removes entire channels (actual speedup), unstructured removes individual weights (needs sparse hardware). - Q: "Why temperature in distillation?" → A: Higher T softens probability distribution, revealing which classes teacher thinks are similar ("dark knowledge"). - Q: "GPTQ vs AWQ?" → A: Both INT4 post-training. AWQ protects important weights based on activations, slightly better for low bits. - Q: "When to compress?" → A: Edge deployment, cost reduction, inference speed critical.

Active Learning Quick Reference¶

Active Learning Loop:
  1. Train on labeled L
  2. Query most informative from unlabeled U
  3. Add to L, repeat

Query Strategies:
  Least Confidence:  max(1 - P(ŷ|x))
  Margin Sampling:   min(P(y₁|x) - P(y₂|x))
  Entropy:           max(-Σ P(y|x) log P(y|x))

Query-by-Committee (QBC):
  Train multiple models
  Query samples with highest disagreement
  Measure: Vote entropy, KL divergence

Expected Model Change:
  Query sample with largest expected gradient
  EGL = E[||∇L(x,y)||]

Diversity Sampling:
  Balance uncertainty with coverage
  k-Center: Cover unlabeled pool
  BADGE: Gradient embeddings + k-means++

When Active Learning FAILS:
  - Very small initial set (model too weak)
  - Highly imbalanced data
  - Noisy labels amplify errors
  - Budget too small (<100 samples)

Best Practice:
  70% uncertainty + 30% diversity
  Cold start: First 50-100 random

One-liners: - Q: "Active learning goal?" → A: Achieve target accuracy with minimum labeling cost by selecting most informative samples. - Q: "Uncertainty sampling methods?" → A: Least confidence (max uncertainty), margin (closest top 2), entropy (highest distribution uncertainty). - Q: "Query-by-Committee?" → A: Train multiple models, query samples where they disagree most. - Q: "When NOT to use active learning?" → A: Very small initial set, noisy labels, budget <100 samples, highly imbalanced data. - Q: "Combine uncertainty + diversity?" → A: 70% uncertain samples + 30% diverse samples prevents redundant queries.

LLM Inference Optimization Quick Reference¶

Quantization Types:
  Per-tensor: Single scale for entire tensor (simpler)
  Per-channel: Scale per output channel (more accurate)

GPTQ:
  Post-training INT4 with Hessian-based compensation
  Works on 70B+ models with <1% perplexity loss
  Formula: delta_F = -(w_q - w) / [H^-1]_qq * (H^-1):,q

Speculative Decoding:
  Draft model proposes k tokens
  Main model verifies in single forward pass
  Accept until mismatch, reject rest
  Speedup: 2-3x for memory-bound generation

Flash Attention:
  Tiled computation (never materialize N×N matrix)
  Memory: O(N) vs O(N²) standard
  Speedup: 2-4x faster, 10x less memory
  v2: Better parallelization, v3: H100 FP8

Hardware Optimization:
  CPU: INT8/INT4, ONNX Runtime, llama.cpp
  GPU: Flash Attention, vLLM, TensorRT
  TPU: XLA, JAX optimization

One-liners: - Q: "Per-tensor vs per-channel?" → A: Per-tensor = one scale for all weights (simpler). Per-channel = scale per channel (more accurate for varying importance). - Q: "GPTQ advantage?" → A: INT4 quantization with Hessian compensation, <1% perplexity loss, works on 70B+ models. - Q: "Speculative decoding?" → A: Draft model proposes tokens, main model verifies in one pass. 2-3x speedup. - Q: "Flash Attention memory?" → A: O(N) vs O(N²) by tiling computation, never materializing full attention matrix. - Q: "CPU vs GPU for inference?" → A: CPU for edge/low-latency with quantization. GPU for throughput with Flash Attention + batching.

Gradient Checkpointing Quick Reference¶

Core Trade-off: Memory ↔ Compute
  Don't store activations → Recompute during backward
  Savings: O(n) → O(√n) memory
  Cost: ~20-30% more compute

PyTorch APIs:
  checkpoint(fn, *args)         # Single function
  checkpoint_sequential(mod, n, x)  # N segments

Memory Formula:
  Peak memory = Static weights + Peak activations
  With checkpointing: Peak activations ≈ √n segments

Best Practices:
  - Checkpoint heavy layers (attention, wide linear)
  - Avoid checkpointing Dropout, BatchNorm
  - Combine with mixed precision for max savings

One-liners: - Q: "Gradient checkpointing?" → A: Trade compute for memory by recomputing activations during backward instead of storing them. - Q: "Memory savings?" → A: 50-70% reduction in peak memory for ~20-30% more training time. - Q: "When to use?" → A: OOM errors, need larger batch size, training deep models on limited GPU memory. - Q: "Checkpointing vs accumulation?" → A: Checkpointing reduces memory, accumulation simulates larger batch with same memory. - Q: "What NOT to checkpoint?" → A: Dropout (RNG issues), BatchNorm (running stats updated twice), in-place ops.

Mixed Precision Quick Reference¶

Core Idea: FP16/BF16 compute, FP32 master weights

FP16 vs BF16:
  FP16: 5-bit exp, 10-bit mantissa, max 65504
  BF16: 8-bit exp, 7-bit mantissa, same range as FP32

Loss Scaling (FP16 only):
  Scale loss before backward: L * scale
  Unscale before optimizer: grad / scale
  Dynamic: Reduce on inf/nan, increase when stable

PyTorch AMP:
  from torch.cuda.amp import autocast, GradScaler
  with autocast(dtype=torch.float16):  # or bfloat16
      loss = model(x)
  scaler.scale(loss).backward()
  scaler.step(optimizer)

Benefits:
  Memory: ~2x reduction
  Speed: 2-3x on Tensor Cores
  Accuracy: Minimal loss with proper scaling

One-liners: - Q: "Mixed precision?" → A: FP16/BF16 for compute (2x memory savings, 2-3x speedup), FP32 for master weights (precision). - Q: "FP16 vs BF16?" → A: FP16 = better precision, risk of overflow. BF16 = same range as FP32, no overflow, but lower precision. - Q: "Why loss scaling?" → A: Prevent gradient underflow in FP16 by scaling up loss, then unscaling gradients before optimizer. - Q: "When BF16?" → A: Ampere+ GPUs (A100, H100), no loss scaling needed, simpler than FP16. - Q: "GradScaler?" → A: Automatically scales loss, detects inf/nan, adjusts scale dynamically for FP16 training.

Distributed Training Quick Reference¶

Parallelism Types:
  Data Parallel:   Model replicated, data split (DDP)
  Model Parallel:  Model split across GPUs
  Pipeline:        Different layers on different GPUs
  Tensor:          Split individual operations (e.g., big matmul)

ZeRO Stages (DeepSpeed):
  ZeRO-1: Shard optimizer states only (~4x memory)
  ZeRO-2: + gradients (~8x memory)
  ZeRO-3: + parameters (~N× memory, N = GPUs)

Memory Math (7B model):
  Standard DDP:  28GB params + 28GB grads + 56GB opt = 112GB
  ZeRO-3 (8 GPU): ~14GB per GPU

FSDP vs DeepSpeed:
  FSDP:      PyTorch native, 90% memory savings, simpler
  DeepSpeed: 95% savings, more features, complex setup

Gradient Accumulation:
  Effective batch = actual_batch × accum_steps
  loss = model(x) / accum_steps  # Scale down
  optimizer.step() every N batches

Pipeline Parallelism:
  Model split into stages on different GPUs
  Micro-batches fill pipeline bubbles
  Best combined with tensor parallel (3D parallelism)

One-liners: - Q: "Data vs Model parallel?" → A: Data parallel = same model on all GPUs with different data. Model parallel = different parts of model on different GPUs. - Q: "ZeRO-3 savings?" → A: Shards params + grads + optimizer states. ~N× memory savings where N = number of GPUs. - Q: "FSDP vs DeepSpeed?" → A: FSDP = PyTorch native, simpler, 90% savings. DeepSpeed = more features, 95% savings, complex. - Q: "Gradient accumulation?" → A: Simulate larger batch by accumulating gradients over N steps before optimizer update. - Q: "When pipeline parallel?" → A: Models too large for data parallel alone, combined with ZeRO/tensor parallel for 3D parallelism.

Feature Stores Quick Reference¶

Core Purpose:
  Consistency: Same features for training and serving
  Reusability: Share features across models
  Time-travel: Point-in-time correct joins
  Freshness: Real-time feature serving

Offline vs Online:
  Offline: Training, hours latency, Parquet/Delta
  Online:  Inference, <10ms, Redis/DynamoDB

Point-in-Time Join:
  Join feature by entity_id + timestamp
  Prevents data leakage (feature value as of event time)

Feast vs Tecton vs Hopsworks:
  Feast:     OSS, free, limited real-time
  Tecton:    Managed SaaS, enterprise, $$$
  Hopsworks: Hybrid, mid-market, $$

Key Concepts:
  Materialization: Offline → Online (batch/streaming)
  Feature Groups:  Group by freshness requirements
  Monitoring:      Freshness alerts, latency P99

One-liners: - Q: "Why feature store?" → A: Ensures consistency between training and serving, prevents data leakage, enables feature reuse. - Q: "Offline vs Online store?" → A: Offline = training (Parquet, batch). Online = inference (Redis, <10ms latency). - Q: "Point-in-time join?" → A: Join features as they existed at event time, preventing future data leakage. - Q: "Feast vs Tecton?" → A: Feast = OSS, free, DIY. Tecton = managed, enterprise, expensive.

Uplift Modeling Quick Reference¶

Core Formula:
  Uplift = P(Y|T=1) - P(Y|T=0)
  Incremental effect of treatment on individual

User Segments:
  Persuadables:  Positive uplift → Target!
  Sure Things:   Buy anyway → Don't waste treatment
  Lost Causes:   Won't buy → Don't target
  Sleeping Dogs: Negative uplift → Avoid!

Model Types:
  T-Learner: Two models (treatment, control), difference
  S-Learner: Single model with treatment as feature
  X-Learner: Propensity-weighted, handles imbalanced groups

Evaluation (no ground truth!):
  AUUC: Area Under Uplift Curve (rank by predicted uplift)
  Qini: Cumulative treatment effect vs random
  Uplift@k: Effect in top-k predictions

One-liners: - Q: "Uplift modeling?" → A: Estimate incremental effect of treatment on individual users, not just average effect. - Q: "Persuadables vs Sleeping Dogs?" → A: Persuadables = treatment helps (target!). Sleeping Dogs = treatment hurts (avoid!). - Q: "T-Learner vs S-Learner?" → A: T-Learner = two separate models for treatment/control. S-Learner = one model with treatment as feature. - Q: "How to evaluate without ground truth?" → A: AUUC, Qini coefficient, uplift-at-k using held-out A/B test data.

LLM Alignment Quick Reference (RLHF, DPO, GRPO)¶

RLHF Pipeline:
  1. SFT on quality examples
  2. Train reward model on preferences
  3. PPO optimize with reward model

PPO vs DPO:
  PPO: 4× models (policy, reward, critic, reference)
       Higher quality, unstable, complex
  DPO: 2× models (policy, reference)
       Simpler, stable, lower compute

GRPO (DeepSeek-R1):
  No critic model (like DPO)
  Group-relative advantages
  93% less compute than PPO
  Pure RL - reasoning emerges

Reward Hacking:
  Model exploits reward without solving task
  Solution: Sparse rewards, adversarial training, human eval

When to Use:
  Style/tone:    DPO
  New knowledge: RAG
  Reasoning:     PPO/GRPO
  Safety:        PPO + Constitutional AI

One-liners: - Q: "RLHF purpose?" → A: Align LLM with human preferences - helpful, harmless, honest. - Q: "PPO vs DPO?" → A: PPO = 4× models, higher quality, complex. DPO = 2× models, simpler, good for style tasks. - Q: "GRPO advantage?" → A: No critic model, 93% less compute than PPO, group-relative ranking. - Q: "Reward hacking?" → A: Model exploits reward signal without solving task. Fix with sparse rewards, adversarial training. - Q: "RLHF vs RAG?" → A: RAG for new knowledge, RLHF for reasoning/style improvement.

GNN Quick Reference¶

Message Passing:
  h_v^(l+1) = UPDATE(h_v^l, AGGREGATE({h_u^l : u in N(v)}))
  Aggregate: sum, mean, max, attention-weighted
  Update: MLP, GRU, identity

Architecture Comparison:
  GCN:        Fixed weights (D^-0.5 A D^-0.5), transductive
  GAT:        Learned attention, transductive
  GraphSAGE:  Sample neighbors, INDUCTIVE (new nodes!)
  GIN:        Sum aggregation = WL-equivalent

Key Problems:
  Over-smoothing: All nodes same after many layers
    Fix: JK-Net, residual, PairNorm, fewer layers

  Heterogeneous: Different node/edge types
    Fix: R-GCN (separate weights), HAN (metapath attention)

One-hop = neighbors, Two-hop = neighbors of neighbors

One-liners: - Q: "GCN vs GAT?" → A: GCN = fixed aggregation weights. GAT = learned attention per edge. - Q: "GraphSAGE advantage?" → A: INDUCTIVE - works on unseen nodes without retraining (neighbor sampling). - Q: "Over-smoothing?" → A: After many layers, all node representations become identical. Fix: JK-Net, residual, 2-3 layers max. - Q: "GIN expressiveness?" → A: GIN with sum aggregation is as powerful as WL graph isomorphism test. Mean/max are not. - Q: "Heterogeneous graphs?" → A: R-GCN (different weights per edge type), HAN (metapath attention).

Diffusion Models Quick Reference¶

Core Process:
  Forward:  x_0 -> x_T (add noise gradually)
  Reverse:  x_T -> x_0 (denoise with neural net)

  Training: Predict noise epsilon from x_t
  Loss: E[||epsilon - epsilon_theta(x_t, t)||^2]

DDPM vs DDIM:
  DDPM: Stochastic, 1000 steps, higher quality
  DDIM: Deterministic (eta=0), 10-50 steps, 10-100x faster

Classifier-Free Guidance:
  epsilon_tilde = epsilon(c) + s * (epsilon(c) - epsilon(uncond))
  s=1: no guidance, s=7-15: typical for Stable Diffusion

Latent Diffusion:
  Compress image with VAE -> diffuse in latent space -> decode
  16x+ faster, lower memory

Architectures:
  U-Net: Conv + attention + AdaGN for time
  DiT:   Pure transformer, patchify, scales better

One-liners: - Q: "Diffusion training objective?" → A: Predict noise epsilon from noisy image x_t at timestep t. - Q: "DDPM vs DDIM?" → A: DDPM = stochastic Markov chain, 1000 steps. DDIM = deterministic, 10-50 steps, much faster. - Q: "Classifier-Free Guidance?" → A: Combine conditional and unconditional predictions with guidance scale s. Higher s = more faithful, less diverse. - Q: "Latent Diffusion?" → A: Diffuse in compressed latent space (VAE), not pixels. 16x+ faster training. - Q: "U-Net vs DiT?" → A: U-Net = conv + attention, inductive bias for locality. DiT = pure transformer, better scaling. - Q: "Consistency Models?" → A: Map any x_t directly to x_0 in one step. Distill from diffusion or train from scratch.

Reinforcement Learning Quick Reference¶

Algorithm Types:
  Value-based:    Learn Q(s,a), greedy action selection
  Policy-based:   Learn pi(a|s) directly, high variance
  Actor-Critic:   Both - lower variance, best of both

Q-Learning Update:
  Q(s,a) <- Q(s,a) + alpha[r + gamma * max Q(s',a') - Q(s,a)]

DQN Key Components:
  Experience Replay: Break temporal correlation
  Target Network:    Stabilize training
  Double DQN:        Reduce overestimation

Algorithm Selection:
  Discrete:       DQN, PPO
  Continuous:     PPO, SAC, TD3
  Sample Efficient: SAC (off-policy)
  Stable/Simple:  PPO (default choice)

Exploration Strategies:
  epsilon-greedy:  Random action with prob epsilon
  Entropy bonus:   -beta * sum(pi * log(pi))
  UCB:             Q + c * sqrt(ln N / n)

One-liners: - Q: "Value-based vs Policy-based?" → A: Value-based learns Q(s,a), policy-based learns pi(a|s) directly. Actor-Critic combines both. - Q: "Q-Learning update?" → A: Q(s,a) = Q(s,a) + alpha[r + gamma*max Q(s',a') - Q(s,a)]. TD learning. - Q: "PPO advantage?" → A: Clipped objective prevents large policy updates. Stable, simple, works for discrete/continuous. - Q: "SAC vs PPO?" → A: SAC = off-policy, entropy regularization, more sample efficient. PPO = on-policy, simpler, more stable. - Q: "Exploration vs exploitation?" → A: Explore to discover, exploit to maximize. Balance with epsilon-greedy, entropy bonus, UCB.

VAE Quick Reference¶

Architecture:
  Encoder: x -> (mu, sigma)
  Sample:  z = mu + sigma * eps (reparameterization trick)
  Decoder: z -> x_hat

ELBO Objective:
  log p(x) >= E[log p(x|z)] - KL(q(z|x) || p(z))
  = Reconstruction + KL regularization

Loss:
  L = BCE(x, x_hat) + beta * KL(N(mu,sigma) || N(0,1))
  KL = -0.5 * sum(1 + log_var - mu^2 - exp(log_var))

Reparameterization Trick:
  Can't backprop through z ~ N(mu, sigma)
  Solution: z = mu + sigma * eps, eps ~ N(0,I)
  Gradients flow through mu, sigma

Posterior Collapse:
  Decoder ignores z, KL -> 0, becomes regular AE
  Fix: KL annealing, free bits, weaker decoder

beta-VAE:
  beta > 1: More disentangled, worse reconstruction
  beta = 4: Good tradeoff

One-liners: - Q: "VAE vs Autoencoder?" → A: VAE learns distribution over latents (probabilistic), AE learns point estimates (deterministic). - Q: "ELBO components?" → A: Reconstruction term (decoder quality) + KL term (latent close to prior N(0,1)). - Q: "Reparameterization trick?" → A: z = mu + sigma*eps. Enables backprop through random sampling. - Q: "Posterior collapse?" → A: Decoder ignores latent z. Fix: KL annealing, free bits, beta-VAE. - Q: "beta-VAE effect?" → A: beta > 1 forces disentangled representations but hurts reconstruction quality.

Dimensionality Reduction Quick Reference¶

PCA:
  Linear, maximize variance preserved
  Eigendecomposition of covariance matrix
  Use for: Preprocessing, visualization, noise reduction

t-SNE:
  Non-linear, preserve local structure
  High-dim similarities -> low-dim probabilities
  Perplexity: 5-50 (effective neighbors)
  Run: PCA first (50 dims), then t-SNE
  NOT for: Feature engineering (no inverse transform)

UMAP:
  Non-linear, preserves global + local
  Faster than t-SNE, supports inverse transform
  n_neighbors: 5-50 (local vs global)
  min_dist: 0.0-0.99 (clustering tightness)
  Use for: Visualization, preprocessing, clustering

Autoencoder:
  Non-linear compression
  x -> encoder -> z -> decoder -> x_hat
  Use for: Anomaly detection, denoising, feature learning

One-liners: - Q: "PCA vs t-SNE vs UMAP?" → A: PCA = linear, fast, interpretable. t-SNE = non-linear, local structure, slow. UMAP = non-linear, global+local, faster than t-SNE. - Q: "t-SNE perplexity?" → A: Effective number of neighbors. Low (5) = local clusters, High (50) = more global structure. Default 30. - Q: "Why PCA before t-SNE?" → A: Reduce dimensions first (50), speeds up t-SNE 10-100x, removes noise. - Q: "UMAP n_neighbors?" → A: Low (5) = local structure focus. High (50) = global structure focus. Default 15. - Q: "t-SNE for features?" → A: No inverse transform, can't apply to new data. Use PCA or autoencoder for feature engineering.

Multi-Armed Bandits Quick Reference¶

Core Trade-off: Exploration vs Exploitation
  Explore: Try new arms to learn their rewards
  Exploit: Choose best-known arm to maximize reward

A/B Testing vs MAB:
  A/B: Fixed allocation, statistical significance, wastes traffic
  MAB: Dynamic allocation, maximize reward, minimizes regret

Algorithms:
  Epsilon-Greedy: With prob epsilon explore random, else exploit best
  UCB: Select arm with highest upper confidence bound
    UCB = mean_reward + sqrt(2*ln(n) / n_arm)
  Thompson Sampling: Sample from posterior, select highest
    Bayesian approach, works well for Bernoulli rewards

When MAB over A/B:
  - Maximize reward during experiment (ads, recommendations)
  - Non-stationary environment (preferences change)
  - Many variants (A/B slow with many arms)
  - Short experiment acceptable

When A/B over MAB:
  - Need statistical rigor (regulation, scientific)
  - Want to learn about user behavior
  - Potential negative impact from exploration

One-liners: - Q: "MAB vs A/B testing?" → A: A/B = fixed split, statistical rigor, wastes traffic on losers. MAB = dynamic, maximizes reward, minimizes regret. - Q: "Epsilon-Greedy vs UCB?" → A: E-Greedy = fixed exploration rate (simple). UCB = adaptive exploration based on uncertainty (no tuning). - Q: "Thompson Sampling?" → A: Bayesian sampling from posterior. Natural exploration, works well for Bernoulli rewards. - Q: "When to use MAB?" → A: Ads, recommendations, any scenario where you want to maximize reward while learning. - Q: "Evaluate MAB offline?" → A: Counterfactual evaluation with IPS (Inverse Propensity Scoring), replay method.

Model Drift Detection Quick Reference¶

Drift Types:
  Data Drift:   P(X) changes - input distribution shifts
  Concept Drift: P(Y|X) changes - relationship changes
  Label Drift:  P(Y) changes - outcome distribution shifts

Metrics:
  PSI (Population Stability Index):
    PSI < 0.1: OK
    0.1-0.25: Investigate
    > 0.25: Action needed

  KS-test: Statistical significance for continuous
  Wasserstein: Robust, geometric interpretation
  JS/KL: Information-theoretic divergence

Monitoring Setup:
  Baselines: Training, healthy production, seasonal
  Windows: Short (1h/1d), Medium (7d), Long (30d)
  Slicing: Country, device, user segment

Alerting:
  Warning (PSI > 0.1): Investigate within N hours
  Critical (PSI > 0.25): Mitigate immediately
  Persistence: Alert if N consecutive windows drift

Response Playbook:
  1. Triage: Check data pipeline, recent changes, localize slice
  2. Mitigate: Rollback, increase fallback, route to human
  3. Investigate: Compare failures to baseline, feature-level drift
  4. Resolve: Targeted labeling, retrain, calibration refresh

One-liners: - Q: "Data vs concept drift?" → A: Data = input distribution changes (new users, devices). Concept = P(Y|X) changes (fraud patterns evolve). - Q: "PSI interpretation?" → A: <0.1 = stable, 0.1-0.25 = investigate, >0.25 = action needed. - Q: "Drift detected - first step?" → A: Check data integrity (pipeline, schema, nulls) before assuming model issue. - Q: "Drift without performance drop?" → A: Benign drift - model still works. Monitor performance, not just inputs. - Q: "LLM drift sources?" → A: Prompt changes, retrieval corpus updates, embedding model changes, tool API changes.

Use this sheet 30 minutes before interview for quick review.