ML Practice: Speed Run Sheet (Layer 7)¶
~21 минута чтения
60-second refresher перед собесом Key formulas, common mistakes, one-liner answers Обновлено: 2026-02-11
ML Math Quick Reference¶
Calculus¶
Gradient Descent: w = w - lr * dL/dw
Sigmoid derivative: sigma'(x) = sigma(x) * (1 - sigma(x))
ReLU derivative: ReLU'(x) = 1 if x > 0 else 0
Tanh derivative: tanh'(x) = 1 - tanh^2(x)
Softmax: softmax(x_i) = exp(x_i) / sum(exp(x_j))
Cross-entropy: CE = -sum(y * log(p))
Chain rule: dL/dw = dL/da * da/dz * dz/dw
One-liners: - Q: "Why ReLU over sigmoid?" → A: No vanishing gradient for positive values, sparse activation, faster computation. - Q: "Why softmax for classification?" → A: Outputs valid probability distribution (sums to 1, all positive).
Linear Algebra¶
L2 Norm: ||x||_2 = sqrt(sum(x_i^2))
Cosine Similarity: cos(a,b) = (a · b) / (||a|| * ||b||)
Softmax stable: softmax(x) = softmax(x - max(x))
SVD: X = U * S * V^T
PCA: X' = X @ V[:,:k] (top-k components)
One-liners: - Q: "When to use cosine vs L2?" → A: Cosine for similarity regardless of magnitude, L2 for distance. - Q: "Why subtract max in softmax?" → A: Numerical stability, prevents overflow.
Statistics¶
Mean: mu = (1/n) * sum(x_i)
Variance: var = (1/n) * sum((x_i - mu)^2)
Std Dev: sigma = sqrt(var)
Correlation: r = cov(X,Y) / (sigma_X * sigma_Y)
P-value: P(obs | H0) — probability of result under null hypothesis
Confidence Interval: mu +/- z * (sigma / sqrt(n))
T-test: t = (x_bar - mu) / (s / sqrt(n))
Chi-square: chi2 = sum((O - E)^2 / E)
One-liners: - Q: "What is p-value?" → A: Probability of observing data this extreme if H0 is true. Low p-value = reject H0. - Q: "When to use t-test vs z-test?" → A: t-test when n < 30 or population variance unknown.
Information Theory¶
Entropy: H(X) = -sum(p(x) * log(p(x)))
Cross-entropy: H(P,Q) = -sum(p(x) * log(q(x)))
KL Divergence: KL(P||Q) = sum(p(x) * log(p(x)/q(x)))
Information Gain: IG = H(parent) - weighted_avg(H(children))
Gini: Gini = 1 - sum(p_i^2)
One-liners: - Q: "Gini vs Entropy?" → A: Both work similarly. Gini faster (no log), Entropy more interpretable (bits). - Q: "What is KL divergence?" → A: How much information lost when Q approximates P. Not symmetric.
Classical ML Quick Reference¶
Loss Functions¶
MSE: L = (1/n) * sum((y - y_hat)^2)
MAE: L = (1/n) * sum(|y - y_hat|)
Binary CE: L = -[y*log(p) + (1-y)*log(1-p)]
Hinge (SVM): L = max(0, 1 - y * f(x))
Metrics¶
Precision: TP / (TP + FP)
Recall: TP / (TP + FN)
F1: 2 * P * R / (P + R)
Accuracy: (TP + TN) / Total
ROC-AUC: Area under TPR vs FPR curve
PR-AUC: Area under Precision vs Recall curve
One-liners: - Q: "Precision vs Recall?" → A: Precision = avoid false positives, Recall = catch all positives. - Q: "When ROC-AUC vs PR-AUC?" → A: PR-AUC for imbalanced data, ROC-AUC for balanced.
Regularization¶
L1 (Lasso): L + lambda * sum(|w|)
L2 (Ridge): L + lambda * sum(w^2)
Elastic Net: L + lambda1 * L1 + lambda2 * L2
Dropout: During training, randomly zero p% of activations
One-liners: - Q: "L1 vs L2?" → A: L1 produces sparse weights (feature selection), L2 shrinks all weights smoothly. - Q: "Why Dropout?" → A: Prevents co-adaptation, acts as ensemble averaging.
Tree Algorithms¶
Decision Tree split: argmax IG(feature)
Random Forest: Bagging + random feature subset per split
GBDT: F(x) = F_prev(x) + lr * h(x), where h fits residuals
XGBoost: GBDT + second-order gradients + regularization
One-liners: - Q: "RF vs GBDT?" → A: RF parallel (independent trees), GBDT sequential (corrects errors). GBDT usually better. - Q: "Why random features in RF?" → A: De-correlates trees, reduces variance.
Deep Learning Quick Reference¶
Optimizers¶
SGD: w = w - lr * grad
Momentum: v = beta*v + grad; w = w - lr*v
RMSprop: w = w - lr * grad / sqrt(v + eps)
Adam: m = beta1*m + (1-beta1)*grad
v = beta2*v + (1-beta2)*grad^2
w = w - lr * m / (sqrt(v) + eps)
One-liners: - Q: "Adam vs SGD?" → A: Adam adaptive, faster convergence. SGD + momentum often better final accuracy with LR scheduling. - Q: "Why beta1=0.9, beta2=0.999?" → A: Exponential moving average of gradient (0.9) and squared gradient (0.999).
Weight Initialization¶
One-liners: - Q: "Why not zero initialization?" → A: All neurons compute same output, no learning (symmetry). - Q: "He vs Xavier?" → A: He for ReLU (2x variance), Xavier for tanh/sigmoid.
Normalization¶
BatchNorm: x_norm = (x - mu_batch) / sqrt(var_batch + eps)
y = gamma * x_norm + beta
LayerNorm: Normalize across features (not batch)
RMSNorm: x / sqrt(mean(x^2) + eps) — no mean subtraction
One-liners: - Q: "BatchNorm vs LayerNorm?" → A: BatchNorm = per-feature across batch, LayerNorm = per-sample across features. LayerNorm for Transformers. - Q: "Why gamma, beta?" → A: Learnable scale and shift, restore representation power.
Attention¶
Attention(Q,K,V) = softmax(Q @ K^T / sqrt(d_k)) @ V
Multi-Head: concat(head_1, ..., head_h) @ W_o
where head_i = Attention(Q @ W_q, K @ W_k, V @ W_v)
One-liners: - Q: "Why sqrt(d_k)?" → A: Scales dot product to prevent softmax from having extremely small gradients. - Q: "Why multi-head?" → A: Different heads learn different relationships, richer representations.
LLM Engineering Quick Reference¶
Tokenization¶
BPE: Merge most frequent pairs iteratively
WordPiece: Merge pairs maximizing likelihood
SentencePiece: Language-agnostic, handles any language
One-liners: - Q: "BPE vs WordPiece?" → A: BPE merges most frequent pairs, WordPiece maximizes likelihood. WordPiece uses ## prefix. - Q: "Why subword?" → A: No OOV, smaller vocabulary, handles morphology.
Decoding¶
Greedy: argmax at each step
Beam Search: Keep top-k sequences at each step
Top-k: Sample from top-k tokens
Top-p (nucleus): Sample from smallest set with cumulative prob >= p
Temperature: logits = logits / T (T<1 = sharper, T>1 = flatter)
One-liners: - Q: "Top-k vs Top-p?" → A: Top-k fixed number, Top-p adaptive to distribution shape. - Q: "When beam search?" → A: When you want most likely sequence, not diverse generation.
RAG¶
BM25: TF-IDF with saturation + length normalization
Dense Retrieval: similarity(query_emb, doc_emb)
Hybrid: alpha * BM25 + (1-alpha) * Dense
Reranking: Cross-encoder (slow, accurate) vs Bi-encoder (fast)
One-liners: - Q: "BM25 vs Dense?" → A: BM25 exact keyword match, Dense semantic similarity. Use hybrid. - Q: "When to rerank?" → A: Rerank top-100 from retrieval for better precision.
LoRA¶
LoRA: W' = W + B @ A, where B: d x r, A: r x d
Only train A, B (r << d)
QLoRA: 4-bit quantized base + LoRA adapters
One-liners: - Q: "Why LoRA?" → A: Train 0.1% parameters, no catastrophic forgetting, easy to switch adapters. - Q: "LoRA rank choice?" → A: Start with r=8-16. Higher rank = more capacity but overfitting risk.
Quantization¶
PTQ (Post-Training): Quantize after training
QAT (Quantization-Aware): Train with quantization simulation
GPTQ: One-shot quantization using Hessian
AWQ: Activation-aware weight quantization
One-liners: - Q: "INT8 vs FP16?" → A: INT8 2x smaller, faster inference, ~1% accuracy drop acceptable. - Q: "GPTQ vs AWQ?" → A: Both INT4, AWQ faster inference, better accuracy retention.
ML System Design Quick Reference¶
Model Serving¶
Latency targets: P50 < 50ms, P99 < 200ms
Optimization:
1. Batching — combine requests
2. Quantization — INT8/INT4
3. Caching — cache popular predictions
4. Async — non-blocking inference
One-liners: - Q: "Reduce latency 2x?" → A: Quantization, batching, caching, model distillation.
A/B Testing¶
Sample size: n = 16 * sigma^2 / delta^2 (95% CI, 80% power)
Significance: p-value < 0.05 → reject H0
z = (p_A - p_B) / sqrt(pooled * (1-pooled) * (1/n_A + 1/n_B))
One-liners: - Q: "Sample size for A/B test?" → A: n = 16 * p(1-p) / delta^2 per variant. - Q: "When A/B test invalid?" → A: Network effects, temporal effects, sample ratio mismatch.
Drift Detection¶
PSI < 0.1: No significant change
PSI 0.1-0.25: Moderate change
PSI > 0.25: Significant change (investigate!)
KS-test: max |CDF_A(x) - CDF_B(x)|
One-liners: - Q: "PSI vs KS-test?" → A: PSI for binned distributions (interpretable), KS for continuous (statistical significance). - Q: "Drift detected. What next?" → A: Investigate cause, evaluate model on new data, consider retraining.
Calibration¶
Platt Scaling: P(calibrated) = sigmoid(A * score + B)
Isotonic: Piecewise constant function
Brier Score: BS = mean((p - y)^2) — lower is better
One-liners: - Q: "Platt vs Isotonic?" → A: Platt parametric (2 params), Isotonic non-parametric (needs more data). - Q: "Why calibrate?" → A: When you need accurate probabilities (medical, risk scoring).
RecSys¶
Two-Tower: user_emb · item_emb = similarity
Cold Start: Content-based, popularity, exploration (bandits)
Metrics: NDCG@k, MRR, CTR Lift
One-liners: - Q: "Two-Tower architecture?" → A: Separate embeddings for user and item, similarity = dot product. - Q: "Cold start solution?" → A: Content features, popularity, bandit exploration, LLM preferences.
AI Agents Quick Reference¶
ReAct Pattern¶
Loop:
1. Thought: Analyze situation
2. Action: Call tool
3. Observe: See result
4. Repeat until done
One-liners: - Q: "What is ReAct?" → A: Interleaves reasoning (Thought) with tool use (Action) in a loop. - Q: "Why ReAct over just prompting?" → A: Can use external tools, transparent reasoning, recover from errors.
Framework Comparison¶
LangGraph: State machine, production, human-in-loop
AutoGen: Multi-agent, Microsoft, conversational
CrewAI: Role-based, simple API, task orchestration
One-liners: - Q: "LangGraph vs AutoGen?" → A: LangGraph = stateful workflows, AutoGen = multi-agent collaboration. - Q: "When CrewAI?" → A: Simple role-based tasks, quick prototyping, less control needed.
Common Mistakes to Avoid¶
| Topic | Mistake | Correct |
|---|---|---|
| Preprocessing | fit_transform on test data | fit on train, transform test |
| Split | random split for time series | temporal split |
| Leakage | using future data | check temporal ordering |
| Eval | train accuracy only | use validation/test |
| Regularization | L2 on bias terms | only on weights |
| BatchNorm | using batch stats at inference | use running stats |
| Dropout | using at inference | only during training |
| Learning Rate | same LR throughout | use scheduling |
| Softmax | on single class | sigmoid for multi-label |
| Loss | MSE for classification | cross-entropy |
Code Snippets to Memorize¶
# Stable softmax
def softmax(x):
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
# Cross-entropy loss
def cross_entropy(y_true, y_pred):
return -np.sum(y_true * np.log(y_pred + 1e-9))
# Accuracy
def accuracy(y_true, y_pred):
return np.mean(y_true == y_pred)
# Mini-batch
for X_batch, y_batch in DataLoader(dataset, batch_size=32, shuffle=True):
# training step
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Learning rate warmup
for param_group in optimizer.param_groups:
param_group['lr'] = min(base_lr, current_step / warmup_steps * base_lr)
Advanced Topics Quick Reference¶
Vision Transformers (ViT)¶
Pipeline: Image → Patches (16x16) → Linear Project → +Pos Embed → Transformer → [CLS] → MLP Head
Patch Embedding: Conv2d(3, 768, kernel=16, stride=16)
Positional: Learnable (196 patches + 1 CLS)
[CLS] Token: Learnable, aggregates global info for classification
Memory: O(N²) attention = 196² = 38K for 224x224
vs CNN: ViT needs more data (JFT-300M), CNN good with ImageNet
Swin Transformer: Window attention = O(N), hierarchical = CNN-like
One-liners: - Q: "ViT vs CNN?" → A: CNN has inductive bias (locality), ViT learns it from data. ViT needs more data but scales better. - Q: "Why [CLS] token?" → A: Learns to aggregate global representation. Alternative is global avg pooling.
Diffusion Models¶
Forward: x_t = sqrt(α_t) * x_{t-1} + sqrt(1-α_t) * ε (add noise)
Reverse: Learn ε_θ(x_t, t) to predict noise added
Training Loss: E[||ε - ε_θ(x_t, t)||²]
DDPM: Stochastic, 1000 steps
DDIM: Deterministic, 10-50 steps, same seed = same output
Latent Diffusion (Stable Diffusion):
Image (512x512) → VAE encode → Latent (64x64) → Diffuse → VAE decode
16x faster, runs on consumer GPU
One-liners: - Q: "DDPM vs DDIM?" → A: DDPM stochastic/slow, DDIM deterministic/fast. DDIM 20x fewer steps. - Q: "Classifier-Free Guidance?" → A: Interpolate conditional and unconditional predictions. Scale s > 1 = more prompt adherence.
Long Context & KV Cache¶
KV-Cache: Store K,V for previous tokens, only compute for new token
Memory: 2 * layers * hidden_size * seq_len * 2 bytes (FP16)
70B @ 128K context ≈ 100GB KV-cache
RoPE Scaling:
Linear: Multiply positions by factor (simple, loses detail)
NTK-aware: Adaptive frequency scaling
YaRN: NTK + linear + temperature = SOTA
GQA (Grouped-Query Attention): Groups of heads share KV
Llama-3-70B: 64 query heads, 8 KV heads = 8x memory saving
One-liners: - Q: "Why RoPE over absolute PE?" → A: Extrapolates to longer sequences, relative position naturally, no learned params. - Q: "1M context on 8x A100?" → A: GQA-8 + FlashAttention-3 + KV-cache eviction + offloading to CPU.
Mixture of Experts (MoE)¶
Architecture: Dense FFN → N Experts (small FFNs) + Router
Routing: Router(x) → probs → Top-K selection → Weighted sum of expert outputs
y = Σ p_i * E_i(x) for i in Top-K
Key Trade-offs:
Total params vs Active params (Mixtral: 46.7B total, ~13B active)
Compute efficiency vs Memory (need all experts loaded)
Specialization vs Load balance
Expert Collapse Problem:
Router selects same experts always → others "die"
Fix: Load balancing loss, capacity limits, noise injection
Load Balancing Loss:
L_aux = n * Σ f_i * P_i (f_i = token fraction, P_i = prob mass)
Minimize when f_i ≈ P_i (uniform usage)
DeepSeek-V3 Innovations:
- 256 fine-grained experts (vs 8 in Mixtral)
- Top-8 routing (vs Top-2)
- Shared experts (always active) + routed experts
- Auxiliary-loss-free routing (dynamic bias instead)
One-liners: - Q: "MoE vs Dense inference?" → A: MoE: 3-10x less compute, same quality. But needs all experts in memory. - Q: "Expert collapse?" → A: Router over-selects same experts. Fix: load balancing loss, capacity limits. - Q: "When NOT to use MoE?" → A: Small scale (<7B), single-domain, latency-critical, memory-constrained edge.
AI Agent Memory¶
4 Memory Types:
1. Internal Knowledge (model weights) - immutable
2. Context Window (128K tokens) - current conversation
3. Short-term Memory (Redis) - session, TTL 24h
4. Long-term Memory (Vector DB) - persistent across sessions
Episodic: Events ("User asked about X at 3pm")
Semantic: Facts ("User prefers Python over JS")
Procedural: Workflows ("Successful resolution paths")
One-liners: - Q: "Agent vs LLM?" → A: Agent = LLM + Tools + Memory + Planning. - Q: "Memory architecture for customer service?" → A: Redis (session) + Vector DB (episodic) + Knowledge Graph (semantic).
Time Series Quick Reference¶
Stationarity: Mean, variance, autocovariance constant over time
ADF Test: H0 = non-stationary, reject if p < 0.05
ARIMA(p,d,q):
p = AR order (PACF cutoff)
d = Differencing order (ADF until stationary)
q = MA order (ACF cutoff)
ACF: Direct + indirect correlation at lag k
PACF: Direct correlation only (partial out intermediate)
Seasonal decomposition:
Additive: Y = Trend + Seasonal + Residual
Multiplicative: Y = Trend × Seasonal × Residual
Log transform: Multiplicative → Additive
Cointegration: Two non-stationary series with stationary linear combination
Test: Engle-Granger (regress, then ADF on residuals)
Application: Pairs trading
One-liners: - Q: "ARIMA parameter selection?" → A: ADF for d, PACF cutoff for p, ACF cutoff for q. - Q: "Prophet vs ARIMA?" → A: Prophet for multiple seasonalities + holidays, ARIMA for clean univariate.
Advanced Time Series (Deep Learning)¶
DeepAR (Amazon):
Autoregressive RNN with probabilistic output
Learns from multiple time series (global model)
Output: Distribution (mean + std), not point estimate
Good for: Many related series, cold start with covariates
TFT (Temporal Fusion Transformer):
Variable Selection Network → which features matter
Static covariate encoder → time-invariant features
Gated Residual Network → skip connections + gating
Multi-head attention → interpretability (which past steps matter)
Quantile regression → prediction intervals
Three types of inputs:
- Static: product category, store location
- Known future: holidays, promotions
- Historical: past sales, weather
Prophet (Meta):
y(t) = g(t) + s(t) + h(t) + ε_t
g(t) = trend (piecewise linear or logistic)
s(t) = seasonality (Fourier series)
h(t) = holiday effects
Good for: Business forecasting with seasonality + holidays
N-BEATS:
Stack of FC blocks with forward/backward residuals
Interpretable: trend + seasonality decomposition
Pure deep learning, no hand-crafted features
Cross-validation (Time Series):
Rolling origin: Train on [0:t], test on [t:t+h], expand t
NOT random split! Temporal order must be preserved
One-liners: - Q: "DeepAR vs ARIMA?" → A: DeepAR for multiple related series with covariates, learns globally. ARIMA for single series. - Q: "TFT key innovation?" → A: Variable Selection + attention for interpretability. Knows which features and past steps matter. - Q: "Time series CV?" → A: Rolling origin. Never random split — temporal order matters. - Q: "Prophet components?" → A: Trend (piecewise) + Seasonality (Fourier) + Holidays + Error.
Causal Inference Quick Reference¶
Association ≠ Causation: Ice cream & crime correlated (confounder: heat)
Potential Outcomes:
Y_i(1) = outcome if treated, Y_i(0) = outcome if not
ITE = Y_i(1) - Y_i(0) [never observed!]
ATE = E[Y(1) - Y(0)]
Confounder: Affects both treatment and outcome → spurious correlation
Propensity Score: P(T=1|X), probability of treatment given covariates
Matching: Pair treated/controls by similar PS
Assumptions: Unconfoundedness, Overlap (0 < PS < 1)
Methods:
PSM: Propensity Score Matching
RDD: Cutoff-based assignment (compare just above/below)
IV: Instrument affects treatment but not outcome directly
DiD: Difference-in-Differences (trend comparison)
Uplift Modeling: Individual treatment effect prediction
Persuadables: Respond only if treated
Sleeping dogs: Respond worse if treated
Sure things: Respond regardless
Lost causes: Never respond
One-liners: - Q: "RCT vs observational?" → A: RCT randomizes treatment, observational needs confounding control. - Q: "ATE vs ATT?" → A: ATE = effect on everyone, ATT = effect on treated population. - Q: "Valid instrument?" → A: Relevant (affects treatment), Exogenous (no direct effect), Excludable.
Bayesian ML & Uncertainty Quick Reference¶
Epistemic Uncertainty: Model ignorance (reducible with more data)
Aleatoric Uncertainty: Data noise (irreducible)
BNN: Weights as distributions, not point estimates
P(y|x,D) = ∫ P(y|x,w) P(w|D) dw
Variational Inference: Approximate posterior with tractable q(w)
ELBO = E_q[log P(D|w)] - KL(q(w) || P(w))
MC Dropout: Dropout at inference ≈ variational approximation
Enable dropout: model.train() at inference!
Predictive uncertainty: Var(y|x) ≈ (1/T) Σ ŷ_t² - ((1/T) Σ ŷ_t)²
Deep Ensembles: Train M models with different init, variance = uncertainty
Better calibrated than MC Dropout in practice
Total variance = average variance + variance of means
Calibration: Predicted probability ≈ actual accuracy
ECE (Expected Calibration Error): Σ (n_i/N) |acc_i - conf_i|
Temperature scaling: p' = softmax(z/T), learn T on validation set
Reliability diagram: bin samples by confidence, plot accuracy
Conformal Prediction: Distribution-free coverage guarantee
P(Y_{n+1} ∈ Ĉ(X_{n+1})) ≥ 1 - α (under exchangeability)
Split conformal: use calibration set, find quantile q
Prediction set: Ĉ(x) = {y: s(x,y) ≤ q} where s is score function
OOD Detection: Identify out-of-distribution samples
MSP (Max Softmax Probability): max(softmax(f(x)))
Energy score: E(x) = -T log Σ exp(f_i(x)/T)
Mahalanobis: (x-μ)ᵀ Σ⁻¹ (x-μ) in feature space
ODIN: Input perturbation + temperature scaling
Selective Prediction: Abstain when uncertain
Trade-off: Coverage ↓ → Accuracy ↑
SelectiveNet: Dedicated selection head, g(x) threshold
Conformal selective: Guarantee (1-α) coverage on accepted
One-liners: - Q: "Epistemic vs aleatoric?" → A: Epistemic = model uncertainty (fix with data), aleatoric = noise (inherent). - Q: "MC Dropout how?" → A: Keep dropout ON at inference, sample N predictions, std = uncertainty. - Q: "MC Dropout vs Ensembles?" → A: Ensembles better calibrated, MC Dropout cheaper (one model). - Q: "What is calibration?" → A: Predicted prob ≈ actual accuracy. Fix with temperature scaling. - Q: "Conformal prediction guarantee?" → A: P(Y ∈ Ĉ(X)) ≥ 1-α under exchangeability, no assumptions about data distribution. - Q: "OOD detection baselines?" → A: MSP (max softmax), Energy (lower = OOD), Mahalanobis distance in feature space. - Q: "Selective prediction?" → A: Model can say "I don't know", trade coverage for accuracy.
Second-Order Optimization Quick Reference¶
Newton's Method: θ = θ - H⁻¹∇L
Uses curvature (Hessian), not just gradient
Quadratic convergence vs SGD linear
Problem: H⁻¹ is O(D³), infeasible for deep learning
BFGS: Quasi-Newton, approximates H⁻¹ iteratively
Memory: O(D²)
L-BFGS: Limited-memory BFGS
Stores only last m gradient differences
Memory: O(mD), practical for large problems
Natural Gradient: ∇̃L = F⁻¹∇L
F = Fisher Information Matrix
Invariant to reparameterization
K-FAC: Kronecker approximation
Conjugate Gradient: Directions conjugate to previous
d_t = -g_t + β_t·d_{t-1}
Guaranteed convergence in ≤ D steps for quadratic
No Hessian storage needed
One-liners: - Q: "Why not Newton in DL?" → A: Hessian is O(D²) memory, O(D³) inverse. Use L-BFGS or Adam. - Q: "L-BFGS vs BFGS?" → A: L-BFGS stores only last m updates, BFGS stores full inverse approximation. - Q: "Natural gradient intuition?" → A: Steepest descent in distribution space, not parameter space.
Explainable AI (XAI) Quick Reference¶
SHAP (SHapley Additive exPlanations):
Theory: Game theory Shapley values
Formula: φ_i = Σ [|S|!(M-|S|-1)!/M!] × [f(S∪{i}) - f(S)]
Guarantees:
Efficiency: Σ φ_i = f(x) - E[f(X)]
Symmetry: Equal contribution = equal SHAP
Dummy: No impact = SHAP = 0
Additivity: SHAP_ensemble = sum of SHAPs
Variants:
TreeSHAP: Exact for trees, O(TLD²), fast (~65ms)
DeepSHAP: Neural networks via DeepLIFT
KernelSHAP: Model-agnostic, slow (~450ms)
LIME (Local Interpretable Model-agnostic Explanations):
Theory: Local surrogate model
Formula: ξ(x) = argmin L(f, g, π_x) + Ω(g)
Process:
1. Perturb input → synthetic samples
2. Weight by proximity: π_x(z) = exp(-D(x,z)²/σ²)
3. Fit interpretable model (linear)
4. Coefficients = explanation
Performance Comparison:
Method Speed Memory Stability
SHAP (Tree) 65ms 78MB 95%
SHAP (Kernel) 450ms 680MB 96%
LIME 85ms 92MB 82%
When to use:
SHAP: Tree models, global explanations, regulated industries
LIME: Novel architectures, quick single explanations, non-technical stakeholders
Failure modes:
- Correlated features → underestimated importance
- OOD samples → unreliable (40%+ error)
- Feature interactions → linear models miss them
One-liners: - Q: "SHAP vs LIME?" → A: SHAP = theoretical rigor (game theory), LIME = flexible but unstable. Use SHAP for regulated, LIME for quick. - Q: "SHAP guarantees?" → A: Efficiency (sum = prediction-baseline), Symmetry, Dummy, Additivity. Only method with all four. - Q: "LIME unstability fix?" → A: Multiple runs + average, increase num_samples, or use SHAP instead. - Q: "XAI for deep learning?" → A: DeepSHAP (backprop-based) or GradientSHAP. TreeSHAP only for trees.
Neural Architecture Search (NAS) Quick Reference¶
3 Components:
Search Space: What architectures are allowed (ops, connections)
Search Strategy: How to explore (RL, EA, gradient, Bayesian)
Performance Estimation: How to evaluate candidates fast
Search Strategies:
RL: RNN controller, reward = accuracy. 1800 GPU-days (NASNet)
Evolution: Population + mutation + crossover. Simple but expensive
DARTS: Continuous relaxation, gradient on α params. 1 GPU-day
One-Shot: Train supernet, sample subnets. 10000x faster
DARTS Formula:
ō(x) = Σ_o softmax(α_o) × o(x)
After training: argmax(α) → discrete architecture
Cell-Based Search:
Search small reusable cell, not whole network
Normal cell: same resolution
Reduction cell: halve resolution
Stack N cells → full network
Hardware-Aware NAS:
Loss = Accuracy - λ × Latency
Optimize for target device (mobile, edge)
OFA: Train once, specialize for many devices
When NOT to use NAS:
- Small scale (<7B params)
- Single-domain task
- Strong baseline exists (ResNet/EfficientNet)
- Limited compute budget
One-liners: - Q: "NAS vs manual design?" → A: NAS finds novel architectures, but 100+ GPU-days. Use when unique constraints or novel task. - Q: "DARTS vs RL NAS?" → A: DARTS = gradient-based, 1 GPU-day. RL = 1800 GPU-days but more thorough. - Q: "Hardware-aware NAS?" → A: Add latency/memory to loss. Optimize for deployment target, not just accuracy. - Q: "Cell-based vs macro search?" → A: Cell = search small block, repeat. Macro = search whole network. Cell faster, transferable.
Generative Models (VAE/GAN) Quick Reference¶
Autoencoder: x → encoder → z → decoder → x̂
Loss: ||x - x̂||² + regularization
Use: Dimensionality reduction, denoising, anomaly detection
VAE: Probabilistic latent space
z ~ q(z|x) = N(μ(x), σ²(x))
Loss: Reconstruction + KL(q(z|x) || p(z))
Reparameterization: z = μ + σ·ε, ε ~ N(0,I)
Issue: Blurry outputs (averages modes)
GAN: Adversarial training
min_G max_D E[log D(x)] + E[log(1-D(G(z)))]
Mode collapse: G produces limited variety
Solutions: WGAN, feature matching, spectral norm
WGAN: Wasserstein distance
No sigmoid in D (critic)
Gradient penalty or weight clipping
Meaningful loss curve
DCGAN rules:
- Strided conv instead of pooling
- BatchNorm (except G output, D input)
- ReLU in G, LeakyReLU in D
- No FC layers
One-liners: - Q: "VAE vs AE?" → A: VAE learns distribution, generative. AE learns compression, not generative. - Q: "Why VAE blurry?" → A: Averages over modes when uncertain. GAN forces sharp outputs. - Q: "Mode collapse?" → A: Generator produces limited variety. Fix with WGAN, mini-batch discrimination. - Q: "WGAN vs GAN?" → A: WGAN uses Wasserstein distance, stable gradients, meaningful loss.
NLP & Word Embeddings Quick Reference¶
Word2Vec:
CBOW: Context → Center word (faster, good for frequent words)
Skip-gram: Center word → Context (better for rare words)
Negative Sampling:
Replace softmax (expensive) with binary classification
Loss = log(σ(v_context · v_center)) + Σ log(σ(-v_neg · v_center))
K = 5-20 negative samples per positive
Sampling: P(w)^0.75 (boost rare words)
GloVe:
Count-based, global co-occurrence statistics
Loss = Σ f(X_ij) (w_i · w̃_j + b_i + b̃_j - log X_ij)²
Better for analogies, Word2Vec better for fine-tuning
Analogies:
king - man + woman ≈ queen (vector arithmetic)
Limitation: Polysemy (bank = river/financial → same vector)
One-liners: - Q: "CBOW vs Skip-gram?" → A: CBOW predicts center from context (fast), Skip-gram predicts context from center (rare words). - Q: "Why negative sampling?" → A: Softmax over 100K vocab is expensive. Binary classification on K negatives is O(K) vs O(V). - Q: "Word2Vec vs GloVe?" → A: Word2Vec = predictive (local), GloVe = count-based (global). Both learn similar embeddings.
NER & Sequence Labeling¶
BIO Tagging:
B-PER: Begin person entity
I-PER: Inside person entity
O: Outside any entity
NER Evaluation:
Token-level: P/R/F1 per class
Entity-level: Exact match required (stricter)
CoNLL-2003: Entity-level F1 standard
CRF for NER:
Learns transition constraints (I-PER follows B-PER, not I-ORG)
P(y|x) = (1/Z) exp(Σ θ · features)
BiLSTM-CRF:
BiLSTM: Contextual representations
CRF: Valid transition sequences
BERT for NER:
Fine-tune + linear classifier
Use first subword token for entity label
SOTA: 93+ F1 on CoNLL-2003
One-liners: - Q: "CRF vs BiLSTM for NER?" → A: CRF learns tag transitions, BiLSTM learns context. BiLSTM-CRF = best of both. - Q: "NER entity vs token F1?" → A: Entity-level requires exact boundaries + type match. Stricter but more realistic. - Q: "BERT for NER subwords?" → A: Use first subword label, ignore rest. "Califor##nia" → B-LOC on "Califor".
POS Tagging¶
HMM Tagger:
P(t|w) ∝ P(w|t) · P(t|t_prev)
Emission: Word given tag
Transition: Tag bigram
Viterbi: Best path through tags
CRF Tagger:
Global normalization over sequences
Features: word, suffix, prefix, neighboring tags
Better than HMM (no independence assumption)
Modern: BERT fine-tuning
Token → BERT → Linear classifier
97%+ accuracy on Penn Treebank
One-liners: - Q: "HMM vs CRF for POS?" → A: HMM generative (P(x,y)), CRF discriminative (P(y|x)). CRF more flexible features. - Q: "Why contextual embeddings for POS?" → A: "can" (verb vs noun) determined by context. BERT captures this.
Recommendation Systems Quick Reference¶
Collaborative Filtering:
User-based: Find similar users → recommend their items
Item-based: Find similar items → recommend (preferred for scale)
Matrix: R ≈ U × V^T (factorization)
Matrix Factorization:
SGD: u_i += η(e_ui × v_j - λ×u_i)
ALS: Alternating least squares
Libraries: Implicit, LightFM
Cold Start Solutions:
New User: Content-based, ask preferences, popularity
New Item: Content features, bandits, side info
Two-Tower Architecture:
User Tower: Features → Embedding
Item Tower: Features → Embedding
Score: Dot product → ANN search (FAISS)
RecSys Pipeline:
1. Retrieval: ANN → 1000 candidates
2. Ranking: GBDT/Deep → Top 100
3. Re-ranking: Diversity, business rules → Final 20
One-liners: - Q: "CF vs Content-Based?" → A: CF uses user-item interactions (discovery), CB uses item features (explainability). - Q: "Item-based vs User-based?" → A: Item-based more stable, pre-computable, better for production scale. - Q: "Two-Tower advantage?" → A: Decoupled inference, pre-compute item embeddings, ANN search for millions of items. - Q: "Cold start for new user?" → A: Content-based first, popularity baseline, ask preferences, smooth to CF as data accumulates.
Hyperparameter Optimization Quick Reference¶
Parameters vs Hyperparameters:
Parameters: Learned from data (weights, biases)
Hyperparameters: Set before training (lr, batch_size, layers)
Grid Search: Exhaustive over all combinations
O(n^k) for k params with n values each
Use for small search spaces
Random Search: Sample random combinations
Often better than grid (explores more values for important params)
Paper: Bergstra & Bengio 2012
Bayesian Optimization:
Build surrogate model (Gaussian Process) of objective
Acquisition function (EI, UCB) guides search
Trade-off: exploration vs exploitation
Use when evaluations are expensive
Optuna: Single-node, TPE sampler, pruning
Ray Tune: Distributed, PBT, ASHA, Hyperband
Priority for tuning:
1. Learning rate (biggest impact)
2. Batch size
3. Optimizer (Adam vs SGD)
4. Architecture (layers, units)
5. Regularization (dropout, weight decay)
Early Stopping in HPO:
Median pruning: Stop if worse than median at step k
ASHA/Hyperband: Promote top performers, stop rest
One-liners: - Q: "Grid vs Random search?" → A: Random often better — explores more values for important params. Grid wastes trials on unimportant dimensions. - Q: "When Bayesian?" → A: Expensive evaluations + low-dimensional space + smooth objective. Otherwise random is fine. - Q: "Optuna vs Ray Tune?" → A: Optuna simpler, single-node. Ray Tune distributed, PBT, ASHA. - Q: "What to tune first?" → A: Learning rate → batch size → optimizer → architecture → regularization. - Q: "Multi-objective HPO?" → A: Pareto front — solutions where no objective can improve without worsening another.
Reinforcement Learning Quick Reference¶
Value-based vs Policy-based:
Value: Learn Q(s,a), choose argmax (DQN)
Policy: Learn π(a|s) directly (REINFORCE, PPO)
Q-Learning:
Q(s,a) ← Q(s,a) + α[r + γ·max Q(s',a') - Q(s,a)]
Bellman equation, tabular doesn't scale
DQN Innovations:
1. Experience Replay: Break sample correlation
2. Target Network: Stable learning targets
Double DQN: Decouple action selection from evaluation
Dueling DQN: Q = V(s) + A(s,a)
Policy Gradient:
∇J(θ) = E[∇log π(a|s) · G]
G = discounted return
REINFORCE: Vanilla policy gradient, high variance
Use advantage A = Q - V as baseline
PPO: Clip policy ratio to prevent large updates
L = min(r·A, clip(r, 1-ε, 1+ε)·A)
Most robust general-purpose algorithm
Actor-Critic: Actor (policy) + Critic (value)
A2C: Synchronous, A3C: Async
SAC: Entropy bonus, continuous actions, off-policy
TD3: Twin critics, delayed updates
One-liners: - Q: "DQN vs PPO?" → A: DQN discrete only, off-policy, sample efficient. PPO both, on-policy, more stable. - Q: "Why experience replay?" → A: Breaks correlation between samples, enables off-policy learning from past experience. - Q: "Why target network?" → A: Prevents oscillations from chasing moving target. Update periodically, not every step. - Q: "PPO popularity?" → A: Clipped updates = stable. Works for discrete and continuous. Simple to implement. - Q: "SAC vs TD3?" → A: SAC has entropy bonus (better exploration), TD3 simpler but needs more tuning. Both continuous. - Q: "Which algorithm first?" → A: PPO. Most robust, works for most problems. Tune from there.
Object Detection Quick Reference¶
One-stage vs Two-stage:
Two-stage (Faster R-CNN): RPN → RoI → Classify (accurate, slow)
One-stage (YOLO, SSD): Direct prediction (fast, slightly less accurate)
Anchor Boxes: Pre-defined box templates
Each location: K anchors (3 scales × 3 ratios = 9)
Model predicts offsets, not absolute boxes
Modern trend: Anchor-free (FCOS, CenterNet)
IoU (Intersection over Union):
IoU = Area(A ∩ B) / Area(A ∪ B)
Variants: GIoU (handles non-overlap), DIoU (+distance), CIoU (+aspect ratio)
NMS (Non-Maximum Suppression):
Sort boxes by score, keep highest
Remove overlapping boxes (IoU > threshold)
Soft-NMS: Reduce score instead of remove
mAP Calculation:
AP = Area under Precision-Recall curve (per class)
mAP = Mean of AP across all classes
COCO: mAP@0.5:0.95 (average over 10 thresholds)
One-liners: - Q: "YOLO vs Faster R-CNN?" → A: YOLO single-shot, real-time (~45 FPS). Faster R-CNN two-stage, more accurate (~5 FPS). - Q: "What is anchor box?" → A: Pre-defined template at each location. Model predicts offset from anchor, not absolute coordinates. - Q: "NMS purpose?" → A: Remove duplicate detections of same object. Keep highest score, suppress overlapping boxes. - Q: "mAP@0.5 vs mAP@0.5:0.95?" → A: mAP@0.5 uses IoU threshold 0.5. mAP@0.5:0.95 averages over thresholds 0.5-0.95 (COCO standard).
YOLO Evolution¶
YOLOv1: 7×7 grid, 2 boxes per cell, no anchors
YOLOv2: Anchors, batch norm, multi-scale training
YOLOv3: Feature pyramid (3 scales), Darknet-53
YOLOv5: PyTorch, auto-learning anchors, Mosaic augmentation
YOLOv8: Anchor-free, decoupled head
YOLOv10: NMS-free training, consistent dual assignments
Loss: λ_coord × L_loc + L_conf + λ_noobj × L_noobj + L_cls
Contrastive & Self-Supervised Learning Quick Reference¶
Contrastive Learning:
Pull positive pairs together, push negative pairs apart
Loss = -log(exp(sim(z_i,z_j)/τ) / Σ exp(sim(z_i,z_k)/τ))
SimCLR:
Augment → Encoder → Projection head → NT-Xent loss
Needs large batch (4096+) for enough negatives
Key: Strong augmentation (crop + color + blur)
CLIP:
Image encoder + Text encoder → Joint embedding
Contrastive loss on 400M (image, text) pairs
Zero-shot: "a photo of a {class}" → classify
MoCo:
Memory bank (queue) instead of large batch
Momentum encoder for stable targets
Smaller batch, same performance
BYOL:
No negative pairs!
Online network + Target network (momentum)
Predictor prevents collapse
One-liners: - Q: "Contrastive learning intuition?" → A: Pull augmented views of same image together, push different images apart in embedding space. - Q: "SimCLR vs MoCo?" → A: SimCLR needs large batch (4096+) for negatives. MoCo uses memory queue, works with smaller batches. - Q: "BYOL without negatives?" → A: Momentum target network + predictor. Stop-gradient prevents collapse. - Q: "CLIP zero-shot?" → A: Encode class names as text ("a photo of a dog"), compute similarity with image embedding. - Q: "Linear probe evaluation?" → A: Freeze encoder, train only linear classifier. Measures feature quality.
Self-Supervised Pretext Tasks¶
Contrastive: SimCLR, MoCo, BYOL, SimSiam
Masked: MAE (mask 75%, reconstruct), BEiT (predict tokens)
Pretext (older): Rotation, Jigsaw, Colorization, Inpainting
2025-2026 Trend: MAE + contrastive hybrid
Model Compression & Quantization Quick Reference¶
Compression Techniques:
Quantization: FP32 → INT8/INT4 (4x-8x smaller)
Pruning: Remove weights/neurons (2x smaller)
Distillation: Train small from large (10-100x smaller)
PTQ vs QAT:
PTQ: Post-training, fast (minutes), may lose accuracy
QAT: During training, slow (full retrain), preserves accuracy
Quantization Types:
Symmetric: Zero-point = 0, simpler (weights)
Asymmetric: Zero-point ≠ 0, more accurate (activations)
Pruning:
Unstructured: Remove individual weights (needs sparse hardware)
Structured: Remove channels/filters (speedup on any hardware)
Knowledge Distillation:
L = α * L_hard + (1-α) * T² * L_soft
Temperature T: Higher = softer distribution
"Dark knowledge": Which classes teacher thinks are similar
LLM Quantization:
GPTQ: Post-training, layer-by-layer, INT4
AWQ: Activation-aware, protects salient weights
GGUF: llama.cpp format, CPU-optimized
One-liners: - Q: "PTQ vs QAT?" → A: PTQ is fast post-training, QAT is during training but preserves accuracy better. - Q: "Structured vs unstructured pruning?" → A: Structured removes entire channels (actual speedup), unstructured removes individual weights (needs sparse hardware). - Q: "Why temperature in distillation?" → A: Higher T softens probability distribution, revealing which classes teacher thinks are similar ("dark knowledge"). - Q: "GPTQ vs AWQ?" → A: Both INT4 post-training. AWQ protects important weights based on activations, slightly better for low bits. - Q: "When to compress?" → A: Edge deployment, cost reduction, inference speed critical.
Active Learning Quick Reference¶
Active Learning Loop:
1. Train on labeled L
2. Query most informative from unlabeled U
3. Add to L, repeat
Query Strategies:
Least Confidence: max(1 - P(ŷ|x))
Margin Sampling: min(P(y₁|x) - P(y₂|x))
Entropy: max(-Σ P(y|x) log P(y|x))
Query-by-Committee (QBC):
Train multiple models
Query samples with highest disagreement
Measure: Vote entropy, KL divergence
Expected Model Change:
Query sample with largest expected gradient
EGL = E[||∇L(x,y)||]
Diversity Sampling:
Balance uncertainty with coverage
k-Center: Cover unlabeled pool
BADGE: Gradient embeddings + k-means++
When Active Learning FAILS:
- Very small initial set (model too weak)
- Highly imbalanced data
- Noisy labels amplify errors
- Budget too small (<100 samples)
Best Practice:
70% uncertainty + 30% diversity
Cold start: First 50-100 random
One-liners: - Q: "Active learning goal?" → A: Achieve target accuracy with minimum labeling cost by selecting most informative samples. - Q: "Uncertainty sampling methods?" → A: Least confidence (max uncertainty), margin (closest top 2), entropy (highest distribution uncertainty). - Q: "Query-by-Committee?" → A: Train multiple models, query samples where they disagree most. - Q: "When NOT to use active learning?" → A: Very small initial set, noisy labels, budget <100 samples, highly imbalanced data. - Q: "Combine uncertainty + diversity?" → A: 70% uncertain samples + 30% diverse samples prevents redundant queries.
LLM Inference Optimization Quick Reference¶
Quantization Types:
Per-tensor: Single scale for entire tensor (simpler)
Per-channel: Scale per output channel (more accurate)
GPTQ:
Post-training INT4 with Hessian-based compensation
Works on 70B+ models with <1% perplexity loss
Formula: delta_F = -(w_q - w) / [H^-1]_qq * (H^-1):,q
Speculative Decoding:
Draft model proposes k tokens
Main model verifies in single forward pass
Accept until mismatch, reject rest
Speedup: 2-3x for memory-bound generation
Flash Attention:
Tiled computation (never materialize N×N matrix)
Memory: O(N) vs O(N²) standard
Speedup: 2-4x faster, 10x less memory
v2: Better parallelization, v3: H100 FP8
Hardware Optimization:
CPU: INT8/INT4, ONNX Runtime, llama.cpp
GPU: Flash Attention, vLLM, TensorRT
TPU: XLA, JAX optimization
One-liners: - Q: "Per-tensor vs per-channel?" → A: Per-tensor = one scale for all weights (simpler). Per-channel = scale per channel (more accurate for varying importance). - Q: "GPTQ advantage?" → A: INT4 quantization with Hessian compensation, <1% perplexity loss, works on 70B+ models. - Q: "Speculative decoding?" → A: Draft model proposes tokens, main model verifies in one pass. 2-3x speedup. - Q: "Flash Attention memory?" → A: O(N) vs O(N²) by tiling computation, never materializing full attention matrix. - Q: "CPU vs GPU for inference?" → A: CPU for edge/low-latency with quantization. GPU for throughput with Flash Attention + batching.
Gradient Checkpointing Quick Reference¶
Core Trade-off: Memory ↔ Compute
Don't store activations → Recompute during backward
Savings: O(n) → O(√n) memory
Cost: ~20-30% more compute
PyTorch APIs:
checkpoint(fn, *args) # Single function
checkpoint_sequential(mod, n, x) # N segments
Memory Formula:
Peak memory = Static weights + Peak activations
With checkpointing: Peak activations ≈ √n segments
Best Practices:
- Checkpoint heavy layers (attention, wide linear)
- Avoid checkpointing Dropout, BatchNorm
- Combine with mixed precision for max savings
One-liners: - Q: "Gradient checkpointing?" → A: Trade compute for memory by recomputing activations during backward instead of storing them. - Q: "Memory savings?" → A: 50-70% reduction in peak memory for ~20-30% more training time. - Q: "When to use?" → A: OOM errors, need larger batch size, training deep models on limited GPU memory. - Q: "Checkpointing vs accumulation?" → A: Checkpointing reduces memory, accumulation simulates larger batch with same memory. - Q: "What NOT to checkpoint?" → A: Dropout (RNG issues), BatchNorm (running stats updated twice), in-place ops.
Mixed Precision Quick Reference¶
Core Idea: FP16/BF16 compute, FP32 master weights
FP16 vs BF16:
FP16: 5-bit exp, 10-bit mantissa, max 65504
BF16: 8-bit exp, 7-bit mantissa, same range as FP32
Loss Scaling (FP16 only):
Scale loss before backward: L * scale
Unscale before optimizer: grad / scale
Dynamic: Reduce on inf/nan, increase when stable
PyTorch AMP:
from torch.cuda.amp import autocast, GradScaler
with autocast(dtype=torch.float16): # or bfloat16
loss = model(x)
scaler.scale(loss).backward()
scaler.step(optimizer)
Benefits:
Memory: ~2x reduction
Speed: 2-3x on Tensor Cores
Accuracy: Minimal loss with proper scaling
One-liners: - Q: "Mixed precision?" → A: FP16/BF16 for compute (2x memory savings, 2-3x speedup), FP32 for master weights (precision). - Q: "FP16 vs BF16?" → A: FP16 = better precision, risk of overflow. BF16 = same range as FP32, no overflow, but lower precision. - Q: "Why loss scaling?" → A: Prevent gradient underflow in FP16 by scaling up loss, then unscaling gradients before optimizer. - Q: "When BF16?" → A: Ampere+ GPUs (A100, H100), no loss scaling needed, simpler than FP16. - Q: "GradScaler?" → A: Automatically scales loss, detects inf/nan, adjusts scale dynamically for FP16 training.
Distributed Training Quick Reference¶
Parallelism Types:
Data Parallel: Model replicated, data split (DDP)
Model Parallel: Model split across GPUs
Pipeline: Different layers on different GPUs
Tensor: Split individual operations (e.g., big matmul)
ZeRO Stages (DeepSpeed):
ZeRO-1: Shard optimizer states only (~4x memory)
ZeRO-2: + gradients (~8x memory)
ZeRO-3: + parameters (~N× memory, N = GPUs)
Memory Math (7B model):
Standard DDP: 28GB params + 28GB grads + 56GB opt = 112GB
ZeRO-3 (8 GPU): ~14GB per GPU
FSDP vs DeepSpeed:
FSDP: PyTorch native, 90% memory savings, simpler
DeepSpeed: 95% savings, more features, complex setup
Gradient Accumulation:
Effective batch = actual_batch × accum_steps
loss = model(x) / accum_steps # Scale down
optimizer.step() every N batches
Pipeline Parallelism:
Model split into stages on different GPUs
Micro-batches fill pipeline bubbles
Best combined with tensor parallel (3D parallelism)
One-liners: - Q: "Data vs Model parallel?" → A: Data parallel = same model on all GPUs with different data. Model parallel = different parts of model on different GPUs. - Q: "ZeRO-3 savings?" → A: Shards params + grads + optimizer states. ~N× memory savings where N = number of GPUs. - Q: "FSDP vs DeepSpeed?" → A: FSDP = PyTorch native, simpler, 90% savings. DeepSpeed = more features, 95% savings, complex. - Q: "Gradient accumulation?" → A: Simulate larger batch by accumulating gradients over N steps before optimizer update. - Q: "When pipeline parallel?" → A: Models too large for data parallel alone, combined with ZeRO/tensor parallel for 3D parallelism.
Feature Stores Quick Reference¶
Core Purpose:
Consistency: Same features for training and serving
Reusability: Share features across models
Time-travel: Point-in-time correct joins
Freshness: Real-time feature serving
Offline vs Online:
Offline: Training, hours latency, Parquet/Delta
Online: Inference, <10ms, Redis/DynamoDB
Point-in-Time Join:
Join feature by entity_id + timestamp
Prevents data leakage (feature value as of event time)
Feast vs Tecton vs Hopsworks:
Feast: OSS, free, limited real-time
Tecton: Managed SaaS, enterprise, $$$
Hopsworks: Hybrid, mid-market, $$
Key Concepts:
Materialization: Offline → Online (batch/streaming)
Feature Groups: Group by freshness requirements
Monitoring: Freshness alerts, latency P99
One-liners: - Q: "Why feature store?" → A: Ensures consistency between training and serving, prevents data leakage, enables feature reuse. - Q: "Offline vs Online store?" → A: Offline = training (Parquet, batch). Online = inference (Redis, <10ms latency). - Q: "Point-in-time join?" → A: Join features as they existed at event time, preventing future data leakage. - Q: "Feast vs Tecton?" → A: Feast = OSS, free, DIY. Tecton = managed, enterprise, expensive.
Uplift Modeling Quick Reference¶
Core Formula:
Uplift = P(Y|T=1) - P(Y|T=0)
Incremental effect of treatment on individual
User Segments:
Persuadables: Positive uplift → Target!
Sure Things: Buy anyway → Don't waste treatment
Lost Causes: Won't buy → Don't target
Sleeping Dogs: Negative uplift → Avoid!
Model Types:
T-Learner: Two models (treatment, control), difference
S-Learner: Single model with treatment as feature
X-Learner: Propensity-weighted, handles imbalanced groups
Evaluation (no ground truth!):
AUUC: Area Under Uplift Curve (rank by predicted uplift)
Qini: Cumulative treatment effect vs random
Uplift@k: Effect in top-k predictions
One-liners: - Q: "Uplift modeling?" → A: Estimate incremental effect of treatment on individual users, not just average effect. - Q: "Persuadables vs Sleeping Dogs?" → A: Persuadables = treatment helps (target!). Sleeping Dogs = treatment hurts (avoid!). - Q: "T-Learner vs S-Learner?" → A: T-Learner = two separate models for treatment/control. S-Learner = one model with treatment as feature. - Q: "How to evaluate without ground truth?" → A: AUUC, Qini coefficient, uplift-at-k using held-out A/B test data.
LLM Alignment Quick Reference (RLHF, DPO, GRPO)¶
RLHF Pipeline:
1. SFT on quality examples
2. Train reward model on preferences
3. PPO optimize with reward model
PPO vs DPO:
PPO: 4× models (policy, reward, critic, reference)
Higher quality, unstable, complex
DPO: 2× models (policy, reference)
Simpler, stable, lower compute
GRPO (DeepSeek-R1):
No critic model (like DPO)
Group-relative advantages
93% less compute than PPO
Pure RL - reasoning emerges
Reward Hacking:
Model exploits reward without solving task
Solution: Sparse rewards, adversarial training, human eval
When to Use:
Style/tone: DPO
New knowledge: RAG
Reasoning: PPO/GRPO
Safety: PPO + Constitutional AI
One-liners: - Q: "RLHF purpose?" → A: Align LLM with human preferences - helpful, harmless, honest. - Q: "PPO vs DPO?" → A: PPO = 4× models, higher quality, complex. DPO = 2× models, simpler, good for style tasks. - Q: "GRPO advantage?" → A: No critic model, 93% less compute than PPO, group-relative ranking. - Q: "Reward hacking?" → A: Model exploits reward signal without solving task. Fix with sparse rewards, adversarial training. - Q: "RLHF vs RAG?" → A: RAG for new knowledge, RLHF for reasoning/style improvement.
GNN Quick Reference¶
Message Passing:
h_v^(l+1) = UPDATE(h_v^l, AGGREGATE({h_u^l : u in N(v)}))
Aggregate: sum, mean, max, attention-weighted
Update: MLP, GRU, identity
Architecture Comparison:
GCN: Fixed weights (D^-0.5 A D^-0.5), transductive
GAT: Learned attention, transductive
GraphSAGE: Sample neighbors, INDUCTIVE (new nodes!)
GIN: Sum aggregation = WL-equivalent
Key Problems:
Over-smoothing: All nodes same after many layers
Fix: JK-Net, residual, PairNorm, fewer layers
Heterogeneous: Different node/edge types
Fix: R-GCN (separate weights), HAN (metapath attention)
One-hop = neighbors, Two-hop = neighbors of neighbors
One-liners: - Q: "GCN vs GAT?" → A: GCN = fixed aggregation weights. GAT = learned attention per edge. - Q: "GraphSAGE advantage?" → A: INDUCTIVE - works on unseen nodes without retraining (neighbor sampling). - Q: "Over-smoothing?" → A: After many layers, all node representations become identical. Fix: JK-Net, residual, 2-3 layers max. - Q: "GIN expressiveness?" → A: GIN with sum aggregation is as powerful as WL graph isomorphism test. Mean/max are not. - Q: "Heterogeneous graphs?" → A: R-GCN (different weights per edge type), HAN (metapath attention).
Diffusion Models Quick Reference¶
Core Process:
Forward: x_0 -> x_T (add noise gradually)
Reverse: x_T -> x_0 (denoise with neural net)
Training: Predict noise epsilon from x_t
Loss: E[||epsilon - epsilon_theta(x_t, t)||^2]
DDPM vs DDIM:
DDPM: Stochastic, 1000 steps, higher quality
DDIM: Deterministic (eta=0), 10-50 steps, 10-100x faster
Classifier-Free Guidance:
epsilon_tilde = epsilon(c) + s * (epsilon(c) - epsilon(uncond))
s=1: no guidance, s=7-15: typical for Stable Diffusion
Latent Diffusion:
Compress image with VAE -> diffuse in latent space -> decode
16x+ faster, lower memory
Architectures:
U-Net: Conv + attention + AdaGN for time
DiT: Pure transformer, patchify, scales better
One-liners: - Q: "Diffusion training objective?" → A: Predict noise epsilon from noisy image x_t at timestep t. - Q: "DDPM vs DDIM?" → A: DDPM = stochastic Markov chain, 1000 steps. DDIM = deterministic, 10-50 steps, much faster. - Q: "Classifier-Free Guidance?" → A: Combine conditional and unconditional predictions with guidance scale s. Higher s = more faithful, less diverse. - Q: "Latent Diffusion?" → A: Diffuse in compressed latent space (VAE), not pixels. 16x+ faster training. - Q: "U-Net vs DiT?" → A: U-Net = conv + attention, inductive bias for locality. DiT = pure transformer, better scaling. - Q: "Consistency Models?" → A: Map any x_t directly to x_0 in one step. Distill from diffusion or train from scratch.
Reinforcement Learning Quick Reference¶
Algorithm Types:
Value-based: Learn Q(s,a), greedy action selection
Policy-based: Learn pi(a|s) directly, high variance
Actor-Critic: Both - lower variance, best of both
Q-Learning Update:
Q(s,a) <- Q(s,a) + alpha[r + gamma * max Q(s',a') - Q(s,a)]
DQN Key Components:
Experience Replay: Break temporal correlation
Target Network: Stabilize training
Double DQN: Reduce overestimation
Algorithm Selection:
Discrete: DQN, PPO
Continuous: PPO, SAC, TD3
Sample Efficient: SAC (off-policy)
Stable/Simple: PPO (default choice)
Exploration Strategies:
epsilon-greedy: Random action with prob epsilon
Entropy bonus: -beta * sum(pi * log(pi))
UCB: Q + c * sqrt(ln N / n)
One-liners: - Q: "Value-based vs Policy-based?" → A: Value-based learns Q(s,a), policy-based learns pi(a|s) directly. Actor-Critic combines both. - Q: "Q-Learning update?" → A: Q(s,a) = Q(s,a) + alpha[r + gamma*max Q(s',a') - Q(s,a)]. TD learning. - Q: "PPO advantage?" → A: Clipped objective prevents large policy updates. Stable, simple, works for discrete/continuous. - Q: "SAC vs PPO?" → A: SAC = off-policy, entropy regularization, more sample efficient. PPO = on-policy, simpler, more stable. - Q: "Exploration vs exploitation?" → A: Explore to discover, exploit to maximize. Balance with epsilon-greedy, entropy bonus, UCB.
VAE Quick Reference¶
Architecture:
Encoder: x -> (mu, sigma)
Sample: z = mu + sigma * eps (reparameterization trick)
Decoder: z -> x_hat
ELBO Objective:
log p(x) >= E[log p(x|z)] - KL(q(z|x) || p(z))
= Reconstruction + KL regularization
Loss:
L = BCE(x, x_hat) + beta * KL(N(mu,sigma) || N(0,1))
KL = -0.5 * sum(1 + log_var - mu^2 - exp(log_var))
Reparameterization Trick:
Can't backprop through z ~ N(mu, sigma)
Solution: z = mu + sigma * eps, eps ~ N(0,I)
Gradients flow through mu, sigma
Posterior Collapse:
Decoder ignores z, KL -> 0, becomes regular AE
Fix: KL annealing, free bits, weaker decoder
beta-VAE:
beta > 1: More disentangled, worse reconstruction
beta = 4: Good tradeoff
One-liners: - Q: "VAE vs Autoencoder?" → A: VAE learns distribution over latents (probabilistic), AE learns point estimates (deterministic). - Q: "ELBO components?" → A: Reconstruction term (decoder quality) + KL term (latent close to prior N(0,1)). - Q: "Reparameterization trick?" → A: z = mu + sigma*eps. Enables backprop through random sampling. - Q: "Posterior collapse?" → A: Decoder ignores latent z. Fix: KL annealing, free bits, beta-VAE. - Q: "beta-VAE effect?" → A: beta > 1 forces disentangled representations but hurts reconstruction quality.
Dimensionality Reduction Quick Reference¶
PCA:
Linear, maximize variance preserved
Eigendecomposition of covariance matrix
Use for: Preprocessing, visualization, noise reduction
t-SNE:
Non-linear, preserve local structure
High-dim similarities -> low-dim probabilities
Perplexity: 5-50 (effective neighbors)
Run: PCA first (50 dims), then t-SNE
NOT for: Feature engineering (no inverse transform)
UMAP:
Non-linear, preserves global + local
Faster than t-SNE, supports inverse transform
n_neighbors: 5-50 (local vs global)
min_dist: 0.0-0.99 (clustering tightness)
Use for: Visualization, preprocessing, clustering
Autoencoder:
Non-linear compression
x -> encoder -> z -> decoder -> x_hat
Use for: Anomaly detection, denoising, feature learning
One-liners: - Q: "PCA vs t-SNE vs UMAP?" → A: PCA = linear, fast, interpretable. t-SNE = non-linear, local structure, slow. UMAP = non-linear, global+local, faster than t-SNE. - Q: "t-SNE perplexity?" → A: Effective number of neighbors. Low (5) = local clusters, High (50) = more global structure. Default 30. - Q: "Why PCA before t-SNE?" → A: Reduce dimensions first (50), speeds up t-SNE 10-100x, removes noise. - Q: "UMAP n_neighbors?" → A: Low (5) = local structure focus. High (50) = global structure focus. Default 15. - Q: "t-SNE for features?" → A: No inverse transform, can't apply to new data. Use PCA or autoencoder for feature engineering.
Multi-Armed Bandits Quick Reference¶
Core Trade-off: Exploration vs Exploitation
Explore: Try new arms to learn their rewards
Exploit: Choose best-known arm to maximize reward
A/B Testing vs MAB:
A/B: Fixed allocation, statistical significance, wastes traffic
MAB: Dynamic allocation, maximize reward, minimizes regret
Algorithms:
Epsilon-Greedy: With prob epsilon explore random, else exploit best
UCB: Select arm with highest upper confidence bound
UCB = mean_reward + sqrt(2*ln(n) / n_arm)
Thompson Sampling: Sample from posterior, select highest
Bayesian approach, works well for Bernoulli rewards
When MAB over A/B:
- Maximize reward during experiment (ads, recommendations)
- Non-stationary environment (preferences change)
- Many variants (A/B slow with many arms)
- Short experiment acceptable
When A/B over MAB:
- Need statistical rigor (regulation, scientific)
- Want to learn about user behavior
- Potential negative impact from exploration
One-liners: - Q: "MAB vs A/B testing?" → A: A/B = fixed split, statistical rigor, wastes traffic on losers. MAB = dynamic, maximizes reward, minimizes regret. - Q: "Epsilon-Greedy vs UCB?" → A: E-Greedy = fixed exploration rate (simple). UCB = adaptive exploration based on uncertainty (no tuning). - Q: "Thompson Sampling?" → A: Bayesian sampling from posterior. Natural exploration, works well for Bernoulli rewards. - Q: "When to use MAB?" → A: Ads, recommendations, any scenario where you want to maximize reward while learning. - Q: "Evaluate MAB offline?" → A: Counterfactual evaluation with IPS (Inverse Propensity Scoring), replay method.
Model Drift Detection Quick Reference¶
Drift Types:
Data Drift: P(X) changes - input distribution shifts
Concept Drift: P(Y|X) changes - relationship changes
Label Drift: P(Y) changes - outcome distribution shifts
Metrics:
PSI (Population Stability Index):
PSI < 0.1: OK
0.1-0.25: Investigate
> 0.25: Action needed
KS-test: Statistical significance for continuous
Wasserstein: Robust, geometric interpretation
JS/KL: Information-theoretic divergence
Monitoring Setup:
Baselines: Training, healthy production, seasonal
Windows: Short (1h/1d), Medium (7d), Long (30d)
Slicing: Country, device, user segment
Alerting:
Warning (PSI > 0.1): Investigate within N hours
Critical (PSI > 0.25): Mitigate immediately
Persistence: Alert if N consecutive windows drift
Response Playbook:
1. Triage: Check data pipeline, recent changes, localize slice
2. Mitigate: Rollback, increase fallback, route to human
3. Investigate: Compare failures to baseline, feature-level drift
4. Resolve: Targeted labeling, retrain, calibration refresh
One-liners: - Q: "Data vs concept drift?" → A: Data = input distribution changes (new users, devices). Concept = P(Y|X) changes (fraud patterns evolve). - Q: "PSI interpretation?" → A: <0.1 = stable, 0.1-0.25 = investigate, >0.25 = action needed. - Q: "Drift detected - first step?" → A: Check data integrity (pipeline, schema, nulls) before assuming model issue. - Q: "Drift without performance drop?" → A: Benign drift - model still works. Monitor performance, not just inputs. - Q: "LLM drift sources?" → A: Prompt changes, retrieval corpus updates, embedding model changes, tool API changes.
Use this sheet 30 minutes before interview for quick review.