Математика для ML: учебные материалы¶
~35 минут чтения
Предварительно: базовый курс математики (школьная программа: производные, матрицы, вероятности)
Раздел охватывает 41 задачу по 22 темам -- от матанализа и линала до информационной геометрии и спектральных методов. По статистике Nareshka (2025-2026), 73% интервью-вопросов по математике для ML приходятся на 5 тем: Statistics (17 задач), Linear Algebra (4), Information Theory (4), Calculus (3) и Probability (3). Материалы ниже покрывают как базу, так и gap fillers для senior-позиций.
Обновлено: 2026-02-11
1. Calculus (3 задачи)¶
Лучшие источники¶
Курсы: - Mathematics for Machine Learning Specialization (Imperial College London) — Linear Algebra, Multivariate Calculus, PCA - MIT 18.06SC Linear Algebra — канонический курс - 3Blue1Brown Essence of Calculus — визуальная интуиция
Блоги: - Lil'Log - Calculus for ML — градиенты, оптимизация - d2l.ai Chapter 2 — Calculus для Deep Learning
YouTube: - Andrej Karpathy: micrograd — backprop с нуля - StatQuest: Gradient Descent — интуитивное объяснение
Ключевые концепции¶
| Тема | Формула | Применение |
|---|---|---|
| Градиент | \(\nabla f = [\frac{\partial f}{\partial x_1}, ...]\) | Направление наискорейшего роста |
| Sigmoid' | \(\sigma'(x) = \sigma(x)(1-\sigma(x))\) | Backprop |
| ReLU' | \(\text{ReLU}'(x) = 1\) if \(x>0\) else \(0\) | Backprop |
| MSE gradient | \(\frac{\partial L}{\partial \hat{y}} = \frac{2}{n}(\hat{y}-y)\) | Регрессия |
| BCE gradient | \(\frac{\partial L}{\partial \hat{y}} = \frac{\hat{y}-y}{\hat{y}(1-\hat{y})}\) | Классификация |
2. Linear Algebra (4 задачи)¶
Лучшие источники¶
Книги: - Mathematics for Machine Learning (Deisenroth, Faisal, Ong) — бесплатно, Chapter 2-4 - Deep Learning Book - Linear Algebra (Goodfellow) — Chapter 2
Курсы: - MIT 18.06 Linear Algebra — Gilbert Strang - Linear Algebra for ML (Coursera)
YouTube: - 3Blue1Brown Essence of Linear Algebra — MUST WATCH
Ключевые концепции¶
| Тема | Формула | Применение |
|---|---|---|
| L2 норма | \(\|x\|_2 = \sqrt{\sum_i x_i^2}\) | Регуляризация |
| Softmax | \(\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\) | Мультикласс |
| Cosine similarity | \(\cos(a,b) = \frac{a \cdot b}{\|a\|\|b\|}\) | Embeddings |
| SVD | \(X = U\Sigma V^T\) | PCA, рекомендации |
Трюк для softmax stability:
def softmax(x):
x_max = np.max(x, axis=-1, keepdims=True)
exp_x = np.exp(x - x_max) # вычитаем max для стабильности
return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
3. Probability (3 задачи)¶
Лучшие источники¶
Книги: - Think Stats 2e — бесплатно, практический подход - Probability for Machine Learning — интерактивный курс
Блоги: - Seeing Theory — визуализация вероятностей - Lil'Log: Probability
Ключевые концепции¶
| Тема | Формула | Применение |
|---|---|---|
| Bayes | \(P(A\|B) = \frac{P(B\|A)P(A)}{P(B)}\) | Naive Bayes, обновление |
| A/B test sample size | \(n = \frac{(z_{\alpha/2}+z_\beta)^2 \cdot 2p(1-p)}{\delta^2}\) | Эксперименты |
| Laplace smoothing | \(P(w\|c) = \frac{\text{count}(w,c)+1}{\text{count}(c)+\|V\|}\) | Naive Bayes |
4. Statistics (17 задач)¶
Лучшие источники¶
Курсы: - Harvard Stats 110 — Probability & Statistics (Joe Blitzstein) - Khan Academy Statistics
Книги: - Practical Statistics for Data Scientists — практический фокус - ISLR Chapter 2-3 — Statistical Learning
Блоги: - StatQuest Josh Starmer — MUST WATCH для интуиции - Seeing Theory — визуализация
Ключевые концепции¶
| Тема | Формула | Интервью важность |
|---|---|---|
| CLT | \(\bar{X} \sim N(\mu, \sigma^2/n)\) | HIGH |
| p-value | \(P(D\|H_0)\) | HIGH |
| t-statistic | \(t = \frac{\bar{x}-\mu}{s/\sqrt{n}}\) | HIGH |
| Bootstrap CI | \([\hat{\theta}_{\alpha/2}, \hat{\theta}_{1-\alpha/2}]\) | MEDIUM |
| VIF | \(\text{VIF}_j = \frac{1}{1-R_j^2}\) | MEDIUM |
| KS test | \(D = \sup_x \|F_n(x)-F(x)\|\) | LOW |
Bootstrap алгоритм:
def bootstrap_ci(data, stat_fn, n_bootstrap=10000, alpha=0.05):
stats = [stat_fn(np.random.choice(data, len(data), replace=True))
for _ in range(n_bootstrap)]
lower = np.percentile(stats, 100 * alpha/2)
upper = np.percentile(stats, 100 * (1-alpha/2))
return lower, upper
5. Information Theory (4 задачи)¶
Лучшие источники¶
Книги: - Information Theory, Inference, and Learning Algorithms (MacKay) — бесплатно - Elements of Information Theory (Cover & Thomas)
Блоги: - Lil'Log: Information Theory - Count Bayesie: Entropy
Ключевые концепции¶
| Тема | Формула | Применение |
|---|---|---|
| Entropy | \(H(X) = -\sum p_i \log_2 p_i\) | Decision trees |
| Cross-Entropy | \(H(p,q) = -\sum p_i \log q_i\) | Loss function |
| KL Divergence | \(D_{KL}(P\|Q) = \sum p_i \log \frac{p_i}{q_i}\) | VAE, distribution matching |
| Information Gain | \(IG(S,A) = H(S) - \sum \frac{\lvert S_v\rvert}{\|S\|}H(S_v)\) | Decision trees |
| Gini | \(G = 1 - \sum p_k^2\) | Random Forest |
Entropy vs Gini:
Entropy: более чувствителен к разбросу, log-scale
Gini: вычислительно проще (нет log), чаще в sklearn
Разница на практике минимальна
6. Classification Metrics (2 задачи)¶
Лучшие источники¶
Блоги: - Google ML Crash Course: Classification - Scikit-learn: Metrics
YouTube: - StatQuest: ROC & AUC
Ключевые концепции¶
| Metric | Formula | Когда использовать |
|---|---|---|
| Precision | \(\frac{TP}{TP+FP}\) | False positives дорогие |
| Recall | \(\frac{TP}{TP+FN}\) | False negatives дорогие |
| F1 | \(\frac{2 \cdot P \cdot R}{P+R}\) | Баланс P и R |
| ROC-AUC | Area under TPR/FPR curve | Ranking quality |
7. Regularization (3 задачи)¶
Лучшие источники¶
Блоги: - Regularization in ML - Dropout Paper
YouTube: - StatQuest: Ridge & Lasso
Ключевые концепции¶
| Method | Effect | Когда использовать |
|---|---|---|
| L1 (Lasso) | Sparsity, feature selection | Много нерелевантных признаков |
| L2 (Ridge) | Shrinkage, small weights | Мультиколлинеарность |
| ElasticNet | L1 + L2 | Комбо |
| Dropout | Random deactivation | Deep learning, overfitting |
| BatchNorm | Normalize activations | Faster training, regularization |
L1 vs L2 геометрия:
8. Ensemble Methods (2 задачи)¶
Лучшие источники¶
Книги: - ESL Chapter 8-10 — Boosting, Trees - XGBoost Documentation
Блоги: - Gradient Boosting from scratch - CatBoost vs XGBoost vs LightGBM
Ключевые концепции¶
| Method | Key Idea | Bias-Variance |
|---|---|---|
| Bagging | Parallel, bootstrap samples | Reduces variance |
| Boosting | Sequential, fit residuals | Reduces bias |
| Random Forest | Bagging + feature randomization | Low variance |
| GBDT | Gradient descent in function space | Low bias |
RF vs GBDT:
RF: параллельно, устойчив к overfitting, простой
GBDT: последовательно, мощнее, требует тюнинга
Практика: начать с RF, потом XGBoost/LightGBM
9. NumPy & Pandas (3 задачи)¶
Лучшие источники¶
Документация: - NumPy User Guide - Pandas User Guide
Cheat Sheets: - NumPy Cheat Sheet - Pandas Cheat Sheet
Ключевые концепции¶
| Операция | Пример | Важно помнить |
|---|---|---|
| loc vs iloc | df.loc['A':'B'] vs df.iloc[0:2] |
loc inclusive, iloc exclusive |
| groupby.agg | df.groupby('cat').agg({'val': ['mean', 'std']}) |
Multiple aggregations |
| merge | pd.merge(df1, df2, on='key', how='left') |
SQL-style JOINs |
| transform | df.groupby('cat')['val'].transform('mean') |
Same shape as input |
10. LLM Scaling Laws (Gap Filler 2026)¶
Лучшие источники¶
Papers: - Scaling Laws for Neural Language Models (Kaplan et al., 2020) — первые scaling laws - Training Compute-Optimal Large Language Models (Chinchilla, Hoffmann et al., 2022) — compute-optimal - Beyond Chinchilla (Porian et al., 2025) — includes inference costs
Блоги: - LLM Scaling Laws 2026: Complete Guide — AIMultiple (Feb 2026) - The Chinchilla Papers Implications — LessWrong
Ключевые формулы¶
Kaplan (2020): $\(L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}\)$
Loss decreases as power law with model size.
Chinchilla (2022): $\(L(N, D) = \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}} + E\)$
Где: - \(N\) = number of parameters - \(D\) = training tokens - \(\alpha \approx 0.34, \beta \approx 0.28, E \approx 1.69\)
Compute-optimal: $\(N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5}\)$
Rule of thumb: ~20 tokens per parameter for compute-optimal training.
Evolution 2020-2026¶
| Year | Paper | Key Insight |
|---|---|---|
| 2020 | Kaplan | Power-law scaling with model size |
| 2022 | Chinchilla | Compute-optimal: balance N and D |
| 2024 | Llama 3 | Overtrain small models (15T tokens on 8B) |
| 2025 | Beyond Chinchilla | Include inference cost in optimization |
| 2025 | Sloth | Latent skills for benchmark prediction |
| 2025 | Densing Law | Capability density rising exponentially |
Practical Implications¶
# Chinchilla-optimal tokens for model size
def compute_optimal_tokens(params_billions: float) -> float:
"""Calculate Chinchilla-optimal training tokens"""
# Rule: 20 tokens per parameter
return params_billions * 1e9 * 20
# Example: 7B model
tokens_7b = compute_optimal_tokens(7) # 140B tokens
# But Llama 3 trained 15T tokens on 8B = 1875 tokens/param
# Why? Inference cost dominates at scale
Interview Questions: 1. "Why did Llama 3 overtrain 8B model with 15T tokens instead of Chinchilla-optimal ~160B?" 2. "How does inference cost change the optimal training strategy?" 3. "What is Densing Law and why does it matter for model selection?"
11. VC Dimension & PAC Learning (Gap Filler 2026)¶
Лучшие источники¶
Books: - Understanding Machine Learning (Shalev-Shwartz & Ben-David) — бесплатно - Foundations of Machine Learning (Mohri) — Bloomberg course
Blogs: - PAC Theory and VC Dimension — Cole Mei (2025) - VC Dimension Deep Dive — Tang Shusen
Key Concepts¶
PAC Learning (Probably Approximately Correct): $\(\Pr\left[ |E_{in}(g) - E_{out}(g)| > \epsilon \right] \leq \delta\)$
Where: - \(E_{in}(g)\) = training error (empirical risk) - \(E_{out}(g)\) = test error (true risk) - \(\epsilon\) = accuracy parameter - \(\delta\) = confidence parameter
VC Dimension Definition: - The maximum number of points that can be shattered by hypothesis class \(\mathcal{H}\) - "Shattered" = all \(2^n\) labelings possible - For linear classifiers in \(\mathbb{R}^d\): VC dimension = \(d + 1\)
Generalization Bound: $\(E_{out}(g) \leq E_{in}(g) + \sqrt{\frac{8}{N} \ln \left( \frac{4(2N)^{VC(\mathcal{H})}}{\delta} \right)}\)$
Trade-off Diagram¶
graph LR
subgraph "VC Dimension Trade-off"
LOW["Low VC dim<br/>High E_in<br/>Low Omega<br/>(underfitting)"]
OPT["Optimal<br/>Balance E_in<br/>and Omega"]
HIGH["High VC dim<br/>Low E_in<br/>High Omega<br/>(overfitting)"]
end
LOW --> OPT --> HIGH
style LOW fill:#e8eaf6,stroke:#3f51b5
style OPT fill:#e8f5e9,stroke:#4caf50
style HIGH fill:#fce4ec,stroke:#c62828
- Low VC dim: high \(E_{in}\), low \(\Omega\) (underfitting)
- High VC dim: low \(E_{in}\), high \(\Omega\) (overfitting)
- Optimal: balance \(E_{in}\) and \(\Omega\)
Interview Questions¶
Q: What is the VC dimension of a decision tree?
A: Theoretically infinite (can shatter any dataset with enough depth). In practice, bounded by number of leaves.
Q: Why is VC dimension useful?
A: It provides theoretical guarantees on generalization without knowing the true distribution.
Q: What's the VC dimension of a linear classifier in 2D?
A: 3 (can shatter 3 non-collinear points, but not 4 in general position).
12. Causal Inference Basics (Gap Filler 2026)¶
Лучшие источники¶
Books: - Causal Inference: The Mixtape (Cunningham) — бесплатно - Elements of Causal Inference (Peters et al.)
Blogs: - Causal Inference Interview Questions — InterviewGemini (2025)
Key Concepts¶
Association vs Causation: - Association: \(P(Y|X)\) — statistical relationship - Causation: \(P(Y|do(X))\) — intervention changes distribution
Potential Outcomes Framework: $\(\tau_i = Y_i(1) - Y_i(0)\)$
Where: - \(Y_i(1)\) = outcome if treated - \(Y_i(0)\) = outcome if not treated - We only observe one!
ATE (Average Treatment Effect): $\(\text{ATE} = \mathbb{E}[Y(1) - Y(0)]\)$
Confounding & DAGs¶
Confounding: Third variable affects both treatment and outcome.
DAG (Directed Acyclic Graph): Visual representation of causal relationships.
Methods for Observational Studies¶
| Method | Key Idea | When to Use |
|---|---|---|
| Propensity Score Matching | Match similar treated/untreated | Selection on observables |
| Regression Discontinuity | Exploit cutoff in treatment | Sharp threshold exists |
| Instrumental Variables | Use exogenous variation | Hidden confounders |
| Difference-in-Differences | Compare trends over time | Panel data available |
Interview Questions¶
Q: Explain the difference between RCTs and observational studies.
A: RCTs randomize treatment → no confounding by design. Observational studies observe natural treatment assignment → must control for confounders.
Q: What is SUTVA?
A: Stable Unit Treatment Value Assumption — treatment of one unit doesn't affect outcomes of other units. Violated in network effects, spillovers.
Q: What are the key assumptions for causal inference?
A: (1) Consistency, (2) Ignorability (no unmeasured confounding), (3) Positivity (overlap), (4) SUTVA.
Q: How do you assess validity of causal claims?
A: (1) Temporal precedence (cause before effect), (2) Covariation, (3) No spurious correlation, (4) Plausible mechanism, (5) Sensitivity analysis.
13. MCMC Sampling Methods (Gap Filler 2026)¶
Лучшие источники¶
Books: - Bayesian Data Analysis (Gelman et al.) — Chapter 11 - MCMC Handbook — бесплатно
Blogs: - Mastering MCMC Methods — NumberAnalytics - MCMC for ML — Lilian Weng
Key Algorithms¶
| Algorithm | Key Idea | When to Use |
|---|---|---|
| Metropolis-Hastings | Propose → Accept/Reject | General purpose |
| Gibbs Sampling | Sample each variable conditionally | Known conditionals |
| Hamiltonian MC | Use gradients for efficient proposals | High-dimensional |
| Slice Sampling | Sample under density curve | No tuning needed |
Metropolis-Hastings Algorithm¶
def metropolis_hastings(target_pdf, proposal_fn, n_samples):
"""Simple MH sampler"""
samples = []
theta = proposal_fn.initial()
for _ in range(n_samples):
theta_prop = proposal_fn.sample(theta)
alpha = min(1, target_pdf(theta_prop) / target_pdf(theta))
if np.random.random() < alpha:
theta = theta_prop # Accept
samples.append(theta)
return samples
Acceptance Probability: $\(\alpha = \min\left(1, \frac{\pi(\theta') q(\theta|\theta')}{\pi(\theta) q(\theta'|\theta)}\right)\)$
Convergence Diagnostics¶
| Diagnostic | What to Check | Good Sign |
|---|---|---|
| Trace Plot | Stationary around mean | No drift, good mixing |
| ACF | Autocorrelation decay | Quick decay |
| ESS | Effective Sample Size | ESS > 400 |
| \(\hat{R}\) | Gelman-Rubin statistic | \(\hat{R} < 1.1\) |
Effective Sample Size: $\(ESS = \frac{N}{1 + 2\sum_{lag=1}^{\infty} \rho(lag)}\)$
Interview Questions¶
Q: Why use MCMC instead of analytical solutions?
A: Many posteriors are intractable (no closed form). MCMC samples without computing the normalizing constant.
Q: What's a good acceptance rate for MH?
A: 20-40% for random walk proposals. Too high = slow exploration, too low = wasted computation.
14. Time Series Statistics (Gap Filler 2026)¶
Лучшие источники¶
Books: - Time Series Analysis and Its Applications (Shumway & Stoffer) — бесплатно - Forecasting: Principles and Practice (Hyndman) — онлайн
Blogs: - Quant Interview FAQ: Time Series — BagelQuant (2025)
Stationarity¶
Weak Stationarity: $\(E[y_t] = \mu, \quad Var(y_t) = \sigma^2, \quad Cov(y_t, y_{t-k}) = \gamma_k\)$
Tests: | Test | H₀ | H₁ | Stationary if | |------|-----|-----|---------------| | ADF | Unit root | Stationary | p < 0.05 | | KPSS | Stationary | Unit root | p > 0.05 |
ARIMA Model¶
ARIMA(p, d, q): - \(p\) = AR lags (autoregressive) - \(d\) = differencing order (integration) - \(q\) = MA lags (moving average)
Model equations: $\(\text{AR(p): } y_t = c + \sum_{i=1}^{p} \phi_i y_{t-i} + \epsilon_t\)$
ACF/PACF for Model Selection¶
| Pattern | Indicates | Order |
|---|---|---|
| ACF cuts off after lag q | MA(q) | q lags |
| PACF cuts off after lag p | AR(p) | p lags |
| Both tail off | ARMA | Use AIC/BIC |
Key Interview Questions¶
Q: How do you make a non-stationary series stationary?
A: (1) Differencing (once or more), (2) Log transformation (stabilize variance), (3) Detrending.
Q: What is cointegration?
A: Two non-stationary series are cointegrated if a linear combination is stationary. Basis for pairs trading.
Q: AR vs MA vs ARMA?
A: AR models persistence in levels. MA models shock propagation. ARMA combines both.
Common Pitfalls¶
- Overfitting with excessive lags
- Ignoring non-stationarity
- Misinterpreting correlation as causation
- Not using robust standard errors
- Ignoring structural breaks
15. Matrix Calculus: Jacobian & Hessian (Gap Filler 2026)¶
Лучшие источники¶
Tutorials: - DataCamp: What is the Hessian Matrix — comprehensive guide with Python - GeeksforGeeks: Jacobian and Hessian Matrix — formulas and applications
Jacobian Matrix¶
Definition: For \(f: \mathbb{R}^n \to \mathbb{R}^m\), the Jacobian is: $\(J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}\)$
Example: \(f(x,y) = [x^2 + y, xy]\) $\(J = \begin{bmatrix} 2x & 1 \\ y & x \end{bmatrix}\)$
Hessian Matrix¶
Definition: For \(f: \mathbb{R}^n \to \mathbb{R}\), the Hessian is: $\(H = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}\)$
Critical Point Classification: | Hessian Eigenvalues | Type | Behavior | |---------------------|------|----------| | All positive | Minimum | Convex locally | | All negative | Maximum | Concave locally | | Mixed signs | Saddle | Neither | | Some zero | Inconclusive | Need higher-order |
Second Derivative Test: $\(D = \begin{vmatrix} f_{xx} & f_{xy} \\ f_{yx} & f_{yy} \end{vmatrix}\)$
- \(D > 0, f_{xx} > 0\): Local minimum
- \(D > 0, f_{xx} < 0\): Local maximum
- \(D < 0\): Saddle point
Comparison Table¶
| Concept | Input → Output | Matrix Shape | Use Case |
|---|---|---|---|
| Gradient | \(\mathbb{R}^n \to \mathbb{R}\) | \(n \times 1\) | First-order optimization |
| Jacobian | \(\mathbb{R}^n \to \mathbb{R}^m\) | \(m \times n\) | Backprop, change of variables |
| Hessian | \(\mathbb{R}^n \to \mathbb{R}\) | \(n \times n\) | Second-order optimization |
Trace Tricks for Matrix Derivatives¶
Example Interview Question:
Q: Derive \(\nabla_X \text{tr}(X^TAX)\)
A: Using trace properties and symmetry: $\(\nabla_X \text{tr}(X^TAX) = AX + A^TX\)$
If \(A\) is symmetric: \(\nabla_X = 2AX\)
Applications in Neural Networks¶
Hessian in optimization: - Newton's method: \(\theta_{t+1} = \theta_t - H^{-1}\nabla L\) - In practice: Hessian too expensive, use approximations (BFGS, L-BFGS)
Jacobian in backprop: - Chain rule through layers: \(\frac{\partial L}{\partial W} = J_{output} \cdot J_{hidden} \cdot ...\)
Python Implementation¶
import numpy as np
from scipy.optimize import approx_fprime
def numerical_jacobian(f, x, eps=1e-8):
"""Compute Jacobian numerically"""
n, m = len(x), len(f(x))
J = np.zeros((m, n))
for i in range(n):
x_plus = x.copy()
x_plus[i] += eps
J[:, i] = (f(x_plus) - f(x)) / eps
return J
def numerical_hessian(f, x, eps=1e-5):
"""Compute Hessian numerically"""
n = len(x)
H = np.zeros((n, n))
fx = f(x)
for i in range(n):
for j in range(n):
x_pp = x.copy(); x_pp[i] += eps; x_pp[j] += eps
x_pm = x.copy(); x_pm[i] += eps; x_pm[j] -= eps
x_mp = x.copy(); x_mp[i] -= eps; x_mp[j] += eps
x_mm = x.copy(); x_mm[i] -= eps; x_mm[j] -= eps
H[i, j] = (f(x_pp) - f(x_pm) - f(x_mp) + f(x_mm)) / (4 * eps**2)
return H
# Example
def f(x):
return x[0]**2 + x[1]**2 # Simple quadratic
x = np.array([1.0, 2.0])
H = numerical_hessian(f, x)
print("Hessian:", H)
# [[2. 0.]
# [0. 2.]] → Positive definite (minimum at origin)
Key Interview Questions¶
Q: Why is Newton's method rarely used in deep learning?
A: (1) Hessian is \(O(n^2)\) memory for \(n\) parameters, (2) Inverting Hessian is \(O(n^3)\), (3) For large networks (millions of params), this is intractable.
Q: What's the relationship between Hessian and conditioning?
A: The condition number of the Hessian (ratio of largest to smallest eigenvalue) determines optimization difficulty. High condition number = ill-conditioned = slow convergence.
Q: How does Jacobian relate to backpropagation?
A: Backprop computes vector-Jacobian products (VJP) efficiently: \(\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot J\). PyTorch/TF compute VJPs without materializing full Jacobians.
16. Numerical Stability (Gap Filler 2026)¶
Лучшие источники¶
Tutorials: - Mastering Softmax with Numerical Stability — Best AI Tools - Gradient Clipping in PyTorch — GeeksforGeeks
Log-Sum-Exp Trick¶
Problem: Computing \(\log(\sum_i e^{x_i})\) directly causes overflow/underflow.
Solution: $\(\log\left(\sum_i e^{x_i}\right) = c + \log\left(\sum_i e^{x_i - c}\right) \text{ where } c = \max_i x_i\)$
Why it works: - \(c\) shifts all values so maximum becomes 0 - \(e^{x_i - c} \leq 1\) → no overflow - At least one term = \(e^0 = 1\) → sum \(\geq 1\) → no underflow
Softmax Numerical Stability¶
Naive (unstable):
Stable version:
def softmax_stable(x):
x_shifted = x - np.max(x, axis=-1, keepdims=True)
exp_x = np.exp(x_shifted)
return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
Gradient Clipping Methods¶
| Method | PyTorch Function | What it does |
|---|---|---|
| Clip by Value | clip_grad_value_(params, clip_value) |
Clips each gradient to \([-v, v]\) |
| Clip by Norm | clip_grad_norm_(params, max_norm) |
Scales gradient vector to max norm |
| Clip by Hook | register_hook(lambda g: torch.clamp(g, min, max)) |
Custom clipping logic |
When to use which: - Clip by norm: Standard for RNNs/Transformers (preserves direction) - Clip by value: When you need component-wise limits - Clip by hook: Asymmetric clipping, layer-specific
Gradient Clipping Code¶
import torch
import torch.nn as nn
# Method 1: Clip by norm (most common)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
# Method 2: Clip by value
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)
# Method 3: Custom hook for asymmetric clipping
for p in model.parameters():
p.register_hook(lambda grad: torch.clamp(grad, -0.1, 1.0))
Common Numerical Issues¶
| Issue | Cause | Solution |
|---|---|---|
| NaN in softmax | Overflow from large inputs | Log-sum-exp trick |
| Exploding gradients | Deep networks, RNNs | Gradient clipping |
| Vanishing gradients | Sigmoid/tanh in deep nets | ReLU, residual connections |
| Division by zero | Sum of exponentials \(\approx 0\) | Add epsilon (\(10^{-8}\)) |
| Log(0) | Computing \(\log(0)\) | Use np.log(x + eps) or np.logaddexp |
FP16/BF16 Considerations¶
Mixed precision training:
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for data, target in dataloader:
optimizer.zero_grad()
with autocast():
output = model(data)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.unscale_(optimizer) # Unscale before clipping!
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()
Key Interview Questions¶
Q: Why subtract max before softmax instead of just using float64?
A: Float64 only delays overflow. Even with 64-bit, \(e^{1000}\) overflows. The log-sum-exp trick is mathematically exact and works with any precision.
Q: Clip by value vs clip by norm — when to use which?
A: Clip by norm preserves gradient direction (preferred for optimization). Clip by value can change direction, useful when certain parameters shouldn't exceed a threshold regardless of others.
Q: How does gradient clipping affect convergence?
A: Clipping acts as a regularizer. Too aggressive clipping → slow learning. Too weak → instability. Typical values: max_norm=1.0 for transformers, 5.0 for RNNs.
Q: What's the difference between exploding and vanishing gradients?
A: Exploding: gradients grow exponentially through layers → NaN weights. Vanishing: gradients shrink to near-zero → no learning in early layers. Solutions differ: clipping for exploding, ReLU/residuals/architecture changes for vanishing.
17. Dimensionality Reduction: t-SNE & UMAP (Gap Filler 2026)¶
Лучшие источники¶
Tutorials: - PCA vs t-SNE vs UMAP — Medium (May 2025) - How to Use t-SNE Effectively — Distill.pub (canonical) - UMAP Original Paper — McInnes et al. (2018)
Comparison Table¶
| Feature | PCA | t-SNE | UMAP |
|---|---|---|---|
| Type | Linear | Non-linear | Non-linear |
| Preserves | Global structure | Local neighborhoods | Local + some global |
| Speed | Fast | Slow | Fast |
| Scalability | Excellent | Poor (\(O(n^2)\)) | Good (\(O(n \log n)\)) |
| Use in ML Models | ✅ Yes | ❌ No (viz only) | ✅ Yes |
| Inverse Transform | ✅ Yes | ❌ No | ❌ No |
| Interpretability | ✅ High | ❌ Low | ⚠️ Medium |
| Deterministic | ✅ Yes | ❌ No (random init) | ❌ No |
t-SNE (t-Distributed Stochastic Neighbor Embedding)¶
How it works: 1. Compute pairwise similarities in high-dim space (Gaussian) 2. Model similarities in low-dim space (Student-t with heavy tails) 3. Minimize KL divergence between both distributions
Key hyperparameters:
- perplexity (5-50): Effective number of neighbors. Lower = more local structure.
- n_iter (250-1000): More iterations = more stable.
- learning_rate (10-1000): Too high = points cluster in ball; too low = compressed cloud.
When to use t-SNE: - Cluster visualization (CNN embeddings, single-cell RNA) - Exploratory data analysis - Small to medium datasets (< 50k samples)
UMAP (Uniform Manifold Approximation and Projection)¶
How it works: 1. Build fuzzy simplicial set (graph) in high-dim space 2. Build similar graph in low-dim space 3. Minimize cross-entropy between fuzzy set representations
Key hyperparameters:
- n_neighbors (5-200): Local vs global structure balance
- min_dist (0.0-0.99): How tightly points cluster
- metric: Distance metric (euclidean, cosine, etc.)
When to use UMAP: - Large datasets (> 50k samples) - When you need speed - When global structure matters - Feature extraction before ML
Python Implementation¶
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap
import matplotlib.pyplot as plt
# Load and scale data
X = StandardScaler().fit_transform(load_iris().data)
y = load_iris().target
# PCA (fast, linear)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
print(f"PCA explained variance: {pca.explained_variance_ratio_.sum():.2%}")
# t-SNE (slow, non-linear, visualization)
tsne = TSNE(n_components=2, perplexity=30, n_iter=300, random_state=42)
X_tsne = tsne.fit_transform(X)
# UMAP (fast, non-linear, good for ML)
umap_model = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
X_umap = umap_model.fit_transform(X)
# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, data, title in [(axes[0], X_pca, 'PCA'),
(axes[1], X_tsne, 't-SNE'),
(axes[2], X_umap, 'UMAP')]:
ax.scatter(data[:, 0], data[:, 1], c=y, cmap='viridis', alpha=0.7)
ax.set_title(title)
Decision Guide¶
Need dimensionality reduction?
├── For ML model input ➝ PCA or UMAP
│ ├── Interpretability needed? ➝ PCA
│ └── Non-linear patterns? ➝ UMAP
├── For visualization?
│ ├── Small dataset (<50k) ➝ t-SNE
│ └── Large dataset (>50k) ➝ UMAP
└── Need inverse transform? ➝ PCA only
Key Interview Questions¶
Q: Why does t-SNE use Student-t distribution in low-dim space?
A: Heavy tails of Student-t alleviate "crowding problem" — in high dimensions, there's more "room" for points. Gaussian in low-dim would compress all points together. Heavy tails push dissimilar points apart.
Q: What's the perplexity parameter in t-SNE?
A: It's a smooth measure of effective neighbors. Perplexity=30 means each point pays attention to ~30 neighbors. Low perplexity = local structure; high = more global.
Q: Why can't t-SNE be used for ML preprocessing?
A: (1) No inverse transform, (2) No out-of-sample extension (can't transform new data), (3) Stochastic results vary between runs, (4) \(O(n^2)\) complexity.
Q: UMAP vs t-SNE — when to use which?
A: UMAP is faster, scales better, and can transform new data. Use t-SNE only for small dataset visualization where interpretability of local clusters is critical. Use UMAP for everything else.
Q: Can t-SNE/UMAP distances be interpreted?
A: No! Only relative ordering matters. Cluster sizes and distances between clusters in t-SNE/UMAP plots are meaningless. Use PCA for interpretable distances.
Common Pitfalls¶
- Interpreting cluster sizes: Larger clusters in t-SNE ≠ more data
- Interpreting distances: Gap between clusters ≠ dissimilarity
- Random seed sensitivity: Different seeds = different plots
- Not scaling data: Always StandardScaler before t-SNE/UMAP
- Wrong perplexity: Rule of thumb: perplexity < \(\sqrt{n}\)
18. Constrained Optimization: Lagrange & KKT (Gap Filler 2026)¶
Лучшие источники¶
Books: - Convex Optimization (Boyd & Vandenberghe) — canonical text - Stanford CS229 Lecture Notes — SVM and optimization
Tutorials: - KKT Conditions: Ultimate Guide — NumberAnalytics - Convex Optimization and Lagrange Multipliers — CircuitLabs (2025)
General Optimization Problem¶
Lagrange Multipliers (Equality Constraints)¶
For problem with only equality constraints \(h_j(x) = 0\):
Lagrangian: $\(L(x, \mu) = f(x) + \sum_{j=1}^{p} \mu_j h_j(x)\)$
Optimality condition: $\(\nabla_x L(x^*, \mu^*) = 0\)$
Intuition: At optimum, gradient of \(f\) is a linear combination of constraint gradients.
KKT Conditions (Inequality + Equality)¶
For general problem with both constraints, \(x^*\) is optimal iff:
| Condition | Formula | Meaning |
|---|---|---|
| Primal feasibility | \(g_i(x^*) \leq 0\), \(h_j(x^*) = 0\) | Point satisfies all constraints |
| Dual feasibility | \(\lambda_i \geq 0\) | Multipliers non-negative |
| Stationarity | \(\nabla f + \sum \lambda_i \nabla g_i + \sum \mu_j \nabla h_j = 0\) | Gradient of Lagrangian = 0 |
| Complementary slackness | \(\lambda_i g_i(x^*) = 0\) | Active constraints only |
Full Lagrangian: $\(L(x, \lambda, \mu) = f(x) + \sum_{i=1}^{m} \lambda_i g_i(x) + \sum_{j=1}^{p} \mu_j h_j(x)\)$
Complementary Slackness Intuition¶
- If constraint is inactive (\(g_i(x^*) < 0\)), then \(\lambda_i = 0\)
- If constraint is active (\(g_i(x^*) = 0\)), then \(\lambda_i \geq 0\)
- Constraint "pushes back" with force \(\lambda_i\)
SVM Dual Formulation¶
Primal (hard margin): $\(\min_{w,b} \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^T x_i + b) \geq 1\)$
Lagrangian: $\(L(w, b, \alpha) = \frac{1}{2}\|w\|^2 - \sum_{i=1}^{n} \alpha_i [y_i(w^T x_i + b) - 1]\)$
Dual: $\(\max_{\alpha} \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^T x_j\)$
Why dual? 1. Kernel trick: replace \(x_i^T x_j\) with \(K(x_i, x_j)\) 2. Solution sparse: most \(\alpha_i = 0\) (support vectors)
Python Implementation¶
import numpy as np
from scipy.optimize import minimize
# Example: Minimize x^2 + y^2 subject to x + y = 1
def objective(x):
return x[0]**2 + x[1]**2
def eq_constraint(x):
return x[0] + x[1] - 1 # x + y = 1
# Solve with scipy
result = minimize(
objective,
x0=[0.5, 0.5],
constraints={'type': 'eq', 'fun': eq_constraint}
)
print(f"Optimal: x={result.x}, f={result.fun}")
# x=[0.5, 0.5], f=0.5
# Verify with Lagrange multiplier
# L = x^2 + y^2 + λ(x + y - 1)
# ∂L/∂x = 2x + λ = 0 → x = -λ/2
# ∂L/∂y = 2y + λ = 0 → y = -λ/2
# Constraint: -λ/2 - λ/2 = 1 → λ = -1
# Solution: x = y = 0.5 ✓
Convexity Matters¶
| Problem Type | Local = Global? | KKT Conditions |
|---|---|---|
| Convex | ✅ Yes | Necessary AND sufficient |
| Non-convex | ❌ No | Necessary only |
Convex problem requires: - Objective \(f(x)\) is convex - Inequality constraints \(g_i(x)\) are convex - Equality constraints \(h_j(x)\) are affine
Key Interview Questions¶
Q: What's the geometric interpretation of Lagrange multipliers?
A: At optimum, gradient of objective is perpendicular to constraint surface. \(\nabla f = -\sum \lambda_i \nabla g_i\) means \(\nabla f\) is in the span of constraint gradients.
Q: When are KKT conditions sufficient (not just necessary)?
A: When the problem is convex: convex objective, convex inequality constraints, affine equality constraints. Then KKT conditions guarantee global optimum.
Q: What does complementary slackness tell us in SVM?
A: \(\alpha_i (y_i(w^Tx_i + b) - 1) = 0\). Either \(\alpha_i = 0\) (point not on margin) or \(y_i(w^Tx_i + b) = 1\) (point on margin = support vector). This is why SVM has sparse solution.
Q: Why formulate SVM as a constrained optimization problem?
A: (1) Margin maximization is naturally a constrained problem, (2) Dual formulation enables kernel trick, (3) Guarantees global optimum (convex), (4) Solution is sparse (only support vectors matter).
Q: What's the difference between primal and dual problem?
A: Primal: optimize over original variables. Dual: optimize over Lagrange multipliers. For convex problems, strong duality holds (optimal values equal). Dual often easier (kernel trick in SVM).
19. Multiple Testing Correction (Gap Filler 2026)¶
Лучшие источники¶
Blogs: - Multiple Comparison Corrections in A/B Testing — Statsig (Oct 2025) - Benjamini-Hochberg Procedure — GeeksforGeeks (2025)
Books: - Multiple Comparisons (Hochberg & Tamhane) - Statistical Inference (Casella & Berger) — Chapter 8
The Multiple Testing Problem¶
Problem: With \(m\) independent tests at \(\alpha = 0.05\): $\(P(\text{at least one false positive}) = 1 - (1 - \alpha)^m\)$
| Tests (m) | P(at least 1 FP) |
|---|---|
| 1 | 5% |
| 5 | 23% |
| 10 | 40% |
| 20 | 64% |
| 100 | 99% |
FWER vs FDR¶
| Metric | Definition | Methods |
|---|---|---|
| FWER (Family-Wise Error Rate) | \(P(\geq 1\) false positive) | Bonferroni, Dunnett, Holm |
| FDR (False Discovery Rate) | \(E[\frac{\text{false positives}}{\text{all positives}}]\) | Benjamini-Hochberg |
FWER = conservative, controls ANY false positive. FDR = balanced, controls PROPORTION of false positives.
Bonferroni Correction¶
Formula: $\(\alpha_{\text{adjusted}} = \frac{\alpha}{m}\)$
Example: 10 tests at \(\alpha = 0.05\) - Adjusted threshold: \(\frac{0.05}{10} = 0.005\) - Reject \(H_0\) only if \(p < 0.005\)
Pros: Simple, controls FWER Cons: Too conservative, low power
Benjamini-Hochberg Procedure¶
Algorithm: 1. Sort p-values: \(p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}\) 2. For each \(i\), compute threshold: \(t_i = \frac{i}{m} \cdot \alpha\) 3. Find largest \(k\) such that \(p_{(k)} \leq t_k\) 4. Reject \(H_{(1)}, H_{(2)}, \ldots, H_{(k)}\)
Example: 5 tests at \(\alpha = 0.05\)
| Rank | p-value | Threshold \((i/m \cdot \alpha)\) | Significant? |
|---|---|---|---|
| 1 | 0.001 | 0.01 | ✅ |
| 2 | 0.008 | 0.02 | ✅ |
| 3 | 0.020 | 0.03 | ✅ (p ≤ 0.03) |
| 4 | 0.045 | 0.04 | ❌ (p > 0.04) |
| 5 | 0.120 | 0.05 | ❌ |
Result: Reject first 3 hypotheses.
Comparison Table¶
| Method | Controls | Power | Best For |
|---|---|---|---|
| Bonferroni | FWER | Low | Few tests, strict control needed |
| Holm | FWER | Medium | Step-down improvement |
| Dunnett | FWER | Medium-High | Multiple vs. control |
| BH | FDR | High | Many tests, some FP OK |
Python Implementation¶
import numpy as np
from scipy import stats
# Example: 10 hypothesis tests
p_values = np.array([0.001, 0.008, 0.02, 0.03, 0.04,
0.05, 0.08, 0.12, 0.15, 0.30])
# Bonferroni correction
alpha = 0.05
bonferroni_threshold = alpha / len(p_values)
bonferroni_significant = p_values < bonferroni_threshold
print(f"Bonferroni threshold: {bonferroni_threshold:.4f}")
print(f"Bonferroni significant: {bonferroni_significant.sum()} tests")
# Benjamini-Hochberg correction
def benjamini_hochberg(p_values, alpha=0.05):
m = len(p_values)
sorted_idx = np.argsort(p_values)
sorted_p = p_values[sorted_idx]
# Calculate thresholds
thresholds = (np.arange(1, m + 1) / m) * alpha
# Find largest k where p[k] <= threshold[k]
below_threshold = sorted_p <= thresholds
if not below_threshold.any():
return np.zeros(m, dtype=bool)
k = np.max(np.where(below_threshold)[0])
# Create result array
significant = np.zeros(m, dtype=bool)
significant[sorted_idx[:k+1]] = True
return significant
bh_significant = benjamini_hochberg(p_values, alpha)
print(f"BH significant: {bh_significant.sum()} tests")
# Using statsmodels
from statsmodels.stats.multitest import multipletests
reject, p_corrected, _, _ = multipletests(p_values, alpha=0.05,
method='fdr_bh') # or 'bonferroni'
print(f"statsmodels BH: {reject.sum()} tests")
When to Use Which¶
Multiple tests needed?
├── Few tests (< 10) and strict control?
│ └── Bonferroni or Holm
├── Many tests (> 10) or exploratory?
│ └── Benjamini-Hochberg
├── All groups vs. control?
│ └── Dunnett's test
└── Sequential testing (peeking)?
└── Sequential testing methods (not BH)
Key Interview Questions¶
Q: Why does Bonferroni reduce power?
A: By dividing \(\alpha\) by \(m\), the threshold becomes very strict. True effects with moderate p-values (e.g., 0.02) may be missed when testing many hypotheses.
Q: When would you choose FDR over FWER?
A: FDR (Benjamini-Hochberg) when: (1) Many tests (10+), (2) Exploratory analysis where some false positives are acceptable, (3) Need more power to detect true effects. Use FWER when even one false positive is costly (e.g., drug approval).
Q: What happens to FWER if you run 100 A/B tests at α=0.05?
A: \(P(\text{at least 1 FP}) = 1 - (1 - 0.05)^{100} \approx 99.4\%\). Without correction, you're almost guaranteed at least one false positive!
Q: Why can't BH be used for sequential testing (peeking)?
A: BH requires all p-values at once for ranking. In sequential testing, tests are run one at a time based on interim results, so the full set isn't available. Use group sequential methods or alpha-spending instead.
Q: What's the difference between Bonferroni and Holm?
A: Both control FWER. Holm is a step-down procedure that's uniformly more powerful: it compares the smallest p-value to \(\alpha/m\), then the second smallest to \(\alpha/(m-1)\), etc. Always prefer Holm over Bonferroni.
20. Spectral Methods: Graph Laplacian (Gap Filler 2026)¶
Лучшие источники¶
Tutorials: - Spectral Clustering in Machine Learning — GeeksforGeeks (2025) - A Tutorial on Spectral Clustering — Von Luxburg (canonical)
Spectral Clustering Overview¶
Core idea: Use eigenvalues/eigenvectors of the graph Laplacian to find clusters in data with non-convex shapes.
When to use: Non-convex clusters, manifold structure, image segmentation, graph data.
Graph Construction¶
| Method | Description | Result |
|---|---|---|
| ε-neighborhood | Connect points within radius ε | Unweighted graph |
| k-NN graph | Connect to k nearest neighbors | Directed/weighted graph |
| Fully connected | Connect all pairs with Gaussian weight | Weighted graph |
Similarity (affinity) matrix: $\(W_{ij} = \exp\left(-\frac{\|x_i - x_j\|^2}{2\sigma^2}\right)\)$
Graph Laplacian Variants¶
| Type | Formula | Properties |
|---|---|---|
| Unnormalized | \(L = D - W\) | Simple, eigenvector for \(k\) clusters |
| Normalized (symmetric) | \(L_{sym} = D^{-1/2}LD^{-1/2} = I - D^{-1/2}WD^{-1/2}\) | Better numerical properties |
| Normalized (random walk) | \(L_{rw} = D^{-1}L = I - D^{-1}W\) | Connection to random walks |
Where \(D\) is the degree matrix: \(D_{ii} = \sum_j W_{ij}\)
Spectral Clustering Algorithm¶
- Build similarity graph from data
- Compute Laplacian matrix \(L\)
- Find k smallest eigenvectors of \(L\)
- Stack eigenvectors as columns in matrix \(U\)
- Run k-means on rows of \(U\)
Python Implementation¶
import numpy as np
from sklearn.cluster import SpectralClustering
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
# Generate non-convex data
X, y = make_moons(n_samples=200, noise=0.05, random_state=42)
# Spectral clustering
sc = SpectralClustering(
n_clusters=2,
affinity='rbf', # or 'nearest_neighbors'
gamma=1.0, # kernel bandwidth
random_state=42
)
labels = sc.fit_predict(X)
# Compare with k-means
from sklearn.cluster import KMeans
km = KMeans(n_clusters=2, random_state=42)
km_labels = km.fit_predict(X)
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
axes[0].set_title('Spectral Clustering')
axes[1].scatter(X[:, 0], X[:, 1], c=km_labels, cmap='viridis')
axes[1].set_title('K-Means (fails on non-convex)')
Eigenvalue Interpretation¶
- 0 eigenvalue: Corresponds to constant eigenvector (all nodes connected)
- Small eigenvalues: Near-0 eigenvectors indicate connected components
- Gap in spectrum: Number of clusters ≈ number of near-zero eigenvalues
K-Means vs Spectral Clustering¶
| Aspect | K-Means | Spectral |
|---|---|---|
| Cluster shape | Convex only | Any shape |
| Complexity | \(O(nkd)\) | \(O(n^3)\) for eigendecomposition |
| Scalability | Excellent | Poor for large n |
| Preprocessing | Scale features | Build affinity matrix |
Key Interview Questions¶
Q: Why does spectral clustering work for non-convex clusters?
A: It transforms data into eigenvector space where the Laplacian's spectral properties separate clusters linearly. The transformation "unfolds" non-convex manifolds into separable representations.
Q: What's the role of the graph Laplacian's eigenvectors?
A: The smallest eigenvectors form an embedding where points in the same cluster are close. Running k-means in this space finds clusters that may be non-convex in original space.
Q: How do you choose the number of clusters k?
A: (1) Look for a gap in the eigenvalue spectrum (eigengap heuristic), (2) Use silhouette score, (3) Domain knowledge. The eigengap method: find k where \(\lambda_{k+1} - \lambda_k\) is largest.
Q: Why is spectral clustering slow for large datasets?
A: Building the full affinity matrix is \(O(n^2)\) memory. Eigendecomposition is \(O(n^3)\). Solutions: use sparse affinity matrices, approximate eigensolvers (Lanczos), or Nyström approximation.
Q: What's the difference between L, L_sym, and L_rw?
A: All find the same clusters in theory. \(L_{sym}\) (symmetric normalized) has better numerical stability. \(L_{rw}\) (random walk) relates to Markov chains on graphs. Always use normalized Laplacian in practice.
Advantages & Disadvantages¶
Pros: - Handles non-convex clusters - No assumption about cluster shape - Mathematically elegant (spectral graph theory)
Cons: - \(O(n^3)\) complexity - \(O(n^2)\) memory for affinity matrix - Sensitive to affinity parameters (σ, k) - Requires specifying number of clusters
21. Test-Time Compute Scaling (2025 Hot Topic)¶
Лучшие источники¶
Blogs: - Inference-Time Compute Scaling Methods — Sebastian Raschka (Mar 2025) — CANONICAL - What is Test-Time Scaling? — Adaline Labs (Apr 2025)
Papers: - The Art of Scaling Test-Time Compute for LLMs — Dec 2025 - s1: Simple Test-Time Scaling — Jan 2025
What is Test-Time Compute?¶
Definition: Spending more computational resources during inference to improve model outputs, rather than just during training.
Analogy: Humans give better responses when given more time to "think". Similarly, LLMs can improve with techniques that encourage more "thought" during generation.
Two Strategies for Better Reasoning¶
| Strategy | When | Cost |
|---|---|---|
| Train-time compute | During training/fine-tuning | Fixed upfront cost |
| Test-time compute | During inference | Per-query cost, scalable |
Methods for Test-Time Scaling¶
| Method | Description | Trade-off |
|---|---|---|
| Chain-of-Thought (CoT) | "Think step by step" prompting | More tokens = more cost |
| Best-of-N sampling | Generate N responses, pick best | N× compute |
| Majority voting | Generate N, take most common answer | N× compute |
| Beam search | Explore multiple paths, keep top-k | k× compute |
| Self-correction | Model reviews and fixes own output | 2× tokens |
| "Wait" tokens | Force model to continue thinking | Budget forcing |
"Wait" Token Technique (s1 Paper)¶
def budget_forcing(model, prompt, max_thinking_tokens=1000):
"""Force model to think longer by adding 'Wait' tokens"""
response = model.generate(prompt)
while len(response) < max_thinking_tokens:
# If model tries to stop, inject "Wait" to continue
if response.endswith("</answer>"):
response = response.replace("</answer>", "")
response += "\nWait, let me reconsider..."
response = model.generate(response, max_new_tokens=100)
return response
Best-of-N with Verification¶
def best_of_n_with_verifier(model, verifier, prompt, n=5):
"""Generate N candidates, use verifier to pick best"""
candidates = [model.generate(prompt) for _ in range(n)]
scores = [verifier.score(prompt, c) for c in candidates]
best_idx = max(range(n), key=lambda i: scores[i])
return candidates[best_idx]
Compute-Optimal Inference¶
Key insight (Snell et al., 2024): There's an optimal allocation of test-time compute: - Easy problems: little extra compute needed - Hard problems: benefit from more sampling/verification - Very hard problems: diminishing returns
Rule of thumb: $\(\text{Test-time budget} \propto \text{problem difficulty}\)$
When Test-Time Compute Helps¶
| Scenario | Effective? | Why |
|---|---|---|
| Math problems | ✅ Yes | Verifiable answers |
| Code generation | ✅ Yes | Tests can verify |
| Creative writing | ⚠️ Limited | No objective "best" |
| Simple Q&A | ❌ No | Overkill |
| Time-critical apps | ❌ No | Latency too high |
Key Interview Questions¶
Q: What's the difference between train-time and test-time compute?
A: Train-time: invest compute upfront during training to improve model weights. Test-time: spend compute during inference on each query. Test-time is more flexible (can scale per-problem difficulty) but costs more per query.
Q: When would you use Best-of-N vs. Chain-of-Thought?
A: Best-of-N when you have a verifier (e.g., code tests, math checker). CoT when you want the model to reason explicitly. Both can be combined: generate N CoT responses, pick the most consistent.
Q: What are "wait" tokens in test-time scaling?
A: A technique from the s1 paper where you force the model to continue generating by appending "Wait, let me reconsider" when it tries to stop. This effectively increases inference compute for harder problems.
Q: Why did DeepSeek R1 NOT use explicit test-time scaling?
A: Their model already generates longer responses naturally (implicit scaling). They noted that test-time scaling is typically an application-layer concern and can be added post-hoc without modifying the model.
Q: What's the cost problem with test-time compute?
A: Linear or super-linear increase in inference cost per query. For high-traffic applications, this can be prohibitively expensive. Need to balance accuracy gains vs. latency and cost constraints.
Cost-Benefit Trade-off¶
More test-time compute →
├── Better accuracy (up to a point)
├── Higher latency (bad for real-time)
├── Higher cost (per query)
└── Diminishing returns for very high compute
Practical guidance: - Start with CoT prompting (free) - Add Best-of-N for verifiable domains (2-5× cost) - Use process reward models for critical applications - Set compute budget based on problem value
22. Information Geometry (Gap Filler 2026)¶
Лучшие источники¶
Blogs: - Information Geometry Part 3: Applications in ML — Ji-Ha Kim (Jun 2025) - Curvature as a Teacher: Information Geometry in Neural Network Learning — Satyam Mishra (Aug 2025) - Information Geometry in Variational Inference — AI Under the Hood
Books: - Amari, Information Geometry and Its Applications — canonical reference - Murphy, Probabilistic Machine Learning — Chapter 10
Core Concept: Probability Manifold¶
Key insight: Probability distributions form a curved manifold, not a flat Euclidean space.
| Concept | Euclidean | Information Geometry |
|---|---|---|
| Space | Flat | Curved manifold |
| Distance | \(\|x - y\|\) | KL divergence |
| Metric | Identity \(I\) | Fisher Information \(I(\theta)\) |
| Gradient | \(\nabla L\) | Natural gradient \(\tilde{\nabla} L\) |
Fisher Information Matrix (FIM)¶
Definition: $\(I(\theta)_{ij} = \mathbb{E}\left[\frac{\partial \log p(x|\theta)}{\partial \theta_i} \frac{\partial \log p(x|\theta)}{\partial \theta_j}\right]\)$
Properties: - Positive semi-definite: \(I(\theta) \succeq 0\) - Captures curvature of log-likelihood - KL divergence approximation: \(D_{KL}(p_\theta \| p_{\theta+\delta}) \approx \frac{1}{2} \delta^T I(\theta) \delta\)
Natural Gradient¶
Standard gradient: \(\nabla L\) — assumes flat Euclidean space
Natural gradient: $\(\tilde{\nabla} L = I(\theta)^{-1} \nabla L\)$
Update rule: $\(\theta_{t+1} = \theta_t - \eta \cdot I(\theta_t)^{-1} \nabla L(\theta_t)\)$
Benefits: - Reparametrization invariant: same direction regardless of parameterization - Faster convergence: follows geodesics (shortest paths) - Stable: respects manifold curvature
Intuition: Why Natural Gradient?¶
Imagine riding a bike on Earth vs. a flat parking lot: - On the parking lot (Euclidean), shortest path is a straight line. - On Earth (curved), shortest path is a great circle.
Natural gradient = GPS navigation that knows the Earth is round.
Approximations for Large Models¶
Problem: For \(d\) parameters, computing \(I(\theta)^{-1}\) is \(O(d^3)\) — infeasible for deep nets.
| Approximation | Complexity | Trade-off |
|---|---|---|
| Diagonal | \(O(d)\) | Ignores correlations, like Adam |
| Block-diagonal (K-FAC) | \(O(d \cdot k^2)\) | Per-layer blocks, Kronecker factors |
| Low-rank | \(O(dr)\) | Capture dominant directions |
| Matrix-free (CG) | \(O(d \cdot k)\) | Solve \(I(\theta)s = \nabla L\) iteratively |
Adam as Approximate Natural Gradient¶
Adam update: $\(\theta_t = \theta_{t-1} - \eta \cdot \frac{g_t}{\sqrt{v_t} + \epsilon}\)$
Interpretation: \(\sqrt{v_t}\) (second moment) ≈ diagonal of empirical Fisher.
import torch
def natural_gradient_update(params, grad, fisher_diag, lr=0.01, damping=1e-4):
"""Natural gradient with diagonal Fisher approximation"""
# F^-1 * grad (element-wise for diagonal)
nat_grad = grad / (fisher_diag + damping)
return params - lr * nat_grad
def empirical_fisher_diag(model, data_loader, n_samples=100):
"""Estimate diagonal of empirical Fisher"""
fisher_diag = {name: torch.zeros_like(p) for name, p in model.named_parameters()}
for i, (x, y) in enumerate(data_loader):
if i >= n_samples:
break
log_prob = model.log_prob(x, y)
for name, p in model.named_parameters():
grad = torch.autograd.grad(log_prob, p, retain_graph=True)[0]
fisher_diag[name] += grad ** 2
# Average
for name in fisher_diag:
fisher_diag[name] /= n_samples
return fisher_diag
K-FAC (Kronecker-Factored Approximate Curvature)¶
For a layer with weights \(W \in \mathbb{R}^{m \times n}\):
where: - \(A = \mathbb{E}[aa^T]\) — activation covariance (input) - \(G = \mathbb{E}[gg^T]\) — gradient covariance (output) - \(\otimes\) — Kronecker product
Benefit: Inverting \((mn \times mn)\) matrix → inverting two \((m \times m)\) and \((n \times n)\) matrices.
Applications in ML¶
| Application | Role of Information Geometry |
|---|---|
| VAEs | Optimize \(q(z\mid x)\) to minimize KL divergence |
| Bayesian Neural Networks | Natural gradient for posterior inference |
| Reinforcement Learning (TRPO) | Fisher metric constrains policy updates |
| Optimization | Adam ≈ diagonal natural gradient |
Key Interview Questions¶
Q: Why is natural gradient better than standard gradient?
A: (1) Reparametrization invariant — same optimization path regardless of how you parameterize the model. (2) Faster convergence — follows geodesics on the probability manifold. (3) Curvature-aware — adapts to sensitivity of parameters.
Q: How is Adam related to natural gradient?
A: Adam's second moment estimate \(v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2\) approximates the diagonal of the empirical Fisher Information Matrix. The update \(g_t / \sqrt{v_t}\) is essentially a natural gradient with diagonal approximation.
Q: What's the computational problem with natural gradient for deep learning?
A: Computing and inverting the full Fisher Information Matrix is \(O(d^2)\) storage and \(O(d^3)\) inversion, where \(d\) is the number of parameters (millions to billions for modern networks). Practical methods use diagonal, block-diagonal (K-FAC), or low-rank approximations.
Q: When would you use K-FAC over Adam?
A: K-FAC when: (1) Training is unstable with Adam, (2) You need faster convergence (fewer steps), (3) Second-order information is valuable (e.g., small datasets, research). Adam is better for: (1) Most practical deep learning, (2) Limited compute, (3) Large-scale training.
Q: What does "curvature" mean in information geometry?
A: Curvature of the statistical manifold defined by the Fisher metric. High curvature regions → parameters more sensitive, need smaller steps. Low curvature → less sensitive, can take larger steps. Natural gradient automatically adapts step sizes based on curvature.
Summary Table¶
| Concept | Formula | Intuition |
|---|---|---|
| Fisher Matrix | \(I(\theta) = \mathbb{E}[\nabla \log p \cdot \nabla \log p^T]\) | Curvature of log-likelihood |
| Natural Gradient | \(\tilde{\nabla} L = I^{-1} \nabla L\) | Steepest descent on manifold |
| KL ≈ Fisher | \(D_{KL} \approx \frac{1}{2} \delta^T I \delta\) | Local distance on manifold |
| Adam | \(g / \sqrt{v}\) | Diagonal natural gradient |
| K-FAC | \(I \approx A \otimes G\) | Layer-wise Kronecker factors |
24. Advanced Optimization: Newton, BFGS, L-BFGS (Gap Filler 2026)¶
Лучшие источники¶
Blogs: - Why Not Newton? The Real Reason We're Stuck with SGD — Dang Truong (2025) - Second-Order Optimization Methods — GeeksforGeeks - L-BFGS for Advanced Optimization — NumberAnalytics
Courses: - EPFL Optimization for ML — lecture notes (2025) - Boyd Convex Optimization — canonical book
First-Order vs Second-Order¶
| Method | Uses | Convergence | Cost per Step |
|---|---|---|---|
| SGD | Gradient \(\nabla f\) | Sub-linear | \(O(n)\) |
| Newton | Gradient + Hessian \(\nabla f, H\) | Quadratic | \(O(n^3)\) |
| BFGS | Approximate inverse Hessian | Superlinear | \(O(n^2)\) |
| L-BFGS | Limited memory BFGS | Superlinear | \(O(n)\) |
Newton's Method¶
Update rule (from Taylor expansion): $\(\theta_{t+1} = \theta_t - H^{-1} \nabla f(\theta_t)\)$
Where: - \(H = \nabla^2 f(\theta)\) — Hessian matrix of second derivatives - \(H_{ij} = \frac{\partial^2 f}{\partial \theta_i \partial \theta_j}\)
Why quadratic convergence? - Taylor expansion: \(f(\theta) \approx f(\theta^*) + \nabla f(\theta^*)(\theta - \theta^*) + \frac{1}{2}(\theta - \theta^*)^T H (\theta - \theta^*)\) - Newton step minimizes the quadratic approximation exactly - Near minimum: error roughly squared each iteration
Why NOT Use Newton for Deep Learning?¶
Level 1: Memory Problem - ResNet-50: \(N = 25\) million parameters - Hessian: \(N \times N = 625\) trillion entries = 2.5 Petabytes - H100 GPU: 80 GB → would need 312,500 GPUs just to store!
Level 2: Stochastic Problem - We use mini-batches (64 samples), not full dataset - Mini-batch gradient: reasonable estimate of true gradient - Mini-batch Hessian: garbage — too noisy to trust - Even with L-BFGS, stochastic noise destroys quadratic convergence
Level 3: Non-Convex Problem - Deep learning landscapes: millions of saddle points, sharp valleys - Newton converges to nearest critical point (could be saddle!) - SGD noise helps escape sharp minima → better generalization
Insight: We don't want a scalpel (Newton). We need a jackhammer (SGD).
BFGS: Quasi-Newton Method¶
Key idea: Approximate \(H^{-1}\) iteratively, don't compute it directly.
Update: $\(\theta_{t+1} = \theta_t - \alpha_t M_t \nabla f(\theta_t)\)$
Where \(M_t \approx H^{-1}\) is updated using: - \(s_t = \theta_{t+1} - \theta_t\) (parameter change) - \(y_t = \nabla f(\theta_{t+1}) - \nabla f(\theta_t)\) (gradient change) - Rank-2 update: \(M_{t+1} = (I - \rho s y^T) M_t (I - \rho y s^T) + \rho s s^T\)
Cost: \(O(n^2)\) per step (vs \(O(n^3)\) for Newton)
L-BFGS: Limited-Memory BFGS¶
Key idea: Don't store \(n \times n\) matrix. Store last \(k\) pairs \((s_i, y_i)\).
Memory: \(O(nk)\) instead of \(O(n^2)\)
Typical: \(k = 10-20\) pairs
import numpy as np
from scipy.optimize import minimize
def rosenbrock(x):
"""Rosenbrock function: classic optimization benchmark"""
return sum(100.0*(x[1:]-x[:-1]**2.0)**2.0 + (1-x[:-1])**2.0)
def rosenbrock_grad(x):
"""Gradient of Rosenbrock"""
xm = x[1:-1]
xm_m1 = x[:-2]
xm_p1 = x[2:]
der = np.zeros_like(x)
der[1:-1] = 200*(xm-xm_m1**2) - 400*(xm_p1 - xm**2)*xm - 2*(1-xm)
der[0] = -400*x[0]*(x[1]-x[0]**2) - 2*(1-x[0])
der[-1] = 200*(x[-1]-x[-2]**2)
return der
# L-BFGS optimization
x0 = np.array([1.3, 0.7, 0.8, 1.9, 1.2])
result = minimize(rosenbrock, x0, method='L-BFGS-B', jac=rosenbrock_grad)
print(f"Minimum at: {result.x}")
print(f"Function value: {result.fun:.6f}")
print(f"Iterations: {result.nit}")
Comparison for Deep Learning¶
| Method | Memory | Works with Mini-batch? | Good for DL? |
|---|---|---|---|
| Newton | \(O(n^2)\) | No | ❌ Impossible |
| BFGS | \(O(n^2)\) | Poorly | ❌ Too much memory |
| L-BFGS | \(O(n)\) | Poorly | ⚠️ Sometimes (small models) |
| SGD | \(O(n)\) | Yes | ✅ Standard |
| Adam | \(O(n)\) | Yes | ✅ Best default |
When L-BFGS IS Used¶
- Classic ML: SVM, logistic regression (convex problems)
- Small neural networks: < 10K parameters
- Fine-tuning: When full-batch is feasible
- Scipy/PyTorch:
scipy.optimize.minimize(method='L-BFGS-B')
Key Interview Questions¶
Q: Why doesn't deep learning use Newton's method?
A: Three reasons: (1) Memory: Hessian is \(O(n^2)\) — petabytes for modern networks. (2) Stochastic: Mini-batch Hessians are too noisy, destroying quadratic convergence benefits. (3) Non-convex: Newton can converge to saddles or sharp minima; SGD's noise finds flat, generalizable minima.
Q: What's the difference between BFGS and L-BFGS?
A: BFGS stores full \(n \times n\) inverse Hessian approximation (\(O(n^2)\) memory). L-BFGS stores only last \(k\) gradient/parameter difference pairs (\(O(nk)\) memory). L-BFGS achieves similar convergence with far less memory.
Q: When would you actually use L-BFGS?
A: For convex problems (SVM, logistic regression), small models (<10K params), or when you can afford full-batch gradients. Not suitable for large-scale stochastic deep learning.
Q: If Adam is "fake second-order," what does it approximate?
A: Adam's second moment \(v_t\) approximates the diagonal of the empirical Fisher Information Matrix. This is a very rough curvature estimate — just enough to adapt step sizes per parameter, but missing the full Hessian information.
Q: What's quadratic convergence?
A: Error roughly squares each iteration: \(|\theta_{t+1} - \theta^*| \approx C|\theta_t - \theta^*|^2\). Near minimum, correct digits double each step. Newton has this; SGD doesn't.
Summary Table¶
| Method | Update Rule | Key Property |
|---|---|---|
| Newton | \(\theta - H^{-1}\nabla f\) | Quadratic convergence, \(O(n^3)\) |
| BFGS | \(\theta - M_t \nabla f\) | Superlinear, \(O(n^2)\) memory |
| L-BFGS | Same, implicit \(M_t\) | \(O(n)\) memory, needs full-batch |
| SGD | \(\theta - \eta \nabla f\) | Noisy, good for non-convex |
| Adam | \(\theta - \eta \frac{m}{\sqrt{v}+\epsilon}\) | Diagonal approx, adaptive |
Связи между темами (Cross-References)¶
Лучшие источники¶
Blogs: - Johnson-Lindenstrauss Lemma & Random Projections — AI Under the Hood - Ultimate Guide to Johnson-Lindenstrauss Lemma — NumberAnalytics
Papers: - Johnson & Lindenstrauss (1984) — original paper - Dasgupta & Gupta (2003) — elementary proof - Achlioptas (2003) — database-friendly sparse projections
Books: - Vempala, The Random Projection Method — canonical reference
Johnson-Lindenstrauss Lemma: Statement¶
For \(n\) points in \(\mathbb{R}^d\) and \(\varepsilon \in (0,1)\), there exists a linear map \(f: \mathbb{R}^d \to \mathbb{R}^k\) with:
such that for all pairs \(x_i, x_j\):
Key insight: \(k\) depends on \(n\) and \(\varepsilon\), NOT on original dimension \(d\)!
Why This Matters¶
| Without JL | With JL |
|---|---|
| \(d = 1,000,000\) | \(k = O(\log n / \varepsilon^2)\) |
| Expensive to store | Compressed storage |
| Slow distance computations | Fast computations |
| Curse of dimensionality | Mitigated |
Random Projection Methods¶
| Method | Matrix entries | Speed | Notes |
|---|---|---|---|
| Gaussian | \(R_{ij} \sim \mathcal{N}(0, 1/k)\) | Slow | Theoretical guarantee |
| Rademacher | \(R_{ij} = \pm 1/\sqrt{k}\) with \(p=0.5\) | Medium | Simpler |
| Sparse (Achlioptas) | \(0\) (⅔), \(\pm\sqrt{3}/\sqrt{k}\) (⅙ each) | Fast | 3× speedup |
Python Implementation¶
import numpy as np
from sklearn.random_projection import GaussianRandomProjection, SparseRandomProjection
def random_projection_custom(X, k, method='gaussian'):
"""
Random projection from d dimensions to k dimensions.
Args:
X: (n, d) data matrix
k: target dimension
method: 'gaussian', 'rademacher', or 'sparse'
"""
n, d = X.shape
if method == 'gaussian':
R = np.random.randn(k, d) / np.sqrt(k)
elif method == 'rademacher':
R = np.random.choice([-1, 1], size=(k, d)) / np.sqrt(k)
elif method == 'sparse':
# Achlioptas: 0 with prob 2/3, ±√3 with prob 1/6 each
R = np.random.choice([0, np.sqrt(3), -np.sqrt(3)],
size=(k, d), p=[2/3, 1/6, 1/6]) / np.sqrt(k)
return X @ R.T
def jl_target_dimension(n, epsilon=0.1):
"""Compute minimum target dimension for JL guarantee."""
return int(np.ceil(8 * np.log(n) / epsilon**2))
# Example usage
n, d = 1000, 5000 # 1000 points in 5000 dimensions
X = np.random.randn(n, d)
epsilon = 0.1
k = jl_target_dimension(n, epsilon)
print(f"Original: {d}D → Projected: {k}D (for n={n}, ε={epsilon})")
X_proj = random_projection_custom(X, k, method='sparse')
# Verify distance preservation
i, j = 0, 1
orig_dist = np.linalg.norm(X[i] - X[j])
proj_dist = np.linalg.norm(X_proj[i] - X_proj[j])
ratio = proj_dist / orig_dist
print(f"Original dist: {orig_dist:.3f}, Projected: {proj_dist:.3f}, Ratio: {ratio:.3f}")
print(f"Within bounds? {(1-epsilon) <= ratio**2 <= (1+epsilon)}")
# Using sklearn (recommended for production)
transformer = SparseRandomProjection(n_components=k, random_state=42)
X_proj_sklearn = transformer.fit_transform(X)
Applications in ML¶
| Application | How JL Helps |
|---|---|
| k-NN Search | Reduce dimensions, keep approximate neighbors |
| Clustering | Faster k-means in lower dimensions |
| SVM | Kernel approximation for large datasets |
| LSH (Locality-Sensitive Hashing) | Efficient approximate nearest neighbors |
| Compressed sensing | Signal recovery from fewer measurements |
JL vs PCA¶
| Aspect | Random Projection | PCA |
|---|---|---|
| Speed | \(O(nkd)\) | \(O(nd^2)\) for SVD |
| Data-dependent? | No (random matrix) | Yes (eigenvalue decomposition) |
| Preserves | Pairwise distances | Variance |
| Best for | Very high \(d\), streaming data | When variance matters most |
| Guarantees | Theoretical (with prob.) | Optimal for variance |
When to Use Random Projections¶
High-dimensional data (d > 1000)?
├── Need fast dimensionality reduction?
│ └── Random Projection ✅
├── Need interpretable components?
│ └── PCA ✅
├── Streaming / online data?
│ └── Random Projection ✅ (no fit needed)
└── Distance-based algorithms (k-NN, clustering)?
└── Random Projection ✅ (preserves distances)
Key Interview Questions¶
Q: What does the Johnson-Lindenstrauss lemma guarantee?
A: For any \(n\) points, we can project to \(k = O(\log n / \varepsilon^2)\) dimensions while preserving all pairwise distances within factor \((1 \pm \varepsilon)\) with high probability. Crucially, \(k\) doesn't depend on original dimension \(d\).
Q: Why is JL useful for ML?
A: (1) Mitigates curse of dimensionality, (2) Speeds up distance-based algorithms (k-NN, clustering), (3) Compresses high-dimensional data (embeddings), (4) Enables efficient approximate nearest neighbor search.
Q: Why use sparse random projections over Gaussian?
A: Sparse matrices (Achlioptas) are faster to compute: many entries are zero, reducing multiplications by ~3×. Still provides same theoretical guarantees with high probability.
Q: When would you use random projection vs PCA?
A: Random projection when: (1) Data dimension is very high (\(d > 1000\)), (2) Need fast online/streaming projection, (3) Don't need interpretable components, (4) Preserving distances is more important than variance. PCA when you need optimal variance preservation and interpretability.
Q: What's the catch with random projections?
A: (1) Adds randomness — results not deterministic without seed, (2) Some distortion (\(\varepsilon\)), (3) Doesn't capture data structure like PCA, (4) Not optimal for any specific dataset — it's a general-purpose tool.
Summary Table¶
| Concept | Formula | Purpose |
|---|---|---|
| Target dimension | \(k \geq \frac{8 \ln n}{\varepsilon^2}\) | Minimum for JL guarantee |
| Gaussian projection | \(R_{ij} \sim \mathcal{N}(0, 1/k)\) | Theoretical standard |
| Sparse projection | \(R_{ij} \in \{0, \pm\sqrt{3}\}/\sqrt{k}\) | Fast computation |
| Distance preservation | \((1-\varepsilon)\|x\|^2 \leq \|Rx\|^2 \leq (1+\varepsilon)\|x\|^2\) | Core guarantee |
Заблуждение: softmax стабилен без трюка с вычитанием max
При \(x_i > 700\) (float64) или \(x_i > 88\) (float32) значение \(e^{x_i}\) = inf. Даже float64 не спасает. Всегда вычитайте max: exp(x - max(x)) -- математически идентично, но численно стабильно. Это спрашивают на 90% ML-интервью.
Заблуждение: PCA можно заменить на t-SNE для предобработки данных
t-SNE: нет inverse transform, нет out-of-sample extension, \(O(n^2)\) сложность, стохастические результаты. Для preprocessing используйте PCA или UMAP. t-SNE -- только для визуализации малых датасетов (<50k точек). Расстояния и размеры кластеров на t-SNE графике бессмысленны.
Заблуждение: Newton's method лучше SGD для deep learning
Гессиан для ResNet-50 (25M параметров) = 625 триллионов элементов = 2.5 петабайт. Обращение матрицы = \(O(n^3)\). Даже L-BFGS плохо работает со стохастическими мини-батчами. SGD + шум помогает найти flat minima с лучшей generalization.
Связи между темами (Cross-References)¶
graph TD
A[Calculus<br/>градиент] --> B[Gradient Descent]
B --> C[Optimizers]
C --> D[Deep Learning]
E[Linear Algebra<br/>матрицы] --> F[Neural Networks]
F --> G[Backprop]
H[Probability<br/>Bayes] --> I[Naive Bayes]
H --> J[Information Theory]
J --> K[Entropy]
L[Statistics<br/>hypothesis testing] --> M[A/B Tests]
K --> N[Information Gain]
N --> O[Decision Trees]
O --> P[Random Forest]
P --> Q[Classification Metrics]
R[Regularization<br/>L1/L2] --> S[Gradient Boosting]
style A fill:#e8eaf6,stroke:#3f51b5
style E fill:#e8eaf6,stroke:#3f51b5
style H fill:#f3e5f5,stroke:#9c27b0
style L fill:#f3e5f5,stroke:#9c27b0
style D fill:#e8f5e9,stroke:#4caf50
style J fill:#fff3e0,stroke:#ef6c00
style Q fill:#e8f5e9,stroke:#4caf50
Рекомендуемый порядок изучения¶
- Week 1-2: Linear Algebra + Calculus (3Blue1Brown)
- Week 3-4: Probability + Statistics (StatQuest, практические задачи)
- Week 5: Information Theory (связь с Loss Functions)
- Week 6: Classification Metrics + Regularization
- Week 7: Ensemble Methods
- Week 8: NumPy/Pandas практика (Kaggle)