Математика для ML: учебные материалы¶

~35 минут чтения

Предварительно: базовый курс математики (школьная программа: производные, матрицы, вероятности)

Раздел охватывает 41 задачу по 22 темам -- от матанализа и линала до информационной геометрии и спектральных методов. По статистике Nareshka (2025-2026), 73% интервью-вопросов по математике для ML приходятся на 5 тем: Statistics (17 задач), Linear Algebra (4), Information Theory (4), Calculus (3) и Probability (3). Материалы ниже покрывают как базу, так и gap fillers для senior-позиций.

Обновлено: 2026-02-11

1. Calculus (3 задачи)¶

Лучшие источники¶

Курсы: - Mathematics for Machine Learning Specialization (Imperial College London) — Linear Algebra, Multivariate Calculus, PCA - MIT 18.06SC Linear Algebra — канонический курс - 3Blue1Brown Essence of Calculus — визуальная интуиция

Блоги: - Lil'Log - Calculus for ML — градиенты, оптимизация - d2l.ai Chapter 2 — Calculus для Deep Learning

YouTube: - Andrej Karpathy: micrograd — backprop с нуля - StatQuest: Gradient Descent — интуитивное объяснение

Ключевые концепции¶

Тема	Формула	Применение
Градиент	$\nabla f = [\frac{\partial f}{\partial x_1}, ...]$	Направление наискорейшего роста
Sigmoid'	$\sigma'(x) = \sigma(x)(1-\sigma(x))$	Backprop
ReLU'	$\text{ReLU}'(x) = 1$ if $x>0$ else $0$	Backprop
MSE gradient	$\frac{\partial L}{\partial \hat{y}} = \frac{2}{n}(\hat{y}-y)$	Регрессия
BCE gradient	$\frac{\partial L}{\partial \hat{y}} = \frac{\hat{y}-y}{\hat{y}(1-\hat{y})}$	Классификация

2. Linear Algebra (4 задачи)¶

Лучшие источники¶

Книги: - Mathematics for Machine Learning (Deisenroth, Faisal, Ong) — бесплатно, Chapter 2-4 - Deep Learning Book - Linear Algebra (Goodfellow) — Chapter 2

Курсы: - MIT 18.06 Linear Algebra — Gilbert Strang - Linear Algebra for ML (Coursera)

YouTube: - 3Blue1Brown Essence of Linear Algebra — MUST WATCH

Ключевые концепции¶

Тема	Формула	Применение
L2 норма	$\\|x\\|_2 = \sqrt{\sum_i x_i^2}$	Регуляризация
Softmax	$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$	Мультикласс
Cosine similarity	$\cos(a,b) = \frac{a \cdot b}{\\|a\\|\\|b\\|}$	Embeddings
SVD	$X = U\Sigma V^T$	PCA, рекомендации

Трюк для softmax stability:

def softmax(x):
    x_max = np.max(x, axis=-1, keepdims=True)
    exp_x = np.exp(x - x_max)  # вычитаем max для стабильности
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

3. Probability (3 задачи)¶

Лучшие источники¶

Книги: - Think Stats 2e — бесплатно, практический подход - Probability for Machine Learning — интерактивный курс

Блоги: - Seeing Theory — визуализация вероятностей - Lil'Log: Probability

Ключевые концепции¶

Тема	Формула	Применение
Bayes	$P(A\\|B) = \frac{P(B\\|A)P(A)}{P(B)}$	Naive Bayes, обновление
A/B test sample size	$n = \frac{(z_{\alpha/2}+z_\beta)^2 \cdot 2p(1-p)}{\delta^2}$	Эксперименты
Laplace smoothing	$P(w\\|c) = \frac{\text{count}(w,c)+1}{\text{count}(c)+\\|V\\|}$	Naive Bayes

4. Statistics (17 задач)¶

Лучшие источники¶

Курсы: - Harvard Stats 110 — Probability & Statistics (Joe Blitzstein) - Khan Academy Statistics

Книги: - Practical Statistics for Data Scientists — практический фокус - ISLR Chapter 2-3 — Statistical Learning

Блоги: - StatQuest Josh Starmer — MUST WATCH для интуиции - Seeing Theory — визуализация

Ключевые концепции¶

Тема	Формула	Интервью важность
CLT	$\bar{X} \sim N(\mu, \sigma^2/n)$	HIGH
p-value	$P(D\\|H_0)$	HIGH
t-statistic	$t = \frac{\bar{x}-\mu}{s/\sqrt{n}}$	HIGH
Bootstrap CI	$[\hat{\theta}_{\alpha/2}, \hat{\theta}_{1-\alpha/2}]$	MEDIUM
VIF	$\text{VIF}_j = \frac{1}{1-R_j^2}$	MEDIUM
KS test	$D = \sup_x \\|F_n(x)-F(x)\\|$	LOW

Bootstrap алгоритм:

def bootstrap_ci(data, stat_fn, n_bootstrap=10000, alpha=0.05):
    stats = [stat_fn(np.random.choice(data, len(data), replace=True))
             for _ in range(n_bootstrap)]
    lower = np.percentile(stats, 100 * alpha/2)
    upper = np.percentile(stats, 100 * (1-alpha/2))
    return lower, upper

5. Information Theory (4 задачи)¶

Лучшие источники¶

Книги: - Information Theory, Inference, and Learning Algorithms (MacKay) — бесплатно - Elements of Information Theory (Cover & Thomas)

Блоги: - Lil'Log: Information Theory - Count Bayesie: Entropy

Ключевые концепции¶

Тема	Формула	Применение
Entropy	$H(X) = -\sum p_i \log_2 p_i$	Decision trees
Cross-Entropy	$H(p,q) = -\sum p_i \log q_i$	Loss function
KL Divergence	$D_{KL}(P\\|Q) = \sum p_i \log \frac{p_i}{q_i}$	VAE, distribution matching
Information Gain	$IG(S,A) = H(S) - \sum \frac{\lvert S_v\rvert}{\\|S\\|}H(S_v)$	Decision trees
Gini	$G = 1 - \sum p_k^2$	Random Forest

Entropy vs Gini:

Entropy: более чувствителен к разбросу, log-scale
Gini: вычислительно проще (нет log), чаще в sklearn
Разница на практике минимальна

6. Classification Metrics (2 задачи)¶

Лучшие источники¶

Блоги: - Google ML Crash Course: Classification - Scikit-learn: Metrics

YouTube: - StatQuest: ROC & AUC

Ключевые концепции¶

Metric	Formula	Когда использовать
Precision	$\frac{TP}{TP+FP}$	False positives дорогие
Recall	$\frac{TP}{TP+FN}$	False negatives дорогие
F1	$\frac{2 \cdot P \cdot R}{P+R}$	Баланс P и R
ROC-AUC	Area under TPR/FPR curve	Ranking quality

7. Regularization (3 задачи)¶

Лучшие источники¶

Блоги: - Regularization in ML - Dropout Paper

YouTube: - StatQuest: Ridge & Lasso

Ключевые концепции¶

Method	Effect	Когда использовать
L1 (Lasso)	Sparsity, feature selection	Много нерелевантных признаков
L2 (Ridge)	Shrinkage, small weights	Мультиколлинеарность
ElasticNet	L1 + L2	Комбо
Dropout	Random deactivation	Deep learning, overfitting
BatchNorm	Normalize activations	Faster training, regularization

L1 vs L2 геометрия:

L1: ромб → углы на осях → sparse solutions
L2: круг → веса близки к 0, но не 0

8. Ensemble Methods (2 задачи)¶

Лучшие источники¶

Книги: - ESL Chapter 8-10 — Boosting, Trees - XGBoost Documentation

Блоги: - Gradient Boosting from scratch - CatBoost vs XGBoost vs LightGBM

Ключевые концепции¶

Method	Key Idea	Bias-Variance
Bagging	Parallel, bootstrap samples	Reduces variance
Boosting	Sequential, fit residuals	Reduces bias
Random Forest	Bagging + feature randomization	Low variance
GBDT	Gradient descent in function space	Low bias

RF vs GBDT:

RF: параллельно, устойчив к overfitting, простой
GBDT: последовательно, мощнее, требует тюнинга
Практика: начать с RF, потом XGBoost/LightGBM

9. NumPy & Pandas (3 задачи)¶

Лучшие источники¶

Документация: - NumPy User Guide - Pandas User Guide

Cheat Sheets: - NumPy Cheat Sheet - Pandas Cheat Sheet

Ключевые концепции¶

Операция	Пример	Важно помнить
loc vs iloc	`df.loc['A':'B']` vs `df.iloc[0:2]`	loc inclusive, iloc exclusive
groupby.agg	`df.groupby('cat').agg({'val': ['mean', 'std']})`	Multiple aggregations
merge	`pd.merge(df1, df2, on='key', how='left')`	SQL-style JOINs
transform	`df.groupby('cat')['val'].transform('mean')`	Same shape as input

10. LLM Scaling Laws (Gap Filler 2026)¶

Лучшие источники¶

Papers: - Scaling Laws for Neural Language Models (Kaplan et al., 2020) — первые scaling laws - Training Compute-Optimal Large Language Models (Chinchilla, Hoffmann et al., 2022) — compute-optimal - Beyond Chinchilla (Porian et al., 2025) — includes inference costs

Блоги: - LLM Scaling Laws 2026: Complete Guide — AIMultiple (Feb 2026) - The Chinchilla Papers Implications — LessWrong

Ключевые формулы¶

Kaplan (2020): $$L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}$$

Loss decreases as power law with model size.

Chinchilla (2022): $$L(N, D) = \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}} + E$$

Где: - $N$ = number of parameters - $D$ = training tokens - $\alpha \approx 0.34, \beta \approx 0.28, E \approx 1.69$

Compute-optimal: $$N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5}$$

Rule of thumb: ~20 tokens per parameter for compute-optimal training.

Evolution 2020-2026¶

Year	Paper	Key Insight
2020	Kaplan	Power-law scaling with model size
2022	Chinchilla	Compute-optimal: balance N and D
2024	Llama 3	Overtrain small models (15T tokens on 8B)
2025	Beyond Chinchilla	Include inference cost in optimization
2025	Sloth	Latent skills for benchmark prediction
2025	Densing Law	Capability density rising exponentially

Practical Implications¶

# Chinchilla-optimal tokens for model size
def compute_optimal_tokens(params_billions: float) -> float:
    """Calculate Chinchilla-optimal training tokens"""
    # Rule: 20 tokens per parameter
    return params_billions * 1e9 * 20

# Example: 7B model
tokens_7b = compute_optimal_tokens(7)  # 140B tokens

# But Llama 3 trained 15T tokens on 8B = 1875 tokens/param
# Why? Inference cost dominates at scale

Interview Questions: 1. "Why did Llama 3 overtrain 8B model with 15T tokens instead of Chinchilla-optimal ~160B?" 2. "How does inference cost change the optimal training strategy?" 3. "What is Densing Law and why does it matter for model selection?"

11. VC Dimension & PAC Learning (Gap Filler 2026)¶

Лучшие источники¶

Books: - Understanding Machine Learning (Shalev-Shwartz & Ben-David) — бесплатно - Foundations of Machine Learning (Mohri) — Bloomberg course

Blogs: - PAC Theory and VC Dimension — Cole Mei (2025) - VC Dimension Deep Dive — Tang Shusen

Key Concepts¶

PAC Learning (Probably Approximately Correct): $$\Pr\left[ |E_{in}(g) - E_{out}(g)| > \epsilon \right] \leq \delta$$

Where: - $E_{in}(g)$ = training error (empirical risk) - $E_{out}(g)$ = test error (true risk) - $\epsilon$ = accuracy parameter - $\delta$ = confidence parameter

VC Dimension Definition: - The maximum number of points that can be shattered by hypothesis class $\mathcal{H}$ - "Shattered" = all $2^n$ labelings possible - For linear classifiers in $\mathbb{R}^d$: VC dimension = $d + 1$

Generalization Bound: $$E_{out}(g) \leq E_{in}(g) + \sqrt{\frac{8}{N} \ln \left( \frac{4(2N)^{VC(\mathcal{H})}}{\delta} \right)}$$

Trade-off Diagram¶

graph LR
    subgraph "VC Dimension Trade-off"
        LOW["Low VC dim<br/>High E_in<br/>Low Omega<br/>(underfitting)"]
        OPT["Optimal<br/>Balance E_in<br/>and Omega"]
        HIGH["High VC dim<br/>Low E_in<br/>High Omega<br/>(overfitting)"]
    end
    LOW --> OPT --> HIGH

    style LOW fill:#e8eaf6,stroke:#3f51b5
    style OPT fill:#e8f5e9,stroke:#4caf50
    style HIGH fill:#fce4ec,stroke:#c62828

Low VC dim: high $E_{in}$, low $\Omega$ (underfitting)
High VC dim: low $E_{in}$, high $\Omega$ (overfitting)
Optimal: balance $E_{in}$ and $\Omega$

Interview Questions¶

Q: What is the VC dimension of a decision tree?

A: Theoretically infinite (can shatter any dataset with enough depth). In practice, bounded by number of leaves.

Q: Why is VC dimension useful?

A: It provides theoretical guarantees on generalization without knowing the true distribution.

Q: What's the VC dimension of a linear classifier in 2D?

A: 3 (can shatter 3 non-collinear points, but not 4 in general position).

12. Causal Inference Basics (Gap Filler 2026)¶

Лучшие источники¶

Books: - Causal Inference: The Mixtape (Cunningham) — бесплатно - Elements of Causal Inference (Peters et al.)

Blogs: - Causal Inference Interview Questions — InterviewGemini (2025)

Key Concepts¶

Association vs Causation: - Association: $P(Y|X)$ — statistical relationship - Causation: $P(Y|do(X))$ — intervention changes distribution

Potential Outcomes Framework: $$\tau_i = Y_i(1) - Y_i(0)$$

Where: - $Y_i(1)$ = outcome if treated - $Y_i(0)$ = outcome if not treated - We only observe one!

ATE (Average Treatment Effect): $$\text{ATE} = \mathbb{E}[Y(1) - Y(0)]$$

Confounding & DAGs¶

Confounding: Third variable affects both treatment and outcome.

Ice Cream Sales ← Temperature → Crime Rate
(Spurious correlation, not causal!)

DAG (Directed Acyclic Graph): Visual representation of causal relationships.

Methods for Observational Studies¶

Method	Key Idea	When to Use
Propensity Score Matching	Match similar treated/untreated	Selection on observables
Regression Discontinuity	Exploit cutoff in treatment	Sharp threshold exists
Instrumental Variables	Use exogenous variation	Hidden confounders
Difference-in-Differences	Compare trends over time	Panel data available

Interview Questions¶

Q: Explain the difference between RCTs and observational studies.

A: RCTs randomize treatment → no confounding by design. Observational studies observe natural treatment assignment → must control for confounders.

Q: What is SUTVA?

A: Stable Unit Treatment Value Assumption — treatment of one unit doesn't affect outcomes of other units. Violated in network effects, spillovers.

Q: What are the key assumptions for causal inference?

A: (1) Consistency, (2) Ignorability (no unmeasured confounding), (3) Positivity (overlap), (4) SUTVA.

Q: How do you assess validity of causal claims?

A: (1) Temporal precedence (cause before effect), (2) Covariation, (3) No spurious correlation, (4) Plausible mechanism, (5) Sensitivity analysis.

13. MCMC Sampling Methods (Gap Filler 2026)¶

Лучшие источники¶

Books: - Bayesian Data Analysis (Gelman et al.) — Chapter 11 - MCMC Handbook — бесплатно

Blogs: - Mastering MCMC Methods — NumberAnalytics - MCMC for ML — Lilian Weng

Key Algorithms¶

Algorithm	Key Idea	When to Use
Metropolis-Hastings	Propose → Accept/Reject	General purpose
Gibbs Sampling	Sample each variable conditionally	Known conditionals
Hamiltonian MC	Use gradients for efficient proposals	High-dimensional
Slice Sampling	Sample under density curve	No tuning needed

Metropolis-Hastings Algorithm¶

def metropolis_hastings(target_pdf, proposal_fn, n_samples):
    """Simple MH sampler"""
    samples = []
    theta = proposal_fn.initial()

    for _ in range(n_samples):
        theta_prop = proposal_fn.sample(theta)
        alpha = min(1, target_pdf(theta_prop) / target_pdf(theta))

        if np.random.random() < alpha:
            theta = theta_prop  # Accept

        samples.append(theta)

    return samples

Acceptance Probability: $$\alpha = \min\left(1, \frac{\pi(\theta') q(\theta|\theta')}{\pi(\theta) q(\theta'|\theta)}\right)$$

Convergence Diagnostics¶

Diagnostic	What to Check	Good Sign
Trace Plot	Stationary around mean	No drift, good mixing
ACF	Autocorrelation decay	Quick decay
ESS	Effective Sample Size	ESS > 400
$\hat{R}$	Gelman-Rubin statistic	$\hat{R} < 1.1$

Effective Sample Size: $$ESS = \frac{N}{1 + 2\sum_{lag=1}^{\infty} \rho(lag)}$$

Interview Questions¶

Q: Why use MCMC instead of analytical solutions?

A: Many posteriors are intractable (no closed form). MCMC samples without computing the normalizing constant.

Q: What's a good acceptance rate for MH?

A: 20-40% for random walk proposals. Too high = slow exploration, too low = wasted computation.

14. Time Series Statistics (Gap Filler 2026)¶

Лучшие источники¶

Books: - Time Series Analysis and Its Applications (Shumway & Stoffer) — бесплатно - Forecasting: Principles and Practice (Hyndman) — онлайн

Blogs: - Quant Interview FAQ: Time Series — BagelQuant (2025)

Stationarity¶

Weak Stationarity: $$E[y_t] = \mu, \quad Var(y_t) = \sigma^2, \quad Cov(y_t, y_{t-k}) = \gamma_k$$

Tests: | Test | H₀ | H₁ | Stationary if | |------|-----|-----|---------------| | ADF | Unit root | Stationary | p < 0.05 | | KPSS | Stationary | Unit root | p > 0.05 |

ARIMA Model¶

ARIMA(p, d, q): - $p$ = AR lags (autoregressive) - $d$ = differencing order (integration) - $q$ = MA lags (moving average)

Model equations: $$\text{AR(p): } y_t = c + \sum_{i=1}^{p} \phi_i y_{t-i} + \epsilon_t$$

\[\text{MA(q): } y_t = c + \sum_{i=1}^{q} \theta_i \epsilon_{t-i} + \epsilon_t\]

ACF/PACF for Model Selection¶

Pattern	Indicates	Order
ACF cuts off after lag q	MA(q)	q lags
PACF cuts off after lag p	AR(p)	p lags
Both tail off	ARMA	Use AIC/BIC

Key Interview Questions¶

Q: How do you make a non-stationary series stationary?

A: (1) Differencing (once or more), (2) Log transformation (stabilize variance), (3) Detrending.

Q: What is cointegration?

A: Two non-stationary series are cointegrated if a linear combination is stationary. Basis for pairs trading.

Q: AR vs MA vs ARMA?

A: AR models persistence in levels. MA models shock propagation. ARMA combines both.

Common Pitfalls¶

Overfitting with excessive lags
Ignoring non-stationarity
Misinterpreting correlation as causation
Not using robust standard errors
Ignoring structural breaks

15. Matrix Calculus: Jacobian & Hessian (Gap Filler 2026)¶

Лучшие источники¶

Tutorials: - DataCamp: What is the Hessian Matrix — comprehensive guide with Python - GeeksforGeeks: Jacobian and Hessian Matrix — formulas and applications

Jacobian Matrix¶

Definition: For $f: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian is: $$J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}$$

Example: $f(x,y) = [x^2 + y, xy]$ $$J = \begin{bmatrix} 2x & 1 \\ y & x \end{bmatrix}$$

Hessian Matrix¶

Definition: For $f: \mathbb{R}^n \to \mathbb{R}$, the Hessian is: $$H = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}$$

Critical Point Classification: | Hessian Eigenvalues | Type | Behavior | |---------------------|------|----------| | All positive | Minimum | Convex locally | | All negative | Maximum | Concave locally | | Mixed signs | Saddle | Neither | | Some zero | Inconclusive | Need higher-order |

Second Derivative Test: $$D = \begin{vmatrix} f_{xx} & f_{xy} \\ f_{yx} & f_{yy} \end{vmatrix}$$

$D > 0, f_{xx} > 0$: Local minimum
$D > 0, f_{xx} < 0$: Local maximum
$D < 0$: Saddle point

Comparison Table¶

Concept	Input → Output	Matrix Shape	Use Case
Gradient	$\mathbb{R}^n \to \mathbb{R}$	$n \times 1$	First-order optimization
Jacobian	$\mathbb{R}^n \to \mathbb{R}^m$	$m \times n$	Backprop, change of variables
Hessian	$\mathbb{R}^n \to \mathbb{R}$	$n \times n$	Second-order optimization

Trace Tricks for Matrix Derivatives¶

\[\nabla_A \text{tr}(AB) = B^T\]

\[\nabla_A \text{tr}(A^TBA) = (B + B^T)A\]

Example Interview Question:

Q: Derive $\nabla_X \text{tr}(X^TAX)$

A: Using trace properties and symmetry: $$\nabla_X \text{tr}(X^TAX) = AX + A^TX$$

If $A$ is symmetric: $\nabla_X = 2AX$

Applications in Neural Networks¶

Hessian in optimization: - Newton's method: $\theta_{t+1} = \theta_t - H^{-1}\nabla L$ - In practice: Hessian too expensive, use approximations (BFGS, L-BFGS)

Jacobian in backprop: - Chain rule through layers: $\frac{\partial L}{\partial W} = J_{output} \cdot J_{hidden} \cdot ...$

Python Implementation¶

import numpy as np
from scipy.optimize import approx_fprime

def numerical_jacobian(f, x, eps=1e-8):
    """Compute Jacobian numerically"""
    n, m = len(x), len(f(x))
    J = np.zeros((m, n))
    for i in range(n):
        x_plus = x.copy()
        x_plus[i] += eps
        J[:, i] = (f(x_plus) - f(x)) / eps
    return J

def numerical_hessian(f, x, eps=1e-5):
    """Compute Hessian numerically"""
    n = len(x)
    H = np.zeros((n, n))
    fx = f(x)
    for i in range(n):
        for j in range(n):
            x_pp = x.copy(); x_pp[i] += eps; x_pp[j] += eps
            x_pm = x.copy(); x_pm[i] += eps; x_pm[j] -= eps
            x_mp = x.copy(); x_mp[i] -= eps; x_mp[j] += eps
            x_mm = x.copy(); x_mm[i] -= eps; x_mm[j] -= eps
            H[i, j] = (f(x_pp) - f(x_pm) - f(x_mp) + f(x_mm)) / (4 * eps**2)
    return H

# Example
def f(x):
    return x[0]**2 + x[1]**2  # Simple quadratic

x = np.array([1.0, 2.0])
H = numerical_hessian(f, x)
print("Hessian:", H)
# [[2. 0.]
#  [0. 2.]] → Positive definite (minimum at origin)

Key Interview Questions¶

Q: Why is Newton's method rarely used in deep learning?

A: (1) Hessian is $O(n^2)$ memory for $n$ parameters, (2) Inverting Hessian is $O(n^3)$, (3) For large networks (millions of params), this is intractable.

Q: What's the relationship between Hessian and conditioning?

A: The condition number of the Hessian (ratio of largest to smallest eigenvalue) determines optimization difficulty. High condition number = ill-conditioned = slow convergence.

Q: How does Jacobian relate to backpropagation?

A: Backprop computes vector-Jacobian products (VJP) efficiently: $\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot J$. PyTorch/TF compute VJPs without materializing full Jacobians.

16. Numerical Stability (Gap Filler 2026)¶

Лучшие источники¶

Tutorials: - Mastering Softmax with Numerical Stability — Best AI Tools - Gradient Clipping in PyTorch — GeeksforGeeks

Log-Sum-Exp Trick¶

Problem: Computing $\log(\sum_i e^{x_i})$ directly causes overflow/underflow.

Solution: $$\log\left(\sum_i e^{x_i}\right) = c + \log\left(\sum_i e^{x_i - c}\right) \text{ where } c = \max_i x_i$$

Why it works: - $c$ shifts all values so maximum becomes 0 - $e^{x_i - c} \leq 1$ → no overflow - At least one term = $e^0 = 1$ → sum $\geq 1$ → no underflow

Softmax Numerical Stability¶

Naive (unstable):

def softmax_unstable(x):
    exp_x = np.exp(x)  # Can overflow!
    return exp_x / np.sum(exp_x)

Stable version:

def softmax_stable(x):
    x_shifted = x - np.max(x, axis=-1, keepdims=True)
    exp_x = np.exp(x_shifted)
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

Gradient Clipping Methods¶

Method	PyTorch Function	What it does
Clip by Value	`clip_grad_value_(params, clip_value)`	Clips each gradient to $[-v, v]$
Clip by Norm	`clip_grad_norm_(params, max_norm)`	Scales gradient vector to max norm
Clip by Hook	`register_hook(lambda g: torch.clamp(g, min, max))`	Custom clipping logic

When to use which: - Clip by norm: Standard for RNNs/Transformers (preserves direction) - Clip by value: When you need component-wise limits - Clip by hook: Asymmetric clipping, layer-specific

Gradient Clipping Code¶

import torch
import torch.nn as nn

# Method 1: Clip by norm (most common)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

# Method 2: Clip by value
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)

# Method 3: Custom hook for asymmetric clipping
for p in model.parameters():
    p.register_hook(lambda grad: torch.clamp(grad, -0.1, 1.0))

Common Numerical Issues¶

Issue	Cause	Solution
NaN in softmax	Overflow from large inputs	Log-sum-exp trick
Exploding gradients	Deep networks, RNNs	Gradient clipping
Vanishing gradients	Sigmoid/tanh in deep nets	ReLU, residual connections
Division by zero	Sum of exponentials $\approx 0$	Add epsilon ($10^{-8}$)
Log(0)	Computing $\log(0)$	Use `np.log(x + eps)` or `np.logaddexp`

FP16/BF16 Considerations¶

Mixed precision training:

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for data, target in dataloader:
    optimizer.zero_grad()
    with autocast():
        output = model(data)
        loss = criterion(output, target)
    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)  # Unscale before clipping!
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(optimizer)
    scaler.update()

Key Interview Questions¶

Q: Why subtract max before softmax instead of just using float64?

A: Float64 only delays overflow. Even with 64-bit, $e^{1000}$ overflows. The log-sum-exp trick is mathematically exact and works with any precision.

Q: Clip by value vs clip by norm — when to use which?

A: Clip by norm preserves gradient direction (preferred for optimization). Clip by value can change direction, useful when certain parameters shouldn't exceed a threshold regardless of others.

Q: How does gradient clipping affect convergence?

A: Clipping acts as a regularizer. Too aggressive clipping → slow learning. Too weak → instability. Typical values: max_norm=1.0 for transformers, 5.0 for RNNs.

Q: What's the difference between exploding and vanishing gradients?

A: Exploding: gradients grow exponentially through layers → NaN weights. Vanishing: gradients shrink to near-zero → no learning in early layers. Solutions differ: clipping for exploding, ReLU/residuals/architecture changes for vanishing.

17. Dimensionality Reduction: t-SNE & UMAP (Gap Filler 2026)¶

Лучшие источники¶

Tutorials: - PCA vs t-SNE vs UMAP — Medium (May 2025) - How to Use t-SNE Effectively — Distill.pub (canonical) - UMAP Original Paper — McInnes et al. (2018)

Comparison Table¶

Feature	PCA	t-SNE	UMAP
Type	Linear	Non-linear	Non-linear
Preserves	Global structure	Local neighborhoods	Local + some global
Speed	Fast	Slow	Fast
Scalability	Excellent	Poor ($O(n^2)$)	Good ($O(n \log n)$)
Use in ML Models	✅ Yes	❌ No (viz only)	✅ Yes
Inverse Transform	✅ Yes	❌ No	❌ No
Interpretability	✅ High	❌ Low	⚠️ Medium
Deterministic	✅ Yes	❌ No (random init)	❌ No

t-SNE (t-Distributed Stochastic Neighbor Embedding)¶

How it works: 1. Compute pairwise similarities in high-dim space (Gaussian) 2. Model similarities in low-dim space (Student-t with heavy tails) 3. Minimize KL divergence between both distributions

Key hyperparameters: - perplexity (5-50): Effective number of neighbors. Lower = more local structure. - n_iter (250-1000): More iterations = more stable. - learning_rate (10-1000): Too high = points cluster in ball; too low = compressed cloud.

When to use t-SNE: - Cluster visualization (CNN embeddings, single-cell RNA) - Exploratory data analysis - Small to medium datasets (< 50k samples)

UMAP (Uniform Manifold Approximation and Projection)¶

How it works: 1. Build fuzzy simplicial set (graph) in high-dim space 2. Build similar graph in low-dim space 3. Minimize cross-entropy between fuzzy set representations

Key hyperparameters: - n_neighbors (5-200): Local vs global structure balance - min_dist (0.0-0.99): How tightly points cluster - metric: Distance metric (euclidean, cosine, etc.)

When to use UMAP: - Large datasets (> 50k samples) - When you need speed - When global structure matters - Feature extraction before ML

Python Implementation¶

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap
import matplotlib.pyplot as plt

# Load and scale data
X = StandardScaler().fit_transform(load_iris().data)
y = load_iris().target

# PCA (fast, linear)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
print(f"PCA explained variance: {pca.explained_variance_ratio_.sum():.2%}")

# t-SNE (slow, non-linear, visualization)
tsne = TSNE(n_components=2, perplexity=30, n_iter=300, random_state=42)
X_tsne = tsne.fit_transform(X)

# UMAP (fast, non-linear, good for ML)
umap_model = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
X_umap = umap_model.fit_transform(X)

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, data, title in [(axes[0], X_pca, 'PCA'),
                         (axes[1], X_tsne, 't-SNE'),
                         (axes[2], X_umap, 'UMAP')]:
    ax.scatter(data[:, 0], data[:, 1], c=y, cmap='viridis', alpha=0.7)
    ax.set_title(title)

Decision Guide¶

Need dimensionality reduction?
├── For ML model input ➝ PCA or UMAP
│   ├── Interpretability needed? ➝ PCA
│   └── Non-linear patterns? ➝ UMAP
├── For visualization?
│   ├── Small dataset (<50k) ➝ t-SNE
│   └── Large dataset (>50k) ➝ UMAP
└── Need inverse transform? ➝ PCA only

Key Interview Questions¶

Q: Why does t-SNE use Student-t distribution in low-dim space?

A: Heavy tails of Student-t alleviate "crowding problem" — in high dimensions, there's more "room" for points. Gaussian in low-dim would compress all points together. Heavy tails push dissimilar points apart.

Q: What's the perplexity parameter in t-SNE?

A: It's a smooth measure of effective neighbors. Perplexity=30 means each point pays attention to ~30 neighbors. Low perplexity = local structure; high = more global.

Q: Why can't t-SNE be used for ML preprocessing?

A: (1) No inverse transform, (2) No out-of-sample extension (can't transform new data), (3) Stochastic results vary between runs, (4) $O(n^2)$ complexity.

Q: UMAP vs t-SNE — when to use which?

A: UMAP is faster, scales better, and can transform new data. Use t-SNE only for small dataset visualization where interpretability of local clusters is critical. Use UMAP for everything else.

Q: Can t-SNE/UMAP distances be interpreted?

A: No! Only relative ordering matters. Cluster sizes and distances between clusters in t-SNE/UMAP plots are meaningless. Use PCA for interpretable distances.

Common Pitfalls¶

Interpreting cluster sizes: Larger clusters in t-SNE ≠ more data
Interpreting distances: Gap between clusters ≠ dissimilarity
Random seed sensitivity: Different seeds = different plots
Not scaling data: Always StandardScaler before t-SNE/UMAP
Wrong perplexity: Rule of thumb: perplexity < $\sqrt{n}$

18. Constrained Optimization: Lagrange & KKT (Gap Filler 2026)¶

Лучшие источники¶

Books: - Convex Optimization (Boyd & Vandenberghe) — canonical text - Stanford CS229 Lecture Notes — SVM and optimization

Tutorials: - KKT Conditions: Ultimate Guide — NumberAnalytics - Convex Optimization and Lagrange Multipliers — CircuitLabs (2025)

General Optimization Problem¶

\[\begin{aligned} \min_{x} \quad & f(x) \\ \text{s.t.} \quad & g_i(x) \leq 0, \quad i = 1, \ldots, m \\ & h_j(x) = 0, \quad j = 1, \ldots, p \end{aligned}\]

Lagrange Multipliers (Equality Constraints)¶

For problem with only equality constraints $h_j(x) = 0$:

Lagrangian: $$L(x, \mu) = f(x) + \sum_{j=1}^{p} \mu_j h_j(x)$$

Optimality condition: $$\nabla_x L(x^*, \mu^*) = 0$$

Intuition: At optimum, gradient of $f$ is a linear combination of constraint gradients.

KKT Conditions (Inequality + Equality)¶

For general problem with both constraints, $x^*$ is optimal iff:

Condition	Formula	Meaning
Primal feasibility	$g_i(x^) \leq 0$, $h_j(x^) = 0$	Point satisfies all constraints
Dual feasibility	$\lambda_i \geq 0$	Multipliers non-negative
Stationarity	$\nabla f + \sum \lambda_i \nabla g_i + \sum \mu_j \nabla h_j = 0$	Gradient of Lagrangian = 0
Complementary slackness	$\lambda_i g_i(x^*) = 0$	Active constraints only

Full Lagrangian: $$L(x, \lambda, \mu) = f(x) + \sum_{i=1}^{m} \lambda_i g_i(x) + \sum_{j=1}^{p} \mu_j h_j(x)$$

Complementary Slackness Intuition¶

\[\lambda_i g_i(x^*) = 0\]

If constraint is inactive ($g_i(x^*) < 0$), then $\lambda_i = 0$
If constraint is active ($g_i(x^*) = 0$), then $\lambda_i \geq 0$
Constraint "pushes back" with force $\lambda_i$

SVM Dual Formulation¶

Primal (hard margin): $$\min_{w,b} \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^T x_i + b) \geq 1$$

Lagrangian: $$L(w, b, \alpha) = \frac{1}{2}\|w\|^2 - \sum_{i=1}^{n} \alpha_i [y_i(w^T x_i + b) - 1]$$

Dual: $$\max_{\alpha} \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^T x_j$$

\[\text{s.t.} \quad \alpha_i \geq 0, \quad \sum_i \alpha_i y_i = 0\]

Why dual? 1. Kernel trick: replace $x_i^T x_j$ with $K(x_i, x_j)$ 2. Solution sparse: most $\alpha_i = 0$ (support vectors)

Python Implementation¶

import numpy as np
from scipy.optimize import minimize

# Example: Minimize x^2 + y^2 subject to x + y = 1
def objective(x):
    return x[0]**2 + x[1]**2

def eq_constraint(x):
    return x[0] + x[1] - 1  # x + y = 1

# Solve with scipy
result = minimize(
    objective,
    x0=[0.5, 0.5],
    constraints={'type': 'eq', 'fun': eq_constraint}
)
print(f"Optimal: x={result.x}, f={result.fun}")
# x=[0.5, 0.5], f=0.5

# Verify with Lagrange multiplier
# L = x^2 + y^2 + λ(x + y - 1)
# ∂L/∂x = 2x + λ = 0  →  x = -λ/2
# ∂L/∂y = 2y + λ = 0  →  y = -λ/2
# Constraint: -λ/2 - λ/2 = 1  →  λ = -1
# Solution: x = y = 0.5 ✓

Convexity Matters¶

Problem Type	Local = Global?	KKT Conditions
Convex	✅ Yes	Necessary AND sufficient
Non-convex	❌ No	Necessary only

Convex problem requires: - Objective $f(x)$ is convex - Inequality constraints $g_i(x)$ are convex - Equality constraints $h_j(x)$ are affine

Key Interview Questions¶

Q: What's the geometric interpretation of Lagrange multipliers?

A: At optimum, gradient of objective is perpendicular to constraint surface. $\nabla f = -\sum \lambda_i \nabla g_i$ means $\nabla f$ is in the span of constraint gradients.

Q: When are KKT conditions sufficient (not just necessary)?

A: When the problem is convex: convex objective, convex inequality constraints, affine equality constraints. Then KKT conditions guarantee global optimum.

Q: What does complementary slackness tell us in SVM?

A: $\alpha_i (y_i(w^Tx_i + b) - 1) = 0$. Either $\alpha_i = 0$ (point not on margin) or $y_i(w^Tx_i + b) = 1$ (point on margin = support vector). This is why SVM has sparse solution.

Q: Why formulate SVM as a constrained optimization problem?

A: (1) Margin maximization is naturally a constrained problem, (2) Dual formulation enables kernel trick, (3) Guarantees global optimum (convex), (4) Solution is sparse (only support vectors matter).

Q: What's the difference between primal and dual problem?

A: Primal: optimize over original variables. Dual: optimize over Lagrange multipliers. For convex problems, strong duality holds (optimal values equal). Dual often easier (kernel trick in SVM).

19. Multiple Testing Correction (Gap Filler 2026)¶

Лучшие источники¶

Blogs: - Multiple Comparison Corrections in A/B Testing — Statsig (Oct 2025) - Benjamini-Hochberg Procedure — GeeksforGeeks (2025)

Books: - Multiple Comparisons (Hochberg & Tamhane) - Statistical Inference (Casella & Berger) — Chapter 8

The Multiple Testing Problem¶

Problem: With $m$ independent tests at $\alpha = 0.05$: $$P(\text{at least one false positive}) = 1 - (1 - \alpha)^m$$

Tests (m)	P(at least 1 FP)
1	5%
5	23%
10	40%
20	64%
100	99%

FWER vs FDR¶

Metric	Definition	Methods
FWER (Family-Wise Error Rate)	$P(\geq 1$ false positive)	Bonferroni, Dunnett, Holm
FDR (False Discovery Rate)	$E[\frac{\text{false positives}}{\text{all positives}}]$	Benjamini-Hochberg

FWER = conservative, controls ANY false positive. FDR = balanced, controls PROPORTION of false positives.

Bonferroni Correction¶

Formula: $$\alpha_{\text{adjusted}} = \frac{\alpha}{m}$$

Example: 10 tests at $\alpha = 0.05$ - Adjusted threshold: $\frac{0.05}{10} = 0.005$ - Reject $H_0$ only if $p < 0.005$

Pros: Simple, controls FWER Cons: Too conservative, low power

Benjamini-Hochberg Procedure¶

Algorithm: 1. Sort p-values: $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}$ 2. For each $i$, compute threshold: $t_i = \frac{i}{m} \cdot \alpha$ 3. Find largest $k$ such that $p_{(k)} \leq t_k$ 4. Reject $H_{(1)}, H_{(2)}, \ldots, H_{(k)}$

Example: 5 tests at $\alpha = 0.05$

Rank	p-value	Threshold $(i/m \cdot \alpha)$	Significant?
1	0.001	0.01	✅
2	0.008	0.02	✅
3	0.020	0.03	✅ (p ≤ 0.03)
4	0.045	0.04	❌ (p > 0.04)
5	0.120	0.05	❌

Result: Reject first 3 hypotheses.

Comparison Table¶

Method	Controls	Power	Best For
Bonferroni	FWER	Low	Few tests, strict control needed
Holm	FWER	Medium	Step-down improvement
Dunnett	FWER	Medium-High	Multiple vs. control
BH	FDR	High	Many tests, some FP OK

Python Implementation¶

import numpy as np
from scipy import stats

# Example: 10 hypothesis tests
p_values = np.array([0.001, 0.008, 0.02, 0.03, 0.04,
                      0.05, 0.08, 0.12, 0.15, 0.30])

# Bonferroni correction
alpha = 0.05
bonferroni_threshold = alpha / len(p_values)
bonferroni_significant = p_values < bonferroni_threshold
print(f"Bonferroni threshold: {bonferroni_threshold:.4f}")
print(f"Bonferroni significant: {bonferroni_significant.sum()} tests")

# Benjamini-Hochberg correction
def benjamini_hochberg(p_values, alpha=0.05):
    m = len(p_values)
    sorted_idx = np.argsort(p_values)
    sorted_p = p_values[sorted_idx]

    # Calculate thresholds
    thresholds = (np.arange(1, m + 1) / m) * alpha

    # Find largest k where p[k] <= threshold[k]
    below_threshold = sorted_p <= thresholds
    if not below_threshold.any():
        return np.zeros(m, dtype=bool)

    k = np.max(np.where(below_threshold)[0])

    # Create result array
    significant = np.zeros(m, dtype=bool)
    significant[sorted_idx[:k+1]] = True
    return significant

bh_significant = benjamini_hochberg(p_values, alpha)
print(f"BH significant: {bh_significant.sum()} tests")

# Using statsmodels
from statsmodels.stats.multitest import multipletests
reject, p_corrected, _, _ = multipletests(p_values, alpha=0.05,
                                           method='fdr_bh')  # or 'bonferroni'
print(f"statsmodels BH: {reject.sum()} tests")

When to Use Which¶

Multiple tests needed?
├── Few tests (< 10) and strict control?
│   └── Bonferroni or Holm
├── Many tests (> 10) or exploratory?
│   └── Benjamini-Hochberg
├── All groups vs. control?
│   └── Dunnett's test
└── Sequential testing (peeking)?
    └── Sequential testing methods (not BH)

Key Interview Questions¶

Q: Why does Bonferroni reduce power?

A: By dividing $\alpha$ by $m$, the threshold becomes very strict. True effects with moderate p-values (e.g., 0.02) may be missed when testing many hypotheses.

Q: When would you choose FDR over FWER?

A: FDR (Benjamini-Hochberg) when: (1) Many tests (10+), (2) Exploratory analysis where some false positives are acceptable, (3) Need more power to detect true effects. Use FWER when even one false positive is costly (e.g., drug approval).

Q: What happens to FWER if you run 100 A/B tests at α=0.05?

A: $P(\text{at least 1 FP}) = 1 - (1 - 0.05)^{100} \approx 99.4\%$. Without correction, you're almost guaranteed at least one false positive!

Q: Why can't BH be used for sequential testing (peeking)?

A: BH requires all p-values at once for ranking. In sequential testing, tests are run one at a time based on interim results, so the full set isn't available. Use group sequential methods or alpha-spending instead.

Q: What's the difference between Bonferroni and Holm?

A: Both control FWER. Holm is a step-down procedure that's uniformly more powerful: it compares the smallest p-value to $\alpha/m$, then the second smallest to $\alpha/(m-1)$, etc. Always prefer Holm over Bonferroni.

20. Spectral Methods: Graph Laplacian (Gap Filler 2026)¶

Лучшие источники¶

Tutorials: - Spectral Clustering in Machine Learning — GeeksforGeeks (2025) - A Tutorial on Spectral Clustering — Von Luxburg (canonical)

Spectral Clustering Overview¶

Core idea: Use eigenvalues/eigenvectors of the graph Laplacian to find clusters in data with non-convex shapes.

When to use: Non-convex clusters, manifold structure, image segmentation, graph data.

Graph Construction¶

Method	Description	Result
ε-neighborhood	Connect points within radius ε	Unweighted graph
k-NN graph	Connect to k nearest neighbors	Directed/weighted graph
Fully connected	Connect all pairs with Gaussian weight	Weighted graph

Similarity (affinity) matrix: $$W_{ij} = \exp\left(-\frac{\|x_i - x_j\|^2}{2\sigma^2}\right)$$

Graph Laplacian Variants¶

Type	Formula	Properties
Unnormalized	$L = D - W$	Simple, eigenvector for $k$ clusters
Normalized (symmetric)	$L_{sym} = D^{-1/2}LD^{-1/2} = I - D^{-1/2}WD^{-1/2}$	Better numerical properties
Normalized (random walk)	$L_{rw} = D^{-1}L = I - D^{-1}W$	Connection to random walks

Where $D$ is the degree matrix: $D_{ii} = \sum_j W_{ij}$

Spectral Clustering Algorithm¶

Build similarity graph from data
Compute Laplacian matrix $L$
Find k smallest eigenvectors of $L$
Stack eigenvectors as columns in matrix $U$
Run k-means on rows of $U$

Python Implementation¶

import numpy as np
from sklearn.cluster import SpectralClustering
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt

# Generate non-convex data
X, y = make_moons(n_samples=200, noise=0.05, random_state=42)

# Spectral clustering
sc = SpectralClustering(
    n_clusters=2,
    affinity='rbf',  # or 'nearest_neighbors'
    gamma=1.0,       # kernel bandwidth
    random_state=42
)
labels = sc.fit_predict(X)

# Compare with k-means
from sklearn.cluster import KMeans
km = KMeans(n_clusters=2, random_state=42)
km_labels = km.fit_predict(X)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
axes[0].set_title('Spectral Clustering')
axes[1].scatter(X[:, 0], X[:, 1], c=km_labels, cmap='viridis')
axes[1].set_title('K-Means (fails on non-convex)')

Eigenvalue Interpretation¶

0 eigenvalue: Corresponds to constant eigenvector (all nodes connected)
Small eigenvalues: Near-0 eigenvectors indicate connected components
Gap in spectrum: Number of clusters ≈ number of near-zero eigenvalues

K-Means vs Spectral Clustering¶

Aspect	K-Means	Spectral
Cluster shape	Convex only	Any shape
Complexity	$O(nkd)$	$O(n^3)$ for eigendecomposition
Scalability	Excellent	Poor for large n
Preprocessing	Scale features	Build affinity matrix

Key Interview Questions¶

Q: Why does spectral clustering work for non-convex clusters?

A: It transforms data into eigenvector space where the Laplacian's spectral properties separate clusters linearly. The transformation "unfolds" non-convex manifolds into separable representations.

Q: What's the role of the graph Laplacian's eigenvectors?

A: The smallest eigenvectors form an embedding where points in the same cluster are close. Running k-means in this space finds clusters that may be non-convex in original space.

Q: How do you choose the number of clusters k?

A: (1) Look for a gap in the eigenvalue spectrum (eigengap heuristic), (2) Use silhouette score, (3) Domain knowledge. The eigengap method: find k where $\lambda_{k+1} - \lambda_k$ is largest.

Q: Why is spectral clustering slow for large datasets?

A: Building the full affinity matrix is $O(n^2)$ memory. Eigendecomposition is $O(n^3)$. Solutions: use sparse affinity matrices, approximate eigensolvers (Lanczos), or Nyström approximation.

Q: What's the difference between L, L_sym, and L_rw?

A: All find the same clusters in theory. $L_{sym}$ (symmetric normalized) has better numerical stability. $L_{rw}$ (random walk) relates to Markov chains on graphs. Always use normalized Laplacian in practice.

Advantages & Disadvantages¶

Pros: - Handles non-convex clusters - No assumption about cluster shape - Mathematically elegant (spectral graph theory)

Cons: - $O(n^3)$ complexity - $O(n^2)$ memory for affinity matrix - Sensitive to affinity parameters (σ, k) - Requires specifying number of clusters

21. Test-Time Compute Scaling (2025 Hot Topic)¶

Лучшие источники¶

Blogs: - Inference-Time Compute Scaling Methods — Sebastian Raschka (Mar 2025) — CANONICAL - What is Test-Time Scaling? — Adaline Labs (Apr 2025)

Papers: - The Art of Scaling Test-Time Compute for LLMs — Dec 2025 - s1: Simple Test-Time Scaling — Jan 2025

What is Test-Time Compute?¶

Definition: Spending more computational resources during inference to improve model outputs, rather than just during training.

Analogy: Humans give better responses when given more time to "think". Similarly, LLMs can improve with techniques that encourage more "thought" during generation.

Two Strategies for Better Reasoning¶

Strategy	When	Cost
Train-time compute	During training/fine-tuning	Fixed upfront cost
Test-time compute	During inference	Per-query cost, scalable

Methods for Test-Time Scaling¶

Method	Description	Trade-off
Chain-of-Thought (CoT)	"Think step by step" prompting	More tokens = more cost
Best-of-N sampling	Generate N responses, pick best	N× compute
Majority voting	Generate N, take most common answer	N× compute
Beam search	Explore multiple paths, keep top-k	k× compute
Self-correction	Model reviews and fixes own output	2× tokens
"Wait" tokens	Force model to continue thinking	Budget forcing

"Wait" Token Technique (s1 Paper)¶

def budget_forcing(model, prompt, max_thinking_tokens=1000):
    """Force model to think longer by adding 'Wait' tokens"""
    response = model.generate(prompt)

    while len(response) < max_thinking_tokens:
        # If model tries to stop, inject "Wait" to continue
        if response.endswith("</answer>"):
            response = response.replace("</answer>", "")
            response += "\nWait, let me reconsider..."

        response = model.generate(response, max_new_tokens=100)

    return response

Best-of-N with Verification¶

def best_of_n_with_verifier(model, verifier, prompt, n=5):
    """Generate N candidates, use verifier to pick best"""
    candidates = [model.generate(prompt) for _ in range(n)]
    scores = [verifier.score(prompt, c) for c in candidates]
    best_idx = max(range(n), key=lambda i: scores[i])
    return candidates[best_idx]

Compute-Optimal Inference¶

Key insight (Snell et al., 2024): There's an optimal allocation of test-time compute: - Easy problems: little extra compute needed - Hard problems: benefit from more sampling/verification - Very hard problems: diminishing returns

Rule of thumb: $$\text{Test-time budget} \propto \text{problem difficulty}$$

When Test-Time Compute Helps¶

Scenario	Effective?	Why
Math problems	✅ Yes	Verifiable answers
Code generation	✅ Yes	Tests can verify
Creative writing	⚠️ Limited	No objective "best"
Simple Q&A	❌ No	Overkill
Time-critical apps	❌ No	Latency too high

Key Interview Questions¶

Q: What's the difference between train-time and test-time compute?

A: Train-time: invest compute upfront during training to improve model weights. Test-time: spend compute during inference on each query. Test-time is more flexible (can scale per-problem difficulty) but costs more per query.

Q: When would you use Best-of-N vs. Chain-of-Thought?

A: Best-of-N when you have a verifier (e.g., code tests, math checker). CoT when you want the model to reason explicitly. Both can be combined: generate N CoT responses, pick the most consistent.

Q: What are "wait" tokens in test-time scaling?

A: A technique from the s1 paper where you force the model to continue generating by appending "Wait, let me reconsider" when it tries to stop. This effectively increases inference compute for harder problems.

Q: Why did DeepSeek R1 NOT use explicit test-time scaling?

A: Their model already generates longer responses naturally (implicit scaling). They noted that test-time scaling is typically an application-layer concern and can be added post-hoc without modifying the model.

Q: What's the cost problem with test-time compute?

A: Linear or super-linear increase in inference cost per query. For high-traffic applications, this can be prohibitively expensive. Need to balance accuracy gains vs. latency and cost constraints.

Cost-Benefit Trade-off¶

More test-time compute →
├── Better accuracy (up to a point)
├── Higher latency (bad for real-time)
├── Higher cost (per query)
└── Diminishing returns for very high compute

Practical guidance: - Start with CoT prompting (free) - Add Best-of-N for verifiable domains (2-5× cost) - Use process reward models for critical applications - Set compute budget based on problem value

22. Information Geometry (Gap Filler 2026)¶

Лучшие источники¶

Blogs: - Information Geometry Part 3: Applications in ML — Ji-Ha Kim (Jun 2025) - Curvature as a Teacher: Information Geometry in Neural Network Learning — Satyam Mishra (Aug 2025) - Information Geometry in Variational Inference — AI Under the Hood

Books: - Amari, Information Geometry and Its Applications — canonical reference - Murphy, Probabilistic Machine Learning — Chapter 10

Core Concept: Probability Manifold¶

Key insight: Probability distributions form a curved manifold, not a flat Euclidean space.

Concept	Euclidean	Information Geometry
Space	Flat	Curved manifold
Distance	$\\|x - y\\|$	KL divergence
Metric	Identity $I$	Fisher Information $I(\theta)$
Gradient	$\nabla L$	Natural gradient $\tilde{\nabla} L$

Fisher Information Matrix (FIM)¶

Definition: $$I(\theta)_{ij} = \mathbb{E}\left[\frac{\partial \log p(x|\theta)}{\partial \theta_i} \frac{\partial \log p(x|\theta)}{\partial \theta_j}\right]$$

Properties: - Positive semi-definite: $I(\theta) \succeq 0$ - Captures curvature of log-likelihood - KL divergence approximation: $D_{KL}(p_\theta \| p_{\theta+\delta}) \approx \frac{1}{2} \delta^T I(\theta) \delta$

Natural Gradient¶

Standard gradient: $\nabla L$ — assumes flat Euclidean space

Natural gradient: $$\tilde{\nabla} L = I(\theta)^{-1} \nabla L$$

Update rule: $$\theta_{t+1} = \theta_t - \eta \cdot I(\theta_t)^{-1} \nabla L(\theta_t)$$

Benefits: - Reparametrization invariant: same direction regardless of parameterization - Faster convergence: follows geodesics (shortest paths) - Stable: respects manifold curvature

Intuition: Why Natural Gradient?¶

Imagine riding a bike on Earth vs. a flat parking lot: - On the parking lot (Euclidean), shortest path is a straight line. - On Earth (curved), shortest path is a great circle.

Natural gradient = GPS navigation that knows the Earth is round.

Approximations for Large Models¶

Problem: For $d$ parameters, computing $I(\theta)^{-1}$ is $O(d^3)$ — infeasible for deep nets.

Approximation	Complexity	Trade-off
Diagonal	$O(d)$	Ignores correlations, like Adam
Block-diagonal (K-FAC)	$O(d \cdot k^2)$	Per-layer blocks, Kronecker factors
Low-rank	$O(dr)$	Capture dominant directions
Matrix-free (CG)	$O(d \cdot k)$	Solve $I(\theta)s = \nabla L$ iteratively

Adam as Approximate Natural Gradient¶

Adam update: $$\theta_t = \theta_{t-1} - \eta \cdot \frac{g_t}{\sqrt{v_t} + \epsilon}$$

Interpretation: $\sqrt{v_t}$ (second moment) ≈ diagonal of empirical Fisher.

import torch

def natural_gradient_update(params, grad, fisher_diag, lr=0.01, damping=1e-4):
    """Natural gradient with diagonal Fisher approximation"""
    # F^-1 * grad (element-wise for diagonal)
    nat_grad = grad / (fisher_diag + damping)
    return params - lr * nat_grad

def empirical_fisher_diag(model, data_loader, n_samples=100):
    """Estimate diagonal of empirical Fisher"""
    fisher_diag = {name: torch.zeros_like(p) for name, p in model.named_parameters()}

    for i, (x, y) in enumerate(data_loader):
        if i >= n_samples:
            break

        log_prob = model.log_prob(x, y)
        for name, p in model.named_parameters():
            grad = torch.autograd.grad(log_prob, p, retain_graph=True)[0]
            fisher_diag[name] += grad ** 2

    # Average
    for name in fisher_diag:
        fisher_diag[name] /= n_samples

    return fisher_diag

K-FAC (Kronecker-Factored Approximate Curvature)¶

For a layer with weights $W \in \mathbb{R}^{m \times n}$:

\[I_W \approx A \otimes G\]

where: - $A = \mathbb{E}[aa^T]$ — activation covariance (input) - $G = \mathbb{E}[gg^T]$ — gradient covariance (output) - $\otimes$ — Kronecker product

Benefit: Inverting $(mn \times mn)$ matrix → inverting two $(m \times m)$ and $(n \times n)$ matrices.

Applications in ML¶

Application	Role of Information Geometry
VAEs	Optimize $q(z\mid x)$ to minimize KL divergence
Bayesian Neural Networks	Natural gradient for posterior inference
Reinforcement Learning (TRPO)	Fisher metric constrains policy updates
Optimization	Adam ≈ diagonal natural gradient

Key Interview Questions¶

Q: Why is natural gradient better than standard gradient?

A: (1) Reparametrization invariant — same optimization path regardless of how you parameterize the model. (2) Faster convergence — follows geodesics on the probability manifold. (3) Curvature-aware — adapts to sensitivity of parameters.

Q: How is Adam related to natural gradient?

A: Adam's second moment estimate $v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$ approximates the diagonal of the empirical Fisher Information Matrix. The update $g_t / \sqrt{v_t}$ is essentially a natural gradient with diagonal approximation.

Q: What's the computational problem with natural gradient for deep learning?

A: Computing and inverting the full Fisher Information Matrix is $O(d^2)$ storage and $O(d^3)$ inversion, where $d$ is the number of parameters (millions to billions for modern networks). Practical methods use diagonal, block-diagonal (K-FAC), or low-rank approximations.

Q: When would you use K-FAC over Adam?

A: K-FAC when: (1) Training is unstable with Adam, (2) You need faster convergence (fewer steps), (3) Second-order information is valuable (e.g., small datasets, research). Adam is better for: (1) Most practical deep learning, (2) Limited compute, (3) Large-scale training.

Q: What does "curvature" mean in information geometry?

A: Curvature of the statistical manifold defined by the Fisher metric. High curvature regions → parameters more sensitive, need smaller steps. Low curvature → less sensitive, can take larger steps. Natural gradient automatically adapts step sizes based on curvature.

Summary Table¶

Concept	Formula	Intuition
Fisher Matrix	$I(\theta) = \mathbb{E}[\nabla \log p \cdot \nabla \log p^T]$	Curvature of log-likelihood
Natural Gradient	$\tilde{\nabla} L = I^{-1} \nabla L$	Steepest descent on manifold
KL ≈ Fisher	$D_{KL} \approx \frac{1}{2} \delta^T I \delta$	Local distance on manifold
Adam	$g / \sqrt{v}$	Diagonal natural gradient
K-FAC	$I \approx A \otimes G$	Layer-wise Kronecker factors

24. Advanced Optimization: Newton, BFGS, L-BFGS (Gap Filler 2026)¶

Лучшие источники¶

Blogs: - Why Not Newton? The Real Reason We're Stuck with SGD — Dang Truong (2025) - Second-Order Optimization Methods — GeeksforGeeks - L-BFGS for Advanced Optimization — NumberAnalytics

Courses: - EPFL Optimization for ML — lecture notes (2025) - Boyd Convex Optimization — canonical book

First-Order vs Second-Order¶

Method	Uses	Convergence	Cost per Step
SGD	Gradient $\nabla f$	Sub-linear	$O(n)$
Newton	Gradient + Hessian $\nabla f, H$	Quadratic	$O(n^3)$
BFGS	Approximate inverse Hessian	Superlinear	$O(n^2)$
L-BFGS	Limited memory BFGS	Superlinear	$O(n)$

Newton's Method¶

Update rule (from Taylor expansion): $$\theta_{t+1} = \theta_t - H^{-1} \nabla f(\theta_t)$$

Where: - $H = \nabla^2 f(\theta)$ — Hessian matrix of second derivatives - $H_{ij} = \frac{\partial^2 f}{\partial \theta_i \partial \theta_j}$

Why quadratic convergence? - Taylor expansion: $f(\theta) \approx f(\theta^*) + \nabla f(\theta^*)(\theta - \theta^*) + \frac{1}{2}(\theta - \theta^*)^T H (\theta - \theta^*)$ - Newton step minimizes the quadratic approximation exactly - Near minimum: error roughly squared each iteration

Why NOT Use Newton for Deep Learning?¶

Level 1: Memory Problem - ResNet-50: $N = 25$ million parameters - Hessian: $N \times N = 625$ trillion entries = 2.5 Petabytes - H100 GPU: 80 GB → would need 312,500 GPUs just to store!

Level 2: Stochastic Problem - We use mini-batches (64 samples), not full dataset - Mini-batch gradient: reasonable estimate of true gradient - Mini-batch Hessian: garbage — too noisy to trust - Even with L-BFGS, stochastic noise destroys quadratic convergence

Level 3: Non-Convex Problem - Deep learning landscapes: millions of saddle points, sharp valleys - Newton converges to nearest critical point (could be saddle!) - SGD noise helps escape sharp minima → better generalization

Insight: We don't want a scalpel (Newton). We need a jackhammer (SGD).

BFGS: Quasi-Newton Method¶

Key idea: Approximate $H^{-1}$ iteratively, don't compute it directly.

Update: $$\theta_{t+1} = \theta_t - \alpha_t M_t \nabla f(\theta_t)$$

Where $M_t \approx H^{-1}$ is updated using: - $s_t = \theta_{t+1} - \theta_t$ (parameter change) - $y_t = \nabla f(\theta_{t+1}) - \nabla f(\theta_t)$ (gradient change) - Rank-2 update: $M_{t+1} = (I - \rho s y^T) M_t (I - \rho y s^T) + \rho s s^T$

Cost: $O(n^2)$ per step (vs $O(n^3)$ for Newton)

L-BFGS: Limited-Memory BFGS¶

Key idea: Don't store $n \times n$ matrix. Store last $k$ pairs $(s_i, y_i)$.

Memory: $O(nk)$ instead of $O(n^2)$

Typical: $k = 10-20$ pairs

import numpy as np
from scipy.optimize import minimize

def rosenbrock(x):
    """Rosenbrock function: classic optimization benchmark"""
    return sum(100.0*(x[1:]-x[:-1]**2.0)**2.0 + (1-x[:-1])**2.0)

def rosenbrock_grad(x):
    """Gradient of Rosenbrock"""
    xm = x[1:-1]
    xm_m1 = x[:-2]
    xm_p1 = x[2:]
    der = np.zeros_like(x)
    der[1:-1] = 200*(xm-xm_m1**2) - 400*(xm_p1 - xm**2)*xm - 2*(1-xm)
    der[0] = -400*x[0]*(x[1]-x[0]**2) - 2*(1-x[0])
    der[-1] = 200*(x[-1]-x[-2]**2)
    return der

# L-BFGS optimization
x0 = np.array([1.3, 0.7, 0.8, 1.9, 1.2])
result = minimize(rosenbrock, x0, method='L-BFGS-B', jac=rosenbrock_grad)
print(f"Minimum at: {result.x}")
print(f"Function value: {result.fun:.6f}")
print(f"Iterations: {result.nit}")

Comparison for Deep Learning¶

Method	Memory	Works with Mini-batch?	Good for DL?
Newton	$O(n^2)$	No	❌ Impossible
BFGS	$O(n^2)$	Poorly	❌ Too much memory
L-BFGS	$O(n)$	Poorly	⚠️ Sometimes (small models)
SGD	$O(n)$	Yes	✅ Standard
Adam	$O(n)$	Yes	✅ Best default

When L-BFGS IS Used¶

Classic ML: SVM, logistic regression (convex problems)
Small neural networks: < 10K parameters
Fine-tuning: When full-batch is feasible
Scipy/PyTorch: scipy.optimize.minimize(method='L-BFGS-B')

Key Interview Questions¶

Q: Why doesn't deep learning use Newton's method?

A: Three reasons: (1) Memory: Hessian is $O(n^2)$ — petabytes for modern networks. (2) Stochastic: Mini-batch Hessians are too noisy, destroying quadratic convergence benefits. (3) Non-convex: Newton can converge to saddles or sharp minima; SGD's noise finds flat, generalizable minima.

Q: What's the difference between BFGS and L-BFGS?

A: BFGS stores full $n \times n$ inverse Hessian approximation ($O(n^2)$ memory). L-BFGS stores only last $k$ gradient/parameter difference pairs ($O(nk)$ memory). L-BFGS achieves similar convergence with far less memory.

Q: When would you actually use L-BFGS?

A: For convex problems (SVM, logistic regression), small models (<10K params), or when you can afford full-batch gradients. Not suitable for large-scale stochastic deep learning.

Q: If Adam is "fake second-order," what does it approximate?

A: Adam's second moment $v_t$ approximates the diagonal of the empirical Fisher Information Matrix. This is a very rough curvature estimate — just enough to adapt step sizes per parameter, but missing the full Hessian information.

Q: What's quadratic convergence?

A: Error roughly squares each iteration: $|\theta_{t+1} - \theta^*| \approx C|\theta_t - \theta^*|^2$. Near minimum, correct digits double each step. Newton has this; SGD doesn't.

Summary Table¶

Method	Update Rule	Key Property
Newton	$\theta - H^{-1}\nabla f$	Quadratic convergence, $O(n^3)$
BFGS	$\theta - M_t \nabla f$	Superlinear, $O(n^2)$ memory
L-BFGS	Same, implicit $M_t$	$O(n)$ memory, needs full-batch
SGD	$\theta - \eta \nabla f$	Noisy, good for non-convex
Adam	$\theta - \eta \frac{m}{\sqrt{v}+\epsilon}$	Diagonal approx, adaptive

Связи между темами (Cross-References)¶

Лучшие источники¶

Blogs: - Johnson-Lindenstrauss Lemma & Random Projections — AI Under the Hood - Ultimate Guide to Johnson-Lindenstrauss Lemma — NumberAnalytics

Papers: - Johnson & Lindenstrauss (1984) — original paper - Dasgupta & Gupta (2003) — elementary proof - Achlioptas (2003) — database-friendly sparse projections

Books: - Vempala, The Random Projection Method — canonical reference

Johnson-Lindenstrauss Lemma: Statement¶

For $n$ points in $\mathbb{R}^d$ and $\varepsilon \in (0,1)$, there exists a linear map $f: \mathbb{R}^d \to \mathbb{R}^k$ with:

\[k \geq \frac{8 \ln n}{\varepsilon^2}\]

such that for all pairs $x_i, x_j$:

Key insight: $k$ depends on $n$ and $\varepsilon$, NOT on original dimension $d$!

Why This Matters¶

Without JL	With JL
$d = 1,000,000$	$k = O(\log n / \varepsilon^2)$
Expensive to store	Compressed storage
Slow distance computations	Fast computations
Curse of dimensionality	Mitigated

Random Projection Methods¶

Method	Matrix entries	Speed	Notes
Gaussian	$R_{ij} \sim \mathcal{N}(0, 1/k)$	Slow	Theoretical guarantee
Rademacher	$R_{ij} = \pm 1/\sqrt{k}$ with $p=0.5$	Medium	Simpler
Sparse (Achlioptas)	$0$ (⅔), $\pm\sqrt{3}/\sqrt{k}$ (⅙ each)	Fast	3× speedup

Python Implementation¶

import numpy as np
from sklearn.random_projection import GaussianRandomProjection, SparseRandomProjection

def random_projection_custom(X, k, method='gaussian'):
    """
    Random projection from d dimensions to k dimensions.

    Args:
        X: (n, d) data matrix
        k: target dimension
        method: 'gaussian', 'rademacher', or 'sparse'
    """
    n, d = X.shape

    if method == 'gaussian':
        R = np.random.randn(k, d) / np.sqrt(k)
    elif method == 'rademacher':
        R = np.random.choice([-1, 1], size=(k, d)) / np.sqrt(k)
    elif method == 'sparse':
        # Achlioptas: 0 with prob 2/3, ±√3 with prob 1/6 each
        R = np.random.choice([0, np.sqrt(3), -np.sqrt(3)],
                             size=(k, d), p=[2/3, 1/6, 1/6]) / np.sqrt(k)

    return X @ R.T

def jl_target_dimension(n, epsilon=0.1):
    """Compute minimum target dimension for JL guarantee."""
    return int(np.ceil(8 * np.log(n) / epsilon**2))

# Example usage
n, d = 1000, 5000  # 1000 points in 5000 dimensions
X = np.random.randn(n, d)

epsilon = 0.1
k = jl_target_dimension(n, epsilon)
print(f"Original: {d}D → Projected: {k}D (for n={n}, ε={epsilon})")

X_proj = random_projection_custom(X, k, method='sparse')

# Verify distance preservation
i, j = 0, 1
orig_dist = np.linalg.norm(X[i] - X[j])
proj_dist = np.linalg.norm(X_proj[i] - X_proj[j])
ratio = proj_dist / orig_dist
print(f"Original dist: {orig_dist:.3f}, Projected: {proj_dist:.3f}, Ratio: {ratio:.3f}")
print(f"Within bounds? {(1-epsilon) <= ratio**2 <= (1+epsilon)}")

# Using sklearn (recommended for production)
transformer = SparseRandomProjection(n_components=k, random_state=42)
X_proj_sklearn = transformer.fit_transform(X)

Applications in ML¶

Application	How JL Helps
k-NN Search	Reduce dimensions, keep approximate neighbors
Clustering	Faster k-means in lower dimensions
SVM	Kernel approximation for large datasets
LSH (Locality-Sensitive Hashing)	Efficient approximate nearest neighbors
Compressed sensing	Signal recovery from fewer measurements

JL vs PCA¶

Aspect	Random Projection	PCA
Speed	$O(nkd)$	$O(nd^2)$ for SVD
Data-dependent?	No (random matrix)	Yes (eigenvalue decomposition)
Preserves	Pairwise distances	Variance
Best for	Very high $d$, streaming data	When variance matters most
Guarantees	Theoretical (with prob.)	Optimal for variance

When to Use Random Projections¶

High-dimensional data (d > 1000)?
├── Need fast dimensionality reduction?
│   └── Random Projection ✅
├── Need interpretable components?
│   └── PCA ✅
├── Streaming / online data?
│   └── Random Projection ✅ (no fit needed)
└── Distance-based algorithms (k-NN, clustering)?
    └── Random Projection ✅ (preserves distances)

Key Interview Questions¶

Q: What does the Johnson-Lindenstrauss lemma guarantee?

A: For any $n$ points, we can project to $k = O(\log n / \varepsilon^2)$ dimensions while preserving all pairwise distances within factor $(1 \pm \varepsilon)$ with high probability. Crucially, $k$ doesn't depend on original dimension $d$.

Q: Why is JL useful for ML?

A: (1) Mitigates curse of dimensionality, (2) Speeds up distance-based algorithms (k-NN, clustering), (3) Compresses high-dimensional data (embeddings), (4) Enables efficient approximate nearest neighbor search.

Q: Why use sparse random projections over Gaussian?

A: Sparse matrices (Achlioptas) are faster to compute: many entries are zero, reducing multiplications by ~3×. Still provides same theoretical guarantees with high probability.

Q: When would you use random projection vs PCA?

A: Random projection when: (1) Data dimension is very high ($d > 1000$), (2) Need fast online/streaming projection, (3) Don't need interpretable components, (4) Preserving distances is more important than variance. PCA when you need optimal variance preservation and interpretability.

Q: What's the catch with random projections?

A: (1) Adds randomness — results not deterministic without seed, (2) Some distortion ($\varepsilon$), (3) Doesn't capture data structure like PCA, (4) Not optimal for any specific dataset — it's a general-purpose tool.

Summary Table¶

Concept	Formula	Purpose
Target dimension	$k \geq \frac{8 \ln n}{\varepsilon^2}$	Minimum for JL guarantee
Gaussian projection	$R_{ij} \sim \mathcal{N}(0, 1/k)$	Theoretical standard
Sparse projection	$R_{ij} \in \{0, \pm\sqrt{3}\}/\sqrt{k}$	Fast computation
Distance preservation	$(1-\varepsilon)\\|x\\|^2 \leq \\|Rx\\|^2 \leq (1+\varepsilon)\\|x\\|^2$	Core guarantee

Заблуждение: softmax стабилен без трюка с вычитанием max

При $x_i > 700$ (float64) или $x_i > 88$ (float32) значение $e^{x_i}$ = inf. Даже float64 не спасает. Всегда вычитайте max: exp(x - max(x)) -- математически идентично, но численно стабильно. Это спрашивают на 90% ML-интервью.

Заблуждение: PCA можно заменить на t-SNE для предобработки данных

t-SNE: нет inverse transform, нет out-of-sample extension, $O(n^2)$ сложность, стохастические результаты. Для preprocessing используйте PCA или UMAP. t-SNE -- только для визуализации малых датасетов (<50k точек). Расстояния и размеры кластеров на t-SNE графике бессмысленны.

Заблуждение: Newton's method лучше SGD для deep learning

Гессиан для ResNet-50 (25M параметров) = 625 триллионов элементов = 2.5 петабайт. Обращение матрицы = $O(n^3)$. Даже L-BFGS плохо работает со стохастическими мини-батчами. SGD + шум помогает найти flat minima с лучшей generalization.

Связи между темами (Cross-References)¶

graph TD
    A[Calculus<br/>градиент] --> B[Gradient Descent]
    B --> C[Optimizers]
    C --> D[Deep Learning]
    E[Linear Algebra<br/>матрицы] --> F[Neural Networks]
    F --> G[Backprop]
    H[Probability<br/>Bayes] --> I[Naive Bayes]
    H --> J[Information Theory]
    J --> K[Entropy]
    L[Statistics<br/>hypothesis testing] --> M[A/B Tests]
    K --> N[Information Gain]
    N --> O[Decision Trees]
    O --> P[Random Forest]
    P --> Q[Classification Metrics]
    R[Regularization<br/>L1/L2] --> S[Gradient Boosting]

    style A fill:#e8eaf6,stroke:#3f51b5
    style E fill:#e8eaf6,stroke:#3f51b5
    style H fill:#f3e5f5,stroke:#9c27b0
    style L fill:#f3e5f5,stroke:#9c27b0
    style D fill:#e8f5e9,stroke:#4caf50
    style J fill:#fff3e0,stroke:#ef6c00
    style Q fill:#e8f5e9,stroke:#4caf50

Тема	Формула	Применение
Градиент	\(\nabla f = [\frac{\partial f}{\partial x_1}, ...]\)	Направление наискорейшего роста
Sigmoid'	\(\sigma'(x) = \sigma(x)(1-\sigma(x))\)	Backprop
ReLU'	\(\text{ReLU}'(x) = 1\) if \(x>0\) else \(0\)	Backprop
MSE gradient	\(\frac{\partial L}{\partial \hat{y}} = \frac{2}{n}(\hat{y}-y)\)	Регрессия
BCE gradient	\(\frac{\partial L}{\partial \hat{y}} = \frac{\hat{y}-y}{\hat{y}(1-\hat{y})}\)	Классификация

Тема	Формула	Применение
L2 норма	\(\\|x\\|_2 = \sqrt{\sum_i x_i^2}\)	Регуляризация
Softmax	\(\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\)	Мультикласс
Cosine similarity	\(\cos(a,b) = \frac{a \cdot b}{\\|a\\|\\|b\\|}\)	Embeddings
SVD	\(X = U\Sigma V^T\)	PCA, рекомендации

Тема	Формула	Применение
Bayes	\(P(A\\|B) = \frac{P(B\\|A)P(A)}{P(B)}\)	Naive Bayes, обновление
A/B test sample size	\(n = \frac{(z_{\alpha/2}+z_\beta)^2 \cdot 2p(1-p)}{\delta^2}\)	Эксперименты
Laplace smoothing	\(P(w\\|c) = \frac{\text{count}(w,c)+1}{\text{count}(c)+\\|V\\|}\)	Naive Bayes

Тема	Формула	Интервью важность
CLT	\(\bar{X} \sim N(\mu, \sigma^2/n)\)	HIGH
p-value	\(P(D\\|H_0)\)	HIGH
t-statistic	\(t = \frac{\bar{x}-\mu}{s/\sqrt{n}}\)	HIGH
Bootstrap CI	\([\hat{\theta}_{\alpha/2}, \hat{\theta}_{1-\alpha/2}]\)	MEDIUM
VIF	\(\text{VIF}_j = \frac{1}{1-R_j^2}\)	MEDIUM
KS test	\(D = \sup_x \\|F_n(x)-F(x)\\|\)	LOW

Тема	Формула	Применение
Entropy	\(H(X) = -\sum p_i \log_2 p_i\)	Decision trees
Cross-Entropy	\(H(p,q) = -\sum p_i \log q_i\)	Loss function
KL Divergence	\(D_{KL}(P\\|Q) = \sum p_i \log \frac{p_i}{q_i}\)	VAE, distribution matching
Information Gain	\(IG(S,A) = H(S) - \sum \frac{\lvert S_v\rvert}{\\|S\\|}H(S_v)\)	Decision trees
Gini	\(G = 1 - \sum p_k^2\)	Random Forest

Metric	Formula	Когда использовать
Precision	\(\frac{TP}{TP+FP}\)	False positives дорогие
Recall	\(\frac{TP}{TP+FN}\)	False negatives дорогие
F1	\(\frac{2 \cdot P \cdot R}{P+R}\)	Баланс P и R
ROC-AUC	Area under TPR/FPR curve	Ranking quality

Concept	Input → Output	Matrix Shape	Use Case
Gradient	\(\mathbb{R}^n \to \mathbb{R}\)	\(n \times 1\)	First-order optimization
Jacobian	\(\mathbb{R}^n \to \mathbb{R}^m\)	\(m \times n\)	Backprop, change of variables
Hessian	\(\mathbb{R}^n \to \mathbb{R}\)	\(n \times n\)	Second-order optimization

Method	PyTorch Function	What it does
Clip by Value	`clip_grad_value_(params, clip_value)`	Clips each gradient to \([-v, v]\)
Clip by Norm	`clip_grad_norm_(params, max_norm)`	Scales gradient vector to max norm
Clip by Hook	`register_hook(lambda g: torch.clamp(g, min, max))`	Custom clipping logic

Condition	Formula	Meaning
Primal feasibility	\(g_i(x^) \leq 0\), \(h_j(x^) = 0\)	Point satisfies all constraints
Dual feasibility	\(\lambda_i \geq 0\)	Multipliers non-negative
Stationarity	\(\nabla f + \sum \lambda_i \nabla g_i + \sum \mu_j \nabla h_j = 0\)	Gradient of Lagrangian = 0
Complementary slackness	\(\lambda_i g_i(x^*) = 0\)	Active constraints only

Metric	Definition	Methods
FWER (Family-Wise Error Rate)	\(P(\geq 1\) false positive)	Bonferroni, Dunnett, Holm
FDR (False Discovery Rate)	\(E[\frac{\text{false positives}}{\text{all positives}}]\)	Benjamini-Hochberg

Type	Formula	Properties
Unnormalized	\(L = D - W\)	Simple, eigenvector for \(k\) clusters
Normalized (symmetric)	\(L_{sym} = D^{-1/2}LD^{-1/2} = I - D^{-1/2}WD^{-1/2}\)	Better numerical properties
Normalized (random walk)	\(L_{rw} = D^{-1}L = I - D^{-1}W\)	Connection to random walks

Approximation	Complexity	Trade-off
Diagonal	\(O(d)\)	Ignores correlations, like Adam
Block-diagonal (K-FAC)	\(O(d \cdot k^2)\)	Per-layer blocks, Kronecker factors
Low-rank	\(O(dr)\)	Capture dominant directions
Matrix-free (CG)	\(O(d \cdot k)\)	Solve \(I(\theta)s = \nabla L\) iteratively

Application	Role of Information Geometry
VAEs	Optimize \(q(z\mid x)\) to minimize KL divergence
Bayesian Neural Networks	Natural gradient for posterior inference
Reinforcement Learning (TRPO)	Fisher metric constrains policy updates
Optimization	Adam ≈ diagonal natural gradient

Concept	Formula	Intuition
Fisher Matrix	\(I(\theta) = \mathbb{E}[\nabla \log p \cdot \nabla \log p^T]\)	Curvature of log-likelihood
Natural Gradient	\(\tilde{\nabla} L = I^{-1} \nabla L\)	Steepest descent on manifold
KL ≈ Fisher	\(D_{KL} \approx \frac{1}{2} \delta^T I \delta\)	Local distance on manifold
Adam	\(g / \sqrt{v}\)	Diagonal natural gradient
K-FAC	\(I \approx A \otimes G\)	Layer-wise Kronecker factors

Method	Uses	Convergence	Cost per Step
SGD	Gradient \(\nabla f\)	Sub-linear	\(O(n)\)
Newton	Gradient + Hessian \(\nabla f, H\)	Quadratic	\(O(n^3)\)
BFGS	Approximate inverse Hessian	Superlinear	\(O(n^2)\)
L-BFGS	Limited memory BFGS	Superlinear	\(O(n)\)

Method	Update Rule	Key Property
Newton	\(\theta - H^{-1}\nabla f\)	Quadratic convergence, \(O(n^3)\)
BFGS	\(\theta - M_t \nabla f\)	Superlinear, \(O(n^2)\) memory
L-BFGS	Same, implicit \(M_t\)	\(O(n)\) memory, needs full-batch
SGD	\(\theta - \eta \nabla f\)	Noisy, good for non-convex
Adam	\(\theta - \eta \frac{m}{\sqrt{v}+\epsilon}\)	Diagonal approx, adaptive

Without JL	With JL
\(d = 1,000,000\)	\(k = O(\log n / \varepsilon^2)\)
Expensive to store	Compressed storage
Slow distance computations	Fast computations
Curse of dimensionality	Mitigated

Method	Matrix entries	Speed	Notes
Gaussian	\(R_{ij} \sim \mathcal{N}(0, 1/k)\)	Slow	Theoretical guarantee
Rademacher	\(R_{ij} = \pm 1/\sqrt{k}\) with \(p=0.5\)	Medium	Simpler
Sparse (Achlioptas)	\(0\) (⅔), \(\pm\sqrt{3}/\sqrt{k}\) (⅙ each)	Fast	3× speedup

Aspect	Random Projection	PCA
Speed	\(O(nkd)\)	\(O(nd^2)\) for SVD
Data-dependent?	No (random matrix)	Yes (eigenvalue decomposition)
Preserves	Pairwise distances	Variance
Best for	Very high \(d\), streaming data	When variance matters most
Guarantees	Theoretical (with prob.)	Optimal for variance

Concept	Formula	Purpose
Target dimension	\(k \geq \frac{8 \ln n}{\varepsilon^2}\)	Minimum for JL guarantee
Gaussian projection	\(R_{ij} \sim \mathcal{N}(0, 1/k)\)	Theoretical standard
Sparse projection	\(R_{ij} \in \{0, \pm\sqrt{3}\}/\sqrt{k}\)	Fast computation
Distance preservation	\((1-\varepsilon)\\|x\\|^2 \leq \\|Rx\\|^2 \leq (1+\varepsilon)\\|x\\|^2\)	Core guarantee