Перейти к содержанию

Математика для ML: учебные материалы

~35 минут чтения

Предварительно: базовый курс математики (школьная программа: производные, матрицы, вероятности)

Раздел охватывает 41 задачу по 22 темам -- от матанализа и линала до информационной геометрии и спектральных методов. По статистике Nareshka (2025-2026), 73% интервью-вопросов по математике для ML приходятся на 5 тем: Statistics (17 задач), Linear Algebra (4), Information Theory (4), Calculus (3) и Probability (3). Материалы ниже покрывают как базу, так и gap fillers для senior-позиций.

Обновлено: 2026-02-11


1. Calculus (3 задачи)

Лучшие источники

Курсы: - Mathematics for Machine Learning Specialization (Imperial College London) — Linear Algebra, Multivariate Calculus, PCA - MIT 18.06SC Linear Algebra — канонический курс - 3Blue1Brown Essence of Calculus — визуальная интуиция

Блоги: - Lil'Log - Calculus for ML — градиенты, оптимизация - d2l.ai Chapter 2 — Calculus для Deep Learning

YouTube: - Andrej Karpathy: micrograd — backprop с нуля - StatQuest: Gradient Descent — интуитивное объяснение

Ключевые концепции

Тема Формула Применение
Градиент \(\nabla f = [\frac{\partial f}{\partial x_1}, ...]\) Направление наискорейшего роста
Sigmoid' \(\sigma'(x) = \sigma(x)(1-\sigma(x))\) Backprop
ReLU' \(\text{ReLU}'(x) = 1\) if \(x>0\) else \(0\) Backprop
MSE gradient \(\frac{\partial L}{\partial \hat{y}} = \frac{2}{n}(\hat{y}-y)\) Регрессия
BCE gradient \(\frac{\partial L}{\partial \hat{y}} = \frac{\hat{y}-y}{\hat{y}(1-\hat{y})}\) Классификация

2. Linear Algebra (4 задачи)

Лучшие источники

Книги: - Mathematics for Machine Learning (Deisenroth, Faisal, Ong) — бесплатно, Chapter 2-4 - Deep Learning Book - Linear Algebra (Goodfellow) — Chapter 2

Курсы: - MIT 18.06 Linear Algebra — Gilbert Strang - Linear Algebra for ML (Coursera)

YouTube: - 3Blue1Brown Essence of Linear Algebra — MUST WATCH

Ключевые концепции

Тема Формула Применение
L2 норма \(\|x\|_2 = \sqrt{\sum_i x_i^2}\) Регуляризация
Softmax \(\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\) Мультикласс
Cosine similarity \(\cos(a,b) = \frac{a \cdot b}{\|a\|\|b\|}\) Embeddings
SVD \(X = U\Sigma V^T\) PCA, рекомендации

Трюк для softmax stability:

def softmax(x):
    x_max = np.max(x, axis=-1, keepdims=True)
    exp_x = np.exp(x - x_max)  # вычитаем max для стабильности
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)


3. Probability (3 задачи)

Лучшие источники

Книги: - Think Stats 2e — бесплатно, практический подход - Probability for Machine Learning — интерактивный курс

Блоги: - Seeing Theory — визуализация вероятностей - Lil'Log: Probability

Ключевые концепции

Тема Формула Применение
Bayes \(P(A\|B) = \frac{P(B\|A)P(A)}{P(B)}\) Naive Bayes, обновление
A/B test sample size \(n = \frac{(z_{\alpha/2}+z_\beta)^2 \cdot 2p(1-p)}{\delta^2}\) Эксперименты
Laplace smoothing \(P(w\|c) = \frac{\text{count}(w,c)+1}{\text{count}(c)+\|V\|}\) Naive Bayes

4. Statistics (17 задач)

Лучшие источники

Курсы: - Harvard Stats 110 — Probability & Statistics (Joe Blitzstein) - Khan Academy Statistics

Книги: - Practical Statistics for Data Scientists — практический фокус - ISLR Chapter 2-3 — Statistical Learning

Блоги: - StatQuest Josh Starmer — MUST WATCH для интуиции - Seeing Theory — визуализация

Ключевые концепции

Тема Формула Интервью важность
CLT \(\bar{X} \sim N(\mu, \sigma^2/n)\) HIGH
p-value \(P(D\|H_0)\) HIGH
t-statistic \(t = \frac{\bar{x}-\mu}{s/\sqrt{n}}\) HIGH
Bootstrap CI \([\hat{\theta}_{\alpha/2}, \hat{\theta}_{1-\alpha/2}]\) MEDIUM
VIF \(\text{VIF}_j = \frac{1}{1-R_j^2}\) MEDIUM
KS test \(D = \sup_x \|F_n(x)-F(x)\|\) LOW

Bootstrap алгоритм:

def bootstrap_ci(data, stat_fn, n_bootstrap=10000, alpha=0.05):
    stats = [stat_fn(np.random.choice(data, len(data), replace=True))
             for _ in range(n_bootstrap)]
    lower = np.percentile(stats, 100 * alpha/2)
    upper = np.percentile(stats, 100 * (1-alpha/2))
    return lower, upper


5. Information Theory (4 задачи)

Лучшие источники

Книги: - Information Theory, Inference, and Learning Algorithms (MacKay) — бесплатно - Elements of Information Theory (Cover & Thomas)

Блоги: - Lil'Log: Information Theory - Count Bayesie: Entropy

Ключевые концепции

Тема Формула Применение
Entropy \(H(X) = -\sum p_i \log_2 p_i\) Decision trees
Cross-Entropy \(H(p,q) = -\sum p_i \log q_i\) Loss function
KL Divergence \(D_{KL}(P\|Q) = \sum p_i \log \frac{p_i}{q_i}\) VAE, distribution matching
Information Gain \(IG(S,A) = H(S) - \sum \frac{\lvert S_v\rvert}{\|S\|}H(S_v)\) Decision trees
Gini \(G = 1 - \sum p_k^2\) Random Forest

Entropy vs Gini:

Entropy: более чувствителен к разбросу, log-scale
Gini: вычислительно проще (нет log), чаще в sklearn
Разница на практике минимальна


6. Classification Metrics (2 задачи)

Лучшие источники

Блоги: - Google ML Crash Course: Classification - Scikit-learn: Metrics

YouTube: - StatQuest: ROC & AUC

Ключевые концепции

Metric Formula Когда использовать
Precision \(\frac{TP}{TP+FP}\) False positives дорогие
Recall \(\frac{TP}{TP+FN}\) False negatives дорогие
F1 \(\frac{2 \cdot P \cdot R}{P+R}\) Баланс P и R
ROC-AUC Area under TPR/FPR curve Ranking quality

7. Regularization (3 задачи)

Лучшие источники

Блоги: - Regularization in ML - Dropout Paper

YouTube: - StatQuest: Ridge & Lasso

Ключевые концепции

Method Effect Когда использовать
L1 (Lasso) Sparsity, feature selection Много нерелевантных признаков
L2 (Ridge) Shrinkage, small weights Мультиколлинеарность
ElasticNet L1 + L2 Комбо
Dropout Random deactivation Deep learning, overfitting
BatchNorm Normalize activations Faster training, regularization

L1 vs L2 геометрия:

L1: ромб → углы на осях → sparse solutions
L2: круг → веса близки к 0, но не 0


8. Ensemble Methods (2 задачи)

Лучшие источники

Книги: - ESL Chapter 8-10 — Boosting, Trees - XGBoost Documentation

Блоги: - Gradient Boosting from scratch - CatBoost vs XGBoost vs LightGBM

Ключевые концепции

Method Key Idea Bias-Variance
Bagging Parallel, bootstrap samples Reduces variance
Boosting Sequential, fit residuals Reduces bias
Random Forest Bagging + feature randomization Low variance
GBDT Gradient descent in function space Low bias

RF vs GBDT:

RF: параллельно, устойчив к overfitting, простой
GBDT: последовательно, мощнее, требует тюнинга
Практика: начать с RF, потом XGBoost/LightGBM


9. NumPy & Pandas (3 задачи)

Лучшие источники

Документация: - NumPy User Guide - Pandas User Guide

Cheat Sheets: - NumPy Cheat Sheet - Pandas Cheat Sheet

Ключевые концепции

Операция Пример Важно помнить
loc vs iloc df.loc['A':'B'] vs df.iloc[0:2] loc inclusive, iloc exclusive
groupby.agg df.groupby('cat').agg({'val': ['mean', 'std']}) Multiple aggregations
merge pd.merge(df1, df2, on='key', how='left') SQL-style JOINs
transform df.groupby('cat')['val'].transform('mean') Same shape as input

10. LLM Scaling Laws (Gap Filler 2026)

Лучшие источники

Papers: - Scaling Laws for Neural Language Models (Kaplan et al., 2020) — первые scaling laws - Training Compute-Optimal Large Language Models (Chinchilla, Hoffmann et al., 2022) — compute-optimal - Beyond Chinchilla (Porian et al., 2025) — includes inference costs

Блоги: - LLM Scaling Laws 2026: Complete Guide — AIMultiple (Feb 2026) - The Chinchilla Papers Implications — LessWrong

Ключевые формулы

Kaplan (2020): $\(L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}\)$

Loss decreases as power law with model size.

Chinchilla (2022): $\(L(N, D) = \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}} + E\)$

Где: - \(N\) = number of parameters - \(D\) = training tokens - \(\alpha \approx 0.34, \beta \approx 0.28, E \approx 1.69\)

Compute-optimal: $\(N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5}\)$

Rule of thumb: ~20 tokens per parameter for compute-optimal training.

Evolution 2020-2026

Year Paper Key Insight
2020 Kaplan Power-law scaling with model size
2022 Chinchilla Compute-optimal: balance N and D
2024 Llama 3 Overtrain small models (15T tokens on 8B)
2025 Beyond Chinchilla Include inference cost in optimization
2025 Sloth Latent skills for benchmark prediction
2025 Densing Law Capability density rising exponentially

Practical Implications

# Chinchilla-optimal tokens for model size
def compute_optimal_tokens(params_billions: float) -> float:
    """Calculate Chinchilla-optimal training tokens"""
    # Rule: 20 tokens per parameter
    return params_billions * 1e9 * 20

# Example: 7B model
tokens_7b = compute_optimal_tokens(7)  # 140B tokens

# But Llama 3 trained 15T tokens on 8B = 1875 tokens/param
# Why? Inference cost dominates at scale

Interview Questions: 1. "Why did Llama 3 overtrain 8B model with 15T tokens instead of Chinchilla-optimal ~160B?" 2. "How does inference cost change the optimal training strategy?" 3. "What is Densing Law and why does it matter for model selection?"


11. VC Dimension & PAC Learning (Gap Filler 2026)

Лучшие источники

Books: - Understanding Machine Learning (Shalev-Shwartz & Ben-David) — бесплатно - Foundations of Machine Learning (Mohri) — Bloomberg course

Blogs: - PAC Theory and VC Dimension — Cole Mei (2025) - VC Dimension Deep Dive — Tang Shusen

Key Concepts

PAC Learning (Probably Approximately Correct): $\(\Pr\left[ |E_{in}(g) - E_{out}(g)| > \epsilon \right] \leq \delta\)$

Where: - \(E_{in}(g)\) = training error (empirical risk) - \(E_{out}(g)\) = test error (true risk) - \(\epsilon\) = accuracy parameter - \(\delta\) = confidence parameter

VC Dimension Definition: - The maximum number of points that can be shattered by hypothesis class \(\mathcal{H}\) - "Shattered" = all \(2^n\) labelings possible - For linear classifiers in \(\mathbb{R}^d\): VC dimension = \(d + 1\)

Generalization Bound: $\(E_{out}(g) \leq E_{in}(g) + \sqrt{\frac{8}{N} \ln \left( \frac{4(2N)^{VC(\mathcal{H})}}{\delta} \right)}\)$

Trade-off Diagram

graph LR
    subgraph "VC Dimension Trade-off"
        LOW["Low VC dim<br/>High E_in<br/>Low Omega<br/>(underfitting)"]
        OPT["Optimal<br/>Balance E_in<br/>and Omega"]
        HIGH["High VC dim<br/>Low E_in<br/>High Omega<br/>(overfitting)"]
    end
    LOW --> OPT --> HIGH

    style LOW fill:#e8eaf6,stroke:#3f51b5
    style OPT fill:#e8f5e9,stroke:#4caf50
    style HIGH fill:#fce4ec,stroke:#c62828
  • Low VC dim: high \(E_{in}\), low \(\Omega\) (underfitting)
  • High VC dim: low \(E_{in}\), high \(\Omega\) (overfitting)
  • Optimal: balance \(E_{in}\) and \(\Omega\)

Interview Questions

Q: What is the VC dimension of a decision tree?

A: Theoretically infinite (can shatter any dataset with enough depth). In practice, bounded by number of leaves.

Q: Why is VC dimension useful?

A: It provides theoretical guarantees on generalization without knowing the true distribution.

Q: What's the VC dimension of a linear classifier in 2D?

A: 3 (can shatter 3 non-collinear points, but not 4 in general position).


12. Causal Inference Basics (Gap Filler 2026)

Лучшие источники

Books: - Causal Inference: The Mixtape (Cunningham) — бесплатно - Elements of Causal Inference (Peters et al.)

Blogs: - Causal Inference Interview Questions — InterviewGemini (2025)

Key Concepts

Association vs Causation: - Association: \(P(Y|X)\) — statistical relationship - Causation: \(P(Y|do(X))\) — intervention changes distribution

Potential Outcomes Framework: $\(\tau_i = Y_i(1) - Y_i(0)\)$

Where: - \(Y_i(1)\) = outcome if treated - \(Y_i(0)\) = outcome if not treated - We only observe one!

ATE (Average Treatment Effect): $\(\text{ATE} = \mathbb{E}[Y(1) - Y(0)]\)$

Confounding & DAGs

Confounding: Third variable affects both treatment and outcome.

Ice Cream Sales ← Temperature → Crime Rate
(Spurious correlation, not causal!)

DAG (Directed Acyclic Graph): Visual representation of causal relationships.

Methods for Observational Studies

Method Key Idea When to Use
Propensity Score Matching Match similar treated/untreated Selection on observables
Regression Discontinuity Exploit cutoff in treatment Sharp threshold exists
Instrumental Variables Use exogenous variation Hidden confounders
Difference-in-Differences Compare trends over time Panel data available

Interview Questions

Q: Explain the difference between RCTs and observational studies.

A: RCTs randomize treatment → no confounding by design. Observational studies observe natural treatment assignment → must control for confounders.

Q: What is SUTVA?

A: Stable Unit Treatment Value Assumption — treatment of one unit doesn't affect outcomes of other units. Violated in network effects, spillovers.

Q: What are the key assumptions for causal inference?

A: (1) Consistency, (2) Ignorability (no unmeasured confounding), (3) Positivity (overlap), (4) SUTVA.

Q: How do you assess validity of causal claims?

A: (1) Temporal precedence (cause before effect), (2) Covariation, (3) No spurious correlation, (4) Plausible mechanism, (5) Sensitivity analysis.


13. MCMC Sampling Methods (Gap Filler 2026)

Лучшие источники

Books: - Bayesian Data Analysis (Gelman et al.) — Chapter 11 - MCMC Handbook — бесплатно

Blogs: - Mastering MCMC Methods — NumberAnalytics - MCMC for ML — Lilian Weng

Key Algorithms

Algorithm Key Idea When to Use
Metropolis-Hastings Propose → Accept/Reject General purpose
Gibbs Sampling Sample each variable conditionally Known conditionals
Hamiltonian MC Use gradients for efficient proposals High-dimensional
Slice Sampling Sample under density curve No tuning needed

Metropolis-Hastings Algorithm

def metropolis_hastings(target_pdf, proposal_fn, n_samples):
    """Simple MH sampler"""
    samples = []
    theta = proposal_fn.initial()

    for _ in range(n_samples):
        theta_prop = proposal_fn.sample(theta)
        alpha = min(1, target_pdf(theta_prop) / target_pdf(theta))

        if np.random.random() < alpha:
            theta = theta_prop  # Accept

        samples.append(theta)

    return samples

Acceptance Probability: $\(\alpha = \min\left(1, \frac{\pi(\theta') q(\theta|\theta')}{\pi(\theta) q(\theta'|\theta)}\right)\)$

Convergence Diagnostics

Diagnostic What to Check Good Sign
Trace Plot Stationary around mean No drift, good mixing
ACF Autocorrelation decay Quick decay
ESS Effective Sample Size ESS > 400
\(\hat{R}\) Gelman-Rubin statistic \(\hat{R} < 1.1\)

Effective Sample Size: $\(ESS = \frac{N}{1 + 2\sum_{lag=1}^{\infty} \rho(lag)}\)$

Interview Questions

Q: Why use MCMC instead of analytical solutions?

A: Many posteriors are intractable (no closed form). MCMC samples without computing the normalizing constant.

Q: What's a good acceptance rate for MH?

A: 20-40% for random walk proposals. Too high = slow exploration, too low = wasted computation.


14. Time Series Statistics (Gap Filler 2026)

Лучшие источники

Books: - Time Series Analysis and Its Applications (Shumway & Stoffer) — бесплатно - Forecasting: Principles and Practice (Hyndman) — онлайн

Blogs: - Quant Interview FAQ: Time Series — BagelQuant (2025)

Stationarity

Weak Stationarity: $\(E[y_t] = \mu, \quad Var(y_t) = \sigma^2, \quad Cov(y_t, y_{t-k}) = \gamma_k\)$

Tests: | Test | H₀ | H₁ | Stationary if | |------|-----|-----|---------------| | ADF | Unit root | Stationary | p < 0.05 | | KPSS | Stationary | Unit root | p > 0.05 |

ARIMA Model

ARIMA(p, d, q): - \(p\) = AR lags (autoregressive) - \(d\) = differencing order (integration) - \(q\) = MA lags (moving average)

Model equations: $\(\text{AR(p): } y_t = c + \sum_{i=1}^{p} \phi_i y_{t-i} + \epsilon_t\)$

\[\text{MA(q): } y_t = c + \sum_{i=1}^{q} \theta_i \epsilon_{t-i} + \epsilon_t\]

ACF/PACF for Model Selection

Pattern Indicates Order
ACF cuts off after lag q MA(q) q lags
PACF cuts off after lag p AR(p) p lags
Both tail off ARMA Use AIC/BIC

Key Interview Questions

Q: How do you make a non-stationary series stationary?

A: (1) Differencing (once or more), (2) Log transformation (stabilize variance), (3) Detrending.

Q: What is cointegration?

A: Two non-stationary series are cointegrated if a linear combination is stationary. Basis for pairs trading.

Q: AR vs MA vs ARMA?

A: AR models persistence in levels. MA models shock propagation. ARMA combines both.

Common Pitfalls

  1. Overfitting with excessive lags
  2. Ignoring non-stationarity
  3. Misinterpreting correlation as causation
  4. Not using robust standard errors
  5. Ignoring structural breaks

15. Matrix Calculus: Jacobian & Hessian (Gap Filler 2026)

Лучшие источники

Tutorials: - DataCamp: What is the Hessian Matrix — comprehensive guide with Python - GeeksforGeeks: Jacobian and Hessian Matrix — formulas and applications

Jacobian Matrix

Definition: For \(f: \mathbb{R}^n \to \mathbb{R}^m\), the Jacobian is: $\(J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}\)$

Example: \(f(x,y) = [x^2 + y, xy]\) $\(J = \begin{bmatrix} 2x & 1 \\ y & x \end{bmatrix}\)$

Hessian Matrix

Definition: For \(f: \mathbb{R}^n \to \mathbb{R}\), the Hessian is: $\(H = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}\)$

Critical Point Classification: | Hessian Eigenvalues | Type | Behavior | |---------------------|------|----------| | All positive | Minimum | Convex locally | | All negative | Maximum | Concave locally | | Mixed signs | Saddle | Neither | | Some zero | Inconclusive | Need higher-order |

Second Derivative Test: $\(D = \begin{vmatrix} f_{xx} & f_{xy} \\ f_{yx} & f_{yy} \end{vmatrix}\)$

  • \(D > 0, f_{xx} > 0\): Local minimum
  • \(D > 0, f_{xx} < 0\): Local maximum
  • \(D < 0\): Saddle point

Comparison Table

Concept Input → Output Matrix Shape Use Case
Gradient \(\mathbb{R}^n \to \mathbb{R}\) \(n \times 1\) First-order optimization
Jacobian \(\mathbb{R}^n \to \mathbb{R}^m\) \(m \times n\) Backprop, change of variables
Hessian \(\mathbb{R}^n \to \mathbb{R}\) \(n \times n\) Second-order optimization

Trace Tricks for Matrix Derivatives

\[\nabla_A \text{tr}(AB) = B^T\]
\[\nabla_A \text{tr}(A^TBA) = (B + B^T)A\]

Example Interview Question:

Q: Derive \(\nabla_X \text{tr}(X^TAX)\)

A: Using trace properties and symmetry: $\(\nabla_X \text{tr}(X^TAX) = AX + A^TX\)$

If \(A\) is symmetric: \(\nabla_X = 2AX\)

Applications in Neural Networks

Hessian in optimization: - Newton's method: \(\theta_{t+1} = \theta_t - H^{-1}\nabla L\) - In practice: Hessian too expensive, use approximations (BFGS, L-BFGS)

Jacobian in backprop: - Chain rule through layers: \(\frac{\partial L}{\partial W} = J_{output} \cdot J_{hidden} \cdot ...\)

Python Implementation

import numpy as np
from scipy.optimize import approx_fprime

def numerical_jacobian(f, x, eps=1e-8):
    """Compute Jacobian numerically"""
    n, m = len(x), len(f(x))
    J = np.zeros((m, n))
    for i in range(n):
        x_plus = x.copy()
        x_plus[i] += eps
        J[:, i] = (f(x_plus) - f(x)) / eps
    return J

def numerical_hessian(f, x, eps=1e-5):
    """Compute Hessian numerically"""
    n = len(x)
    H = np.zeros((n, n))
    fx = f(x)
    for i in range(n):
        for j in range(n):
            x_pp = x.copy(); x_pp[i] += eps; x_pp[j] += eps
            x_pm = x.copy(); x_pm[i] += eps; x_pm[j] -= eps
            x_mp = x.copy(); x_mp[i] -= eps; x_mp[j] += eps
            x_mm = x.copy(); x_mm[i] -= eps; x_mm[j] -= eps
            H[i, j] = (f(x_pp) - f(x_pm) - f(x_mp) + f(x_mm)) / (4 * eps**2)
    return H

# Example
def f(x):
    return x[0]**2 + x[1]**2  # Simple quadratic

x = np.array([1.0, 2.0])
H = numerical_hessian(f, x)
print("Hessian:", H)
# [[2. 0.]
#  [0. 2.]] → Positive definite (minimum at origin)

Key Interview Questions

Q: Why is Newton's method rarely used in deep learning?

A: (1) Hessian is \(O(n^2)\) memory for \(n\) parameters, (2) Inverting Hessian is \(O(n^3)\), (3) For large networks (millions of params), this is intractable.

Q: What's the relationship between Hessian and conditioning?

A: The condition number of the Hessian (ratio of largest to smallest eigenvalue) determines optimization difficulty. High condition number = ill-conditioned = slow convergence.

Q: How does Jacobian relate to backpropagation?

A: Backprop computes vector-Jacobian products (VJP) efficiently: \(\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot J\). PyTorch/TF compute VJPs without materializing full Jacobians.


16. Numerical Stability (Gap Filler 2026)

Лучшие источники

Tutorials: - Mastering Softmax with Numerical Stability — Best AI Tools - Gradient Clipping in PyTorch — GeeksforGeeks

Log-Sum-Exp Trick

Problem: Computing \(\log(\sum_i e^{x_i})\) directly causes overflow/underflow.

Solution: $\(\log\left(\sum_i e^{x_i}\right) = c + \log\left(\sum_i e^{x_i - c}\right) \text{ where } c = \max_i x_i\)$

Why it works: - \(c\) shifts all values so maximum becomes 0 - \(e^{x_i - c} \leq 1\) → no overflow - At least one term = \(e^0 = 1\) → sum \(\geq 1\) → no underflow

Softmax Numerical Stability

Naive (unstable):

def softmax_unstable(x):
    exp_x = np.exp(x)  # Can overflow!
    return exp_x / np.sum(exp_x)

Stable version:

def softmax_stable(x):
    x_shifted = x - np.max(x, axis=-1, keepdims=True)
    exp_x = np.exp(x_shifted)
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

Gradient Clipping Methods

Method PyTorch Function What it does
Clip by Value clip_grad_value_(params, clip_value) Clips each gradient to \([-v, v]\)
Clip by Norm clip_grad_norm_(params, max_norm) Scales gradient vector to max norm
Clip by Hook register_hook(lambda g: torch.clamp(g, min, max)) Custom clipping logic

When to use which: - Clip by norm: Standard for RNNs/Transformers (preserves direction) - Clip by value: When you need component-wise limits - Clip by hook: Asymmetric clipping, layer-specific

Gradient Clipping Code

import torch
import torch.nn as nn

# Method 1: Clip by norm (most common)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

# Method 2: Clip by value
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)

# Method 3: Custom hook for asymmetric clipping
for p in model.parameters():
    p.register_hook(lambda grad: torch.clamp(grad, -0.1, 1.0))

Common Numerical Issues

Issue Cause Solution
NaN in softmax Overflow from large inputs Log-sum-exp trick
Exploding gradients Deep networks, RNNs Gradient clipping
Vanishing gradients Sigmoid/tanh in deep nets ReLU, residual connections
Division by zero Sum of exponentials \(\approx 0\) Add epsilon (\(10^{-8}\))
Log(0) Computing \(\log(0)\) Use np.log(x + eps) or np.logaddexp

FP16/BF16 Considerations

Mixed precision training:

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for data, target in dataloader:
    optimizer.zero_grad()
    with autocast():
        output = model(data)
        loss = criterion(output, target)
    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)  # Unscale before clipping!
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(optimizer)
    scaler.update()

Key Interview Questions

Q: Why subtract max before softmax instead of just using float64?

A: Float64 only delays overflow. Even with 64-bit, \(e^{1000}\) overflows. The log-sum-exp trick is mathematically exact and works with any precision.

Q: Clip by value vs clip by norm — when to use which?

A: Clip by norm preserves gradient direction (preferred for optimization). Clip by value can change direction, useful when certain parameters shouldn't exceed a threshold regardless of others.

Q: How does gradient clipping affect convergence?

A: Clipping acts as a regularizer. Too aggressive clipping → slow learning. Too weak → instability. Typical values: max_norm=1.0 for transformers, 5.0 for RNNs.

Q: What's the difference between exploding and vanishing gradients?

A: Exploding: gradients grow exponentially through layers → NaN weights. Vanishing: gradients shrink to near-zero → no learning in early layers. Solutions differ: clipping for exploding, ReLU/residuals/architecture changes for vanishing.


17. Dimensionality Reduction: t-SNE & UMAP (Gap Filler 2026)

Лучшие источники

Tutorials: - PCA vs t-SNE vs UMAP — Medium (May 2025) - How to Use t-SNE Effectively — Distill.pub (canonical) - UMAP Original Paper — McInnes et al. (2018)

Comparison Table

Feature PCA t-SNE UMAP
Type Linear Non-linear Non-linear
Preserves Global structure Local neighborhoods Local + some global
Speed Fast Slow Fast
Scalability Excellent Poor (\(O(n^2)\)) Good (\(O(n \log n)\))
Use in ML Models ✅ Yes ❌ No (viz only) ✅ Yes
Inverse Transform ✅ Yes ❌ No ❌ No
Interpretability ✅ High ❌ Low ⚠️ Medium
Deterministic ✅ Yes ❌ No (random init) ❌ No

t-SNE (t-Distributed Stochastic Neighbor Embedding)

How it works: 1. Compute pairwise similarities in high-dim space (Gaussian) 2. Model similarities in low-dim space (Student-t with heavy tails) 3. Minimize KL divergence between both distributions

Key hyperparameters: - perplexity (5-50): Effective number of neighbors. Lower = more local structure. - n_iter (250-1000): More iterations = more stable. - learning_rate (10-1000): Too high = points cluster in ball; too low = compressed cloud.

When to use t-SNE: - Cluster visualization (CNN embeddings, single-cell RNA) - Exploratory data analysis - Small to medium datasets (< 50k samples)

UMAP (Uniform Manifold Approximation and Projection)

How it works: 1. Build fuzzy simplicial set (graph) in high-dim space 2. Build similar graph in low-dim space 3. Minimize cross-entropy between fuzzy set representations

Key hyperparameters: - n_neighbors (5-200): Local vs global structure balance - min_dist (0.0-0.99): How tightly points cluster - metric: Distance metric (euclidean, cosine, etc.)

When to use UMAP: - Large datasets (> 50k samples) - When you need speed - When global structure matters - Feature extraction before ML

Python Implementation

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap
import matplotlib.pyplot as plt

# Load and scale data
X = StandardScaler().fit_transform(load_iris().data)
y = load_iris().target

# PCA (fast, linear)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
print(f"PCA explained variance: {pca.explained_variance_ratio_.sum():.2%}")

# t-SNE (slow, non-linear, visualization)
tsne = TSNE(n_components=2, perplexity=30, n_iter=300, random_state=42)
X_tsne = tsne.fit_transform(X)

# UMAP (fast, non-linear, good for ML)
umap_model = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
X_umap = umap_model.fit_transform(X)

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, data, title in [(axes[0], X_pca, 'PCA'),
                         (axes[1], X_tsne, 't-SNE'),
                         (axes[2], X_umap, 'UMAP')]:
    ax.scatter(data[:, 0], data[:, 1], c=y, cmap='viridis', alpha=0.7)
    ax.set_title(title)

Decision Guide

Need dimensionality reduction?
├── For ML model input ➝ PCA or UMAP
│   ├── Interpretability needed? ➝ PCA
│   └── Non-linear patterns? ➝ UMAP
├── For visualization?
│   ├── Small dataset (<50k) ➝ t-SNE
│   └── Large dataset (>50k) ➝ UMAP
└── Need inverse transform? ➝ PCA only

Key Interview Questions

Q: Why does t-SNE use Student-t distribution in low-dim space?

A: Heavy tails of Student-t alleviate "crowding problem" — in high dimensions, there's more "room" for points. Gaussian in low-dim would compress all points together. Heavy tails push dissimilar points apart.

Q: What's the perplexity parameter in t-SNE?

A: It's a smooth measure of effective neighbors. Perplexity=30 means each point pays attention to ~30 neighbors. Low perplexity = local structure; high = more global.

Q: Why can't t-SNE be used for ML preprocessing?

A: (1) No inverse transform, (2) No out-of-sample extension (can't transform new data), (3) Stochastic results vary between runs, (4) \(O(n^2)\) complexity.

Q: UMAP vs t-SNE — when to use which?

A: UMAP is faster, scales better, and can transform new data. Use t-SNE only for small dataset visualization where interpretability of local clusters is critical. Use UMAP for everything else.

Q: Can t-SNE/UMAP distances be interpreted?

A: No! Only relative ordering matters. Cluster sizes and distances between clusters in t-SNE/UMAP plots are meaningless. Use PCA for interpretable distances.

Common Pitfalls

  1. Interpreting cluster sizes: Larger clusters in t-SNE ≠ more data
  2. Interpreting distances: Gap between clusters ≠ dissimilarity
  3. Random seed sensitivity: Different seeds = different plots
  4. Not scaling data: Always StandardScaler before t-SNE/UMAP
  5. Wrong perplexity: Rule of thumb: perplexity < \(\sqrt{n}\)

18. Constrained Optimization: Lagrange & KKT (Gap Filler 2026)

Лучшие источники

Books: - Convex Optimization (Boyd & Vandenberghe) — canonical text - Stanford CS229 Lecture Notes — SVM and optimization

Tutorials: - KKT Conditions: Ultimate Guide — NumberAnalytics - Convex Optimization and Lagrange Multipliers — CircuitLabs (2025)

General Optimization Problem

\[\begin{aligned} \min_{x} \quad & f(x) \\ \text{s.t.} \quad & g_i(x) \leq 0, \quad i = 1, \ldots, m \\ & h_j(x) = 0, \quad j = 1, \ldots, p \end{aligned}\]

Lagrange Multipliers (Equality Constraints)

For problem with only equality constraints \(h_j(x) = 0\):

Lagrangian: $\(L(x, \mu) = f(x) + \sum_{j=1}^{p} \mu_j h_j(x)\)$

Optimality condition: $\(\nabla_x L(x^*, \mu^*) = 0\)$

Intuition: At optimum, gradient of \(f\) is a linear combination of constraint gradients.

KKT Conditions (Inequality + Equality)

For general problem with both constraints, \(x^*\) is optimal iff:

Condition Formula Meaning
Primal feasibility \(g_i(x^*) \leq 0\), \(h_j(x^*) = 0\) Point satisfies all constraints
Dual feasibility \(\lambda_i \geq 0\) Multipliers non-negative
Stationarity \(\nabla f + \sum \lambda_i \nabla g_i + \sum \mu_j \nabla h_j = 0\) Gradient of Lagrangian = 0
Complementary slackness \(\lambda_i g_i(x^*) = 0\) Active constraints only

Full Lagrangian: $\(L(x, \lambda, \mu) = f(x) + \sum_{i=1}^{m} \lambda_i g_i(x) + \sum_{j=1}^{p} \mu_j h_j(x)\)$

Complementary Slackness Intuition

\[\lambda_i g_i(x^*) = 0\]
  • If constraint is inactive (\(g_i(x^*) < 0\)), then \(\lambda_i = 0\)
  • If constraint is active (\(g_i(x^*) = 0\)), then \(\lambda_i \geq 0\)
  • Constraint "pushes back" with force \(\lambda_i\)

SVM Dual Formulation

Primal (hard margin): $\(\min_{w,b} \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^T x_i + b) \geq 1\)$

Lagrangian: $\(L(w, b, \alpha) = \frac{1}{2}\|w\|^2 - \sum_{i=1}^{n} \alpha_i [y_i(w^T x_i + b) - 1]\)$

Dual: $\(\max_{\alpha} \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^T x_j\)$

\[\text{s.t.} \quad \alpha_i \geq 0, \quad \sum_i \alpha_i y_i = 0\]

Why dual? 1. Kernel trick: replace \(x_i^T x_j\) with \(K(x_i, x_j)\) 2. Solution sparse: most \(\alpha_i = 0\) (support vectors)

Python Implementation

import numpy as np
from scipy.optimize import minimize

# Example: Minimize x^2 + y^2 subject to x + y = 1
def objective(x):
    return x[0]**2 + x[1]**2

def eq_constraint(x):
    return x[0] + x[1] - 1  # x + y = 1

# Solve with scipy
result = minimize(
    objective,
    x0=[0.5, 0.5],
    constraints={'type': 'eq', 'fun': eq_constraint}
)
print(f"Optimal: x={result.x}, f={result.fun}")
# x=[0.5, 0.5], f=0.5

# Verify with Lagrange multiplier
# L = x^2 + y^2 + λ(x + y - 1)
# ∂L/∂x = 2x + λ = 0  →  x = -λ/2
# ∂L/∂y = 2y + λ = 0  →  y = -λ/2
# Constraint: -λ/2 - λ/2 = 1  →  λ = -1
# Solution: x = y = 0.5 ✓

Convexity Matters

Problem Type Local = Global? KKT Conditions
Convex ✅ Yes Necessary AND sufficient
Non-convex ❌ No Necessary only

Convex problem requires: - Objective \(f(x)\) is convex - Inequality constraints \(g_i(x)\) are convex - Equality constraints \(h_j(x)\) are affine

Key Interview Questions

Q: What's the geometric interpretation of Lagrange multipliers?

A: At optimum, gradient of objective is perpendicular to constraint surface. \(\nabla f = -\sum \lambda_i \nabla g_i\) means \(\nabla f\) is in the span of constraint gradients.

Q: When are KKT conditions sufficient (not just necessary)?

A: When the problem is convex: convex objective, convex inequality constraints, affine equality constraints. Then KKT conditions guarantee global optimum.

Q: What does complementary slackness tell us in SVM?

A: \(\alpha_i (y_i(w^Tx_i + b) - 1) = 0\). Either \(\alpha_i = 0\) (point not on margin) or \(y_i(w^Tx_i + b) = 1\) (point on margin = support vector). This is why SVM has sparse solution.

Q: Why formulate SVM as a constrained optimization problem?

A: (1) Margin maximization is naturally a constrained problem, (2) Dual formulation enables kernel trick, (3) Guarantees global optimum (convex), (4) Solution is sparse (only support vectors matter).

Q: What's the difference between primal and dual problem?

A: Primal: optimize over original variables. Dual: optimize over Lagrange multipliers. For convex problems, strong duality holds (optimal values equal). Dual often easier (kernel trick in SVM).


19. Multiple Testing Correction (Gap Filler 2026)

Лучшие источники

Blogs: - Multiple Comparison Corrections in A/B Testing — Statsig (Oct 2025) - Benjamini-Hochberg Procedure — GeeksforGeeks (2025)

Books: - Multiple Comparisons (Hochberg & Tamhane) - Statistical Inference (Casella & Berger) — Chapter 8

The Multiple Testing Problem

Problem: With \(m\) independent tests at \(\alpha = 0.05\): $\(P(\text{at least one false positive}) = 1 - (1 - \alpha)^m\)$

Tests (m) P(at least 1 FP)
1 5%
5 23%
10 40%
20 64%
100 99%

FWER vs FDR

Metric Definition Methods
FWER (Family-Wise Error Rate) \(P(\geq 1\) false positive) Bonferroni, Dunnett, Holm
FDR (False Discovery Rate) \(E[\frac{\text{false positives}}{\text{all positives}}]\) Benjamini-Hochberg

FWER = conservative, controls ANY false positive. FDR = balanced, controls PROPORTION of false positives.

Bonferroni Correction

Formula: $\(\alpha_{\text{adjusted}} = \frac{\alpha}{m}\)$

Example: 10 tests at \(\alpha = 0.05\) - Adjusted threshold: \(\frac{0.05}{10} = 0.005\) - Reject \(H_0\) only if \(p < 0.005\)

Pros: Simple, controls FWER Cons: Too conservative, low power

Benjamini-Hochberg Procedure

Algorithm: 1. Sort p-values: \(p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}\) 2. For each \(i\), compute threshold: \(t_i = \frac{i}{m} \cdot \alpha\) 3. Find largest \(k\) such that \(p_{(k)} \leq t_k\) 4. Reject \(H_{(1)}, H_{(2)}, \ldots, H_{(k)}\)

Example: 5 tests at \(\alpha = 0.05\)

Rank p-value Threshold \((i/m \cdot \alpha)\) Significant?
1 0.001 0.01
2 0.008 0.02
3 0.020 0.03 ✅ (p ≤ 0.03)
4 0.045 0.04 ❌ (p > 0.04)
5 0.120 0.05

Result: Reject first 3 hypotheses.

Comparison Table

Method Controls Power Best For
Bonferroni FWER Low Few tests, strict control needed
Holm FWER Medium Step-down improvement
Dunnett FWER Medium-High Multiple vs. control
BH FDR High Many tests, some FP OK

Python Implementation

import numpy as np
from scipy import stats

# Example: 10 hypothesis tests
p_values = np.array([0.001, 0.008, 0.02, 0.03, 0.04,
                      0.05, 0.08, 0.12, 0.15, 0.30])

# Bonferroni correction
alpha = 0.05
bonferroni_threshold = alpha / len(p_values)
bonferroni_significant = p_values < bonferroni_threshold
print(f"Bonferroni threshold: {bonferroni_threshold:.4f}")
print(f"Bonferroni significant: {bonferroni_significant.sum()} tests")

# Benjamini-Hochberg correction
def benjamini_hochberg(p_values, alpha=0.05):
    m = len(p_values)
    sorted_idx = np.argsort(p_values)
    sorted_p = p_values[sorted_idx]

    # Calculate thresholds
    thresholds = (np.arange(1, m + 1) / m) * alpha

    # Find largest k where p[k] <= threshold[k]
    below_threshold = sorted_p <= thresholds
    if not below_threshold.any():
        return np.zeros(m, dtype=bool)

    k = np.max(np.where(below_threshold)[0])

    # Create result array
    significant = np.zeros(m, dtype=bool)
    significant[sorted_idx[:k+1]] = True
    return significant

bh_significant = benjamini_hochberg(p_values, alpha)
print(f"BH significant: {bh_significant.sum()} tests")

# Using statsmodels
from statsmodels.stats.multitest import multipletests
reject, p_corrected, _, _ = multipletests(p_values, alpha=0.05,
                                           method='fdr_bh')  # or 'bonferroni'
print(f"statsmodels BH: {reject.sum()} tests")

When to Use Which

Multiple tests needed?
├── Few tests (< 10) and strict control?
│   └── Bonferroni or Holm
├── Many tests (> 10) or exploratory?
│   └── Benjamini-Hochberg
├── All groups vs. control?
│   └── Dunnett's test
└── Sequential testing (peeking)?
    └── Sequential testing methods (not BH)

Key Interview Questions

Q: Why does Bonferroni reduce power?

A: By dividing \(\alpha\) by \(m\), the threshold becomes very strict. True effects with moderate p-values (e.g., 0.02) may be missed when testing many hypotheses.

Q: When would you choose FDR over FWER?

A: FDR (Benjamini-Hochberg) when: (1) Many tests (10+), (2) Exploratory analysis where some false positives are acceptable, (3) Need more power to detect true effects. Use FWER when even one false positive is costly (e.g., drug approval).

Q: What happens to FWER if you run 100 A/B tests at α=0.05?

A: \(P(\text{at least 1 FP}) = 1 - (1 - 0.05)^{100} \approx 99.4\%\). Without correction, you're almost guaranteed at least one false positive!

Q: Why can't BH be used for sequential testing (peeking)?

A: BH requires all p-values at once for ranking. In sequential testing, tests are run one at a time based on interim results, so the full set isn't available. Use group sequential methods or alpha-spending instead.

Q: What's the difference between Bonferroni and Holm?

A: Both control FWER. Holm is a step-down procedure that's uniformly more powerful: it compares the smallest p-value to \(\alpha/m\), then the second smallest to \(\alpha/(m-1)\), etc. Always prefer Holm over Bonferroni.


20. Spectral Methods: Graph Laplacian (Gap Filler 2026)

Лучшие источники

Tutorials: - Spectral Clustering in Machine Learning — GeeksforGeeks (2025) - A Tutorial on Spectral Clustering — Von Luxburg (canonical)

Spectral Clustering Overview

Core idea: Use eigenvalues/eigenvectors of the graph Laplacian to find clusters in data with non-convex shapes.

When to use: Non-convex clusters, manifold structure, image segmentation, graph data.

Graph Construction

Method Description Result
ε-neighborhood Connect points within radius ε Unweighted graph
k-NN graph Connect to k nearest neighbors Directed/weighted graph
Fully connected Connect all pairs with Gaussian weight Weighted graph

Similarity (affinity) matrix: $\(W_{ij} = \exp\left(-\frac{\|x_i - x_j\|^2}{2\sigma^2}\right)\)$

Graph Laplacian Variants

Type Formula Properties
Unnormalized \(L = D - W\) Simple, eigenvector for \(k\) clusters
Normalized (symmetric) \(L_{sym} = D^{-1/2}LD^{-1/2} = I - D^{-1/2}WD^{-1/2}\) Better numerical properties
Normalized (random walk) \(L_{rw} = D^{-1}L = I - D^{-1}W\) Connection to random walks

Where \(D\) is the degree matrix: \(D_{ii} = \sum_j W_{ij}\)

Spectral Clustering Algorithm

  1. Build similarity graph from data
  2. Compute Laplacian matrix \(L\)
  3. Find k smallest eigenvectors of \(L\)
  4. Stack eigenvectors as columns in matrix \(U\)
  5. Run k-means on rows of \(U\)

Python Implementation

import numpy as np
from sklearn.cluster import SpectralClustering
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt

# Generate non-convex data
X, y = make_moons(n_samples=200, noise=0.05, random_state=42)

# Spectral clustering
sc = SpectralClustering(
    n_clusters=2,
    affinity='rbf',  # or 'nearest_neighbors'
    gamma=1.0,       # kernel bandwidth
    random_state=42
)
labels = sc.fit_predict(X)

# Compare with k-means
from sklearn.cluster import KMeans
km = KMeans(n_clusters=2, random_state=42)
km_labels = km.fit_predict(X)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
axes[0].set_title('Spectral Clustering')
axes[1].scatter(X[:, 0], X[:, 1], c=km_labels, cmap='viridis')
axes[1].set_title('K-Means (fails on non-convex)')

Eigenvalue Interpretation

  • 0 eigenvalue: Corresponds to constant eigenvector (all nodes connected)
  • Small eigenvalues: Near-0 eigenvectors indicate connected components
  • Gap in spectrum: Number of clusters ≈ number of near-zero eigenvalues

K-Means vs Spectral Clustering

Aspect K-Means Spectral
Cluster shape Convex only Any shape
Complexity \(O(nkd)\) \(O(n^3)\) for eigendecomposition
Scalability Excellent Poor for large n
Preprocessing Scale features Build affinity matrix

Key Interview Questions

Q: Why does spectral clustering work for non-convex clusters?

A: It transforms data into eigenvector space where the Laplacian's spectral properties separate clusters linearly. The transformation "unfolds" non-convex manifolds into separable representations.

Q: What's the role of the graph Laplacian's eigenvectors?

A: The smallest eigenvectors form an embedding where points in the same cluster are close. Running k-means in this space finds clusters that may be non-convex in original space.

Q: How do you choose the number of clusters k?

A: (1) Look for a gap in the eigenvalue spectrum (eigengap heuristic), (2) Use silhouette score, (3) Domain knowledge. The eigengap method: find k where \(\lambda_{k+1} - \lambda_k\) is largest.

Q: Why is spectral clustering slow for large datasets?

A: Building the full affinity matrix is \(O(n^2)\) memory. Eigendecomposition is \(O(n^3)\). Solutions: use sparse affinity matrices, approximate eigensolvers (Lanczos), or Nyström approximation.

Q: What's the difference between L, L_sym, and L_rw?

A: All find the same clusters in theory. \(L_{sym}\) (symmetric normalized) has better numerical stability. \(L_{rw}\) (random walk) relates to Markov chains on graphs. Always use normalized Laplacian in practice.

Advantages & Disadvantages

Pros: - Handles non-convex clusters - No assumption about cluster shape - Mathematically elegant (spectral graph theory)

Cons: - \(O(n^3)\) complexity - \(O(n^2)\) memory for affinity matrix - Sensitive to affinity parameters (σ, k) - Requires specifying number of clusters


21. Test-Time Compute Scaling (2025 Hot Topic)

Лучшие источники

Blogs: - Inference-Time Compute Scaling Methods — Sebastian Raschka (Mar 2025) — CANONICAL - What is Test-Time Scaling? — Adaline Labs (Apr 2025)

Papers: - The Art of Scaling Test-Time Compute for LLMs — Dec 2025 - s1: Simple Test-Time Scaling — Jan 2025

What is Test-Time Compute?

Definition: Spending more computational resources during inference to improve model outputs, rather than just during training.

Analogy: Humans give better responses when given more time to "think". Similarly, LLMs can improve with techniques that encourage more "thought" during generation.

Two Strategies for Better Reasoning

Strategy When Cost
Train-time compute During training/fine-tuning Fixed upfront cost
Test-time compute During inference Per-query cost, scalable

Methods for Test-Time Scaling

Method Description Trade-off
Chain-of-Thought (CoT) "Think step by step" prompting More tokens = more cost
Best-of-N sampling Generate N responses, pick best N× compute
Majority voting Generate N, take most common answer N× compute
Beam search Explore multiple paths, keep top-k k× compute
Self-correction Model reviews and fixes own output 2× tokens
"Wait" tokens Force model to continue thinking Budget forcing

"Wait" Token Technique (s1 Paper)

def budget_forcing(model, prompt, max_thinking_tokens=1000):
    """Force model to think longer by adding 'Wait' tokens"""
    response = model.generate(prompt)

    while len(response) < max_thinking_tokens:
        # If model tries to stop, inject "Wait" to continue
        if response.endswith("</answer>"):
            response = response.replace("</answer>", "")
            response += "\nWait, let me reconsider..."

        response = model.generate(response, max_new_tokens=100)

    return response

Best-of-N with Verification

def best_of_n_with_verifier(model, verifier, prompt, n=5):
    """Generate N candidates, use verifier to pick best"""
    candidates = [model.generate(prompt) for _ in range(n)]
    scores = [verifier.score(prompt, c) for c in candidates]
    best_idx = max(range(n), key=lambda i: scores[i])
    return candidates[best_idx]

Compute-Optimal Inference

Key insight (Snell et al., 2024): There's an optimal allocation of test-time compute: - Easy problems: little extra compute needed - Hard problems: benefit from more sampling/verification - Very hard problems: diminishing returns

Rule of thumb: $\(\text{Test-time budget} \propto \text{problem difficulty}\)$

When Test-Time Compute Helps

Scenario Effective? Why
Math problems ✅ Yes Verifiable answers
Code generation ✅ Yes Tests can verify
Creative writing ⚠️ Limited No objective "best"
Simple Q&A ❌ No Overkill
Time-critical apps ❌ No Latency too high

Key Interview Questions

Q: What's the difference between train-time and test-time compute?

A: Train-time: invest compute upfront during training to improve model weights. Test-time: spend compute during inference on each query. Test-time is more flexible (can scale per-problem difficulty) but costs more per query.

Q: When would you use Best-of-N vs. Chain-of-Thought?

A: Best-of-N when you have a verifier (e.g., code tests, math checker). CoT when you want the model to reason explicitly. Both can be combined: generate N CoT responses, pick the most consistent.

Q: What are "wait" tokens in test-time scaling?

A: A technique from the s1 paper where you force the model to continue generating by appending "Wait, let me reconsider" when it tries to stop. This effectively increases inference compute for harder problems.

Q: Why did DeepSeek R1 NOT use explicit test-time scaling?

A: Their model already generates longer responses naturally (implicit scaling). They noted that test-time scaling is typically an application-layer concern and can be added post-hoc without modifying the model.

Q: What's the cost problem with test-time compute?

A: Linear or super-linear increase in inference cost per query. For high-traffic applications, this can be prohibitively expensive. Need to balance accuracy gains vs. latency and cost constraints.

Cost-Benefit Trade-off

More test-time compute →
├── Better accuracy (up to a point)
├── Higher latency (bad for real-time)
├── Higher cost (per query)
└── Diminishing returns for very high compute

Practical guidance: - Start with CoT prompting (free) - Add Best-of-N for verifiable domains (2-5× cost) - Use process reward models for critical applications - Set compute budget based on problem value


22. Information Geometry (Gap Filler 2026)

Лучшие источники

Blogs: - Information Geometry Part 3: Applications in ML — Ji-Ha Kim (Jun 2025) - Curvature as a Teacher: Information Geometry in Neural Network Learning — Satyam Mishra (Aug 2025) - Information Geometry in Variational Inference — AI Under the Hood

Books: - Amari, Information Geometry and Its Applications — canonical reference - Murphy, Probabilistic Machine Learning — Chapter 10

Core Concept: Probability Manifold

Key insight: Probability distributions form a curved manifold, not a flat Euclidean space.

Concept Euclidean Information Geometry
Space Flat Curved manifold
Distance \(\|x - y\|\) KL divergence
Metric Identity \(I\) Fisher Information \(I(\theta)\)
Gradient \(\nabla L\) Natural gradient \(\tilde{\nabla} L\)

Fisher Information Matrix (FIM)

Definition: $\(I(\theta)_{ij} = \mathbb{E}\left[\frac{\partial \log p(x|\theta)}{\partial \theta_i} \frac{\partial \log p(x|\theta)}{\partial \theta_j}\right]\)$

Properties: - Positive semi-definite: \(I(\theta) \succeq 0\) - Captures curvature of log-likelihood - KL divergence approximation: \(D_{KL}(p_\theta \| p_{\theta+\delta}) \approx \frac{1}{2} \delta^T I(\theta) \delta\)

Natural Gradient

Standard gradient: \(\nabla L\) — assumes flat Euclidean space

Natural gradient: $\(\tilde{\nabla} L = I(\theta)^{-1} \nabla L\)$

Update rule: $\(\theta_{t+1} = \theta_t - \eta \cdot I(\theta_t)^{-1} \nabla L(\theta_t)\)$

Benefits: - Reparametrization invariant: same direction regardless of parameterization - Faster convergence: follows geodesics (shortest paths) - Stable: respects manifold curvature

Intuition: Why Natural Gradient?

Imagine riding a bike on Earth vs. a flat parking lot: - On the parking lot (Euclidean), shortest path is a straight line. - On Earth (curved), shortest path is a great circle.

Natural gradient = GPS navigation that knows the Earth is round.

Approximations for Large Models

Problem: For \(d\) parameters, computing \(I(\theta)^{-1}\) is \(O(d^3)\) — infeasible for deep nets.

Approximation Complexity Trade-off
Diagonal \(O(d)\) Ignores correlations, like Adam
Block-diagonal (K-FAC) \(O(d \cdot k^2)\) Per-layer blocks, Kronecker factors
Low-rank \(O(dr)\) Capture dominant directions
Matrix-free (CG) \(O(d \cdot k)\) Solve \(I(\theta)s = \nabla L\) iteratively

Adam as Approximate Natural Gradient

Adam update: $\(\theta_t = \theta_{t-1} - \eta \cdot \frac{g_t}{\sqrt{v_t} + \epsilon}\)$

Interpretation: \(\sqrt{v_t}\) (second moment) ≈ diagonal of empirical Fisher.

import torch

def natural_gradient_update(params, grad, fisher_diag, lr=0.01, damping=1e-4):
    """Natural gradient with diagonal Fisher approximation"""
    # F^-1 * grad (element-wise for diagonal)
    nat_grad = grad / (fisher_diag + damping)
    return params - lr * nat_grad

def empirical_fisher_diag(model, data_loader, n_samples=100):
    """Estimate diagonal of empirical Fisher"""
    fisher_diag = {name: torch.zeros_like(p) for name, p in model.named_parameters()}

    for i, (x, y) in enumerate(data_loader):
        if i >= n_samples:
            break

        log_prob = model.log_prob(x, y)
        for name, p in model.named_parameters():
            grad = torch.autograd.grad(log_prob, p, retain_graph=True)[0]
            fisher_diag[name] += grad ** 2

    # Average
    for name in fisher_diag:
        fisher_diag[name] /= n_samples

    return fisher_diag

K-FAC (Kronecker-Factored Approximate Curvature)

For a layer with weights \(W \in \mathbb{R}^{m \times n}\):

\[I_W \approx A \otimes G\]

where: - \(A = \mathbb{E}[aa^T]\) — activation covariance (input) - \(G = \mathbb{E}[gg^T]\) — gradient covariance (output) - \(\otimes\) — Kronecker product

Benefit: Inverting \((mn \times mn)\) matrix → inverting two \((m \times m)\) and \((n \times n)\) matrices.

Applications in ML

Application Role of Information Geometry
VAEs Optimize \(q(z\mid x)\) to minimize KL divergence
Bayesian Neural Networks Natural gradient for posterior inference
Reinforcement Learning (TRPO) Fisher metric constrains policy updates
Optimization Adam ≈ diagonal natural gradient

Key Interview Questions

Q: Why is natural gradient better than standard gradient?

A: (1) Reparametrization invariant — same optimization path regardless of how you parameterize the model. (2) Faster convergence — follows geodesics on the probability manifold. (3) Curvature-aware — adapts to sensitivity of parameters.

Q: How is Adam related to natural gradient?

A: Adam's second moment estimate \(v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2\) approximates the diagonal of the empirical Fisher Information Matrix. The update \(g_t / \sqrt{v_t}\) is essentially a natural gradient with diagonal approximation.

Q: What's the computational problem with natural gradient for deep learning?

A: Computing and inverting the full Fisher Information Matrix is \(O(d^2)\) storage and \(O(d^3)\) inversion, where \(d\) is the number of parameters (millions to billions for modern networks). Practical methods use diagonal, block-diagonal (K-FAC), or low-rank approximations.

Q: When would you use K-FAC over Adam?

A: K-FAC when: (1) Training is unstable with Adam, (2) You need faster convergence (fewer steps), (3) Second-order information is valuable (e.g., small datasets, research). Adam is better for: (1) Most practical deep learning, (2) Limited compute, (3) Large-scale training.

Q: What does "curvature" mean in information geometry?

A: Curvature of the statistical manifold defined by the Fisher metric. High curvature regions → parameters more sensitive, need smaller steps. Low curvature → less sensitive, can take larger steps. Natural gradient automatically adapts step sizes based on curvature.

Summary Table

Concept Formula Intuition
Fisher Matrix \(I(\theta) = \mathbb{E}[\nabla \log p \cdot \nabla \log p^T]\) Curvature of log-likelihood
Natural Gradient \(\tilde{\nabla} L = I^{-1} \nabla L\) Steepest descent on manifold
KL ≈ Fisher \(D_{KL} \approx \frac{1}{2} \delta^T I \delta\) Local distance on manifold
Adam \(g / \sqrt{v}\) Diagonal natural gradient
K-FAC \(I \approx A \otimes G\) Layer-wise Kronecker factors

24. Advanced Optimization: Newton, BFGS, L-BFGS (Gap Filler 2026)

Лучшие источники

Blogs: - Why Not Newton? The Real Reason We're Stuck with SGD — Dang Truong (2025) - Second-Order Optimization Methods — GeeksforGeeks - L-BFGS for Advanced Optimization — NumberAnalytics

Courses: - EPFL Optimization for ML — lecture notes (2025) - Boyd Convex Optimization — canonical book

First-Order vs Second-Order

Method Uses Convergence Cost per Step
SGD Gradient \(\nabla f\) Sub-linear \(O(n)\)
Newton Gradient + Hessian \(\nabla f, H\) Quadratic \(O(n^3)\)
BFGS Approximate inverse Hessian Superlinear \(O(n^2)\)
L-BFGS Limited memory BFGS Superlinear \(O(n)\)

Newton's Method

Update rule (from Taylor expansion): $\(\theta_{t+1} = \theta_t - H^{-1} \nabla f(\theta_t)\)$

Where: - \(H = \nabla^2 f(\theta)\) — Hessian matrix of second derivatives - \(H_{ij} = \frac{\partial^2 f}{\partial \theta_i \partial \theta_j}\)

Why quadratic convergence? - Taylor expansion: \(f(\theta) \approx f(\theta^*) + \nabla f(\theta^*)(\theta - \theta^*) + \frac{1}{2}(\theta - \theta^*)^T H (\theta - \theta^*)\) - Newton step minimizes the quadratic approximation exactly - Near minimum: error roughly squared each iteration

Why NOT Use Newton for Deep Learning?

Level 1: Memory Problem - ResNet-50: \(N = 25\) million parameters - Hessian: \(N \times N = 625\) trillion entries = 2.5 Petabytes - H100 GPU: 80 GB → would need 312,500 GPUs just to store!

Level 2: Stochastic Problem - We use mini-batches (64 samples), not full dataset - Mini-batch gradient: reasonable estimate of true gradient - Mini-batch Hessian: garbage — too noisy to trust - Even with L-BFGS, stochastic noise destroys quadratic convergence

Level 3: Non-Convex Problem - Deep learning landscapes: millions of saddle points, sharp valleys - Newton converges to nearest critical point (could be saddle!) - SGD noise helps escape sharp minima → better generalization

Insight: We don't want a scalpel (Newton). We need a jackhammer (SGD).

BFGS: Quasi-Newton Method

Key idea: Approximate \(H^{-1}\) iteratively, don't compute it directly.

Update: $\(\theta_{t+1} = \theta_t - \alpha_t M_t \nabla f(\theta_t)\)$

Where \(M_t \approx H^{-1}\) is updated using: - \(s_t = \theta_{t+1} - \theta_t\) (parameter change) - \(y_t = \nabla f(\theta_{t+1}) - \nabla f(\theta_t)\) (gradient change) - Rank-2 update: \(M_{t+1} = (I - \rho s y^T) M_t (I - \rho y s^T) + \rho s s^T\)

Cost: \(O(n^2)\) per step (vs \(O(n^3)\) for Newton)

L-BFGS: Limited-Memory BFGS

Key idea: Don't store \(n \times n\) matrix. Store last \(k\) pairs \((s_i, y_i)\).

Memory: \(O(nk)\) instead of \(O(n^2)\)

Typical: \(k = 10-20\) pairs

import numpy as np
from scipy.optimize import minimize

def rosenbrock(x):
    """Rosenbrock function: classic optimization benchmark"""
    return sum(100.0*(x[1:]-x[:-1]**2.0)**2.0 + (1-x[:-1])**2.0)

def rosenbrock_grad(x):
    """Gradient of Rosenbrock"""
    xm = x[1:-1]
    xm_m1 = x[:-2]
    xm_p1 = x[2:]
    der = np.zeros_like(x)
    der[1:-1] = 200*(xm-xm_m1**2) - 400*(xm_p1 - xm**2)*xm - 2*(1-xm)
    der[0] = -400*x[0]*(x[1]-x[0]**2) - 2*(1-x[0])
    der[-1] = 200*(x[-1]-x[-2]**2)
    return der

# L-BFGS optimization
x0 = np.array([1.3, 0.7, 0.8, 1.9, 1.2])
result = minimize(rosenbrock, x0, method='L-BFGS-B', jac=rosenbrock_grad)
print(f"Minimum at: {result.x}")
print(f"Function value: {result.fun:.6f}")
print(f"Iterations: {result.nit}")

Comparison for Deep Learning

Method Memory Works with Mini-batch? Good for DL?
Newton \(O(n^2)\) No ❌ Impossible
BFGS \(O(n^2)\) Poorly ❌ Too much memory
L-BFGS \(O(n)\) Poorly ⚠️ Sometimes (small models)
SGD \(O(n)\) Yes ✅ Standard
Adam \(O(n)\) Yes ✅ Best default

When L-BFGS IS Used

  • Classic ML: SVM, logistic regression (convex problems)
  • Small neural networks: < 10K parameters
  • Fine-tuning: When full-batch is feasible
  • Scipy/PyTorch: scipy.optimize.minimize(method='L-BFGS-B')

Key Interview Questions

Q: Why doesn't deep learning use Newton's method?

A: Three reasons: (1) Memory: Hessian is \(O(n^2)\) — petabytes for modern networks. (2) Stochastic: Mini-batch Hessians are too noisy, destroying quadratic convergence benefits. (3) Non-convex: Newton can converge to saddles or sharp minima; SGD's noise finds flat, generalizable minima.

Q: What's the difference between BFGS and L-BFGS?

A: BFGS stores full \(n \times n\) inverse Hessian approximation (\(O(n^2)\) memory). L-BFGS stores only last \(k\) gradient/parameter difference pairs (\(O(nk)\) memory). L-BFGS achieves similar convergence with far less memory.

Q: When would you actually use L-BFGS?

A: For convex problems (SVM, logistic regression), small models (<10K params), or when you can afford full-batch gradients. Not suitable for large-scale stochastic deep learning.

Q: If Adam is "fake second-order," what does it approximate?

A: Adam's second moment \(v_t\) approximates the diagonal of the empirical Fisher Information Matrix. This is a very rough curvature estimate — just enough to adapt step sizes per parameter, but missing the full Hessian information.

Q: What's quadratic convergence?

A: Error roughly squares each iteration: \(|\theta_{t+1} - \theta^*| \approx C|\theta_t - \theta^*|^2\). Near minimum, correct digits double each step. Newton has this; SGD doesn't.

Summary Table

Method Update Rule Key Property
Newton \(\theta - H^{-1}\nabla f\) Quadratic convergence, \(O(n^3)\)
BFGS \(\theta - M_t \nabla f\) Superlinear, \(O(n^2)\) memory
L-BFGS Same, implicit \(M_t\) \(O(n)\) memory, needs full-batch
SGD \(\theta - \eta \nabla f\) Noisy, good for non-convex
Adam \(\theta - \eta \frac{m}{\sqrt{v}+\epsilon}\) Diagonal approx, adaptive

Связи между темами (Cross-References)

Лучшие источники

Blogs: - Johnson-Lindenstrauss Lemma & Random Projections — AI Under the Hood - Ultimate Guide to Johnson-Lindenstrauss Lemma — NumberAnalytics

Papers: - Johnson & Lindenstrauss (1984) — original paper - Dasgupta & Gupta (2003) — elementary proof - Achlioptas (2003) — database-friendly sparse projections

Books: - Vempala, The Random Projection Method — canonical reference

Johnson-Lindenstrauss Lemma: Statement

For \(n\) points in \(\mathbb{R}^d\) and \(\varepsilon \in (0,1)\), there exists a linear map \(f: \mathbb{R}^d \to \mathbb{R}^k\) with:

\[k \geq \frac{8 \ln n}{\varepsilon^2}\]

such that for all pairs \(x_i, x_j\):

\[(1 - \varepsilon) \|x_i - x_j\|^2 \leq \|f(x_i) - f(x_j)\|^2 \leq (1 + \varepsilon) \|x_i - x_j\|^2\]

Key insight: \(k\) depends on \(n\) and \(\varepsilon\), NOT on original dimension \(d\)!

Why This Matters

Without JL With JL
\(d = 1,000,000\) \(k = O(\log n / \varepsilon^2)\)
Expensive to store Compressed storage
Slow distance computations Fast computations
Curse of dimensionality Mitigated

Random Projection Methods

Method Matrix entries Speed Notes
Gaussian \(R_{ij} \sim \mathcal{N}(0, 1/k)\) Slow Theoretical guarantee
Rademacher \(R_{ij} = \pm 1/\sqrt{k}\) with \(p=0.5\) Medium Simpler
Sparse (Achlioptas) \(0\) (⅔), \(\pm\sqrt{3}/\sqrt{k}\) (⅙ each) Fast 3× speedup

Python Implementation

import numpy as np
from sklearn.random_projection import GaussianRandomProjection, SparseRandomProjection

def random_projection_custom(X, k, method='gaussian'):
    """
    Random projection from d dimensions to k dimensions.

    Args:
        X: (n, d) data matrix
        k: target dimension
        method: 'gaussian', 'rademacher', or 'sparse'
    """
    n, d = X.shape

    if method == 'gaussian':
        R = np.random.randn(k, d) / np.sqrt(k)
    elif method == 'rademacher':
        R = np.random.choice([-1, 1], size=(k, d)) / np.sqrt(k)
    elif method == 'sparse':
        # Achlioptas: 0 with prob 2/3, ±√3 with prob 1/6 each
        R = np.random.choice([0, np.sqrt(3), -np.sqrt(3)],
                             size=(k, d), p=[2/3, 1/6, 1/6]) / np.sqrt(k)

    return X @ R.T

def jl_target_dimension(n, epsilon=0.1):
    """Compute minimum target dimension for JL guarantee."""
    return int(np.ceil(8 * np.log(n) / epsilon**2))

# Example usage
n, d = 1000, 5000  # 1000 points in 5000 dimensions
X = np.random.randn(n, d)

epsilon = 0.1
k = jl_target_dimension(n, epsilon)
print(f"Original: {d}D → Projected: {k}D (for n={n}, ε={epsilon})")

X_proj = random_projection_custom(X, k, method='sparse')

# Verify distance preservation
i, j = 0, 1
orig_dist = np.linalg.norm(X[i] - X[j])
proj_dist = np.linalg.norm(X_proj[i] - X_proj[j])
ratio = proj_dist / orig_dist
print(f"Original dist: {orig_dist:.3f}, Projected: {proj_dist:.3f}, Ratio: {ratio:.3f}")
print(f"Within bounds? {(1-epsilon) <= ratio**2 <= (1+epsilon)}")

# Using sklearn (recommended for production)
transformer = SparseRandomProjection(n_components=k, random_state=42)
X_proj_sklearn = transformer.fit_transform(X)

Applications in ML

Application How JL Helps
k-NN Search Reduce dimensions, keep approximate neighbors
Clustering Faster k-means in lower dimensions
SVM Kernel approximation for large datasets
LSH (Locality-Sensitive Hashing) Efficient approximate nearest neighbors
Compressed sensing Signal recovery from fewer measurements

JL vs PCA

Aspect Random Projection PCA
Speed \(O(nkd)\) \(O(nd^2)\) for SVD
Data-dependent? No (random matrix) Yes (eigenvalue decomposition)
Preserves Pairwise distances Variance
Best for Very high \(d\), streaming data When variance matters most
Guarantees Theoretical (with prob.) Optimal for variance

When to Use Random Projections

High-dimensional data (d > 1000)?
├── Need fast dimensionality reduction?
│   └── Random Projection ✅
├── Need interpretable components?
│   └── PCA ✅
├── Streaming / online data?
│   └── Random Projection ✅ (no fit needed)
└── Distance-based algorithms (k-NN, clustering)?
    └── Random Projection ✅ (preserves distances)

Key Interview Questions

Q: What does the Johnson-Lindenstrauss lemma guarantee?

A: For any \(n\) points, we can project to \(k = O(\log n / \varepsilon^2)\) dimensions while preserving all pairwise distances within factor \((1 \pm \varepsilon)\) with high probability. Crucially, \(k\) doesn't depend on original dimension \(d\).

Q: Why is JL useful for ML?

A: (1) Mitigates curse of dimensionality, (2) Speeds up distance-based algorithms (k-NN, clustering), (3) Compresses high-dimensional data (embeddings), (4) Enables efficient approximate nearest neighbor search.

Q: Why use sparse random projections over Gaussian?

A: Sparse matrices (Achlioptas) are faster to compute: many entries are zero, reducing multiplications by ~3×. Still provides same theoretical guarantees with high probability.

Q: When would you use random projection vs PCA?

A: Random projection when: (1) Data dimension is very high (\(d > 1000\)), (2) Need fast online/streaming projection, (3) Don't need interpretable components, (4) Preserving distances is more important than variance. PCA when you need optimal variance preservation and interpretability.

Q: What's the catch with random projections?

A: (1) Adds randomness — results not deterministic without seed, (2) Some distortion (\(\varepsilon\)), (3) Doesn't capture data structure like PCA, (4) Not optimal for any specific dataset — it's a general-purpose tool.

Summary Table

Concept Formula Purpose
Target dimension \(k \geq \frac{8 \ln n}{\varepsilon^2}\) Minimum for JL guarantee
Gaussian projection \(R_{ij} \sim \mathcal{N}(0, 1/k)\) Theoretical standard
Sparse projection \(R_{ij} \in \{0, \pm\sqrt{3}\}/\sqrt{k}\) Fast computation
Distance preservation \((1-\varepsilon)\|x\|^2 \leq \|Rx\|^2 \leq (1+\varepsilon)\|x\|^2\) Core guarantee

Заблуждение: softmax стабилен без трюка с вычитанием max

При \(x_i > 700\) (float64) или \(x_i > 88\) (float32) значение \(e^{x_i}\) = inf. Даже float64 не спасает. Всегда вычитайте max: exp(x - max(x)) -- математически идентично, но численно стабильно. Это спрашивают на 90% ML-интервью.

Заблуждение: PCA можно заменить на t-SNE для предобработки данных

t-SNE: нет inverse transform, нет out-of-sample extension, \(O(n^2)\) сложность, стохастические результаты. Для preprocessing используйте PCA или UMAP. t-SNE -- только для визуализации малых датасетов (<50k точек). Расстояния и размеры кластеров на t-SNE графике бессмысленны.

Заблуждение: Newton's method лучше SGD для deep learning

Гессиан для ResNet-50 (25M параметров) = 625 триллионов элементов = 2.5 петабайт. Обращение матрицы = \(O(n^3)\). Даже L-BFGS плохо работает со стохастическими мини-батчами. SGD + шум помогает найти flat minima с лучшей generalization.

Связи между темами (Cross-References)

graph TD
    A[Calculus<br/>градиент] --> B[Gradient Descent]
    B --> C[Optimizers]
    C --> D[Deep Learning]
    E[Linear Algebra<br/>матрицы] --> F[Neural Networks]
    F --> G[Backprop]
    H[Probability<br/>Bayes] --> I[Naive Bayes]
    H --> J[Information Theory]
    J --> K[Entropy]
    L[Statistics<br/>hypothesis testing] --> M[A/B Tests]
    K --> N[Information Gain]
    N --> O[Decision Trees]
    O --> P[Random Forest]
    P --> Q[Classification Metrics]
    R[Regularization<br/>L1/L2] --> S[Gradient Boosting]

    style A fill:#e8eaf6,stroke:#3f51b5
    style E fill:#e8eaf6,stroke:#3f51b5
    style H fill:#f3e5f5,stroke:#9c27b0
    style L fill:#f3e5f5,stroke:#9c27b0
    style D fill:#e8f5e9,stroke:#4caf50
    style J fill:#fff3e0,stroke:#ef6c00
    style Q fill:#e8f5e9,stroke:#4caf50

Рекомендуемый порядок изучения

  1. Week 1-2: Linear Algebra + Calculus (3Blue1Brown)
  2. Week 3-4: Probability + Statistics (StatQuest, практические задачи)
  3. Week 5: Information Theory (связь с Loss Functions)
  4. Week 6: Classification Metrics + Regularization
  5. Week 7: Ensemble Methods
  6. Week 8: NumPy/Pandas практика (Kaggle)