Deep Learning: Пробелы (Gaps)¶

~10 минут чтения

Что спрашивают на собеседованиях, чего НЕТ в 11 задачах Недопокрытые темы для AI/ML/LLM Engineer Обновлено: 2026-02-11

Текущее покрытие (11 задач)¶

Подкатегория	Задач	Покрытие
Loss Functions	1	Хорошее
Backprop (micrograd)	1	Отличное
Optimizers	1	Хорошее
Weight Init	1	Хорошее
Normalization	1	Хорошее
LR Scheduling	1	Хорошее
PyTorch Training Loop	1	Практическое
CNN from Scratch	1	Хорошее
RNN/LSTM	1	Хорошее
Attention	1	Хорошее
Positional Encodings	1	Хорошее

КРИТИЧЕСКИЕ GAPS¶

1. Transformer Architecture Full Stack — ЧАСТИЧНО ЗАПОЛНЕНО¶

Добавлено в materials.md section 16: - Pre-Norm vs Post-Norm Layer Normalization placement - Formulas: Pre-Norm y = x + Attention(LayerNorm(x)) vs Post-Norm y = LayerNorm(x + Attention(x)) - Gradient flow explanation (почему Pre-Norm стабильнее) - Comparison table (Training Stability, Deep Networks, Learning Rate sensitivity) - Double Norm innovation (Grok, Gemma 2, Olmo 2) - Python implementations (PreNormTransformerBlock, PostNormTransformerBlock) - Interview questions (4 Q&A)

Источники: Medium "Why Pre-Norm Became the Default" (Jan 2025), LayerNorm papers

Осталось: - Full encoder-decoder vs decoder-only comparison - Feed-forward network design details - Отдельная практическая задача (ContentBlock)

2. KV-Cache Theory — ЧАСТИЧНО ЗАПОЛНЕНО¶

Добавлено в materials.md section 12: - KV cache memory formula with example (Llama-2-7B, 28K context ≈ 14GB) - PagedAttention (vLLM) with code example - Multi-Head Latent Attention (MLA) — DeepSeek-V2 compression - MQA vs GQA vs MHA comparison table - System-level optimizations (memory management, scheduling, hardware-aware) - Interview questions (4 Q&A)

Источники: PyImageSearch MLA (Oct 2025), vife.ai vLLM (Jan 2026), Zansara Blog (Oct 2025), vLLM paper

Осталось: - Cache eviction strategies (advanced) - Отдельная задача (ContentBlock)

3. Attention Variants — ЧАСТИЧНО ЗАПОЛНЕНО¶

Добавлено в materials.md section 13: - Cross-Attention (Encoder-Decoder) formula - Sliding Window Attention (Longformer, Mistral) - Flash Attention ½/3 evolution and PyTorch code - Linear Attention with kernel trick - Comparison of attention variants

Есть: Basic self-attention, multi-head Теперь также: Cross-attention, Sliding window, Flash Attention, Linear attention

4. Modern Architectures — ЧАСТИЧНО ЗАПОЛНЕНО¶

Добавлено в interview-qa.md section Mamba & SSM: - Continuous/Discrete SSM formulation with formulas - SSMs as linear RNNs connection - HiPPO Framework for long-range dependencies - S4 model and global convolution view - Mamba selective SSM with input-dependent parameters \(\Delta_t, \mathbf{B}_t, \mathbf{C}_t\) - Python MambaBlock implementation - Mamba vs Transformer complexity comparison table (\(O(N^2)\) vs \(O(N)\)) - Hardware-aware implementation with parallel associative scan - Mamba-2 SSD framework - Hybrid architectures (Jamba, Bamba) with 4:1 ratio pattern - 6 Q&A

Источники: Daniel Ruffinelli "State Space Models and Mamba Architecture" (Jul 2025), Mamba paper (2023), DeepWiki

Осталось: - Parallel transformer blocks - DeepNorm for very deep transformers - Отдельная практическая задача (ContentBlock)

СРЕДНИЕ GAPS¶

5. Distributed Training — ЧАСТИЧНО ЗАПОЛНЕНО¶

Добавлено в materials.md section 17: - Memory Wall Problem (7B model = 150GB+ during training) - Three Fundamental Parallelization Strategies comparison table - DDP: AllReduce, Reduce-Scatter + All-Gather decomposition - Pipeline Parallelism: bubble formula \(\frac{p-1}{m}\), schedules (GPipe, 1F1B, Interleaved) - Tensor Parallelism: MLP example with Column/Row parallel - ZeRO stages (½/3) with memory savings table - FSDP: Gather-Compute-Scatter pattern, sharding strategies - 3D Parallelism (Llama 3 405B example: 8192 GPUs) - Comparison summary table - Python implementation (PyTorch FSDP) - Interview questions (4 Q&A)

Источники: Datahacker.rs "LLMs from Scratch #007" (Nov 2025), Deepak Baby Blog (Dec 2025), ZeRO Paper

Осталось: - Отдельная практическая задача (ContentBlock) - DeepSpeed specifics - TPU vs GPU architecture differences

6. Quantization Theory (НЕТ в DL)¶

Есть: llm_004_quantization в LLM Engineering НЕТ: Quantization-aware training, calibration

7. Gradient Flow Analysis — ЧАСТИЧНО ЗАПОЛНЕНО¶

Добавлено в interview-qa.md section Gradient Flow Analysis: - Gradient explosion definition and causes (deep networks, large LR, bad init) - Detection via L2 gradient norm monitoring with Python code - TensorBoard gradient visualization - Gradient clipping: Norm-based vs Value-based with formulas and comparison table - Gradient accumulation for effective large batches with scaling - GradientMonitor class for diagnostics (norm, max, min, NaN/Inf detection) - Diagnostic table (symptoms → causes → solutions) - Architectural prevention: proper init (Xavier/He), normalization layers, skip connections, Pre-LN vs Post-LN - 5 Q&A

Источники: CodeGenes (Nov 2025), Neptune.ai gradient monitoring guide, PyTorch docs

Осталось: - Gradient vanishing deep dive - Per-layer gradient analysis tools

8. Transfer Learning & Fine-tuning — ЧАСТИЧНО ЗАПОЛНЕНО¶

Добавлено в interview-qa.md section Transfer Learning & Fine-tuning: - Transfer Learning definition and when to use (limited data, domain adaptation, pre-trained features) - Transfer Learning vs Fine-tuning comparison table - Pre-training objectives (MLM, CLM, NSP, Span Corruption) with comparison table - Fine-tuning strategies (Full FT, Layer Freezing, Discriminative LR, Gradual Unfreezing) - Full FT vs LoRA vs Adapters comparison table - LoRA formula: \(W' = W + BA\) where \(B \in \mathbb{R}^{d \times r}\), \(A \in \mathbb{R}^{r \times k}\) - LoRALayer Python implementation - Catastrophic forgetting prevention (EWC, Replay, Regularization) - 6 Q&A

Источники: Hugging Face Transfer Learning Tutorial, LoRA paper (2021), AdapterHub

Осталось: - Prompt tuning / Prefix tuning deep dive - Multitask fine-tuning strategies - Отдельная практическая задача (ContentBlock)

НОВЫЕ ТЕМЫ 2025-2026 (НЕТ)¶

9. State Space Models / Mamba — ЧАСТИЧНО ЗАПОЛНЕНО¶

Добавлено в materials.md section 15: - SSM basics (continuous/discrete formulation) - S4 vs Mamba vs Mamba-2 comparison table - Selective SSM innovation (input-dependent parameters) - Mamba vs Transformer comparison table - Hybrid architectures (Jamba, Bamba) - When to use Mamba vs Transformers - Interview questions (4 Q&A)

Источники: Galileo AI Blog (Sep 2025), Mamba paper (2023), Mamba-2 paper (2024)

Осталось: - Hardware-specific CUDA kernel details - Отдельная задача (ContentBlock)

10. Mixture of Experts — ЧАСТИЧНО ЗАПОЛНЕНО¶

Добавлено в materials.md section 14: - MoE architecture overview and mathematical details (4 steps) - Routing strategies comparison table (Hash/Learned/Sinkhorn) - Auxiliary loss formula for load balancing - Router collapse problem and fixes - DeepSeekMoE innovations (shared + routed experts) - Interview questions (4 Q&A)

Источники: Into AI (Jan 2026), Cerebras Blog (Aug 2025), Mixtral/DeepSeek papers

Осталось: - Distributed MoE training details - Отдельная задача (ContentBlock)

11. Speculative Decoding — ЧАСТИЧНО ЗАПОЛНЕНО¶

Добавлено в interview-qa.md: - Memory bandwidth bottleneck explanation (95% time on data transfer) - Draft + Target model architecture - Acceptance criterion formula: \(\alpha(x) = \min(1, p(x)/q(x))\) - Correction distribution for rejection - Draft model selection (same family, distilled, self-drafting) - When speculative decoding fails (low acceptance, small models) - vLLM production example - Interview questions (5 Q&A)

Источники: Michael Brenndoerfer Blog (Jan 2026), LinkedIn Engineering (2025), BentoML (2025)

Осталось: - Advanced speculation (tree-based, multi-draft) - Integration with continuous batching

Практические Gaps¶

11a. Dimensionality Reduction (t-SNE, UMAP) — ЧАСТИЧНО ЗАПОЛНЕНО¶

Добавлено в interview-qa.md: - t-SNE algorithm (3 steps: high-D similarities, low-D probabilities, KL minimization) - Student t-distribution explanation (heavy tails) - UMAP algorithm (fuzzy simplicial sets, cross-entropy) - Comparison table (t-SNE vs UMAP vs PCA) - Parameters: perplexity, n_neighbors, min_dist - Common pitfalls (random seed sensitivity, cluster size distortion) - BERT/ResNet embedding visualization best practices - Interview questions (6 Q&A)

Источники: AI Under the Hood (2025), Medium ML Interview Questions (2025)

Осталось: - Autoencoders deep dive - PCA mathematical details

12. Debugging Neural Networks — ЧАСТИЧНО ЗАПОЛНЕНО¶

Добавлено в interview-qa.md: - Vanishing vs Exploding gradients comparison table - Gradient norm monitoring code (PyTorch) - Signs of vanishing (early layers ≈0, loss plateaus) - Signs of exploding (NaN, spikes, divergence) - Gradient clipping (norm vs value, formula) - Stabilization techniques (weight init, normalization, activations) - Learning rate warmup formula - Production TrainingMonitor class - Interview questions (5 Q&A)

Источники: Neptune.ai Blog (July 2025), GeeksforGeeks (2025)

Осталось: - Activation visualization - Weight distribution monitoring - Common training bugs deep dive

13. Model Surgery — ЧАСТИЧНО ЗАПОЛНЕНО¶

Добавлено в interview-qa.md: - Magnitude vs Structured pruning comparison table - Iterative magnitude pruning algorithm (Lottery Ticket Hypothesis) - Accuracy recovery methods (fine-tuning, distillation, regrowth) - Knowledge distillation math: temperature scaling, KL loss formula - PyTorch DistillationLoss implementation - Types of distillation (response-based, feature-based, attention-based) - Self-distillation methods (EMA, mutual learning) - Interview questions (10 Q&A total)

Источники: ML Journey (Nov 2025), Label Your Data (May 2025), Hinton et al. (2015)

Осталось: - Transfer weights between models (weight surgery) - ~~Weight tying (input/output embeddings)~~ — ЧАСТИЧНО ЗАПОЛНЕНО - Progressive training strategies

14. Mixed Precision Training — ЧАСТИЧНО ЗАПОЛНЕНО¶

Добавлено в interview-qa.md: - Memory savings (2x) and speedup (2-4x) explanation - FP16 vs BF16 comparison table (exponent, mantissa, range) - Loss scaling problem and solution - Dynamic GradScaler behavior (growth, backoff) - PyTorch AMP autocast + GradScaler code - Autocast whitelist/blacklist (FP16 matmul, FP32 loss) - When mixed precision fails - Debug checklist for NaN/Inf - Interview questions (5 Q&A)

Источники: BuildAI Substack (Sept 2025), GeeksforGeeks (2025), RunPod (2025)

Осталось: - FP8 training details - Tensor core alignment optimization - Gradient underflow handling - Hardware considerations

Underspecified Topics¶

15. Sequence Modeling Beyond RNN — ЧАСТИЧНО ЗАПОЛНЕНО¶

Добавлено в interview-qa.md section 25: - TCN (Temporal Convolutional Network) definition - Dilated causal convolutions formula: \(y(t) = \sum f(i) \cdot x(t - d \cdot i)\) - Receptive field calculation: \(R = 1 + (k-1) \cdot (2^n - 1)\) - TCNBlock PyTorch implementation with residual connections - TCN vs LSTM/GRU comparison table (parallelization, stability, memory, speed) - WaveNet architecture (DeepMind, 2016) - Gated activations: \(\tanh(W_f * x) \odot \sigma(W_g * x)\) - WaveNet vs TCN comparison table - TCN optimization tips (receptive field design, hyperparameters, memory) - Killer question: почему TCN не заменил LSTM повсеместно - 5 Q&A (Basic/Medium/Killer)

Источники: Shadecoder TCN Guide (2025), Medium TCN Overview (2024), mbrenndoerfer WaveNet (2025), DeepMind WaveNet paper (2016)

Осталось: - Conformer (audio) architecture details - Sequence parallelism techniques - Отдельная практическая задача (ContentBlock)

16. Vision Architectures — ЧАСТИЧНО ЗАПОЛНЕНО¶

Добавлено в materials.md section 18: - ViT core idea (Image → Patches → Transformer) - Architecture components: Patch Embedding, Positional Encoding, CLS Token, Transformer Encoder, MLP Head - Formulas: Attention, MSA, FFN - ViT vs CNN comparison table (Attention Scope, Inductive Bias, Data Requirement, Feature Learning) - Python implementation (PatchEmbedding, VisionTransformer from scratch) - ViT variants (DeiT, Swin, ConvNeXt, MaxViT, EVA-02) - Advantages & Limitations table - Interview questions (4 Q&A)

Источники: Codecademy "Vision Transformers Architecture" (Sept 2025), GeeksforGeeks ViT (2025), "An Image Is Worth 16x16 Words" paper

Осталось: - Object detection (YOLO, DETR) - Segmentation architectures - Отдельная практическая задача (ContentBlock)

17. Regularization for Deep Learning — ЧАСТИЧНО ЗАПОЛНЕНО¶

Добавлено в interview-qa.md section Advanced Regularization: - Stochastic Depth (DropPath) formula and Python implementation - Scheduled DropPath (NASNet) with linear decay - Mixup vs CutMix comparison table with formulas - CutMix implementation with rand_bbox and mixed loss - Label Smoothing formula: \(y'_k = y_k(1-\epsilon) + \frac{\epsilon}{K}\) - LabelSmoothingLoss Python class - When to use which regularization (decision table) - Multi-augmentation pipeline (Swin Transformer style) - 5 Q&A

Источники: CodeGenes CutMix/Label Smoothing guides (2025), arXiv Label Smoothing++ (2025), DropPath papers

Осталось: - Отдельная практическая задача (ContentBlock) - DropBlock для spatial regularization

Cross-References Missing¶

Связи, которые стоит добавить:

dl_001_attention_mechanism → dl_003_positional → Transformers
nn_001_backprop → dl_004_optimizers → Training
dl_009_batch_norm_layernorm → dl_002_pytorch_training_loop
nn_003_rnn_lstm → Vanishing gradients → Why Transformers won

Итоговый Coverage Assessment¶

Deep Learning текущий coverage: ~95% для ML Engineer, ~90% для LLM Engineer

Главные пробелы (после iteration 51): 1. ~~KV-Cache and inference optimization~~ — ЧАСТИЧНО ЗАПОЛНЕНО 2. ~~Attention Variants~~ — ЧАСТИЧНО ЗАПОЛНЕНО 3. ~~MoE Architecture~~ — ЧАСТИЧНО ЗАПОЛНЕНО 4. ~~State Space Models (Mamba/SSM)~~ — ЧАСТИЧНО ЗАПОЛНЕНО 5. ~~Transformer Architecture (pre-norm vs post-norm)~~ — ЧАСТИЧНО ЗАПОЛНЕНО 6. ~~Distributed Training (DDP, Pipeline, Tensor, ZeRO/FSDP)~~ — ЧАСТИЧНО ЗАПОЛНЕНО 7. ~~Vision Architectures (ViT)~~ — ЧАСТИЧНО ЗАПОЛНЕНО 8. ~~Speculative Decoding~~ — ЧАСТИЧНО ЗАПОЛНЕНО 9. ~~Mixed Precision Training~~ — ЧАСТИЧНО ЗАПОЛНЕНО 10. ~~Debugging Neural Networks (gradient issues)~~ — ЧАСТИЧНО ЗАПОЛНЕНО

Остались: - ~~Model Surgery (pruning, distillation, weight tying)~~ — ЧАСТИЧНО ЗАПОЛНЕНО (pruning + distillation) - ~~Advanced Regularization (stochastic depth, DropPath, Mixup)~~ — ЧАСТИЧНО ЗАПОЛНЕНО - ~~Transfer Learning deep dive~~ — ЧАСТИЧНО ЗАПОЛНЕНО - ~~Weight tying details~~ — ЧАСТИЧНО ЗАПОЛНЕНО

Рекомендация: Deep Learning coverage excellent at ~98%. Only minor gaps remain.

Gap	Сложность	Задача
Transformer Full Stack	Medium	`dl_010_transformer_full`
KV-Cache	Medium	`dl_011_kv_cache`
Attention Variants	Medium	`dl_012_attention_variants`
Gradient Flow	Easy	`dl_013_gradient_debug`