Deep Learning: Пробелы (Gaps)¶
~10 минут чтения
Что спрашивают на собеседованиях, чего НЕТ в 11 задачах Недопокрытые темы для AI/ML/LLM Engineer Обновлено: 2026-02-11
Текущее покрытие (11 задач)¶
| Подкатегория | Задач | Покрытие |
|---|---|---|
| Loss Functions | 1 | Хорошее |
| Backprop (micrograd) | 1 | Отличное |
| Optimizers | 1 | Хорошее |
| Weight Init | 1 | Хорошее |
| Normalization | 1 | Хорошее |
| LR Scheduling | 1 | Хорошее |
| PyTorch Training Loop | 1 | Практическое |
| CNN from Scratch | 1 | Хорошее |
| RNN/LSTM | 1 | Хорошее |
| Attention | 1 | Хорошее |
| Positional Encodings | 1 | Хорошее |
КРИТИЧЕСКИЕ GAPS¶
1. Transformer Architecture Full Stack — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 16:
- Pre-Norm vs Post-Norm Layer Normalization placement
- Formulas: Pre-Norm y = x + Attention(LayerNorm(x)) vs Post-Norm y = LayerNorm(x + Attention(x))
- Gradient flow explanation (почему Pre-Norm стабильнее)
- Comparison table (Training Stability, Deep Networks, Learning Rate sensitivity)
- Double Norm innovation (Grok, Gemma 2, Olmo 2)
- Python implementations (PreNormTransformerBlock, PostNormTransformerBlock)
- Interview questions (4 Q&A)
Источники: Medium "Why Pre-Norm Became the Default" (Jan 2025), LayerNorm papers
Осталось: - Full encoder-decoder vs decoder-only comparison - Feed-forward network design details - Отдельная практическая задача (ContentBlock)
2. KV-Cache Theory — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 12: - KV cache memory formula with example (Llama-2-7B, 28K context ≈ 14GB) - PagedAttention (vLLM) with code example - Multi-Head Latent Attention (MLA) — DeepSeek-V2 compression - MQA vs GQA vs MHA comparison table - System-level optimizations (memory management, scheduling, hardware-aware) - Interview questions (4 Q&A)
Источники: PyImageSearch MLA (Oct 2025), vife.ai vLLM (Jan 2026), Zansara Blog (Oct 2025), vLLM paper
Осталось: - Cache eviction strategies (advanced) - Отдельная задача (ContentBlock)
3. Attention Variants — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 13: - Cross-Attention (Encoder-Decoder) formula - Sliding Window Attention (Longformer, Mistral) - Flash Attention ½/3 evolution and PyTorch code - Linear Attention with kernel trick - Comparison of attention variants
Есть: Basic self-attention, multi-head Теперь также: Cross-attention, Sliding window, Flash Attention, Linear attention
4. Modern Architectures — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в interview-qa.md section Mamba & SSM: - Continuous/Discrete SSM formulation with formulas - SSMs as linear RNNs connection - HiPPO Framework for long-range dependencies - S4 model and global convolution view - Mamba selective SSM with input-dependent parameters \(\Delta_t, \mathbf{B}_t, \mathbf{C}_t\) - Python MambaBlock implementation - Mamba vs Transformer complexity comparison table (\(O(N^2)\) vs \(O(N)\)) - Hardware-aware implementation with parallel associative scan - Mamba-2 SSD framework - Hybrid architectures (Jamba, Bamba) with 4:1 ratio pattern - 6 Q&A
Источники: Daniel Ruffinelli "State Space Models and Mamba Architecture" (Jul 2025), Mamba paper (2023), DeepWiki
Осталось: - Parallel transformer blocks - DeepNorm for very deep transformers - Отдельная практическая задача (ContentBlock)
СРЕДНИЕ GAPS¶
5. Distributed Training — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 17: - Memory Wall Problem (7B model = 150GB+ during training) - Three Fundamental Parallelization Strategies comparison table - DDP: AllReduce, Reduce-Scatter + All-Gather decomposition - Pipeline Parallelism: bubble formula \(\frac{p-1}{m}\), schedules (GPipe, 1F1B, Interleaved) - Tensor Parallelism: MLP example with Column/Row parallel - ZeRO stages (½/3) with memory savings table - FSDP: Gather-Compute-Scatter pattern, sharding strategies - 3D Parallelism (Llama 3 405B example: 8192 GPUs) - Comparison summary table - Python implementation (PyTorch FSDP) - Interview questions (4 Q&A)
Источники: Datahacker.rs "LLMs from Scratch #007" (Nov 2025), Deepak Baby Blog (Dec 2025), ZeRO Paper
Осталось: - Отдельная практическая задача (ContentBlock) - DeepSpeed specifics - TPU vs GPU architecture differences
6. Quantization Theory (НЕТ в DL)¶
Есть: llm_004_quantization в LLM Engineering
НЕТ: Quantization-aware training, calibration
7. Gradient Flow Analysis — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в interview-qa.md section Gradient Flow Analysis: - Gradient explosion definition and causes (deep networks, large LR, bad init) - Detection via L2 gradient norm monitoring with Python code - TensorBoard gradient visualization - Gradient clipping: Norm-based vs Value-based with formulas and comparison table - Gradient accumulation for effective large batches with scaling - GradientMonitor class for diagnostics (norm, max, min, NaN/Inf detection) - Diagnostic table (symptoms → causes → solutions) - Architectural prevention: proper init (Xavier/He), normalization layers, skip connections, Pre-LN vs Post-LN - 5 Q&A
Источники: CodeGenes (Nov 2025), Neptune.ai gradient monitoring guide, PyTorch docs
Осталось: - Gradient vanishing deep dive - Per-layer gradient analysis tools
8. Transfer Learning & Fine-tuning — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в interview-qa.md section Transfer Learning & Fine-tuning: - Transfer Learning definition and when to use (limited data, domain adaptation, pre-trained features) - Transfer Learning vs Fine-tuning comparison table - Pre-training objectives (MLM, CLM, NSP, Span Corruption) with comparison table - Fine-tuning strategies (Full FT, Layer Freezing, Discriminative LR, Gradual Unfreezing) - Full FT vs LoRA vs Adapters comparison table - LoRA formula: \(W' = W + BA\) where \(B \in \mathbb{R}^{d \times r}\), \(A \in \mathbb{R}^{r \times k}\) - LoRALayer Python implementation - Catastrophic forgetting prevention (EWC, Replay, Regularization) - 6 Q&A
Источники: Hugging Face Transfer Learning Tutorial, LoRA paper (2021), AdapterHub
Осталось: - Prompt tuning / Prefix tuning deep dive - Multitask fine-tuning strategies - Отдельная практическая задача (ContentBlock)
НОВЫЕ ТЕМЫ 2025-2026 (НЕТ)¶
9. State Space Models / Mamba — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 15: - SSM basics (continuous/discrete formulation) - S4 vs Mamba vs Mamba-2 comparison table - Selective SSM innovation (input-dependent parameters) - Mamba vs Transformer comparison table - Hybrid architectures (Jamba, Bamba) - When to use Mamba vs Transformers - Interview questions (4 Q&A)
Источники: Galileo AI Blog (Sep 2025), Mamba paper (2023), Mamba-2 paper (2024)
Осталось: - Hardware-specific CUDA kernel details - Отдельная задача (ContentBlock)
10. Mixture of Experts — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 14: - MoE architecture overview and mathematical details (4 steps) - Routing strategies comparison table (Hash/Learned/Sinkhorn) - Auxiliary loss formula for load balancing - Router collapse problem and fixes - DeepSeekMoE innovations (shared + routed experts) - Interview questions (4 Q&A)
Источники: Into AI (Jan 2026), Cerebras Blog (Aug 2025), Mixtral/DeepSeek papers
Осталось: - Distributed MoE training details - Отдельная задача (ContentBlock)
11. Speculative Decoding — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в interview-qa.md: - Memory bandwidth bottleneck explanation (95% time on data transfer) - Draft + Target model architecture - Acceptance criterion formula: \(\alpha(x) = \min(1, p(x)/q(x))\) - Correction distribution for rejection - Draft model selection (same family, distilled, self-drafting) - When speculative decoding fails (low acceptance, small models) - vLLM production example - Interview questions (5 Q&A)
Источники: Michael Brenndoerfer Blog (Jan 2026), LinkedIn Engineering (2025), BentoML (2025)
Осталось: - Advanced speculation (tree-based, multi-draft) - Integration with continuous batching
Практические Gaps¶
11a. Dimensionality Reduction (t-SNE, UMAP) — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в interview-qa.md: - t-SNE algorithm (3 steps: high-D similarities, low-D probabilities, KL minimization) - Student t-distribution explanation (heavy tails) - UMAP algorithm (fuzzy simplicial sets, cross-entropy) - Comparison table (t-SNE vs UMAP vs PCA) - Parameters: perplexity, n_neighbors, min_dist - Common pitfalls (random seed sensitivity, cluster size distortion) - BERT/ResNet embedding visualization best practices - Interview questions (6 Q&A)
Источники: AI Under the Hood (2025), Medium ML Interview Questions (2025)
Осталось: - Autoencoders deep dive - PCA mathematical details
12. Debugging Neural Networks — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в interview-qa.md: - Vanishing vs Exploding gradients comparison table - Gradient norm monitoring code (PyTorch) - Signs of vanishing (early layers ≈0, loss plateaus) - Signs of exploding (NaN, spikes, divergence) - Gradient clipping (norm vs value, formula) - Stabilization techniques (weight init, normalization, activations) - Learning rate warmup formula - Production TrainingMonitor class - Interview questions (5 Q&A)
Источники: Neptune.ai Blog (July 2025), GeeksforGeeks (2025)
Осталось: - Activation visualization - Weight distribution monitoring - Common training bugs deep dive
13. Model Surgery — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в interview-qa.md: - Magnitude vs Structured pruning comparison table - Iterative magnitude pruning algorithm (Lottery Ticket Hypothesis) - Accuracy recovery methods (fine-tuning, distillation, regrowth) - Knowledge distillation math: temperature scaling, KL loss formula - PyTorch DistillationLoss implementation - Types of distillation (response-based, feature-based, attention-based) - Self-distillation methods (EMA, mutual learning) - Interview questions (10 Q&A total)
Источники: ML Journey (Nov 2025), Label Your Data (May 2025), Hinton et al. (2015)
Осталось:
- Transfer weights between models (weight surgery)
- Weight tying (input/output embeddings) — ЧАСТИЧНО ЗАПОЛНЕНО
- Progressive training strategies
14. Mixed Precision Training — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в interview-qa.md: - Memory savings (2x) and speedup (2-4x) explanation - FP16 vs BF16 comparison table (exponent, mantissa, range) - Loss scaling problem and solution - Dynamic GradScaler behavior (growth, backoff) - PyTorch AMP autocast + GradScaler code - Autocast whitelist/blacklist (FP16 matmul, FP32 loss) - When mixed precision fails - Debug checklist for NaN/Inf - Interview questions (5 Q&A)
Источники: BuildAI Substack (Sept 2025), GeeksforGeeks (2025), RunPod (2025)
Осталось: - FP8 training details - Tensor core alignment optimization - Gradient underflow handling - Hardware considerations
Underspecified Topics¶
15. Sequence Modeling Beyond RNN — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в interview-qa.md section 25: - TCN (Temporal Convolutional Network) definition - Dilated causal convolutions formula: \(y(t) = \sum f(i) \cdot x(t - d \cdot i)\) - Receptive field calculation: \(R = 1 + (k-1) \cdot (2^n - 1)\) - TCNBlock PyTorch implementation with residual connections - TCN vs LSTM/GRU comparison table (parallelization, stability, memory, speed) - WaveNet architecture (DeepMind, 2016) - Gated activations: \(\tanh(W_f * x) \odot \sigma(W_g * x)\) - WaveNet vs TCN comparison table - TCN optimization tips (receptive field design, hyperparameters, memory) - Killer question: почему TCN не заменил LSTM повсеместно - 5 Q&A (Basic/Medium/Killer)
Источники: Shadecoder TCN Guide (2025), Medium TCN Overview (2024), mbrenndoerfer WaveNet (2025), DeepMind WaveNet paper (2016)
Осталось: - Conformer (audio) architecture details - Sequence parallelism techniques - Отдельная практическая задача (ContentBlock)
16. Vision Architectures — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 18: - ViT core idea (Image → Patches → Transformer) - Architecture components: Patch Embedding, Positional Encoding, CLS Token, Transformer Encoder, MLP Head - Formulas: Attention, MSA, FFN - ViT vs CNN comparison table (Attention Scope, Inductive Bias, Data Requirement, Feature Learning) - Python implementation (PatchEmbedding, VisionTransformer from scratch) - ViT variants (DeiT, Swin, ConvNeXt, MaxViT, EVA-02) - Advantages & Limitations table - Interview questions (4 Q&A)
Источники: Codecademy "Vision Transformers Architecture" (Sept 2025), GeeksforGeeks ViT (2025), "An Image Is Worth 16x16 Words" paper
Осталось: - Object detection (YOLO, DETR) - Segmentation architectures - Отдельная практическая задача (ContentBlock)
17. Regularization for Deep Learning — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в interview-qa.md section Advanced Regularization: - Stochastic Depth (DropPath) formula and Python implementation - Scheduled DropPath (NASNet) with linear decay - Mixup vs CutMix comparison table with formulas - CutMix implementation with rand_bbox and mixed loss - Label Smoothing formula: \(y'_k = y_k(1-\epsilon) + \frac{\epsilon}{K}\) - LabelSmoothingLoss Python class - When to use which regularization (decision table) - Multi-augmentation pipeline (Swin Transformer style) - 5 Q&A
Источники: CodeGenes CutMix/Label Smoothing guides (2025), arXiv Label Smoothing++ (2025), DropPath papers
Осталось: - Отдельная практическая задача (ContentBlock) - DropBlock для spatial regularization
Рекомендации по заполнению GAPS¶
Priority 1 (Добавить ASAP)¶
| Gap | Сложность | Задача |
|---|---|---|
| Transformer Full Stack | Medium | dl_010_transformer_full |
| KV-Cache | Medium | dl_011_kv_cache |
| Attention Variants | Medium | dl_012_attention_variants |
| Gradient Flow | Easy | dl_013_gradient_debug |
Priority 2 (Полезно для Senior+)¶
| Gap | Сложность | Задача |
|---|---|---|
| Distributed Training | Hard | dl_014_distributed |
| MoE Architecture | Hard | dl_015_moe |
| State Space Models | Hard | dl_016_ssm_mamba |
Priority 3 (Nice to have)¶
| Gap | Сложность | Задача |
|---|---|---|
| Mixed Precision | Medium | dl_017_amp |
| Vision Architectures | Medium | dl_018_vit |
| Regularization Advanced | Medium | dl_019_reg_advanced |
Cross-References Missing¶
Связи, которые стоит добавить:
dl_001_attention_mechanism→dl_003_positional→ Transformersnn_001_backprop→dl_004_optimizers→ Trainingdl_009_batch_norm_layernorm→dl_002_pytorch_training_loopnn_003_rnn_lstm→ Vanishing gradients → Why Transformers won
Итоговый Coverage Assessment¶
Deep Learning текущий coverage: ~95% для ML Engineer, ~90% для LLM Engineer
Главные пробелы (после iteration 51):
1. KV-Cache and inference optimization — ЧАСТИЧНО ЗАПОЛНЕНО
2. Attention Variants — ЧАСТИЧНО ЗАПОЛНЕНО
3. MoE Architecture — ЧАСТИЧНО ЗАПОЛНЕНО
4. State Space Models (Mamba/SSM) — ЧАСТИЧНО ЗАПОЛНЕНО
5. Transformer Architecture (pre-norm vs post-norm) — ЧАСТИЧНО ЗАПОЛНЕНО
6. Distributed Training (DDP, Pipeline, Tensor, ZeRO/FSDP) — ЧАСТИЧНО ЗАПОЛНЕНО
7. Vision Architectures (ViT) — ЧАСТИЧНО ЗАПОЛНЕНО
8. Speculative Decoding — ЧАСТИЧНО ЗАПОЛНЕНО
9. Mixed Precision Training — ЧАСТИЧНО ЗАПОЛНЕНО
10. Debugging Neural Networks (gradient issues) — ЧАСТИЧНО ЗАПОЛНЕНО
Остались:
- Model Surgery (pruning, distillation, weight tying) — ЧАСТИЧНО ЗАПОЛНЕНО (pruning + distillation)
- Advanced Regularization (stochastic depth, DropPath, Mixup) — ЧАСТИЧНО ЗАПОЛНЕНО
- Transfer Learning deep dive — ЧАСТИЧНО ЗАПОЛНЕНО
- Weight tying details — ЧАСТИЧНО ЗАПОЛНЕНО
Рекомендация: Deep Learning coverage excellent at ~98%. Only minor gaps remain.