DL Interview: Компрессия и Transfer Learning¶
~6 минут чтения
Навигация: Все темы DL интервью | Материалы DL | Математика для ML
Model Pruning (Magnitude, Structured, Lottery Ticket), Knowledge Distillation (Response-based, Feature-based, Self-Distillation), Transfer Learning & Fine-tuning (Feature Extraction, LoRA, Adapters, Catastrophic Forgetting), Weight Tying (Shared Embeddings, Pseudo-Inverse Tying).
Model Pruning¶
Q: Magnitude Pruning vs Structured Pruning -- в чём разница?¶
A:
| Aspect | Magnitude (Unstructured) | Structured |
|---|---|---|
| Что удаляется | Individual weights (близкие к 0) | Entire neurons/filters/channels |
| Compression | 90-99% | 50-80% |
| Hardware friendly | Нет (sparse ops) | Да (dense ops) |
| Speedup | Требует sparse kernels | Прямое ускорение |
| Accuracy drop | Меньше | Больше |
Magnitude pruning: $\(\text{Prune if } |w_{ij}| < \theta\)$
Structured pruning (filter): $\(\text{Importance}_k = \sum_{i,j} |W_k^{(i,j)}|\)$
Q: Как работает iterative magnitude pruning?¶
A:
Lottery Ticket Hypothesis: В сети есть "winning ticket" -- subnetwork которая может обучиться сама.
Iterative procedure:
def iterative_magnitude_pruning(model, train_loader, prune_ratio=0.2, n_iterations=10):
mask = torch.ones_like(model.state_dict()['weight'])
for iteration in range(n_iterations):
# 1. Train with current mask
train_with_mask(model, train_loader, mask)
# 2. Get magnitude scores
weights = model.state_dict()['weight']
scores = torch.abs(weights) * mask
# 3. Determine threshold
flat_scores = scores[mask == 1].flatten()
threshold = torch.quantile(flat_scores, prune_ratio)
# 4. Update mask
mask = (scores > threshold).float()
# 5. Reset weights to initialization (optional)
reset_weights_to_init(model)
return model, mask
Key insight: Rewinding to early training step (not init) works better.
Q: Как восстановить accuracy после pruning?¶
A:
1. Fine-tuning:
for epoch in range(finetune_epochs):
for x, y in train_loader:
optimizer.zero_grad()
output = model(x)
loss = criterion(output, y)
loss.backward()
# Mask gradients (keep pruned weights zero)
for name, param in model.named_parameters():
if name in masks:
param.grad *= masks[name]
optimizer.step()
2. Knowledge Distillation:
- Teacher: original full model
- Student: pruned model
- Loss: L = CE_loss + alpha * KL(soft_targets || predictions)
3. Weight regrowth (RigL): - Prune + regrow based on gradient magnitude - Dynamic sparse training
Q: Pruning vs Quantization -- что выбрать?¶
A:
| Method | Size reduction | Speedup | Accuracy impact |
|---|---|---|---|
| FP16 quantization | 2x | 2-4x | Minimal |
| INT8 quantization | 4x | 2-4x | Small |
| Pruning (unstructured) | 10x | Variable (sparse) | Moderate |
| Pruning (structured) | 2-4x | 2-4x | Moderate |
| Combined | 10-20x | 4-8x | Needs tuning |
Best practice: Quantize first (FP16/INT8), then prune if needed.
Q: Когда использовать pruning в production?¶
A:
| Scenario | Рекомендация |
|---|---|
| Edge deployment | Structured pruning (50-70%) |
| Cloud inference | Magnitude pruning + sparse kernels |
| Real-time latency | Structured (predictable speedup) |
| Model size critical | Magnitude (max compression) |
Production pipeline: 1. Train full model to convergence 2. Apply structured pruning (filter/channel) 3. Fine-tune with distillation 4. Validate accuracy drop < 1-2% 5. Export to ONNX/TensorRT
Knowledge Distillation¶
Q: Как работает Knowledge Distillation?¶
A:
Core idea: Small "student" model learns from large "teacher" model's soft predictions.
Temperature scaling: $\(p_i^{(T)} = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}\)$
- High T -> softer distribution (more information)
- Low T -> harder distribution (closer to one-hot)
Distillation loss: $\(L_{total} = \alpha \cdot L_{CE}(y_{true}, y_{student}) + (1-\alpha) \cdot T^2 \cdot KL(p_{teacher}^{(T)} \| p_{student}^{(T)})\)$
Why \(T^2\): Scale KL to match magnitude of cross-entropy.
Q: PyTorch реализация Knowledge Distillation¶
A:
import torch.nn.functional as F
class DistillationLoss(nn.Module):
def __init__(self, temperature=4.0, alpha=0.5):
super().__init__()
self.temperature = temperature
self.alpha = alpha
def forward(self, student_logits, teacher_logits, labels):
# Soft targets from teacher
soft_targets = F.softmax(teacher_logits / self.temperature, dim=1)
soft_predictions = F.log_softmax(student_logits / self.temperature, dim=1)
# KL divergence loss (scaled by T^2)
distill_loss = F.kl_div(
soft_predictions, soft_targets, reduction='batchmean'
) * (self.temperature ** 2)
# Hard targets loss
hard_loss = F.cross_entropy(student_logits, labels)
return self.alpha * hard_loss + (1 - self.alpha) * distill_loss
# Training loop
teacher.eval() # Teacher is frozen
for x, y in train_loader:
with torch.no_grad():
teacher_logits = teacher(x)
student_logits = student(x)
loss = distill_loss(student_logits, teacher_logits, y)
loss.backward()
optimizer.step()
Q: Типы Knowledge Distillation¶
A:
| Type | Что передаётся | Пример |
|---|---|---|
| Response-based | Soft logits | Classic Hinton (2015) |
| Feature-based | Intermediate activations | FitNets |
| Attention-based | Attention maps | TinyBERT |
| Relation-based | Sample relationships | RKD |
Feature-based (FitNets):
# Student matches intermediate features of teacher
hint_loss = MSE(student_features, teacher_features)
Attention-based (TinyBERT):
Q: Когда Knowledge Distillation работает лучше всего?¶
A:
Good scenarios: - Teacher >> Student in capacity (100x+) - Same architecture family (both transformers) - Enough data for student to learn - Teacher well-trained (not underfitting)
Weak scenarios: - Teacher barely better than student - Very small datasets - Teacher overconfident/wrong
Best practices: 1. Use temperature 3-5 (experiment) 2. alpha = 0.5-0.7 (balance hard/soft) 3. Train student longer than usual 4. Data augmentation helps student
Q: Self-Distillation -- что это?¶
A:
Idea: Model teaches itself -- earlier checkpoints teach later ones, or ensemble of own predictions.
Methods:
1. Temporal ensemble:
# Use EMA of model as teacher
ema_model = copy.deepcopy(model)
for param in ema_model.parameters():
param.requires_grad = False
# Update EMA slowly
for ema_param, model_param in zip(ema_model.parameters(), model.parameters()):
ema_param.data = 0.999 * ema_param.data + 0.001 * model_param.data
2. Deep mutual learning: - Two identical models learn from each other - Both are students AND teachers
Benefits: No need for pretrained teacher, works with any model size.
Transfer Learning & Fine-tuning¶
Q: Что такое Transfer Learning и когда его использовать?¶
A:
Transfer Learning -- использование знаний от модели обученной на одной задаче для другой связанной задачи.
Когда использовать: - Limited labeled data для целевой задачи - Source и target domains связаны - Предобученная модель доступна (ResNet, BERT, GPT) - Экономия compute ресурсов
Когда НЕ использовать: - Target domain сильно отличается от source - Достаточно данных для обучения с нуля - Domain-specific особенности критичны
Типы: 1. Feature Extraction: Freeze backbone, train only head 2. Fine-tuning: Update all or some layers 3. Domain Adaptation: Adapt to new distribution
Q: В чём разница между Transfer Learning и Fine-tuning?¶
A:
| Aspect | Transfer Learning | Fine-tuning |
|---|---|---|
| Definition | Общая концепция | Конкретный метод |
| Scope | Может использовать frozen features | Всегда обновляет веса |
| Example | Use BERT embeddings + классификатор | Update BERT weights на domain data |
Transfer Learning (frozen):
model = torchvision.models.resnet50(pretrained=True)
for param in model.parameters():
param.requires_grad = False
model.fc = nn.Linear(2048, num_classes) # Only train head
Fine-tuning:
model = torchvision.models.resnet50(pretrained=True)
model.fc = nn.Linear(2048, num_classes)
optimizer = Adam(model.parameters(), lr=1e-5) # Small LR!
Q: Какие pre-training objectives используются для языковых моделей?¶
A:
| Objective | Direction | Best For | Example Models |
|---|---|---|---|
| MLM | Bidirectional | Understanding, NLU | BERT, RoBERTa |
| CLM | Unidirectional | Generation, NLG | GPT, LLaMA |
| Span Corruption | Bidirectional | Seq2Seq | T5, BART |
| NSP | Bidirectional | Document-level | BERT (deprecated) |
MLM (BERT): Mask 15% токенов, предсказать их. \(\mathcal{L}_{MLM} = -\sum_{i \in M} \log P(x_i | x_{\backslash M})\)
CLM (GPT): Predict next token. \(\mathcal{L}_{CLM} = -\sum_{t=1}^{T} \log P(x_t | x_{<t})\)
Q: Какие стратегии fine-tuning существуют?¶
A:
1. Full Fine-tuning:
for param in model.parameters():
param.requires_grad = True
optimizer = AdamW(model.parameters(), lr=2e-5)
2. Layer Freezing:
for name, param in model.named_parameters():
if 'layer.0' in name or 'layer.1' in name:
param.requires_grad = False
3. Discriminative Learning Rates:
param_groups = [
{'params': early_layers, 'lr': 1e-6},
{'params': middle_layers, 'lr': 1e-5},
{'params': late_layers, 'lr': 1e-4},
]
optimizer = AdamW(param_groups)
4. Gradual Unfreezing (ULMFiT):
def unfreeze_layer(model, layer_idx):
for name, param in model.named_parameters():
if f'layer.{layer_idx}' in name:
param.requires_grad = True
Q: Full Fine-tuning vs LoRA vs Adapters -- сравнение?¶
A:
| Method | Trainable Params | Memory | Performance | Speed |
|---|---|---|---|---|
| Full FT | 100% | High | Best (potentially) | Slow |
| LoRA | 0.1-1% | Low | Near-full | Fast |
| Adapters | 1-5% | Medium | Near-full | Medium |
LoRA (Low-Rank Adaptation): $\(W' = W + \Delta W = W + BA\)$
где \(B \in \mathbb{R}^{d \times r}\), \(A \in \mathbb{R}^{r \times k}\), \(r \ll d\)
class LoRALayer(nn.Module):
def __init__(self, in_dim, out_dim, rank=8):
self.A = nn.Parameter(torch.randn(in_dim, rank) * 0.01)
self.B = nn.Parameter(torch.zeros(rank, out_dim))
def forward(self, x, original_output):
return original_output + x @ self.A @ self.B
When to use what: - Full FT: Maximum accuracy needed, sufficient resources - LoRA: Multiple tasks, limited memory, quick switching - Adapters: Need modularity, industrial deployment
Q: Как избежать catastrophic forgetting при fine-tuning?¶
A:
Проблема: Модель "забывает" исходные знания при обучении на новых данных.
Решения:
1. Learning Rate Scheduling:
scheduler = get_cosine_schedule_with_warmup(
optimizer, num_warmup_steps=100, num_training_steps=10000
)
2. Elastic Weight Consolidation (EWC): $\(\mathcal{L} = \mathcal{L}_{new} + \lambda \sum_i F_i(\theta_i - \theta_i^*)^2\)$
где \(F_i\) -- Fisher information, \(\theta_i^*\) -- исходные веса
3. Replay/Mixing:
4. PEFT methods (LoRA, Adapters): - Original weights не изменяются - Naturally preserve pre-trained knowledge
Weight Tying (Shared Embeddings)¶
Q: Что такое Weight Tying в языковых моделях?¶
A:
Weight Tying -- техника разделения весов между input embedding и output projection матрицами в языковых моделях.
Идея: - Input embedding: token_id -> вектор (V x D matrix) - Output projection: вектор -> logits над словарём (D x V matrix) - Tying: используем одну матрицу для обоих!
Экономия параметров: $\(\text{Saved params} = V \times D\)$
Для GPT-2 small: 50K vocab x 768 dim = 38M параметров
Используется в: GPT-2, GPT-3, GPT-4, Llama, SmolLM⅔, большинстве современных LLM.
Q: Как реализовать Weight Tying в PyTorch?¶
A:
class LanguageModel(nn.Module):
def __init__(self, vocab_size, embed_dim):
super().__init__()
self.token_embedding = nn.Embedding(vocab_size, embed_dim)
self.lm_head = nn.Linear(embed_dim, vocab_size, bias=False)
# Weight tying!
self.lm_head.weight = self.token_embedding.weight
def forward(self, input_ids):
x = self.token_embedding(input_ids) # (B, L, D)
# ... transformer layers ...
logits = self.lm_head(x) # (B, L, V)
return logits
Important notes:
1. bias=False -- bias не нужен при weight tying
2. Gradients correctly flow через tied weights
3. Same matrix transposed: input = (VxD), output = (DxV)
Q: Weight Tying vs Separate Embeddings -- когда что?¶
A:
| Criterion | Weight Tying | Separate |
|---|---|---|
| Parameters | Fewer (save VxD) | More |
| Performance | Usually better for LMs | Sometimes better for MT |
| Consistency | Input/Output aligned | Independent |
Когда Weight Tying лучше: - Decoder-only LMs (GPT, Llama) - Large vocabulary models - Memory-constrained deployment
Когда отдельные embeddings: - Machine Translation (encoder-decoder) - Different input/output vocabularies - Multi-task learning scenarios
Q: Что такое Pseudo-Inverse Tying (2026)?¶
A:
Pseudo-Inverse Tying (arXiv:2602.04556, Feb 2026): $\(W_{out} = (W_{in}^T W_{in})^{-1} W_{in}^T = W_{in}^+\)$
Идея: - Вместо прямого копирования -- вычисляем "обратную" проекцию - Сохраняет parameter efficiency - Более стабильное обучение
Результаты (PIT vs WT vs Separate): | Method | Params | PPL | Stability | |--------|--------|-----|-----------| | Separate | Full | Baseline | High | | Weight Tying | -VxD | Slightly better | Medium | | Pseudo-Inverse | -VxD | Best | Highest |
Q: Как Weight Tying влияет на gradient flow?¶
A:
# Forward: logits = x @ W.T
# Backward: dW = d_logits.T @ x (от lm_head)
# + d_embeddings (от input embedding)
# Gradients accumulate from both paths!
W.grad = grad_from_output + grad_from_input
Преимущества: 1. Gradients from two sources -> stronger signal 2. Regularization effect: input/output consistency enforced 3. Faster convergence in many cases
Потенциальные проблемы: - Conflicting gradients (rare) - Может "притягивать" representations друг к другу слишком сильно
Q: Спроектируйте модель с Weight Tying для 100K словаря.¶
A:
Without Weight Tying:
Input embedding: 100K x 1024 = 102.4M params
Output projection: 1024 x 100K = 102.4M params
Total embedding params: 204.8M
With Weight Tying:
Shared embedding: 100K x 1024 = 102.4M params
Saved: 102.4M params (50% reduction)
Memory saved: ~400MB (FP32) or ~200MB (FP16)
class EfficientLM(nn.Module):
def __init__(self, vocab_size=100_000, dim=1024, n_layers=12):
super().__init__()
self.embed = nn.Embedding(vocab_size, dim)
self.layers = nn.ModuleList([
TransformerBlock(dim) for _ in range(n_layers)
])
self.ln_f = nn.LayerNorm(dim)
self.lm_head = nn.Linear(dim, vocab_size, bias=False)
self.lm_head.weight = self.embed.weight
def forward(self, input_ids):
x = self.embed(input_ids)
for layer in self.layers:
x = layer(x)
x = self.ln_f(x)
logits = self.lm_head(x)
return logits