Перейти к содержанию

DL Interview: Компрессия и Transfer Learning

~6 минут чтения

Навигация: Все темы DL интервью | Материалы DL | Математика для ML

Model Pruning (Magnitude, Structured, Lottery Ticket), Knowledge Distillation (Response-based, Feature-based, Self-Distillation), Transfer Learning & Fine-tuning (Feature Extraction, LoRA, Adapters, Catastrophic Forgetting), Weight Tying (Shared Embeddings, Pseudo-Inverse Tying).


Model Pruning

Q: Magnitude Pruning vs Structured Pruning -- в чём разница?

A:

Aspect Magnitude (Unstructured) Structured
Что удаляется Individual weights (близкие к 0) Entire neurons/filters/channels
Compression 90-99% 50-80%
Hardware friendly Нет (sparse ops) Да (dense ops)
Speedup Требует sparse kernels Прямое ускорение
Accuracy drop Меньше Больше

Magnitude pruning: $\(\text{Prune if } |w_{ij}| < \theta\)$

Structured pruning (filter): $\(\text{Importance}_k = \sum_{i,j} |W_k^{(i,j)}|\)$

Q: Как работает iterative magnitude pruning?

A:

Lottery Ticket Hypothesis: В сети есть "winning ticket" -- subnetwork которая может обучиться сама.

Iterative procedure:

def iterative_magnitude_pruning(model, train_loader, prune_ratio=0.2, n_iterations=10):
    mask = torch.ones_like(model.state_dict()['weight'])

    for iteration in range(n_iterations):
        # 1. Train with current mask
        train_with_mask(model, train_loader, mask)

        # 2. Get magnitude scores
        weights = model.state_dict()['weight']
        scores = torch.abs(weights) * mask

        # 3. Determine threshold
        flat_scores = scores[mask == 1].flatten()
        threshold = torch.quantile(flat_scores, prune_ratio)

        # 4. Update mask
        mask = (scores > threshold).float()

        # 5. Reset weights to initialization (optional)
        reset_weights_to_init(model)

    return model, mask

Key insight: Rewinding to early training step (not init) works better.

Q: Как восстановить accuracy после pruning?

A:

1. Fine-tuning:

for epoch in range(finetune_epochs):
    for x, y in train_loader:
        optimizer.zero_grad()
        output = model(x)
        loss = criterion(output, y)
        loss.backward()

        # Mask gradients (keep pruned weights zero)
        for name, param in model.named_parameters():
            if name in masks:
                param.grad *= masks[name]

        optimizer.step()

2. Knowledge Distillation: - Teacher: original full model - Student: pruned model - Loss: L = CE_loss + alpha * KL(soft_targets || predictions)

3. Weight regrowth (RigL): - Prune + regrow based on gradient magnitude - Dynamic sparse training

Q: Pruning vs Quantization -- что выбрать?

A:

Method Size reduction Speedup Accuracy impact
FP16 quantization 2x 2-4x Minimal
INT8 quantization 4x 2-4x Small
Pruning (unstructured) 10x Variable (sparse) Moderate
Pruning (structured) 2-4x 2-4x Moderate
Combined 10-20x 4-8x Needs tuning

Best practice: Quantize first (FP16/INT8), then prune if needed.

Q: Когда использовать pruning в production?

A:

Scenario Рекомендация
Edge deployment Structured pruning (50-70%)
Cloud inference Magnitude pruning + sparse kernels
Real-time latency Structured (predictable speedup)
Model size critical Magnitude (max compression)

Production pipeline: 1. Train full model to convergence 2. Apply structured pruning (filter/channel) 3. Fine-tune with distillation 4. Validate accuracy drop < 1-2% 5. Export to ONNX/TensorRT


Knowledge Distillation

Q: Как работает Knowledge Distillation?

A:

Core idea: Small "student" model learns from large "teacher" model's soft predictions.

Temperature scaling: $\(p_i^{(T)} = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}\)$

  • High T -> softer distribution (more information)
  • Low T -> harder distribution (closer to one-hot)

Distillation loss: $\(L_{total} = \alpha \cdot L_{CE}(y_{true}, y_{student}) + (1-\alpha) \cdot T^2 \cdot KL(p_{teacher}^{(T)} \| p_{student}^{(T)})\)$

Why \(T^2\): Scale KL to match magnitude of cross-entropy.

Q: PyTorch реализация Knowledge Distillation

A:

import torch.nn.functional as F

class DistillationLoss(nn.Module):
    def __init__(self, temperature=4.0, alpha=0.5):
        super().__init__()
        self.temperature = temperature
        self.alpha = alpha

    def forward(self, student_logits, teacher_logits, labels):
        # Soft targets from teacher
        soft_targets = F.softmax(teacher_logits / self.temperature, dim=1)
        soft_predictions = F.log_softmax(student_logits / self.temperature, dim=1)

        # KL divergence loss (scaled by T^2)
        distill_loss = F.kl_div(
            soft_predictions, soft_targets, reduction='batchmean'
        ) * (self.temperature ** 2)

        # Hard targets loss
        hard_loss = F.cross_entropy(student_logits, labels)

        return self.alpha * hard_loss + (1 - self.alpha) * distill_loss

# Training loop
teacher.eval()  # Teacher is frozen
for x, y in train_loader:
    with torch.no_grad():
        teacher_logits = teacher(x)

    student_logits = student(x)
    loss = distill_loss(student_logits, teacher_logits, y)

    loss.backward()
    optimizer.step()

Q: Типы Knowledge Distillation

A:

Type Что передаётся Пример
Response-based Soft logits Classic Hinton (2015)
Feature-based Intermediate activations FitNets
Attention-based Attention maps TinyBERT
Relation-based Sample relationships RKD

Feature-based (FitNets):

# Student matches intermediate features of teacher
hint_loss = MSE(student_features, teacher_features)

Attention-based (TinyBERT):

# Match attention matrices from each layer
attn_loss = MSE(student_attention, teacher_attention)

Q: Когда Knowledge Distillation работает лучше всего?

A:

Good scenarios: - Teacher >> Student in capacity (100x+) - Same architecture family (both transformers) - Enough data for student to learn - Teacher well-trained (not underfitting)

Weak scenarios: - Teacher barely better than student - Very small datasets - Teacher overconfident/wrong

Best practices: 1. Use temperature 3-5 (experiment) 2. alpha = 0.5-0.7 (balance hard/soft) 3. Train student longer than usual 4. Data augmentation helps student

Q: Self-Distillation -- что это?

A:

Idea: Model teaches itself -- earlier checkpoints teach later ones, or ensemble of own predictions.

Methods:

1. Temporal ensemble:

# Use EMA of model as teacher
ema_model = copy.deepcopy(model)
for param in ema_model.parameters():
    param.requires_grad = False

# Update EMA slowly
for ema_param, model_param in zip(ema_model.parameters(), model.parameters()):
    ema_param.data = 0.999 * ema_param.data + 0.001 * model_param.data

2. Deep mutual learning: - Two identical models learn from each other - Both are students AND teachers

Benefits: No need for pretrained teacher, works with any model size.


Transfer Learning & Fine-tuning

Q: Что такое Transfer Learning и когда его использовать?

A:

Transfer Learning -- использование знаний от модели обученной на одной задаче для другой связанной задачи.

Когда использовать: - Limited labeled data для целевой задачи - Source и target domains связаны - Предобученная модель доступна (ResNet, BERT, GPT) - Экономия compute ресурсов

Когда НЕ использовать: - Target domain сильно отличается от source - Достаточно данных для обучения с нуля - Domain-specific особенности критичны

Типы: 1. Feature Extraction: Freeze backbone, train only head 2. Fine-tuning: Update all or some layers 3. Domain Adaptation: Adapt to new distribution

Q: В чём разница между Transfer Learning и Fine-tuning?

A:

Aspect Transfer Learning Fine-tuning
Definition Общая концепция Конкретный метод
Scope Может использовать frozen features Всегда обновляет веса
Example Use BERT embeddings + классификатор Update BERT weights на domain data

Transfer Learning (frozen):

model = torchvision.models.resnet50(pretrained=True)
for param in model.parameters():
    param.requires_grad = False
model.fc = nn.Linear(2048, num_classes)  # Only train head

Fine-tuning:

model = torchvision.models.resnet50(pretrained=True)
model.fc = nn.Linear(2048, num_classes)
optimizer = Adam(model.parameters(), lr=1e-5)  # Small LR!

Q: Какие pre-training objectives используются для языковых моделей?

A:

Objective Direction Best For Example Models
MLM Bidirectional Understanding, NLU BERT, RoBERTa
CLM Unidirectional Generation, NLG GPT, LLaMA
Span Corruption Bidirectional Seq2Seq T5, BART
NSP Bidirectional Document-level BERT (deprecated)

MLM (BERT): Mask 15% токенов, предсказать их. \(\mathcal{L}_{MLM} = -\sum_{i \in M} \log P(x_i | x_{\backslash M})\)

CLM (GPT): Predict next token. \(\mathcal{L}_{CLM} = -\sum_{t=1}^{T} \log P(x_t | x_{<t})\)

Q: Какие стратегии fine-tuning существуют?

A:

1. Full Fine-tuning:

for param in model.parameters():
    param.requires_grad = True
optimizer = AdamW(model.parameters(), lr=2e-5)

2. Layer Freezing:

for name, param in model.named_parameters():
    if 'layer.0' in name or 'layer.1' in name:
        param.requires_grad = False

3. Discriminative Learning Rates:

param_groups = [
    {'params': early_layers, 'lr': 1e-6},
    {'params': middle_layers, 'lr': 1e-5},
    {'params': late_layers, 'lr': 1e-4},
]
optimizer = AdamW(param_groups)

4. Gradual Unfreezing (ULMFiT):

def unfreeze_layer(model, layer_idx):
    for name, param in model.named_parameters():
        if f'layer.{layer_idx}' in name:
            param.requires_grad = True

Q: Full Fine-tuning vs LoRA vs Adapters -- сравнение?

A:

Method Trainable Params Memory Performance Speed
Full FT 100% High Best (potentially) Slow
LoRA 0.1-1% Low Near-full Fast
Adapters 1-5% Medium Near-full Medium

LoRA (Low-Rank Adaptation): $\(W' = W + \Delta W = W + BA\)$

где \(B \in \mathbb{R}^{d \times r}\), \(A \in \mathbb{R}^{r \times k}\), \(r \ll d\)

class LoRALayer(nn.Module):
    def __init__(self, in_dim, out_dim, rank=8):
        self.A = nn.Parameter(torch.randn(in_dim, rank) * 0.01)
        self.B = nn.Parameter(torch.zeros(rank, out_dim))

    def forward(self, x, original_output):
        return original_output + x @ self.A @ self.B

When to use what: - Full FT: Maximum accuracy needed, sufficient resources - LoRA: Multiple tasks, limited memory, quick switching - Adapters: Need modularity, industrial deployment

Q: Как избежать catastrophic forgetting при fine-tuning?

A:

Проблема: Модель "забывает" исходные знания при обучении на новых данных.

Решения:

1. Learning Rate Scheduling:

scheduler = get_cosine_schedule_with_warmup(
    optimizer, num_warmup_steps=100, num_training_steps=10000
)

2. Elastic Weight Consolidation (EWC): $\(\mathcal{L} = \mathcal{L}_{new} + \lambda \sum_i F_i(\theta_i - \theta_i^*)^2\)$

где \(F_i\) -- Fisher information, \(\theta_i^*\) -- исходные веса

3. Replay/Mixing:

batch = torch.cat([new_data, old_data_sample])

4. PEFT methods (LoRA, Adapters): - Original weights не изменяются - Naturally preserve pre-trained knowledge


Weight Tying (Shared Embeddings)

Q: Что такое Weight Tying в языковых моделях?

A:

Weight Tying -- техника разделения весов между input embedding и output projection матрицами в языковых моделях.

Идея: - Input embedding: token_id -> вектор (V x D matrix) - Output projection: вектор -> logits над словарём (D x V matrix) - Tying: используем одну матрицу для обоих!

Экономия параметров: $\(\text{Saved params} = V \times D\)$

Для GPT-2 small: 50K vocab x 768 dim = 38M параметров

Используется в: GPT-2, GPT-3, GPT-4, Llama, SmolLM⅔, большинстве современных LLM.

Q: Как реализовать Weight Tying в PyTorch?

A:

class LanguageModel(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, embed_dim)
        self.lm_head = nn.Linear(embed_dim, vocab_size, bias=False)

        # Weight tying!
        self.lm_head.weight = self.token_embedding.weight

    def forward(self, input_ids):
        x = self.token_embedding(input_ids)  # (B, L, D)
        # ... transformer layers ...
        logits = self.lm_head(x)  # (B, L, V)
        return logits

Important notes: 1. bias=False -- bias не нужен при weight tying 2. Gradients correctly flow через tied weights 3. Same matrix transposed: input = (VxD), output = (DxV)

Q: Weight Tying vs Separate Embeddings -- когда что?

A:

Criterion Weight Tying Separate
Parameters Fewer (save VxD) More
Performance Usually better for LMs Sometimes better for MT
Consistency Input/Output aligned Independent

Когда Weight Tying лучше: - Decoder-only LMs (GPT, Llama) - Large vocabulary models - Memory-constrained deployment

Когда отдельные embeddings: - Machine Translation (encoder-decoder) - Different input/output vocabularies - Multi-task learning scenarios

Q: Что такое Pseudo-Inverse Tying (2026)?

A:

Pseudo-Inverse Tying (arXiv:2602.04556, Feb 2026): $\(W_{out} = (W_{in}^T W_{in})^{-1} W_{in}^T = W_{in}^+\)$

Идея: - Вместо прямого копирования -- вычисляем "обратную" проекцию - Сохраняет parameter efficiency - Более стабильное обучение

Результаты (PIT vs WT vs Separate): | Method | Params | PPL | Stability | |--------|--------|-----|-----------| | Separate | Full | Baseline | High | | Weight Tying | -VxD | Slightly better | Medium | | Pseudo-Inverse | -VxD | Best | Highest |

Q: Как Weight Tying влияет на gradient flow?

A:

# Forward: logits = x @ W.T
# Backward: dW = d_logits.T @ x (от lm_head)
#         + d_embeddings (от input embedding)

# Gradients accumulate from both paths!
W.grad = grad_from_output + grad_from_input

Преимущества: 1. Gradients from two sources -> stronger signal 2. Regularization effect: input/output consistency enforced 3. Faster convergence in many cases

Потенциальные проблемы: - Conflicting gradients (rare) - Может "притягивать" representations друг к другу слишком сильно

Q: Спроектируйте модель с Weight Tying для 100K словаря.

A:

Without Weight Tying:

Input embedding: 100K x 1024 = 102.4M params
Output projection: 1024 x 100K = 102.4M params
Total embedding params: 204.8M

With Weight Tying:

Shared embedding: 100K x 1024 = 102.4M params
Saved: 102.4M params (50% reduction)
Memory saved: ~400MB (FP32) or ~200MB (FP16)

class EfficientLM(nn.Module):
    def __init__(self, vocab_size=100_000, dim=1024, n_layers=12):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, dim)
        self.layers = nn.ModuleList([
            TransformerBlock(dim) for _ in range(n_layers)
        ])
        self.ln_f = nn.LayerNorm(dim)
        self.lm_head = nn.Linear(dim, vocab_size, bias=False)
        self.lm_head.weight = self.embed.weight

    def forward(self, input_ids):
        x = self.embed(input_ids)
        for layer in self.layers:
            x = layer(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)
        return logits