DL Interview: Vision & Detection¶
~8 минут чтения
Навигация: Все темы DL интервью | Материалы DL | Математика для ML
Vision Transformers (ViT, Swin), Object Detection (YOLO, Faster R-CNN, FPN), Contrastive & Self-Supervised Learning (SimCLR, MoCo, CLIP, BYOL, MAE).
Vision Transformers (ViT)¶
Q: Как ViT обрабатывает изображения?¶
A:
Core idea: Treat image as sequence of patches.
Pipeline: 1. Patch embedding: Split image into \(16 \times 16\) patches 2. Flatten + Linear: \(P^2 \cdot C\) → \(D\) (embedding dim) 3. Positional encoding: Add learnable position embeddings 4. [CLS] token: Prepend learnable classification token 5. Transformer encoder: Standard self-attention layers 6. MLP head: Classification from [CLS] token
class PatchEmbedding(nn.Module):
def __init__(self, img_size=224, patch_size=16, in_ch=3, embed_dim=768):
super().__init__()
self.n_patches = (img_size // patch_size) ** 2 # 196
self.proj = nn.Conv2d(in_ch, embed_dim, patch_size, patch_size)
def forward(self, x):
# x: (B, 3, 224, 224) -> (B, 196, 768)
x = self.proj(x) # (B, 768, 14, 14)
x = x.flatten(2).transpose(1, 2)
return x
Q: ViT vs CNN -- когда что использовать?¶
A:
| Aspect | CNN | ViT |
|---|---|---|
| Inductive bias | Translation equivariance, locality | Minimal (learned from data) |
| Data efficiency | Good with small data | Needs large datasets |
| Global context | Limited by receptive field | Full attention from layer 1 |
| Compute | Efficient for small images | Scales with sequence length |
| Pre-training | ImageNet sufficient | JFT-300M, LAION better |
Decision: - Small dataset (<100K): CNN (ResNet, EfficientNet) - Large dataset + compute: ViT - Production real-time: CNN - SOTA on large-scale: ViT + huge pre-training
Q: Почему ViT нужен [CLS] token?¶
A:
Purpose: Aggregate global representation for classification.
How it works: - [CLS] attends to all patch tokens - After L layers, [CLS] embedding contains "summary" of image - Only [CLS] connects to classification head
Alternative: Global average pooling over all patch tokens. But [CLS] allows model to learn what's important for the task.
Q: Swin Transformer vs ViT?¶
A:
ViT problem: Global attention = \(O(N^2)\) for N patches. 224x224 image = 196 patches → manageable. High-res = prohibitive.
Swin solution: Hierarchical attention in windows.
Key features: 1. Window attention: Only attend within local windows (7x7 in patch grid = 49 patches per window) 2. Shifted windows: Alternate window positions for cross-window connections 3. Hierarchical: Progressive downsampling (like CNNs) 4. Linear complexity: \(O(N)\) instead of \(O(N^2)\)
Use cases: - ViT: Classification, small images - Swin: Detection, segmentation, high-res images
Object Detection¶
Q: One-stage vs Two-stage detectors -- разница?¶
A:
| Aspect | Two-stage (Faster R-CNN) | One-stage (YOLO, SSD) |
|---|---|---|
| Pipeline | Region proposal + Classification | Direct prediction |
| Speed | Slower (~5 FPS) | Faster (~45-155 FPS) |
| Accuracy | Higher mAP | Slightly lower |
| Anchors | RPN generates proposals | Pre-defined anchor boxes |
Two-stage (Faster R-CNN): 1. Stage 1: Region Proposal Network (RPN) generates candidate boxes 2. Stage 2: RoI pooling + classification per region
One-stage (YOLO): - Single network predicts bounding boxes + classes directly - Treats detection as regression problem
When to use: - Two-stage: Accuracy critical, smaller datasets - One-stage: Real-time applications, embedded systems
Q: Что такое anchor boxes?¶
A:
Concept: Pre-defined box templates at each location to match objects of different scales/aspect ratios.
How it works: - Each location has K anchor boxes (e.g., 3 scales x 3 ratios = 9 anchors) - Model predicts offsets from anchors, not absolute boxes - During training, match anchors to ground truth (IoU > 0.5)
Anchor design:
# Typical YOLO anchors
anchors = [
(10, 13), (16, 30), (33, 23), # Small objects
(30, 61), (62, 45), (59, 119), # Medium objects
(116, 90), (156, 198), (373, 326) # Large objects
]
Modern trend (anchor-free): - FCOS, CenterNet -- predict from center point - No anchor boxes, simpler post-processing
Q: IoU (Intersection over Union) -- формула и применение¶
A:
Uses: 1. Training: Match predictions to ground truth 2. NMS: Remove duplicate detections 3. Evaluation: mAP calculation 4. Loss: IoU loss for better localization
IoU Loss: $\(L_{IoU} = 1 - IoU\)$
Improved variants: - GIoU: Adds penalty for non-overlapping area - DIoU: Adds distance between box centers - CIoU: DIoU + aspect ratio consistency
Q: Non-Maximum Suppression (NMS) -- как работает?¶
A:
Problem: Same object detected multiple times with slightly different boxes.
Algorithm:
def nms(boxes, scores, iou_threshold=0.5):
keep = []
order = scores.argsort()[::-1] # Sort by score
while order.size > 0:
i = order[0]
keep.append(i)
# Compute IoU with remaining boxes
ious = compute_iou(boxes[i], boxes[order[1:]])
# Keep boxes with IoU < threshold
inds = np.where(ious <= iou_threshold)[0]
order = order[inds + 1]
return keep
Soft-NMS: Instead of removing, reduce score of overlapping boxes. $\(s_i = s_i \cdot (1 - IoU(b_i, b_{max}))\)$
Learnable NMS: Network learns to suppress (e.g., Relation Network).
Q: YOLO архитектура и эволюция¶
A:
Core idea: Divide image into SxS grid, each cell predicts B boxes + C classes.
YOLOv1: - Input: 448x448, Grid: 7x7, Boxes per cell: 2 - Output: 7x7x(5x2+20) = 7x7x30
Evolution: - YOLOv2: Anchor boxes, batch norm, multi-scale training - YOLOv3: Feature pyramid (3 scales), darknet-53 backbone - YOLOv4: CSPDarknet, PANet, Mish activation - YOLOv5: PyTorch implementation, auto-learning box anchors - YOLOv8: Anchor-free, decoupled head, Mosaic augmentation - YOLOv9: Programmable gradient information (PGI) - YOLOv10: NMS-free training, consistent dual assignments
YOLO loss: $\(L = \lambda_{coord} L_{loc} + L_{conf} + \lambda_{noobj} L_{noobj} + L_{cls}\)$
Q: Feature Pyramid Network (FPN) -- зачем нужна?¶
A:
Problem: Objects at different scales need different feature resolutions.
Solution: Multi-scale feature hierarchy with top-down pathway + lateral connections.
Architecture:
Backbone (bottom-up):
C1 (1/2) -> C2 (1/4) -> C3 (1/8) -> C4 (1/16) -> C5 (1/32)
| | |
FPN (top-down): P3 <---- P4 <----- P5
+ + +
(1x1 conv from C3) (upsample)
Benefits: - Each pyramid level: strong semantic features - P3 (80x80): Small objects - P4 (40x40): Medium objects - P5 (20x20): Large objects
Modern variants: - PANet: Bottom-up + top-down - BiFPN (EfficientDet): Bidirectional, weighted fusion
Q: mAP (mean Average Precision) -- как считается?¶
A:
Precision-Recall curve: Vary confidence threshold, plot P vs R.
AP (Average Precision): Area under PR curve for one class. $\(AP = \int_0^1 P(R) dR \approx \sum_{k=1}^n P(k) \Delta R(k)\)$
mAP: Mean of AP across all classes. $\(mAP = \frac{1}{C} \sum_{c=1}^C AP_c\)$
Common variants: - mAP@0.5: IoU threshold = 0.5 - mAP@0.5:0.95: Average over IoU thresholds 0.5, 0.55, ..., 0.95 (COCO standard) - mAP@small/medium/large: Object size specific
COCO metrics: - AP = mAP@0.5:0.95 (primary metric) - AP50 = mAP@0.5 - AP75 = mAP@0.75
Contrastive Learning & Self-Supervised Learning¶
Q: Что такое contrastive learning?¶
A:
Core idea: Learn representations by contrasting positive pairs against negative pairs.
Objective: Pull similar samples together, push dissimilar samples apart in embedding space.
Contrastive loss: $\(L = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k)/\tau)}\)$
Where: - \(z_i, z_j\) = embeddings of positive pair - \(\tau\) = temperature parameter - sim = cosine similarity
Key components: 1. Positive pairs: Augmented views of same image 2. Negative pairs: Different images (in-batch or memory bank) 3. Projection head: MLP that maps to embedding space
Q: SimCLR -- архитектура и обучение¶
A:
Pipeline: 1. Augmentation: Random crop, color distortion, Gaussian blur 2. Encoder: ResNet-50 → 2048-dim representation 3. Projection head: MLP (2048 → 512 → 128) 4. Contrastive loss: NT-Xent on normalized embeddings
Key findings: - Strong augmentation critical: Random crop + color jitter - Projection head helps: Don't use for downstream - Large batch size: 4096-8192 for enough negatives - NT-Xent loss: Normalized Temperature-scaled Cross Entropy
NT-Xent Loss: $\(\ell_{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k)/\tau)}\)$
def nt_xent_loss(z_i, z_j, temperature=0.5):
# z_i, z_j: (batch, dim) - embeddings of two views
batch = z_i.shape[0]
# Concatenate and normalize
z = torch.cat([z_i, z_j], dim=0) # (2N, dim)
z = F.normalize(z, dim=1)
# Similarity matrix
sim = torch.mm(z, z.T) / temperature # (2N, 2N)
# Mask out self-similarity
mask = torch.eye(2 * batch, device=z.device).bool()
sim.masked_fill_(mask, float('-inf'))
# Labels: i's positive is j, j's positive is i
labels = torch.cat([
torch.arange(batch, 2 * batch),
torch.arange(0, batch)
], dim=0).to(z.device)
return F.cross_entropy(sim, labels)
Q: CLIP -- как работает multimodal contrastive learning?¶
A:
Architecture: - Image encoder: ViT or ResNet - Text encoder: Transformer - Joint embedding space: 512-dim
Training: - 400M (image, text) pairs from internet - Contrastive loss: Match image to correct text caption
Loss: $\(L = \frac{1}{2}(L_{I \to T} + L_{T \to I})\)$
Where \(L_{I \to T}\) = cross-entropy over text candidates for each image.
Zero-shot classification:
# Prompt engineering
text_prompts = [f"a photo of a {label}" for label in classes]
text_embeddings = text_encoder(text_prompts)
image_embedding = image_encoder(image)
# Zero-shot prediction
logits = image_embedding @ text_embeddings.T
pred = logits.argmax(dim=-1)
Capabilities: - Zero-shot transfer to new datasets - Text-guided image generation (DALL-E, Stable Diffusion) - Image retrieval via text queries
Q: MoCo vs SimCLR -- в чём разница?¶
A:
| Aspect | SimCLR | MoCo |
|---|---|---|
| Negatives | In-batch (requires large batch) | Memory bank / Queue |
| Batch size | 4096-8192 | 256-1024 |
| Memory | O(B^2) | O(K) where K = queue size |
| Encoder | Single (shared) | Two (query + key, momentum update) |
| Augmentation | Strong (crop + color + blur) | Similar |
MoCo v2 improvements: - Added MLP projection head - Strong augmentation from SimCLR - Cosine learning rate schedule
MoCo v3: - ViT backbone - No memory queue (large batch + stop-gradient) - Simpler, faster
Q: Self-supervised pretext tasks -- какие бывают?¶
A:
Contrastive methods: - SimCLR, MoCo, BYOL, SimSiam
Non-contrastive: - BYOL: Bootstrap Your Own Latent (no negatives) - SimSiam: Stop-gradient + predictor network
Masked prediction: - MAE (Masked Autoencoder): Mask 75% patches, reconstruct - BEiT: Predict discrete tokens (like BERT)
Pretext tasks (older): - Rotation prediction: 0, 90, 180, 270 degrees - Jigsaw puzzle: Rearrange shuffled patches - Colorization: Predict color from grayscale - Inpainting: Fill masked regions
2025-2026 trend: MAE + contrastive hybrid for best of both worlds.
Q: BYOL -- как обучаться без negative pairs?¶
A:
Key insight: Bootstrap representations using momentum encoder + predictor.
Architecture: - Online network: Encoder + Projector + Predictor - Target network: Encoder + Projector (momentum updated)
Loss: $\(L = 2 - 2 \cdot \frac{q(z_\theta(x), z'_{\xi}(x'))}{\|q(z_\theta(x))\| \cdot \|z'_{\xi}(x')\|}\)$
Where \(q\) = predictor, \(z_\theta\) = online, \(z'_{\xi}\) = target.
Why no collapse? 1. Predictor: Forces online to predict target's output 2. Stop-gradient: Target doesn't receive gradients 3. Momentum update: Slowly moving target prevents trivial solutions
Momentum update: $\(\xi \leftarrow \tau \xi + (1 - \tau) \theta\)$
Where \(\tau \approx 0.996\).
Q: Как оценить качество self-supervised representations?¶
A:
Linear probing: Freeze encoder, train linear classifier.
# Standard evaluation protocol
encoder.eval() # Frozen
classifier = nn.Linear(embed_dim, num_classes)
# Train only classifier on labeled data
for images, labels in train_loader:
with torch.no_grad():
features = encoder(images)
logits = classifier(features)
loss = F.cross_entropy(logits, labels)
Fine-tuning: Unfreeze and train entire network.
Transfer learning: Train on ImageNet, test on other datasets (CIFAR, Food101, etc.).
Metrics: - Linear probe accuracy: Quality of features (higher = better) - Fine-tune accuracy: Downstream task performance - k-NN accuracy: Non-parametric evaluation
Typical results (ImageNet): | Method | Linear Probe | Fine-tune | |--------|--------------|-----------| | Supervised | 76.7% | 76.7% | | SimCLR | 69.3% | 74.5% | | MoCo v2 | 71.1% | 75.2% | | BYOL | 74.3% | 76.6% | | MAE | 68.0% | 83.6% |