DL Interview: Vision & Detection¶

~8 минут чтения

Навигация: Все темы DL интервью | Материалы DL | Математика для ML

Vision Transformers (ViT, Swin), Object Detection (YOLO, Faster R-CNN, FPN), Contrastive & Self-Supervised Learning (SimCLR, MoCo, CLIP, BYOL, MAE).

Vision Transformers (ViT)¶

Q: Как ViT обрабатывает изображения?¶

A:

Core idea: Treat image as sequence of patches.

Pipeline: 1. Patch embedding: Split image into $16 \times 16$ patches 2. Flatten + Linear: $P^2 \cdot C$ → $D$ (embedding dim) 3. Positional encoding: Add learnable position embeddings 4. [CLS] token: Prepend learnable classification token 5. Transformer encoder: Standard self-attention layers 6. MLP head: Classification from [CLS] token

class PatchEmbedding(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_ch=3, embed_dim=768):
        super().__init__()
        self.n_patches = (img_size // patch_size) ** 2  # 196
        self.proj = nn.Conv2d(in_ch, embed_dim, patch_size, patch_size)

    def forward(self, x):
        # x: (B, 3, 224, 224) -> (B, 196, 768)
        x = self.proj(x)  # (B, 768, 14, 14)
        x = x.flatten(2).transpose(1, 2)
        return x

Q: ViT vs CNN -- когда что использовать?¶

A:

Aspect	CNN	ViT
Inductive bias	Translation equivariance, locality	Minimal (learned from data)
Data efficiency	Good with small data	Needs large datasets
Global context	Limited by receptive field	Full attention from layer 1
Compute	Efficient for small images	Scales with sequence length
Pre-training	ImageNet sufficient	JFT-300M, LAION better

Decision: - Small dataset (<100K): CNN (ResNet, EfficientNet) - Large dataset + compute: ViT - Production real-time: CNN - SOTA on large-scale: ViT + huge pre-training

Q: Почему ViT нужен [CLS] token?¶

A:

Purpose: Aggregate global representation for classification.

How it works: - [CLS] attends to all patch tokens - After L layers, [CLS] embedding contains "summary" of image - Only [CLS] connects to classification head

Alternative: Global average pooling over all patch tokens. But [CLS] allows model to learn what's important for the task.

Q: Swin Transformer vs ViT?¶

A:

ViT problem: Global attention = $O(N^2)$ for N patches. 224x224 image = 196 patches → manageable. High-res = prohibitive.

Swin solution: Hierarchical attention in windows.

Key features: 1. Window attention: Only attend within local windows (7x7 in patch grid = 49 patches per window) 2. Shifted windows: Alternate window positions for cross-window connections 3. Hierarchical: Progressive downsampling (like CNNs) 4. Linear complexity: $O(N)$ instead of $O(N^2)$

Use cases: - ViT: Classification, small images - Swin: Detection, segmentation, high-res images

Object Detection¶

Q: One-stage vs Two-stage detectors -- разница?¶

A:

Aspect	Two-stage (Faster R-CNN)	One-stage (YOLO, SSD)
Pipeline	Region proposal + Classification	Direct prediction
Speed	Slower (~5 FPS)	Faster (~45-155 FPS)
Accuracy	Higher mAP	Slightly lower
Anchors	RPN generates proposals	Pre-defined anchor boxes

Two-stage (Faster R-CNN): 1. Stage 1: Region Proposal Network (RPN) generates candidate boxes 2. Stage 2: RoI pooling + classification per region

One-stage (YOLO): - Single network predicts bounding boxes + classes directly - Treats detection as regression problem

When to use: - Two-stage: Accuracy critical, smaller datasets - One-stage: Real-time applications, embedded systems

Q: Что такое anchor boxes?¶

A:

Concept: Pre-defined box templates at each location to match objects of different scales/aspect ratios.

How it works: - Each location has K anchor boxes (e.g., 3 scales x 3 ratios = 9 anchors) - Model predicts offsets from anchors, not absolute boxes - During training, match anchors to ground truth (IoU > 0.5)

Anchor design:

# Typical YOLO anchors
anchors = [
    (10, 13), (16, 30), (33, 23),    # Small objects
    (30, 61), (62, 45), (59, 119),   # Medium objects
    (116, 90), (156, 198), (373, 326) # Large objects
]

Modern trend (anchor-free): - FCOS, CenterNet -- predict from center point - No anchor boxes, simpler post-processing

Q: IoU (Intersection over Union) -- формула и применение¶

A:

\[IoU = \frac{\text{Area of Overlap}}{\text{Area of Union}} = \frac{|A \cap B|}{|A \cup B|}\]

Uses: 1. Training: Match predictions to ground truth 2. NMS: Remove duplicate detections 3. Evaluation: mAP calculation 4. Loss: IoU loss for better localization

IoU Loss: $$L_{IoU} = 1 - IoU$$

Improved variants: - GIoU: Adds penalty for non-overlapping area - DIoU: Adds distance between box centers - CIoU: DIoU + aspect ratio consistency

Q: Non-Maximum Suppression (NMS) -- как работает?¶

A:

Problem: Same object detected multiple times with slightly different boxes.

Algorithm:

def nms(boxes, scores, iou_threshold=0.5):
    keep = []
    order = scores.argsort()[::-1]  # Sort by score

    while order.size > 0:
        i = order[0]
        keep.append(i)

        # Compute IoU with remaining boxes
        ious = compute_iou(boxes[i], boxes[order[1:]])

        # Keep boxes with IoU < threshold
        inds = np.where(ious <= iou_threshold)[0]
        order = order[inds + 1]

    return keep

Soft-NMS: Instead of removing, reduce score of overlapping boxes. $$s_i = s_i \cdot (1 - IoU(b_i, b_{max}))$$

Learnable NMS: Network learns to suppress (e.g., Relation Network).

Q: YOLO архитектура и эволюция¶

A:

Core idea: Divide image into SxS grid, each cell predicts B boxes + C classes.

YOLOv1: - Input: 448x448, Grid: 7x7, Boxes per cell: 2 - Output: 7x7x(5x2+20) = 7x7x30

Evolution: - YOLOv2: Anchor boxes, batch norm, multi-scale training - YOLOv3: Feature pyramid (3 scales), darknet-53 backbone - YOLOv4: CSPDarknet, PANet, Mish activation - YOLOv5: PyTorch implementation, auto-learning box anchors - YOLOv8: Anchor-free, decoupled head, Mosaic augmentation - YOLOv9: Programmable gradient information (PGI) - YOLOv10: NMS-free training, consistent dual assignments

YOLO loss: $$L = \lambda_{coord} L_{loc} + L_{conf} + \lambda_{noobj} L_{noobj} + L_{cls}$$

Q: Feature Pyramid Network (FPN) -- зачем нужна?¶

A:

Problem: Objects at different scales need different feature resolutions.

Solution: Multi-scale feature hierarchy with top-down pathway + lateral connections.

Architecture:

Backbone (bottom-up):
  C1 (1/2) -> C2 (1/4) -> C3 (1/8) -> C4 (1/16) -> C5 (1/32)
                            |         |          |
FPN (top-down):            P3  <---- P4  <----- P5
                            +         +          +
                          (1x1 conv from C3)   (upsample)

Benefits: - Each pyramid level: strong semantic features - P3 (80x80): Small objects - P4 (40x40): Medium objects - P5 (20x20): Large objects

Modern variants: - PANet: Bottom-up + top-down - BiFPN (EfficientDet): Bidirectional, weighted fusion

Q: mAP (mean Average Precision) -- как считается?¶

A:

Precision-Recall curve: Vary confidence threshold, plot P vs R.

AP (Average Precision): Area under PR curve for one class. $$AP = \int_0^1 P(R) dR \approx \sum_{k=1}^n P(k) \Delta R(k)$$

mAP: Mean of AP across all classes. $$mAP = \frac{1}{C} \sum_{c=1}^C AP_c$$

Common variants: - mAP@0.5: IoU threshold = 0.5 - mAP@0.5:0.95: Average over IoU thresholds 0.5, 0.55, ..., 0.95 (COCO standard) - mAP@small/medium/large: Object size specific

COCO metrics: - AP = mAP@0.5:0.95 (primary metric) - AP50 = mAP@0.5 - AP75 = mAP@0.75

Contrastive Learning & Self-Supervised Learning¶

Q: Что такое contrastive learning?¶

A:

Core idea: Learn representations by contrasting positive pairs against negative pairs.

Objective: Pull similar samples together, push dissimilar samples apart in embedding space.

Contrastive loss: $$L = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k)/\tau)}$$

Where: - $z_i, z_j$ = embeddings of positive pair - $\tau$ = temperature parameter - sim = cosine similarity

Key components: 1. Positive pairs: Augmented views of same image 2. Negative pairs: Different images (in-batch or memory bank) 3. Projection head: MLP that maps to embedding space

Q: SimCLR -- архитектура и обучение¶

A:

Pipeline: 1. Augmentation: Random crop, color distortion, Gaussian blur 2. Encoder: ResNet-50 → 2048-dim representation 3. Projection head: MLP (2048 → 512 → 128) 4. Contrastive loss: NT-Xent on normalized embeddings

Key findings: - Strong augmentation critical: Random crop + color jitter - Projection head helps: Don't use for downstream - Large batch size: 4096-8192 for enough negatives - NT-Xent loss: Normalized Temperature-scaled Cross Entropy

NT-Xent Loss: $$\ell_{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k)/\tau)}$$

def nt_xent_loss(z_i, z_j, temperature=0.5):
    # z_i, z_j: (batch, dim) - embeddings of two views
    batch = z_i.shape[0]

    # Concatenate and normalize
    z = torch.cat([z_i, z_j], dim=0)  # (2N, dim)
    z = F.normalize(z, dim=1)

    # Similarity matrix
    sim = torch.mm(z, z.T) / temperature  # (2N, 2N)

    # Mask out self-similarity
    mask = torch.eye(2 * batch, device=z.device).bool()
    sim.masked_fill_(mask, float('-inf'))

    # Labels: i's positive is j, j's positive is i
    labels = torch.cat([
        torch.arange(batch, 2 * batch),
        torch.arange(0, batch)
    ], dim=0).to(z.device)

    return F.cross_entropy(sim, labels)

Q: CLIP -- как работает multimodal contrastive learning?¶

A:

Architecture: - Image encoder: ViT or ResNet - Text encoder: Transformer - Joint embedding space: 512-dim

Training: - 400M (image, text) pairs from internet - Contrastive loss: Match image to correct text caption

Loss: $$L = \frac{1}{2}(L_{I \to T} + L_{T \to I})$$

Where $L_{I \to T}$ = cross-entropy over text candidates for each image.

Zero-shot classification:

# Prompt engineering
text_prompts = [f"a photo of a {label}" for label in classes]
text_embeddings = text_encoder(text_prompts)
image_embedding = image_encoder(image)

# Zero-shot prediction
logits = image_embedding @ text_embeddings.T
pred = logits.argmax(dim=-1)

Capabilities: - Zero-shot transfer to new datasets - Text-guided image generation (DALL-E, Stable Diffusion) - Image retrieval via text queries

Q: MoCo vs SimCLR -- в чём разница?¶

A:

Aspect	SimCLR	MoCo
Negatives	In-batch (requires large batch)	Memory bank / Queue
Batch size	4096-8192	256-1024
Memory	O(B^2)	O(K) where K = queue size
Encoder	Single (shared)	Two (query + key, momentum update)
Augmentation	Strong (crop + color + blur)	Similar

MoCo v2 improvements: - Added MLP projection head - Strong augmentation from SimCLR - Cosine learning rate schedule

MoCo v3: - ViT backbone - No memory queue (large batch + stop-gradient) - Simpler, faster

Q: Self-supervised pretext tasks -- какие бывают?¶

A:

Contrastive methods: - SimCLR, MoCo, BYOL, SimSiam

Non-contrastive: - BYOL: Bootstrap Your Own Latent (no negatives) - SimSiam: Stop-gradient + predictor network

Masked prediction: - MAE (Masked Autoencoder): Mask 75% patches, reconstruct - BEiT: Predict discrete tokens (like BERT)

Pretext tasks (older): - Rotation prediction: 0, 90, 180, 270 degrees - Jigsaw puzzle: Rearrange shuffled patches - Colorization: Predict color from grayscale - Inpainting: Fill masked regions

2025-2026 trend: MAE + contrastive hybrid for best of both worlds.

Q: BYOL -- как обучаться без negative pairs?¶

A:

Key insight: Bootstrap representations using momentum encoder + predictor.

Architecture: - Online network: Encoder + Projector + Predictor - Target network: Encoder + Projector (momentum updated)

Loss: $$L = 2 - 2 \cdot \frac{q(z_\theta(x), z'_{\xi}(x'))}{\|q(z_\theta(x))\| \cdot \|z'_{\xi}(x')\|}$$

Where $q$ = predictor, $z_\theta$ = online, $z'_{\xi}$ = target.

Why no collapse? 1. Predictor: Forces online to predict target's output 2. Stop-gradient: Target doesn't receive gradients 3. Momentum update: Slowly moving target prevents trivial solutions

Momentum update: $$\xi \leftarrow \tau \xi + (1 - \tau) \theta$$

Where $\tau \approx 0.996$.

Q: Как оценить качество self-supervised representations?¶

A:

Linear probing: Freeze encoder, train linear classifier.

# Standard evaluation protocol
encoder.eval()  # Frozen
classifier = nn.Linear(embed_dim, num_classes)

# Train only classifier on labeled data
for images, labels in train_loader:
    with torch.no_grad():
        features = encoder(images)
    logits = classifier(features)
    loss = F.cross_entropy(logits, labels)

Fine-tuning: Unfreeze and train entire network.

Transfer learning: Train on ImageNet, test on other datasets (CIFAR, Food101, etc.).

Metrics: - Linear probe accuracy: Quality of features (higher = better) - Fine-tune accuracy: Downstream task performance - k-NN accuracy: Non-parametric evaluation

Typical results (ImageNet): | Method | Linear Probe | Fine-tune | |--------|--------------|-----------| | Supervised | 76.7% | 76.7% | | SimCLR | 69.3% | 74.5% | | MoCo v2 | 71.1% | 75.2% | | BYOL | 74.3% | 76.6% | | MAE | 68.0% | 83.6% |