Самообучаемые модели компьютерного зрения¶
~7 минут чтения
Предварительно: Vision-трансформеры | Дистилляция знаний LLM
Разметка изображений стоит $0.05-0.50 за картинку, а для ImageNet-масштаба (14M+) это миллионы долларов. Self-supervised learning обходит эту проблему: DINOv2 обучен на 142M изображениях без единой метки и достигает 86.3% linear probe на ImageNet -- лучше, чем многие supervised модели. В 2025-2026 три парадигмы доминируют: self-distillation (DINOv2), masked autoencoding (MAE) и joint-embedding prediction (I-JEPA/V-JEPA 2). Выбор между ними определяет качество transfer learning, эффективность обучения и применимость к видео.
Ключевые концепции¶
Ландшафт 2026¶
| Модель | Тип | Ключевая инновация | Лучше для |
|---|---|---|---|
| DINOv2 | Self-distillation | Dense features, no labels | Transfer learning, frozen features |
| MAE | Masked autoencoding | High masking ratio (75%) | Pre-training efficiency |
| I-JEPA | Joint-embedding | Predict representations | Sample efficiency |
| V-JEPA 2 | Video JEPA | Temporal prediction | Video understanding, robotic planning |
1. DINOv2¶
Архитектура¶
+-----------------------------------------------------------+
| DINOv2 Architecture |
+------------------------------------------------------------|
| |
| Self-Distillation with no labels: |
| +-----------------------------------------------------+ |
| | Image -> Global Crop -> Student ViT -> Features | |
| | Image -> Global Crop -> Teacher ViT -> Features | |
| | | |
| | Loss = Cross-Entropy(Student || Teacher) | |
| | | |
| | Teacher = EMA of Student (momentum update) | |
| | Centering + Sharpening to prevent collapse | |
| +-----------------------------------------------------+ |
| |
| Additional losses: iBOT masked image modeling, |
| KoLeo regularization for feature uniformity |
| |
| Improvements over DINOv1: |
| - 142M curated images (vs ImageNet-1K) |
| - Patch size 14 (vs 16) |
| - Registers for better attention |
+-------------------------------------------------------------+
DINOv1 vs DINOv2¶
| Feature | DINOv1 | DINOv2 |
|---|---|---|
| Training data | ImageNet-1K | 142M curated images |
| Patch size | 16 | 14 |
| Registers | No | Yes |
| Linear probe | Good | Excellent |
| Frozen features | Good | SOTA |
Performance¶
| Task | Результат |
|---|---|
| ImageNet k-NN | 82.1% top-1 |
| ImageNet Linear probe | 86.3% top-1 |
| ADE20K Segmentation | 53.1 mIoU (SOTA) |
| NYUv2 Depth | 0.357 RMSE (SOTA) |
When to Use¶
| Use Case | Рекомендация |
|---|---|
| Transfer learning | Frozen features work well |
| Semantic segmentation | Dense patch features |
| Depth estimation | Geometric understanding |
| Object detection | Rich spatial features |
| Medical imaging | Strong transfer from natural images |
| Small datasets | Good sample efficiency |
2. MAE (Masked Autoencoders)¶
Архитектура¶
+-----------------------------------------------------------+
| MAE Architecture |
+------------------------------------------------------------|
| Input Image (224x224) |
| | |
| v |
| Patch Embedding (16x16 patches = 196 patches) |
| | |
| v |
| Random Masking (75% patches removed!) |
| Keep only 25% visible patches = 49 patches |
| | |
| v |
| Encoder (ViT) - Only on visible patches |
| Much faster since 75% fewer patches |
| | |
| v |
| Decoder (smaller ViT) - Full sequence + mask tokens |
| Reconstructs original pixel values |
| | |
| v |
| Loss = MSE(reconstructed, original) on masked patches |
+-------------------------------------------------------------+
Why 75%? Lower ratios don't force the model to learn useful representations. High masking creates a challenging prediction task. Encoder only sees 25% patches = efficient.
Training Efficiency¶
| Model | Params | ImageNet Fine-tune |
|---|---|---|
| ViT-B/16 MAE | 86M | 83.6% |
| ViT-L/16 MAE | 307M | 85.9% |
| ViT-H/14 MAE | 632M | 86.9% |
Pros and Cons¶
| Aspect | Pros | Cons |
|---|---|---|
| Efficiency | Only 25% patches to encoder | -- |
| Scalability | Scales to huge models | -- |
| Features | Good for fine-tuning | Poor frozen features |
| Linear probe | -- | Needs full fine-tuning |
MAE vs Other Methods¶
| Method | Pre-train Compute | Fine-tune Performance |
|---|---|---|
| Supervised | High | Good |
| DINO | High | Good |
| MAE | Lower | Better |
| MAE + DINO | High | Best |
3. I-JEPA¶
Архитектура (LeCun)¶
+-----------------------------------------------------------+
| I-JEPA Architecture |
+------------------------------------------------------------|
| |
| Key idea: Predict embeddings, not pixels |
| |
| Context blocks -> Encoder -> Context embeddings |
| Target blocks -> Encoder (EMA) -> Target embeddings |
| |
| Predictor (small network): |
| Predict target embeddings from context |
| |
| Loss = ||Predicted embedding - Target embedding||^2 |
| |
| Why I-JEPA works: |
| - No decoder needed (predicts in latent space) |
| - Learns semantic features directly |
| - Better linear probe than MAE |
| - More sample-efficient |
+-------------------------------------------------------------+
I-JEPA vs MAE¶
| Aspect | MAE | I-JEPA |
|---|---|---|
| Prediction target | Pixels | Embeddings |
| Decoder | Required | Not needed |
| Linear probe | Poor | Good |
| Fine-tuning | Required | Optional |
| Sample efficiency | Lower | Higher |
| Compute efficiency | Higher (75% mask) | Lower |
4. V-JEPA 2¶
Video Self-Supervised Learning¶
+-----------------------------------------------------------+
| V-JEPA 2 Architecture |
+------------------------------------------------------------|
| |
| Training: |
| 1. Mask temporal regions in video |
| 2. Encode visible frames |
| 3. Predict masked frame embeddings |
| |
| Key capabilities: |
| - Temporal prediction |
| - Motion understanding |
| - Physical reasoning |
| |
| V-JEPA 2-AC (Action-Conditioned): |
| - Predict future state given action |
| - Trained on <62 hours robot data |
| - Zero-shot planning capability |
| - Works on real robots |
+-------------------------------------------------------------+
V-JEPA 2 Benchmarks¶
| Benchmark | V-JEPA 2 | Previous SOTA |
|---|---|---|
| Something-Something v2 | 75.3% | 72.1% |
| Kinetics-400 | 84.2% | 82.5% |
| Epic-Kitchens | 41.8% | 38.5% |
| Robotic planning | Zero-shot | Requires training |
V-JEPA vs Video MAE¶
| Aspect | Video MAE | V-JEPA |
|---|---|---|
| Target | Pixel reconstruction | Representation prediction |
| Compute | Higher | Lower |
| Temporal modeling | Implicit | Explicit |
| Planning capability | Limited | Better |
5. Comparison¶
Method Comparison¶
| Feature | DINOv2 | MAE | I-JEPA | V-JEPA 2 |
|---|---|---|---|---|
| Type | Self-distillation | Masked AE | JEPA | Video JEPA |
| Target | Features | Pixels | Representations | Video reprs |
| Decoder needed | No | Yes | No | No |
| Dense features | Excellent | Poor | Good | Good |
| Video | Frame-level | Frame-level | Frame-level | Native |
| Sample efficiency | Medium | Lower | Higher | Higher |
ImageNet Linear Probe¶
| Method | ViT-B | ViT-L |
|---|---|---|
| Supervised | 77.9% | 76.5% |
| DINO | 76.2% | 77.4% |
| DINOv2 | 81.1% | 84.3% |
| MAE | 68.0% | 75.8% |
| I-JEPA | 74.5% | 81.0% |
Feature Quality¶
| Model | Linear Probe | Fine-tuned | Frozen Features |
|---|---|---|---|
| DINOv2 | Excellent | Excellent | Excellent |
| MAE | Poor | Excellent | Poor |
| I-JEPA | Good | Excellent | Good |
| V-JEPA 2 | Good | Excellent | Good |
Decision Tree¶
Task?
|
+-- Dense prediction (segmentation, depth)? -> DINOv2
+-- Efficient pre-training? -> MAE
+-- Sample-efficient learning? -> I-JEPA
+-- Video understanding? -> V-JEPA 2
+-- Robotic planning? -> V-JEPA 2-AC
+-- General transfer learning? -> DINOv2 or I-JEPA
Training Efficiency¶
| Method | Epochs to Converge | Relative Speed |
|---|---|---|
| Supervised | 300 | 1x |
| DINO | 800 | 0.4x |
| DINOv2 | 625 | 0.5x |
| MAE | 1600 | 0.8x (75% masking) |
| I-JEPA | 500 | 0.6x |
Masking Ratios¶
| Method | Optimal Masking | Reason |
|---|---|---|
| MAE | 75% | Forces learning |
| BEiT | 40% | Token prediction |
| I-JEPA | Variable | Block-wise |
6. Implementation¶
DINOv2 Feature Extraction¶
import torch
from transformers import AutoModel, AutoImageProcessor
model = AutoModel.from_pretrained("facebook/dinov2-base")
processor = AutoImageProcessor.from_pretrained("facebook/dinov2-base")
def extract_features(image):
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
cls_features = outputs.last_hidden_state[:, 0] # [1, 768]
return cls_features
MAE Pre-training (Simplified)¶
class MaskedAutoencoder(nn.Module):
def __init__(self, encoder, decoder, mask_ratio=0.75):
super().__init__()
self.encoder = encoder
self.decoder = decoder
self.mask_ratio = mask_ratio
def random_masking(self, patches):
batch, num_patches, dim = patches.shape
num_keep = int(num_patches * (1 - self.mask_ratio))
noise = torch.rand(batch, num_patches, device=patches.device)
ids_shuffle = torch.argsort(noise, dim=1)
ids_restore = torch.argsort(ids_shuffle, dim=1)
ids_keep = ids_shuffle[:, :num_keep]
visible_patches = torch.gather(
patches, 1,
ids_keep.unsqueeze(-1).expand(-1, -1, dim)
)
return visible_patches, ids_restore
def forward(self, images):
patches = self.encoder.patchify(images)
visible_patches, ids_restore = self.random_masking(patches)
encoded = self.encoder(visible_patches)
decoded = self.decoder(encoded, ids_restore)
loss = self.reconstruction_loss(decoded, patches)
return loss
Для интервью¶
Q: "Сравните DINOv2, MAE и I-JEPA."¶
DINOv2: self-distillation (student-teacher с EMA), 142M images, SOTA frozen features (86.3% linear probe). MAE: masked autoencoding, 75% masking (only 25% patches to encoder), efficient pre-training but poor frozen features (68.0% linear probe, needs fine-tuning). I-JEPA (LeCun): predict representations not pixels, no decoder, better sample efficiency, 77.5% linear probe. Key trade-off: DINOv2 best for frozen transfer, MAE best for compute-efficient pre-training, I-JEPA best for limited data.
Q: "Что такое V-JEPA 2?"¶
V-JEPA 2 extends I-JEPA to video: mask temporal regions, predict future frame embeddings. SOTA: Something-Something v2 75.3%, Kinetics-400 84.2%. V-JEPA 2-AC (action-conditioned): predict future state given robot action, trained on <62 hours robot data, zero-shot planning capability. Path toward human-level visual understanding по Yann LeCun.
Ключевые числа¶
| Факт | Значение |
|---|---|
| DINOv2 ImageNet linear probe | 86.3% |
| MAE optimal masking ratio | 75% |
| MAE ImageNet linear probe | 68.0% (ViT-B) |
| I-JEPA ImageNet linear probe | 77.5% |
| DINOv2 training data | 142M images |
| DINOv2-G params | 1.1B |
| MAE-ViT-H params | 632M |
| V-JEPA 2 Something-Something v2 | 75.3% |
| V-JEPA 2-AC robot training data | <62 hours |
Частые заблуждения¶
Заблуждение: MAE дает лучшие features для transfer learning
MAE отлично обучается (efficient, scalable), но его frozen features плохи: 68.0% linear probe для ViT-B vs 81.1% у DINOv2. MAE требует полного fine-tuning для хорошего результата. Если нужны frozen features (например, как backbone для downstream задач без дообучения) -- DINOv2 однозначно лучше.
Заблуждение: Self-supervised всегда хуже supervised
DINOv2 с linear probe (86.3%) превосходит supervised ViT-B (77.9%) на ImageNet. Self-supervised модели учат более универсальные представления, потому что не привязаны к конкретным меткам. Это особенно заметно на transfer tasks: segmentation (ADE20K 53.1 mIoU), depth estimation, medical imaging.
Заблуждение: I-JEPA и MAE -- это одно и то же, только маскируют по-разному
Фундаментальная разница: MAE предсказывает пиксели (pixel-space reconstruction), а I-JEPA предсказывает representations (latent-space prediction). I-JEPA не нуждается в decoder, учит семантические features напрямую и дает лучший linear probe (74.5% vs 68.0% для ViT-B). Это разные философии: generative (MAE) vs discriminative (I-JEPA).
Вопросы для собеседования¶
У вас датасет из 10K медицинских изображений без разметки. Какую self-supervised модель выбрать и почему?
"Используем MAE, он самый популярный" -- MAE требует fine-tuning, а с 10K изображений это рискованно (overfitting).
Сильный ответ: Лучший выбор -- DINOv2 pre-trained на 142M natural images, используем frozen features. DINOv2 показывает отличный transfer на медицинские данные без дообучения. Для 10K изображений: (1) извлекаем DINOv2 features (CLS token + patch tokens), (2) обучаем lightweight head (linear probe или MLP) на целевой задаче. Если нужна domain adaptation -- fine-tune с малым learning rate. I-JEPA -- альтернатива с лучшей sample efficiency, но DINOv2 имеет более зрелую экосистему.
Объясните, почему masking ratio 75% оптимален для MAE, а не 50% или 90%?
"Просто эмпирически подобрали" -- верно по форме, но не показывает понимание.
Сильный ответ: При 50% маскировании задача слишком простая -- модель может восстановить пиксели через интерполяцию соседних патчей, не изучая семантику. При 90% -- задача слишком сложная, модель не может даже приблизительно восстановить изображение. 75% -- это sweet spot: задача достаточно сложная, чтобы заставить модель выучить высокоуровневые представления (объекты, текстуры, контекст), но достаточно решаемая для стабильного обучения. Дополнительный бонус: encoder обрабатывает только 25% патчей, что дает 4x speedup на pre-training.
Когда V-JEPA 2 предпочтительнее DINOv2?
"V-JEPA 2 новее, значит лучше" -- нет понимания trade-offs.
Сильный ответ: V-JEPA 2 выбирают когда критична темпоральная информация: video understanding (Something-Something v2: 75.3% vs frame-level DINOv2), action recognition, робототехника. V-JEPA 2-AC может планировать действия робота zero-shot, обученный менее чем на 62 часах данных. Но для статических изображений DINOv2 остается лучше: dense features для segmentation (53.1 mIoU ADE20K), depth estimation, object detection. Выбор зависит от модальности данных, не от "новизны" модели.
Источники¶
- Meta AI Blog -- "V-JEPA 2: Self-Supervised Video Models" (2025)
- arXiv -- "V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning" (2506.09985)
- Meta AI -- DINOv2 technical report
- arXiv -- "Masked Autoencoders Are Scalable Vision Learners" (He et al., 2111.06377)
- arXiv -- "I-JEPA: Joint-Embedding Predictive Architecture" (LeCun et al., 2301.08243)
- ICML 2025 -- "DINO-WM: World Models on Pre-trained Visual Features"
- arXiv -- "Joint-Embedding vs Reconstruction for SSL" (2602.03604)
- OpenReview -- "DINO-Foresight: Looking into the Future with DINO"