Vision-Language модели (VLM)¶

~10 минут чтения

Предварительно: Vision-трансформеры (ViT) | LoRA и варианты файнтюнинга

Связанный файл: Мультимодальные модели: CLIP, SigLIP, LLaVA -- математика CLIP/SigLIP, детали обучения, сравнительные таблицы

VLM объединяют зрение и язык в одну модель: изображение разбивается на патчи, каждый патч становится токеном, и языковая модель генерирует текстовый ответ с визуальным пониманием. В 2025 году open-source VLM (Qwen2.5-VL-72B) достигли 70% на MMMU-Pro, обогнав GPT-4o (59.9%), а модели с 500M параметров запускаются на мобильных устройствах. Это означает, что для большинства production-задач проприетарные API больше не обязательны -- open-source покрывает 80%+ сценариев при полном контроле данных.

Ключевые концепции¶

Тренды 2025-2026¶

Any-to-any models -- input/output any modality (text, image, audio)
Reasoning models -- chain-of-thought для complex visual problems
Small yet capable -- <2B params running на consumer devices
Mixture-of-Experts -- 50-70% latency reduction
Vision-Language-Action (VLA) -- robotics control от VLMs

Why production teams deploy VLMs: open-source caught up (Qwen2.5-VL matches GPT-4o), efficient MoE architectures, QLoRA enables 70B fine-tuning on single A100.

1. VLM Architecture¶

Three-Stage Pipeline¶

graph LR
    A["Vision Encoder<br/>(ViT / CLIP / SigLIP)"] --> B["Projection Layer<br/>(2-Layer MLP)"]
    B --> C["LLM Decoder<br/>(Llama / GPT)"]
    A2["Image patches<br/>-> visual tokens"] -.-> A
    C --> C2["Text response<br/>с визуальным пониманием"]
    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#e8f5e9,stroke:#4caf50
    style A2 fill:#f3e5f5,stroke:#9c27b0
    style C2 fill:#e8f5e9,stroke:#4caf50

Stage 1: Vision Encoders¶

Vision Transformers (ViT): split images into 16x16 pixel patches, each patch = token. 1024x1024 image = 4,096 visual tokens (~2,000-3,000 words equivalent).

Encoder	Training Data	Output	Used By
CLIP-ViT-L/14	400M image-text pairs	1,024-dim / 768-dim projected	LLaVA, many open-source
SigLIP	CLIP successor, sigmoid loss	Lower memory	Phi-4, DeepSeek-VL2, Kimi-VL
Custom encoders	Proprietary	Optimized	GPT-4o, Gemini, Claude

Stage 2: Projection Layer¶

Purpose: bridge vision encoder (e.g. 1,024-dim CLIP) -> LLM (4,096-dim Llama).

Implementation: 2-layer MLP. Training strategy: freeze vision encoder, train projector + top 2-4 LLM layers. Reduces compute costs by 90%.

Stage 3: Token Fusion¶

Modern hybrid approach: early fusion in first layers (visual grounding) + sparse cross-attention in deeper layers (efficiency).

Prompt ordering impact:

Order	Accuracy	Use Case
Question-before-image	+5-10%	Reasoning tasks
Image-before-question	Lower	Conversational interfaces

2. Model Categories¶

2.1 Any-to-Any Models¶

Model	Capabilities	Architecture
Qwen2.5-Omni	Text, image, audio in/out	Thinker-Talker architecture
MiniCPM-o 2.6	Vision, speech, language (8B)	Multimodal unified
Janus-Pro-7B	Understanding + generation	Decoupled visual encoding

2.2 Reasoning Models¶

Model	Size	Key Features
QVQ-72B-preview	72B	First open-source multimodal reasoning
Kimi-VL-A3B-Thinking	2.8B active / 16B total	MoE decoder, long CoT, SigLIP-so-400M

2.3 Small Yet Capable (<2B)¶

Model	Size	Context	Key Features
SmolVLM-256M	256M	-	Minimal viable
SmolVLM-500M	500M	-	Video understanding sweet spot
SmolVLM-2.2B	2.2B	-	Full capabilities
gemma-3-4b-it	4B	128K	140+ languages
Qwen2.5-VL-3B	3B	32K	Localization, document understanding
Phi-4-Multimodal	1-10B	-	Edge, sub-100ms

2.4 MoE Decoders¶

Model	Total	Active	Feature
Kimi-VL	16B	2.8B	Most advanced open reasoning
DeepSeek-VL2	MoE	-	50-70% latency reduction

Advantages: faster inference, quick convergence. Trade-off: higher memory (entire model on GPU).

2.5 Vision-Language-Action (VLA)¶

Input: images + text instructions. Output: robot action tokens (joint positions, gripper states).

Model	Purpose
pi0 / pi0-FAST	First robotics foundation models (Physical Intelligence)
GR00T N1	NVIDIA robotics foundation model

Current success rate: 60-80%.

3. Leaderboard 2026¶

Open-Source¶

Rank	Model	Size	Best For
1	Gemma 3 (largest)	-	Chatbot Arena #1
2	Qwen2.5-VL-72B	72B	Best open performance (MMMU-Pro 70%)
3	InternVL3-78B	78B	Expert-level reasoning (MMMU-Pro 70%)
4	DeepSeek-VL2	MoE	Efficiency
5	Kimi-VL-A3B	2.8B active	Reasoning, 128K context
6	Janus-Pro-7B	7B	Generation + understanding
7	SmolVLM-500M	500M	Mobile/edge

Proprietary¶

Model	Context	Best For
Gemini 2.5 Pro	1M+	Best overall multimodal
GPT-4.1	128K	Production reliability
Claude 3.5 Sonnet	200K	Vision + reasoning

MMMU-Pro Scores¶

Model	Score
Human experts	88.6%
Qwen2.5-VL-72B	70%
InternVL3-78B	70%
GPT-4o	59.9%

4. Applications¶

Visual Question Answering¶

Domain-specific: medical (X-ray analysis), manufacturing (defect identification), financial (chart trend analysis).

Document Understanding¶

VLMs read layout and structure, not just characters. Insurance (policy extraction), legal (contract parsing), finance (nested tables with footnotes).

Semantic Search¶

Replaces metadata tagging: "black leather jacket with asymmetric zipper" -> visual match. Uses FAISS for similarity matching in shared embedding space.

Multimodal RAG¶

graph LR
    Q["Query<br/>(text/image)"] --> R["Multimodal<br/>Retriever"]
    R --> RR["Re-ranker"]
    RR --> D["Retrieved docs<br/>+ images"]
    D --> L["Multimodal LLM"]
    L --> O["Response"]
    style Q fill:#f3e5f5,stroke:#9c27b0
    style R fill:#e8eaf6,stroke:#3f51b5
    style RR fill:#fff3e0,stroke:#ef6c00
    style D fill:#e8eaf6,stroke:#3f51b5
    style L fill:#e8f5e9,stroke:#4caf50
    style O fill:#e8f5e9,stroke:#4caf50

25-40% improvement in retrieval accuracy vs text-only RAG.

Video Understanding¶

Long video understanding, temporal reasoning, summarization, action recognition. Kimi-VL (long videos), Gemini 1.5 Pro (1 hour+).

5. Fine-Tuning¶

Dataset Requirements¶

Task Complexity	Examples Needed
Simple tasks	500-1,000
Complex reasoning	5,000-50,000
Production quality	10,000-100,000

Best practice: mix 70% task-specific + 30% general data (prevent catastrophic forgetting).

LoRA для VLMs¶

Freeze pretrained weights, add trainable matrices (rank 16 or 64). Enables 70B fine-tuning on single A100. Compute cost: \(100-\)5,000. Annotation: \(0.10-\)2.00 per image-text pair.

QLoRA для VLMs¶

4-bit quantization + LoRA. Memory reduction 75%. 70B trainable on consumer GPUs. Trade-off: 1-3% accuracy loss.

6. Evaluation & Limitations¶

Benchmarks¶

Benchmark	Description
MMMU-Pro	12,700 expert-level questions
VHELM	9 aspects (perception, reasoning, bias)
Video-MME	900 videos, 5 difficulty levels
LongVideoBench	5-20+ minute videos
MMT-Bench	Perception, reasoning, action, creativity

Metrics¶

Metric	Target
Image-text alignment	Cosine similarity >0.8
Hallucination rate	10-30% typical (complex scenes)
Spatial reasoning	50-60% (current limitation)

Hallucination Mitigation¶

Approach	Latency Impact	Reduction
DPO with preference data	Training only	High
Object detection validation	+100-200ms	30-50%
Constrained decoding	Variable	Moderate
Prompt engineering	None	Low

Context Window Bottlenecks¶

1024x1024 image = 4,096 tokens. Three high-res images + conversation hits 16K limits.

Solution	Token Reduction
Dynamic resolution (Qwen2.5-VL)	Variable
Adaptive tiling (DeepSeek-VL2)	Large
Key frame selection	50-80%
Lower resolution (512x512)	~75%

7. Deployment¶

Edge (Apple Silicon)¶

import mlx_vlm
model = mlx_vlm.load("HuggingfaceTB/SmolVLM-500M-Instruct")
response = model.generate(image, "Describe this image")

Server (vLLM)¶

from vllm import LLM
llm = LLM(model="Qwen/Qwen2.5-VL-7B-Instruct")
outputs = llm.generate(prompts, images)

TGI / llama.cpp¶

# TGI
docker run -p 8080:80 ghcr.io/huggingface/text-generation-inference \
  --model-id Qwen/Qwen2.5-VL-7B-Instruct

# llama.cpp GGUF
llama-server -hf ggml-org/gemma-3-4b-it-GGUF

Model Selection¶

Priority	Recommended	Reason
Prototyping	LLaVA 1.6	Extensive docs
Production + fine-tuning	Qwen2.5-VL / InternVL3	Data privacy, custom
Low latency	DeepSeek-VL2	50-70% faster
Best accuracy	GPT-4o / Claude 3.5	No infra needed
Edge	Phi-4-Multimodal	Sub-100ms
Robotics	NVIDIA Groot N1	VLA

VLM vs Traditional CV¶

Aspect	Traditional CV	VLM
Task flexibility	Separate model per task	Single model, NL instructions
Training data	Task-specific labels	Image-text pairs
Adaptation	Retrain	Prompt engineering
Inference cost	Lower	Higher (token cost)
Best for	Fixed, simple tasks	Evolving, complex reasoning

Для интервью¶

Q: "Как устроена архитектура VLM?"¶

Три стадии: (1) Vision Encoder (ViT/SigLIP) -- splits image into 16x16 patches, каждый patch = token. 1024x1024 = 4,096 visual tokens. (2) Projection Layer -- 2-layer MLP, bridge vision dims (1024) to LLM dims (4096). Training: freeze encoder, train projector + top LLM layers (90% compute reduction). (3) LLM Decoder -- generates text response. Prompt ordering matters: question-before-image +5-10% accuracy.

Q: "Open-source vs proprietary VLMs в 2026?"¶

Open-source caught up: Qwen2.5-VL-72B и InternVL3-78B = 70% MMMU-Pro (vs GPT-4o 59.9%). MoE models (DeepSeek-VL2, Kimi-VL) дают 50-70% latency reduction. Small VLMs (SmolVLM-500M, Phi-4) run on edge. Proprietary advantage: Gemini 1.5 Pro (1M context), Claude 3.5 (complex layouts).

Q: "Как fine-tune VLM?"¶

LoRA (rank 16/64): freeze pretrained weights, add small trainable matrices. 70B on single A100. Cost: \(100-\)5,000. QLoRA: +4-bit quantization, 75% memory reduction, 1-3% accuracy loss. Dataset: 70% task-specific + 30% general (prevent catastrophic forgetting). Simple tasks: 500-1K examples, production: 10K-100K.

Q: "Ограничения VLM?"¶

(1) Hallucination: spatial reasoning 50-60%, even top models. Autoregressive LLMs generate plausible text regardless of image content. (2) Context bottleneck: 1024x1024 image = 4,096 tokens. Three images + conversation hits 16K. Solutions: dynamic resolution, adaptive tiling, key frame selection. (3) Token cost: одно изображение = 2-3K слов текста.

Q: "Vision-Language-Action models?"¶

VLM + action decoder для robot control. Input: images + text instructions. Output: joint positions, gripper states. Models: pi0 (Physical Intelligence), GR00T N1 (NVIDIA). Current success rate: 60-80%. Trend: unified model replaces perception + planning + control pipeline.

Ключевые числа¶

Факт	Значение
1024x1024 image	4,096 visual tokens
Image = text equivalent	~2,000-3,000 words
Human MMMU-Pro	88.6%
Open-source MMMU-Pro (Qwen/InternVL)	70%
GPT-4o MMMU-Pro	59.9%
Hallucination rate (complex scenes)	10-30%
Spatial reasoning accuracy	50-60%
MoE latency reduction	50-70%
VLA robot success rate	60-80%
LoRA compute reduction	90%
QLoRA memory reduction	75%
QLoRA accuracy loss	1-3%

Типичные заблуждения¶

Заблуждение: VLM понимает изображение как человек

VLM не «видит» изображение -- она обрабатывает набор патч-токенов через проекционный слой. Пространственное понимание (spatial reasoning) остаётся на уровне 50-60%, а галлюцинации в сложных сценах достигают 10-30%. Модель генерирует правдоподобный текст даже когда «не уверена» в содержимом картинки.

Заблуждение: Open-source VLM сильно отстают от проприетарных

На MMMU-Pro 2025: Qwen2.5-VL-72B и InternVL3-78B набирают 70%, тогда как GPT-4o -- 59.9%. Open-source модели обогнали GPT-4o на академических бенчмарках. Проприетарные модели сохраняют преимущество в длинном контексте (Gemini 1M+) и сложных мульти-документных задачах.

Заблуждение: Больше пикселей = лучше результат

Изображение 1024x1024 = 4096 токенов. Три таких изображения + разговор легко пробивают лимит контекста 16K. На практике dynamic resolution (Qwen2.5-VL) и adaptive tiling (DeepSeek-VL2) дают лучшее соотношение качество/стоимость, чем наивное увеличение разрешения.

Вопросы для собеседования¶

Почему MoE-декодер выгоден для VLM? Назовите trade-off.

«MoE просто быстрее» -- нет конкретики.

MoE (DeepSeek-VL2, Kimi-VL) активирует только 2.8B из 16B параметров на каждый токен, давая 50-70% снижение latency при сохранении quality 16B-модели. Trade-off: вся модель (16B) должна быть в GPU-памяти, поэтому memory footprint не уменьшается. Также routing overhead может нивелировать выигрыш на коротких запросах.

Как порядок prompt (вопрос до/после изображения) влияет на accuracy?

«Порядок не важен, модель всё равно видит оба» -- это неверно.

Question-before-image даёт +5-10% accuracy на reasoning-задачах, потому что модель формирует «фокус внимания» до обработки визуальных токенов. Image-before-question лучше для conversational UI. Это доказано на LLaVA и Qwen2.5-VL бенчмарках.

Как предотвратить catastrophic forgetting при fine-tuning VLM?

«Просто обучить на новых данных с LoRA» -- это упрощение.

Best practice: dataset = 70% task-specific + 30% general data. LoRA (rank 16/64) замораживает pretrained weights и добавляет trainable matrices. QLoRA добавляет 4-bit quantization (75% memory reduction, 1-3% accuracy loss). Для production-качества нужно 10K-100K примеров. Без general data mixing модель теряет zero-shot capabilities.

VLM vs Traditional CV pipeline -- когда что выбрать?

«VLM всегда лучше потому что универсальнее» -- неверно.

Traditional CV (YOLO, ResNet) лучше для фиксированных простых задач: дешевле, быстрее, детерминированнее. VLM выигрывает когда задача эволюционирует, нужно NL-интерфейс, или сложный reasoning. Например: defect detection на конвейере -- CV. Парсинг произвольных документов с таблицами и сносками -- VLM. Token cost VLM на порядок выше.

Источники¶

Hugging Face Blog -- "Vision Language Models (Better, faster, stronger)" (May 2025)
Label Your Data -- "VLM: How Vision-Language Models Work (2026 Guide)"
BentoML -- "Multimodal AI: Open-Source Vision Language Models in 2026"
Zylos AI -- "LLM Fine-tuning Techniques 2026"
arXiv:2409.02813 -- MMMU-Pro benchmark
arXiv:2010.11929 -- "An Image is Worth 16x16 Words" (ViT, Dosovitskiy et al.)
DeepSeek AI -- Janus-Pro-7B release
Qwen Team -- Qwen2.5-Omni, QVQ-72B-preview
Moonshot AI -- Kimi-VL-A3B