Сравнение моделей эмбеддингов¶

~4 минуты чтения

Предварительно: RAG архитектуры | Бенчмарки оценки LLM

Выбор embedding-модели определяет качество retrieval в RAG-pipeline: разница между лидером (Cohere embed-v4, MTEB 65.2) и средней моделью (MTEB ~57) -- это +15-20% Recall@10 в production. При этом цена отличается в 72 раза: Gemini $0.0025/M tokens vs Voyage $0.18/M. Для 100M tokens/месяц это $250 vs $18,000. BGE-M3 (MIT, self-hosted, MTEB 63.0) закрывает 80% use-case'ов бесплатно. Hybrid search (70-80% semantic + 20-30% BM25) + cross-encoder reranking даёт +5-10% nDCG@10 поверх чистого vector search.

Ключевые концепции¶

Embeddings -- dense vectors, кодирующие семантику текста для similarity search и RAG.

\[\text{Embedding: } \mathbb{S}^* \rightarrow \mathbb{R}^d\]

Similarity Metrics¶

Metric	Formula	Use Case
Cosine	$\frac{A \cdot B}{\\|A\\| \cdot \\|B\\|}$	Most common
Dot Product	$A \cdot B$	Normalized vectors
Euclidean	$\\|A - B\\|$	Clustering

Retrieval Metrics¶

\[\text{Recall}@k = \frac{|\text{relevant} \cap \text{top-}k|}{|\text{total relevant}|}\]

\[\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}\]

1. MTEB Benchmark¶

Massive Text Embedding Benchmark -- de facto standard оценки моделей эмбеддингов.

8 Task Categories¶

Task	Metric	What It Tests
Retrieval	nDCG@10	Query -> document matching
STS	Spearman	Semantic similarity
Classification	Accuracy/F1	Semantic boundaries
Clustering	V-measure	Unsupervised grouping
Reranking	MAP	Fine-grained relevance
Pair Classification	AP	Duplicate/paraphrase detection
Summarization	Spearman	Abstraction recognition
Bitext Mining	F1	Cross-lingual alignment

1000+ languages, 58 English datasets, Hugging Face Spaces.

Leaderboard (2026)¶

Rank	Model	MTEB Score	Type
1	Gemini embedding	#1	Commercial
2	Cohere embed-v4	65.2	Commercial
3	OpenAI text-3-large	64.6	Commercial
4	BGE-M3	63.0	Open-source
5	E5-large-v2	~62	Open-source
6	Voyage-3-large	60.5	Commercial
7	Nomic Embed	~60	Open-source
8	Jina-embeddings-v3	~57	Open-source

MTEB Score Tiers¶

Tier	Score	Examples
Excellent	64-70	Cohere v4, OpenAI large, Gemini
Good	60-64	BGE-M3, Voyage, E5
Average	55-60	Older models, Jina v3
Poor	<55	Basic embeddings

Emerging Benchmarks¶

RTEB (Retrieval Embedding Benchmark) -- фокус на retrieval, более релевантен для RAG
BEIR 2.0 (Jan 2026) -- updated nDCG@10, more diverse evaluation sets

2. Commercial Models¶

OpenAI text-3Cohere embed-v4Voyage AIGoogle Gemini

Variant	Dimensions	MTEB	Price/M tokens	Max Tokens
small	512	~62	$0.02	8191
large	3072	64.6	$0.13	8191

Matryoshka Representation Learning (MRL): truncatable dimensions (3072 -> 1536 -> 768 -> 256) without retraining. Storage optimization without quality collapse.

Best for: RAG production, cost-performance balance.

Aspect	Details
Dimensions	1024
MTEB	65.2 (leader Nov 2025)
Multimodal	Text + Image
Price	$0.10/M tokens

Enterprise features: handles noisy data, compression-ready, works with Cohere Reranker.

Best for: Multimodal RAG, enterprise deployments.

Model	Dimensions	MTEB	Price	Max Tokens
voyage-3-large	1024	60.5	$0.18/M	32000
voyage-3	1024	~58	$0.06/M	32000
voyage-code-3	1024	~59	$0.18/M	32000

Long context (32K), code-specialized model.

Best for: Long documents, code search.

Model	Dimensions	Price
text-embedding-005	768	$0.0025/M
Gecko (256d)	256	Research

Cheapest option. Gecko (256d) outperforms 768d models. Native GCP integration.

Best for: Budget deployments, GCP ecosystem.

3. Open-Source Models¶

BGE-M3 (BAAI)GTE (Alibaba)E5 (Microsoft)Jina v4Nomic Embed

Aspect	Details
Parameters	568M
MTEB	63.0
Languages	100+
Max context	8,192 tokens
Retrieval types	Dense, multi-vector (ColBERT), sparse
Latency	<30ms query time
License	MIT

Single model produces multiple embedding types -- eliminates need for separate dense/sparse models.

Best for: Self-hosted RAG, multilingual, privacy-critical.

Aspect	Details
Parameters	305M (multilingual-base)
Languages	70+
Inference speed	10x faster than competitors

Best for high-throughput production, Asian language retrieval.

Best for: High-throughput, Asian languages.

Model	Params	Top-5 Accuracy	Speed
E5-small	118M	100%	14x faster than 8B
E5-large-v2	560M	MTEB ~62	Baseline
E5-Mistral-7B	7B	High	4,096 context

Training: 270M text pairs via weakly supervised contrastive learning.

Best for: Speed-critical production, budget self-hosting.

Aspect	Details
Base model	Qwen2.5-VL-3B-Instruct
Dimensions	2048 (dense), multi-vector
Languages	30+
Modalities	Text, images, visual docs
License	CC-BY-NC-4.0

Best for: Multimodal open-source, visual document search.

768d, MTEB ~60, 8192 context, Apache 2.0, fully reproducible, open data training.

Best for: Fully open/reproducible research.

4. Technical Architecture¶

Dimensionality Trade-offs¶

Dimensions	Storage/1M docs	Recall	Speed
256	1 GB	Low	Fastest
512	2 GB	Medium	Fast
768	3 GB	Good	Medium
1024	4 GB	Better	Medium
3072	12 GB	Best	Slower

Quantization¶

Precision	Storage Reduction	Quality Loss
FP32 -> FP16	50%	~0%
FP32 -> INT8	75%	~1-2%
FP32 -> Binary	96%	5-10%

5. Semantic Search Architecture¶

graph LR
    A["Query"] --> B["Embedding"]
    B --> C["Vector Search"]
    C --> D["Rerank"]
    D --> E["Metadata Filter"]
    E --> F["Results"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#f3e5f5,stroke:#9c27b0
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#fff3e0,stroke:#ef6c00
    style E fill:#e8eaf6,stroke:#3f51b5
    style F fill:#e8f5e9,stroke:#4caf50

Hybrid Search¶

Combines lexical (BM25, exact matches) + semantic (vectors, conceptual similarity).

Best practice: 70-80% semantic + 20-30% lexical weighting.

Reranking¶

Stage	Purpose	Latency
Initial retrieval	Get candidates (top-100)	Fast
Cross-encoder reranking	Precision refinement (top-10)	10-50ms

6. Multimodal Embeddings¶

Model	Modalities	Notes
CLIP	Image + Text	Foundation model
ImageBind	Image, Text, Audio, Depth, Thermal	Universal space
BGE-VL	Image + Text	State-of-art visual search
Jina v4	Text, Images, Visual docs	Multimodal + multilingual
Cohere embed-v4	Text + Image	Enterprise multimodal

Use cases: visual product search, cross-modal retrieval, document understanding.

7. Selection Guide¶

graph TD
    A["Need multimodal?"] -->|YES| B["Cohere embed-v4 / Jina v4"]
    A -->|NO| C["Cheapest option?"]
    C -->|YES| D["Gemini embedding<br/>($0.0025/M)"]
    C -->|NO| E["Best open-source?"]
    E -->|YES| F["BGE-M3"]
    E -->|NO| G["Long context >8K?"]
    G -->|YES| H["Voyage-3-large<br/>(32K)"]
    G -->|NO| I["Code embeddings?"]
    I -->|YES| J["Voyage-code-3"]
    I -->|NO| K["OpenAI text-3-large"]

    style A fill:#fff3e0,stroke:#ef6c00
    style B fill:#e8f5e9,stroke:#4caf50
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#e8f5e9,stroke:#4caf50
    style E fill:#fff3e0,stroke:#ef6c00
    style F fill:#e8f5e9,stroke:#4caf50
    style G fill:#fff3e0,stroke:#ef6c00
    style H fill:#e8f5e9,stroke:#4caf50
    style I fill:#fff3e0,stroke:#ef6c00
    style J fill:#e8f5e9,stroke:#4caf50
    style K fill:#e8f5e9,stroke:#4caf50

Use Case Matrix¶

Use Case	Recommended	Reason
RAG production	OpenAI text-3-large	Best balance
Multimodal RAG	Cohere embed-v4	Text + Image
Budget	Gemini embedding	Cheapest
Privacy	BGE-M3 self-host	Full control
Long documents	Voyage-3-large	32K context
Code search	Voyage-code-3	Optimized
Multilingual	BGE-M3 or GTE	70-100+ languages
High throughput	GTE / E5-small	10-14x faster

8. Pricing¶

Model	Price/1M tokens	10M/mo	100M/mo
Gemini embedding	$0.0025	$25	$250
OpenAI text-3-small	$0.02	$200	$2,000
BGE-M3 (self-host)	~$0.01*	$100	$1,000
Cohere embed-v4	$0.10	$1,000	$10,000
OpenAI text-3-large	$0.13	$1,300	$13,000
Voyage-3-large	$0.18	$1,800	$18,000

Для интервью¶

Q: "How do you choose an embedding model for RAG?"¶

Depends on: (1) Quality vs cost -- MTEB scores 60-65 for top models, Gemini cheapest ($0.0025/M), OpenAI best cost-performance ($0.13/M, 64.6 MTEB). (2) Open-source vs API -- BGE-M3 (63.0 MTEB, MIT, 100+ languages, dense+sparse+ColBERT) best open-source. (3) Context length -- standard 8K, Voyage 32K. (4) Multimodal -- Cohere embed-v4 or Jina v4. (5) Latency -- E5-small 16ms, BGE-M3 <30ms for real-time RAG. Use hybrid search (70-80% semantic + 20-30% lexical) with cross-encoder reranking.

Q: "What is Matryoshka Representation Learning?"¶

MRL trains embeddings to work at multiple dimensions simultaneously. OpenAI text-3-large: 3072d -> 1536 -> 768 -> 256 via truncation, without retraining. Enables storage optimization (3072d = 12GB/1M docs -> 256d = 1GB) with minimal quality loss. Trade-off: compression sacrifices nuanced details for general topic understanding.

Q: "Compare MTEB vs RTEB benchmarks."¶

MTEB -- standard general benchmark: 8 task categories (retrieval, STS, classification, clustering, reranking, pair classification, summarization, bitext mining), 58 English datasets, 1000+ languages. RTEB -- emerging retrieval-specific benchmark, more relevant for RAG. MTEB scores 60-65 for top models; task-specific performance varies (Cohere best retrieval 67.5, OpenAI best STS 85.2, E5 best classification 78.3).

Ключевые числа¶

Факт	Значение
Top MTEB score (Cohere v4)	65.2
OpenAI large MTEB	64.6
BGE-M3 MTEB	63.0
BGE-M3 query latency	<30ms
E5-small latency	16ms
GTE inference speed	10x faster than competitors
OpenAI small price	$0.02/M tokens
Gemini price	$0.0025/M tokens
Hybrid search weighting	70-80% semantic + 20-30% lexical
Retrieval Recall@10	80-95%
STS Correlation (top models)	0.75-0.90
1B vectors (1024d) storage	4 TB
FP32->INT8 quality loss	~1-2%

Формулы¶

\[\text{cosine}(A, B) = \frac{A \cdot B}{\|A\| \cdot \|B\|}\]

\[\text{Recall}@k = \frac{|\text{relevant} \cap \text{top-}k|}{|\text{total relevant}|}\]

\[\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}\]

Заблуждение: чем больше dimensions, тем лучше качество

Google Gecko с 256 dimensions обходит многие модели с 768d на MTEB. Matryoshka Learning (OpenAI text-3-large) позволяет truncate 3072d -> 256d с потерей всего 2-5% Recall. При этом storage падает с 12 GB/1M docs до 1 GB. Выбор dimensions -- trade-off storage/latency vs nuance, а не "больше = лучше".

Заблуждение: MTEB score -- главный критерий выбора модели

MTEB -- усреднённый score по 8 задачам (retrieval, STS, classification, clustering, reranking, pair classification, summarization, bitext mining). Модель с MTEB 65 может проигрывать модели с MTEB 60 на конкретной задаче retrieval. Cohere embed-v4 лидирует по MTEB, но на code retrieval Voyage-code-3 лучше. Всегда бенчмаркайте на своих данных -- domain-specific evaluation важнее MTEB.

Заблуждение: vector search достаточен без BM25

Чистый vector search пропускает exact keyword matches: запрос 'error code NX-4521' не найдёт документ с этим кодом, если embedding не видел его в training data. Hybrid search (70-80% semantic + 20-30% BM25) + cross-encoder reranking даёт +5-10% nDCG@10. BGE-M3 решает это встроенно -- одна модель генерирует dense, sparse и ColBERT embeddings.

Interview Questions¶

Q: Как выбрать embedding-модель для RAG production?

Red flag: "Берём OpenAI text-3-large, у неё лучший MTEB"

Strong answer: "5 критериев: (1) Quality vs cost -- MTEB 60-65 для топ моделей, но оценивать на своих данных. (2) API vs self-host -- BGE-M3 (MIT, 63.0 MTEB, 100+ языков, dense+sparse+ColBERT) лучший open-source. (3) Pricing -- Gemini $0.0025/M vs OpenAI $0.13/M vs self-host BGE-M3 ~$0.01/M. (4) Context length -- стандарт 8K, Voyage 32K для длинных документов. (5) Multimodal -- Cohere embed-v4 или Jina v4 для text+image. Hybrid search (70-80% semantic + 20-30% BM25) + cross-encoder reranking обязательны."

Q: Что такое Matryoshka Representation Learning и зачем оно нужно?

Red flag: "Это способ сжатия эмбеддингов"

Strong answer: "MRL обучает embeddings работать на нескольких dimensions одновременно. OpenAI text-3-large: 3072d -> 1536 -> 768 -> 256 простым truncation, без переобучения. Практика: 3072d = 12 GB/1M docs -> 256d = 1 GB, потеря Recall 2-5%. Trade-off: мелкие dimensions теряют nuanced details, но сохраняют topic-level understanding. Используется для tiered storage: full dimensions для top candidates, truncated для initial retrieval."

Q: Сравните dense vs sparse vs ColBERT retrieval.

Red flag: "Dense лучше sparse, sparse устарел"

Strong answer: "Dense (single vector per document) -- хорош для семантики, пропускает exact matches. Sparse (BM25-like) -- ловит keywords, слаб на парафразах. ColBERT (multi-vector, token-level) -- лучшая точность, но 10-100x больше storage. BGE-M3 генерирует все три из одной модели. Production: hybrid = dense + sparse, reranking = cross-encoder. nDCG@10: dense 0.45, sparse 0.40, hybrid 0.50, hybrid+rerank 0.55 (типичные цифры на BEIR)."

Источники¶

AIMultiple -- "Embedding Models: OpenAI vs Gemini vs Cohere in 2026"
Zylos Research -- "Embedding Models and Semantic Search 2026"
Hugging Face -- MTEB Leaderboard
Cohere -- "Embed 4 Blog"
Milvus -- "We Benchmarked 20+ Embedding APIs"
arXiv 2403.20327 -- "Gecko: Versatile Text Embeddings Distilled from LLMs"
arXiv 2305.05665 -- "ImageBind: One Embedding Space To Bind Them All"
arXiv 2210.07316 -- MTEB Benchmark Paper (Muennighoff et al., 2022)
BEIR Benchmark 2.0 (Jan 2026)
AgentSet Embedding Leaderboard

Metric	Formula	Use Case
Cosine	\(\frac{A \cdot B}{\\|A\\| \cdot \\|B\\|}\)	Most common
Dot Product	\(A \cdot B\)	Normalized vectors
Euclidean	\(\\|A - B\\|\)	Clustering