Перейти к содержанию

Сравнение моделей эмбеддингов

~4 минуты чтения

Предварительно: RAG архитектуры | Бенчмарки оценки LLM

Выбор embedding-модели определяет качество retrieval в RAG-pipeline: разница между лидером (Cohere embed-v4, MTEB 65.2) и средней моделью (MTEB ~57) -- это +15-20% Recall@10 в production. При этом цена отличается в 72 раза: Gemini $0.0025/M tokens vs Voyage $0.18/M. Для 100M tokens/месяц это $250 vs $18,000. BGE-M3 (MIT, self-hosted, MTEB 63.0) закрывает 80% use-case'ов бесплатно. Hybrid search (70-80% semantic + 20-30% BM25) + cross-encoder reranking даёт +5-10% nDCG@10 поверх чистого vector search.


Ключевые концепции

Embeddings -- dense vectors, кодирующие семантику текста для similarity search и RAG.

\[\text{Embedding: } \mathbb{S}^* \rightarrow \mathbb{R}^d\]

Similarity Metrics

Metric Formula Use Case
Cosine \(\frac{A \cdot B}{\|A\| \cdot \|B\|}\) Most common
Dot Product \(A \cdot B\) Normalized vectors
Euclidean \(\|A - B\|\) Clustering

Retrieval Metrics

\[\text{Recall}@k = \frac{|\text{relevant} \cap \text{top-}k|}{|\text{total relevant}|}\]
\[\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}\]

1. MTEB Benchmark

Massive Text Embedding Benchmark -- de facto standard оценки моделей эмбеддингов.

8 Task Categories

Task Metric What It Tests
Retrieval nDCG@10 Query -> document matching
STS Spearman Semantic similarity
Classification Accuracy/F1 Semantic boundaries
Clustering V-measure Unsupervised grouping
Reranking MAP Fine-grained relevance
Pair Classification AP Duplicate/paraphrase detection
Summarization Spearman Abstraction recognition
Bitext Mining F1 Cross-lingual alignment

1000+ languages, 58 English datasets, Hugging Face Spaces.

Leaderboard (2026)

Rank Model MTEB Score Type
1 Gemini embedding #1 Commercial
2 Cohere embed-v4 65.2 Commercial
3 OpenAI text-3-large 64.6 Commercial
4 BGE-M3 63.0 Open-source
5 E5-large-v2 ~62 Open-source
6 Voyage-3-large 60.5 Commercial
7 Nomic Embed ~60 Open-source
8 Jina-embeddings-v3 ~57 Open-source

MTEB Score Tiers

Tier Score Examples
Excellent 64-70 Cohere v4, OpenAI large, Gemini
Good 60-64 BGE-M3, Voyage, E5
Average 55-60 Older models, Jina v3
Poor <55 Basic embeddings

Emerging Benchmarks

  • RTEB (Retrieval Embedding Benchmark) -- фокус на retrieval, более релевантен для RAG
  • BEIR 2.0 (Jan 2026) -- updated nDCG@10, more diverse evaluation sets

2. Commercial Models

Variant Dimensions MTEB Price/M tokens Max Tokens
small 512 ~62 $0.02 8191
large 3072 64.6 $0.13 8191

Matryoshka Representation Learning (MRL): truncatable dimensions (3072 -> 1536 -> 768 -> 256) without retraining. Storage optimization without quality collapse.

Best for: RAG production, cost-performance balance.

Aspect Details
Dimensions 1024
MTEB 65.2 (leader Nov 2025)
Multimodal Text + Image
Price $0.10/M tokens

Enterprise features: handles noisy data, compression-ready, works with Cohere Reranker.

Best for: Multimodal RAG, enterprise deployments.

Model Dimensions MTEB Price Max Tokens
voyage-3-large 1024 60.5 $0.18/M 32000
voyage-3 1024 ~58 $0.06/M 32000
voyage-code-3 1024 ~59 $0.18/M 32000

Long context (32K), code-specialized model.

Best for: Long documents, code search.

Model Dimensions Price
text-embedding-005 768 $0.0025/M
Gecko (256d) 256 Research

Cheapest option. Gecko (256d) outperforms 768d models. Native GCP integration.

Best for: Budget deployments, GCP ecosystem.


3. Open-Source Models

Aspect Details
Parameters 568M
MTEB 63.0
Languages 100+
Max context 8,192 tokens
Retrieval types Dense, multi-vector (ColBERT), sparse
Latency <30ms query time
License MIT

Single model produces multiple embedding types -- eliminates need for separate dense/sparse models.

Best for: Self-hosted RAG, multilingual, privacy-critical.

Aspect Details
Parameters 305M (multilingual-base)
Languages 70+
Inference speed 10x faster than competitors

Best for high-throughput production, Asian language retrieval.

Best for: High-throughput, Asian languages.

Model Params Top-5 Accuracy Speed
E5-small 118M 100% 14x faster than 8B
E5-large-v2 560M MTEB ~62 Baseline
E5-Mistral-7B 7B High 4,096 context

Training: 270M text pairs via weakly supervised contrastive learning.

Best for: Speed-critical production, budget self-hosting.

Aspect Details
Base model Qwen2.5-VL-3B-Instruct
Dimensions 2048 (dense), multi-vector
Languages 30+
Modalities Text, images, visual docs
License CC-BY-NC-4.0

Best for: Multimodal open-source, visual document search.

768d, MTEB ~60, 8192 context, Apache 2.0, fully reproducible, open data training.

Best for: Fully open/reproducible research.


4. Technical Architecture

Dimensionality Trade-offs

Dimensions Storage/1M docs Recall Speed
256 1 GB Low Fastest
512 2 GB Medium Fast
768 3 GB Good Medium
1024 4 GB Better Medium
3072 12 GB Best Slower

Quantization

Precision Storage Reduction Quality Loss
FP32 -> FP16 50% ~0%
FP32 -> INT8 75% ~1-2%
FP32 -> Binary 96% 5-10%

5. Semantic Search Architecture

graph LR
    A["Query"] --> B["Embedding"]
    B --> C["Vector Search"]
    C --> D["Rerank"]
    D --> E["Metadata Filter"]
    E --> F["Results"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#f3e5f5,stroke:#9c27b0
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#fff3e0,stroke:#ef6c00
    style E fill:#e8eaf6,stroke:#3f51b5
    style F fill:#e8f5e9,stroke:#4caf50

Combines lexical (BM25, exact matches) + semantic (vectors, conceptual similarity).

Best practice: 70-80% semantic + 20-30% lexical weighting.

Reranking

Stage Purpose Latency
Initial retrieval Get candidates (top-100) Fast
Cross-encoder reranking Precision refinement (top-10) 10-50ms

6. Multimodal Embeddings

Model Modalities Notes
CLIP Image + Text Foundation model
ImageBind Image, Text, Audio, Depth, Thermal Universal space
BGE-VL Image + Text State-of-art visual search
Jina v4 Text, Images, Visual docs Multimodal + multilingual
Cohere embed-v4 Text + Image Enterprise multimodal

Use cases: visual product search, cross-modal retrieval, document understanding.


7. Selection Guide

graph TD
    A["Need multimodal?"] -->|YES| B["Cohere embed-v4 / Jina v4"]
    A -->|NO| C["Cheapest option?"]
    C -->|YES| D["Gemini embedding<br/>($0.0025/M)"]
    C -->|NO| E["Best open-source?"]
    E -->|YES| F["BGE-M3"]
    E -->|NO| G["Long context >8K?"]
    G -->|YES| H["Voyage-3-large<br/>(32K)"]
    G -->|NO| I["Code embeddings?"]
    I -->|YES| J["Voyage-code-3"]
    I -->|NO| K["OpenAI text-3-large"]

    style A fill:#fff3e0,stroke:#ef6c00
    style B fill:#e8f5e9,stroke:#4caf50
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#e8f5e9,stroke:#4caf50
    style E fill:#fff3e0,stroke:#ef6c00
    style F fill:#e8f5e9,stroke:#4caf50
    style G fill:#fff3e0,stroke:#ef6c00
    style H fill:#e8f5e9,stroke:#4caf50
    style I fill:#fff3e0,stroke:#ef6c00
    style J fill:#e8f5e9,stroke:#4caf50
    style K fill:#e8f5e9,stroke:#4caf50

Use Case Matrix

Use Case Recommended Reason
RAG production OpenAI text-3-large Best balance
Multimodal RAG Cohere embed-v4 Text + Image
Budget Gemini embedding Cheapest
Privacy BGE-M3 self-host Full control
Long documents Voyage-3-large 32K context
Code search Voyage-code-3 Optimized
Multilingual BGE-M3 or GTE 70-100+ languages
High throughput GTE / E5-small 10-14x faster

8. Pricing

Model Price/1M tokens 10M/mo 100M/mo
Gemini embedding $0.0025 $25 $250
OpenAI text-3-small $0.02 $200 $2,000
BGE-M3 (self-host) ~$0.01* $100 $1,000
Cohere embed-v4 $0.10 $1,000 $10,000
OpenAI text-3-large $0.13 $1,300 $13,000
Voyage-3-large $0.18 $1,800 $18,000

Для интервью

Q: "How do you choose an embedding model for RAG?"

Depends on: (1) Quality vs cost -- MTEB scores 60-65 for top models, Gemini cheapest (\(0.0025/M), OpenAI best cost-performance (\)0.13/M, 64.6 MTEB). (2) Open-source vs API -- BGE-M3 (63.0 MTEB, MIT, 100+ languages, dense+sparse+ColBERT) best open-source. (3) Context length -- standard 8K, Voyage 32K. (4) Multimodal -- Cohere embed-v4 or Jina v4. (5) Latency -- E5-small 16ms, BGE-M3 <30ms for real-time RAG. Use hybrid search (70-80% semantic + 20-30% lexical) with cross-encoder reranking.

Q: "What is Matryoshka Representation Learning?"

MRL trains embeddings to work at multiple dimensions simultaneously. OpenAI text-3-large: 3072d -> 1536 -> 768 -> 256 via truncation, without retraining. Enables storage optimization (3072d = 12GB/1M docs -> 256d = 1GB) with minimal quality loss. Trade-off: compression sacrifices nuanced details for general topic understanding.

Q: "Compare MTEB vs RTEB benchmarks."

MTEB -- standard general benchmark: 8 task categories (retrieval, STS, classification, clustering, reranking, pair classification, summarization, bitext mining), 58 English datasets, 1000+ languages. RTEB -- emerging retrieval-specific benchmark, more relevant for RAG. MTEB scores 60-65 for top models; task-specific performance varies (Cohere best retrieval 67.5, OpenAI best STS 85.2, E5 best classification 78.3).


Ключевые числа

Факт Значение
Top MTEB score (Cohere v4) 65.2
OpenAI large MTEB 64.6
BGE-M3 MTEB 63.0
BGE-M3 query latency <30ms
E5-small latency 16ms
GTE inference speed 10x faster than competitors
OpenAI small price $0.02/M tokens
Gemini price $0.0025/M tokens
Hybrid search weighting 70-80% semantic + 20-30% lexical
Retrieval Recall@10 80-95%
STS Correlation (top models) 0.75-0.90
1B vectors (1024d) storage 4 TB
FP32->INT8 quality loss ~1-2%

Формулы

\[\text{cosine}(A, B) = \frac{A \cdot B}{\|A\| \cdot \|B\|}\]
\[\text{Recall}@k = \frac{|\text{relevant} \cap \text{top-}k|}{|\text{total relevant}|}\]
\[\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}\]


Заблуждение: чем больше dimensions, тем лучше качество

Google Gecko с 256 dimensions обходит многие модели с 768d на MTEB. Matryoshka Learning (OpenAI text-3-large) позволяет truncate 3072d -> 256d с потерей всего 2-5% Recall. При этом storage падает с 12 GB/1M docs до 1 GB. Выбор dimensions -- trade-off storage/latency vs nuance, а не "больше = лучше".

Заблуждение: MTEB score -- главный критерий выбора модели

MTEB -- усреднённый score по 8 задачам (retrieval, STS, classification, clustering, reranking, pair classification, summarization, bitext mining). Модель с MTEB 65 может проигрывать модели с MTEB 60 на конкретной задаче retrieval. Cohere embed-v4 лидирует по MTEB, но на code retrieval Voyage-code-3 лучше. Всегда бенчмаркайте на своих данных -- domain-specific evaluation важнее MTEB.

Заблуждение: vector search достаточен без BM25

Чистый vector search пропускает exact keyword matches: запрос 'error code NX-4521' не найдёт документ с этим кодом, если embedding не видел его в training data. Hybrid search (70-80% semantic + 20-30% BM25) + cross-encoder reranking даёт +5-10% nDCG@10. BGE-M3 решает это встроенно -- одна модель генерирует dense, sparse и ColBERT embeddings.


Interview Questions

Q: Как выбрать embedding-модель для RAG production?

❌ Red flag: "Берём OpenAI text-3-large, у неё лучший MTEB"

✅ Strong answer: "5 критериев: (1) Quality vs cost -- MTEB 60-65 для топ моделей, но оценивать на своих данных. (2) API vs self-host -- BGE-M3 (MIT, 63.0 MTEB, 100+ языков, dense+sparse+ColBERT) лучший open-source. (3) Pricing -- Gemini $0.0025/M vs OpenAI \(0.13/M vs self-host BGE-M3 ~\)0.01/M. (4) Context length -- стандарт 8K, Voyage 32K для длинных документов. (5) Multimodal -- Cohere embed-v4 или Jina v4 для text+image. Hybrid search (70-80% semantic + 20-30% BM25) + cross-encoder reranking обязательны."

Q: Что такое Matryoshka Representation Learning и зачем оно нужно?

❌ Red flag: "Это способ сжатия эмбеддингов"

✅ Strong answer: "MRL обучает embeddings работать на нескольких dimensions одновременно. OpenAI text-3-large: 3072d -> 1536 -> 768 -> 256 простым truncation, без переобучения. Практика: 3072d = 12 GB/1M docs -> 256d = 1 GB, потеря Recall 2-5%. Trade-off: мелкие dimensions теряют nuanced details, но сохраняют topic-level understanding. Используется для tiered storage: full dimensions для top candidates, truncated для initial retrieval."

Q: Сравните dense vs sparse vs ColBERT retrieval.

❌ Red flag: "Dense лучше sparse, sparse устарел"

✅ Strong answer: "Dense (single vector per document) -- хорош для семантики, пропускает exact matches. Sparse (BM25-like) -- ловит keywords, слаб на парафразах. ColBERT (multi-vector, token-level) -- лучшая точность, но 10-100x больше storage. BGE-M3 генерирует все три из одной модели. Production: hybrid = dense + sparse, reranking = cross-encoder. nDCG@10: dense 0.45, sparse 0.40, hybrid 0.50, hybrid+rerank 0.55 (типичные цифры на BEIR)."


Источники

  1. AIMultiple -- "Embedding Models: OpenAI vs Gemini vs Cohere in 2026"
  2. Zylos Research -- "Embedding Models and Semantic Search 2026"
  3. Hugging Face -- MTEB Leaderboard
  4. Cohere -- "Embed 4 Blog"
  5. Milvus -- "We Benchmarked 20+ Embedding APIs"
  6. arXiv 2403.20327 -- "Gecko: Versatile Text Embeddings Distilled from LLMs"
  7. arXiv 2305.05665 -- "ImageBind: One Embedding Space To Bind Them All"
  8. arXiv 2210.07316 -- MTEB Benchmark Paper (Muennighoff et al., 2022)
  9. BEIR Benchmark 2.0 (Jan 2026)
  10. AgentSet Embedding Leaderboard

See Also