Сравнение моделей эмбеддингов¶
~4 минуты чтения
Предварительно: RAG архитектуры | Бенчмарки оценки LLM
Выбор embedding-модели определяет качество retrieval в RAG-pipeline: разница между лидером (Cohere embed-v4, MTEB 65.2) и средней моделью (MTEB ~57) -- это +15-20% Recall@10 в production. При этом цена отличается в 72 раза: Gemini $0.0025/M tokens vs Voyage $0.18/M. Для 100M tokens/месяц это $250 vs $18,000. BGE-M3 (MIT, self-hosted, MTEB 63.0) закрывает 80% use-case'ов бесплатно. Hybrid search (70-80% semantic + 20-30% BM25) + cross-encoder reranking даёт +5-10% nDCG@10 поверх чистого vector search.
Ключевые концепции¶
Embeddings -- dense vectors, кодирующие семантику текста для similarity search и RAG.
Similarity Metrics¶
| Metric | Formula | Use Case |
|---|---|---|
| Cosine | \(\frac{A \cdot B}{\|A\| \cdot \|B\|}\) | Most common |
| Dot Product | \(A \cdot B\) | Normalized vectors |
| Euclidean | \(\|A - B\|\) | Clustering |
Retrieval Metrics¶
1. MTEB Benchmark¶
Massive Text Embedding Benchmark -- de facto standard оценки моделей эмбеддингов.
8 Task Categories¶
| Task | Metric | What It Tests |
|---|---|---|
| Retrieval | nDCG@10 | Query -> document matching |
| STS | Spearman | Semantic similarity |
| Classification | Accuracy/F1 | Semantic boundaries |
| Clustering | V-measure | Unsupervised grouping |
| Reranking | MAP | Fine-grained relevance |
| Pair Classification | AP | Duplicate/paraphrase detection |
| Summarization | Spearman | Abstraction recognition |
| Bitext Mining | F1 | Cross-lingual alignment |
1000+ languages, 58 English datasets, Hugging Face Spaces.
Leaderboard (2026)¶
| Rank | Model | MTEB Score | Type |
|---|---|---|---|
| 1 | Gemini embedding | #1 | Commercial |
| 2 | Cohere embed-v4 | 65.2 | Commercial |
| 3 | OpenAI text-3-large | 64.6 | Commercial |
| 4 | BGE-M3 | 63.0 | Open-source |
| 5 | E5-large-v2 | ~62 | Open-source |
| 6 | Voyage-3-large | 60.5 | Commercial |
| 7 | Nomic Embed | ~60 | Open-source |
| 8 | Jina-embeddings-v3 | ~57 | Open-source |
MTEB Score Tiers¶
| Tier | Score | Examples |
|---|---|---|
| Excellent | 64-70 | Cohere v4, OpenAI large, Gemini |
| Good | 60-64 | BGE-M3, Voyage, E5 |
| Average | 55-60 | Older models, Jina v3 |
| Poor | <55 | Basic embeddings |
Emerging Benchmarks¶
- RTEB (Retrieval Embedding Benchmark) -- фокус на retrieval, более релевантен для RAG
- BEIR 2.0 (Jan 2026) -- updated nDCG@10, more diverse evaluation sets
2. Commercial Models¶
| Variant | Dimensions | MTEB | Price/M tokens | Max Tokens |
|---|---|---|---|---|
| small | 512 | ~62 | $0.02 | 8191 |
| large | 3072 | 64.6 | $0.13 | 8191 |
Matryoshka Representation Learning (MRL): truncatable dimensions (3072 -> 1536 -> 768 -> 256) without retraining. Storage optimization without quality collapse.
Best for: RAG production, cost-performance balance.
| Aspect | Details |
|---|---|
| Dimensions | 1024 |
| MTEB | 65.2 (leader Nov 2025) |
| Multimodal | Text + Image |
| Price | $0.10/M tokens |
Enterprise features: handles noisy data, compression-ready, works with Cohere Reranker.
Best for: Multimodal RAG, enterprise deployments.
| Model | Dimensions | MTEB | Price | Max Tokens |
|---|---|---|---|---|
| voyage-3-large | 1024 | 60.5 | $0.18/M | 32000 |
| voyage-3 | 1024 | ~58 | $0.06/M | 32000 |
| voyage-code-3 | 1024 | ~59 | $0.18/M | 32000 |
Long context (32K), code-specialized model.
Best for: Long documents, code search.
| Model | Dimensions | Price |
|---|---|---|
| text-embedding-005 | 768 | $0.0025/M |
| Gecko (256d) | 256 | Research |
Cheapest option. Gecko (256d) outperforms 768d models. Native GCP integration.
Best for: Budget deployments, GCP ecosystem.
3. Open-Source Models¶
| Aspect | Details |
|---|---|
| Parameters | 568M |
| MTEB | 63.0 |
| Languages | 100+ |
| Max context | 8,192 tokens |
| Retrieval types | Dense, multi-vector (ColBERT), sparse |
| Latency | <30ms query time |
| License | MIT |
Single model produces multiple embedding types -- eliminates need for separate dense/sparse models.
Best for: Self-hosted RAG, multilingual, privacy-critical.
| Aspect | Details |
|---|---|
| Parameters | 305M (multilingual-base) |
| Languages | 70+ |
| Inference speed | 10x faster than competitors |
Best for high-throughput production, Asian language retrieval.
Best for: High-throughput, Asian languages.
| Model | Params | Top-5 Accuracy | Speed |
|---|---|---|---|
| E5-small | 118M | 100% | 14x faster than 8B |
| E5-large-v2 | 560M | MTEB ~62 | Baseline |
| E5-Mistral-7B | 7B | High | 4,096 context |
Training: 270M text pairs via weakly supervised contrastive learning.
Best for: Speed-critical production, budget self-hosting.
| Aspect | Details |
|---|---|
| Base model | Qwen2.5-VL-3B-Instruct |
| Dimensions | 2048 (dense), multi-vector |
| Languages | 30+ |
| Modalities | Text, images, visual docs |
| License | CC-BY-NC-4.0 |
Best for: Multimodal open-source, visual document search.
768d, MTEB ~60, 8192 context, Apache 2.0, fully reproducible, open data training.
Best for: Fully open/reproducible research.
4. Technical Architecture¶
Dimensionality Trade-offs¶
| Dimensions | Storage/1M docs | Recall | Speed |
|---|---|---|---|
| 256 | 1 GB | Low | Fastest |
| 512 | 2 GB | Medium | Fast |
| 768 | 3 GB | Good | Medium |
| 1024 | 4 GB | Better | Medium |
| 3072 | 12 GB | Best | Slower |
Quantization¶
| Precision | Storage Reduction | Quality Loss |
|---|---|---|
| FP32 -> FP16 | 50% | ~0% |
| FP32 -> INT8 | 75% | ~1-2% |
| FP32 -> Binary | 96% | 5-10% |
5. Semantic Search Architecture¶
graph LR
A["Query"] --> B["Embedding"]
B --> C["Vector Search"]
C --> D["Rerank"]
D --> E["Metadata Filter"]
E --> F["Results"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#f3e5f5,stroke:#9c27b0
style C fill:#fff3e0,stroke:#ef6c00
style D fill:#fff3e0,stroke:#ef6c00
style E fill:#e8eaf6,stroke:#3f51b5
style F fill:#e8f5e9,stroke:#4caf50
Hybrid Search¶
Combines lexical (BM25, exact matches) + semantic (vectors, conceptual similarity).
Best practice: 70-80% semantic + 20-30% lexical weighting.
Reranking¶
| Stage | Purpose | Latency |
|---|---|---|
| Initial retrieval | Get candidates (top-100) | Fast |
| Cross-encoder reranking | Precision refinement (top-10) | 10-50ms |
6. Multimodal Embeddings¶
| Model | Modalities | Notes |
|---|---|---|
| CLIP | Image + Text | Foundation model |
| ImageBind | Image, Text, Audio, Depth, Thermal | Universal space |
| BGE-VL | Image + Text | State-of-art visual search |
| Jina v4 | Text, Images, Visual docs | Multimodal + multilingual |
| Cohere embed-v4 | Text + Image | Enterprise multimodal |
Use cases: visual product search, cross-modal retrieval, document understanding.
7. Selection Guide¶
graph TD
A["Need multimodal?"] -->|YES| B["Cohere embed-v4 / Jina v4"]
A -->|NO| C["Cheapest option?"]
C -->|YES| D["Gemini embedding<br/>($0.0025/M)"]
C -->|NO| E["Best open-source?"]
E -->|YES| F["BGE-M3"]
E -->|NO| G["Long context >8K?"]
G -->|YES| H["Voyage-3-large<br/>(32K)"]
G -->|NO| I["Code embeddings?"]
I -->|YES| J["Voyage-code-3"]
I -->|NO| K["OpenAI text-3-large"]
style A fill:#fff3e0,stroke:#ef6c00
style B fill:#e8f5e9,stroke:#4caf50
style C fill:#fff3e0,stroke:#ef6c00
style D fill:#e8f5e9,stroke:#4caf50
style E fill:#fff3e0,stroke:#ef6c00
style F fill:#e8f5e9,stroke:#4caf50
style G fill:#fff3e0,stroke:#ef6c00
style H fill:#e8f5e9,stroke:#4caf50
style I fill:#fff3e0,stroke:#ef6c00
style J fill:#e8f5e9,stroke:#4caf50
style K fill:#e8f5e9,stroke:#4caf50
Use Case Matrix¶
| Use Case | Recommended | Reason |
|---|---|---|
| RAG production | OpenAI text-3-large | Best balance |
| Multimodal RAG | Cohere embed-v4 | Text + Image |
| Budget | Gemini embedding | Cheapest |
| Privacy | BGE-M3 self-host | Full control |
| Long documents | Voyage-3-large | 32K context |
| Code search | Voyage-code-3 | Optimized |
| Multilingual | BGE-M3 or GTE | 70-100+ languages |
| High throughput | GTE / E5-small | 10-14x faster |
8. Pricing¶
| Model | Price/1M tokens | 10M/mo | 100M/mo |
|---|---|---|---|
| Gemini embedding | $0.0025 | $25 | $250 |
| OpenAI text-3-small | $0.02 | $200 | $2,000 |
| BGE-M3 (self-host) | ~$0.01* | $100 | $1,000 |
| Cohere embed-v4 | $0.10 | $1,000 | $10,000 |
| OpenAI text-3-large | $0.13 | $1,300 | $13,000 |
| Voyage-3-large | $0.18 | $1,800 | $18,000 |
Для интервью¶
Q: "How do you choose an embedding model for RAG?"¶
Depends on: (1) Quality vs cost -- MTEB scores 60-65 for top models, Gemini cheapest (\(0.0025/M), OpenAI best cost-performance (\)0.13/M, 64.6 MTEB). (2) Open-source vs API -- BGE-M3 (63.0 MTEB, MIT, 100+ languages, dense+sparse+ColBERT) best open-source. (3) Context length -- standard 8K, Voyage 32K. (4) Multimodal -- Cohere embed-v4 or Jina v4. (5) Latency -- E5-small 16ms, BGE-M3 <30ms for real-time RAG. Use hybrid search (70-80% semantic + 20-30% lexical) with cross-encoder reranking.
Q: "What is Matryoshka Representation Learning?"¶
MRL trains embeddings to work at multiple dimensions simultaneously. OpenAI text-3-large: 3072d -> 1536 -> 768 -> 256 via truncation, without retraining. Enables storage optimization (3072d = 12GB/1M docs -> 256d = 1GB) with minimal quality loss. Trade-off: compression sacrifices nuanced details for general topic understanding.
Q: "Compare MTEB vs RTEB benchmarks."¶
MTEB -- standard general benchmark: 8 task categories (retrieval, STS, classification, clustering, reranking, pair classification, summarization, bitext mining), 58 English datasets, 1000+ languages. RTEB -- emerging retrieval-specific benchmark, more relevant for RAG. MTEB scores 60-65 for top models; task-specific performance varies (Cohere best retrieval 67.5, OpenAI best STS 85.2, E5 best classification 78.3).
Ключевые числа¶
| Факт | Значение |
|---|---|
| Top MTEB score (Cohere v4) | 65.2 |
| OpenAI large MTEB | 64.6 |
| BGE-M3 MTEB | 63.0 |
| BGE-M3 query latency | <30ms |
| E5-small latency | 16ms |
| GTE inference speed | 10x faster than competitors |
| OpenAI small price | $0.02/M tokens |
| Gemini price | $0.0025/M tokens |
| Hybrid search weighting | 70-80% semantic + 20-30% lexical |
| Retrieval Recall@10 | 80-95% |
| STS Correlation (top models) | 0.75-0.90 |
| 1B vectors (1024d) storage | 4 TB |
| FP32->INT8 quality loss | ~1-2% |
Формулы¶
Заблуждение: чем больше dimensions, тем лучше качество
Google Gecko с 256 dimensions обходит многие модели с 768d на MTEB. Matryoshka Learning (OpenAI text-3-large) позволяет truncate 3072d -> 256d с потерей всего 2-5% Recall. При этом storage падает с 12 GB/1M docs до 1 GB. Выбор dimensions -- trade-off storage/latency vs nuance, а не "больше = лучше".
Заблуждение: MTEB score -- главный критерий выбора модели
MTEB -- усреднённый score по 8 задачам (retrieval, STS, classification, clustering, reranking, pair classification, summarization, bitext mining). Модель с MTEB 65 может проигрывать модели с MTEB 60 на конкретной задаче retrieval. Cohere embed-v4 лидирует по MTEB, но на code retrieval Voyage-code-3 лучше. Всегда бенчмаркайте на своих данных -- domain-specific evaluation важнее MTEB.
Заблуждение: vector search достаточен без BM25
Чистый vector search пропускает exact keyword matches: запрос 'error code NX-4521' не найдёт документ с этим кодом, если embedding не видел его в training data. Hybrid search (70-80% semantic + 20-30% BM25) + cross-encoder reranking даёт +5-10% nDCG@10. BGE-M3 решает это встроенно -- одна модель генерирует dense, sparse и ColBERT embeddings.
Interview Questions¶
Q: Как выбрать embedding-модель для RAG production?
Red flag: "Берём OpenAI text-3-large, у неё лучший MTEB"
Strong answer: "5 критериев: (1) Quality vs cost -- MTEB 60-65 для топ моделей, но оценивать на своих данных. (2) API vs self-host -- BGE-M3 (MIT, 63.0 MTEB, 100+ языков, dense+sparse+ColBERT) лучший open-source. (3) Pricing -- Gemini $0.0025/M vs OpenAI \(0.13/M vs self-host BGE-M3 ~\)0.01/M. (4) Context length -- стандарт 8K, Voyage 32K для длинных документов. (5) Multimodal -- Cohere embed-v4 или Jina v4 для text+image. Hybrid search (70-80% semantic + 20-30% BM25) + cross-encoder reranking обязательны."
Q: Что такое Matryoshka Representation Learning и зачем оно нужно?
Red flag: "Это способ сжатия эмбеддингов"
Strong answer: "MRL обучает embeddings работать на нескольких dimensions одновременно. OpenAI text-3-large: 3072d -> 1536 -> 768 -> 256 простым truncation, без переобучения. Практика: 3072d = 12 GB/1M docs -> 256d = 1 GB, потеря Recall 2-5%. Trade-off: мелкие dimensions теряют nuanced details, но сохраняют topic-level understanding. Используется для tiered storage: full dimensions для top candidates, truncated для initial retrieval."
Q: Сравните dense vs sparse vs ColBERT retrieval.
Red flag: "Dense лучше sparse, sparse устарел"
Strong answer: "Dense (single vector per document) -- хорош для семантики, пропускает exact matches. Sparse (BM25-like) -- ловит keywords, слаб на парафразах. ColBERT (multi-vector, token-level) -- лучшая точность, но 10-100x больше storage. BGE-M3 генерирует все три из одной модели. Production: hybrid = dense + sparse, reranking = cross-encoder. nDCG@10: dense 0.45, sparse 0.40, hybrid 0.50, hybrid+rerank 0.55 (типичные цифры на BEIR)."
Источники¶
- AIMultiple -- "Embedding Models: OpenAI vs Gemini vs Cohere in 2026"
- Zylos Research -- "Embedding Models and Semantic Search 2026"
- Hugging Face -- MTEB Leaderboard
- Cohere -- "Embed 4 Blog"
- Milvus -- "We Benchmarked 20+ Embedding APIs"
- arXiv 2403.20327 -- "Gecko: Versatile Text Embeddings Distilled from LLMs"
- arXiv 2305.05665 -- "ImageBind: One Embedding Space To Bind Them All"
- arXiv 2210.07316 -- MTEB Benchmark Paper (Muennighoff et al., 2022)
- BEIR Benchmark 2.0 (Jan 2026)
- AgentSet Embedding Leaderboard
See Also¶
- Vector DB Comparison — Pinecone vs Qdrant vs Weaviate
- RAG System Design — embedding models в RAG pipeline
- Knowledge Distillation — дистилляция embedding models