RAG-техники и векторные базы данных¶
~8 минут чтения
URL: American Express, Jishu Labs Тип: rag / vector-databases / architecture Дата: Февраль 2026 Сбор: Ralph Research ФАЗА 5
Предварительно: Стратегии чанкинга
Зачем это нужно¶
Vanilla RAG (retrieve-and-generate) ломается в production: при росте коллекции до 100K+ документов retrieval quality падает, irrelevant chunks отравляют контекст, а статический pipeline не адаптируется к сложности запроса. Self-RAG добавляет reflection tokens и поднимает faithfulness с 58% до 82%. CorrectiveRAG классифицирует retrieval quality и откатывается на web search при ошибках. Adaptive-RAG сокращает ненужные retrievals на 40%. Выбор техники -- это trade-off между accuracy, latency и стоимостью fine-tuning.
Part 1: Beyond Vanilla RAG - 7 Advanced Techniques (2026)¶
Why Vanilla RAG Fails¶
Common Issues: - Retrieval quality degrades with large document collections - No reasoning about when to retrieve or when retrieved info is relevant - Irrelevant chunks poison the context - Complex queries need multi-step reasoning - Static retrieval can't adapt to query complexity
Technique 1: Self-RAG (Self-Reflective RAG)¶
Core Concept: Model learns to reflect on retrieval quality and generation.
Architecture:
Query → LLM (should I retrieve?) → [YES] → Retrieve
→ LLM (is this relevant?) → [YES] → Generate
→ [NO] → Re-retrieve
[NO] → Generate without retrieval
Key Innovation:
- Fine-tuned model (Self-RAG-7B) learns special tokens:
- [Retrieve] - should I retrieve?
- [IsREL] - is document relevant?
- [IsSUP] - does generation need support?
- [IsUSE] - is response useful?
Performance: | Metric | Vanilla RAG | Self-RAG | |--------|-------------|----------| | Accuracy | 65% | 76% | | Faithfulness | 58% | 82% |
Best For: High-stakes Q&A, medical/legal applications
Technique 2: ActiveRAG¶
Core Concept: Dual-tasking with Chain-of-Thought and knowledge construction.
Process:
1. Query + Retrieved Docs
2. CoT Reasoning → "What do I know? What do I need?"
3. Knowledge Construction → Build understanding from chunks
4. Final Answer Generation
Key Features: - Explicitly builds knowledge graph from retrieved chunks - Identifies conflicts between sources - Confidence scoring for each piece of information
Best For: Research synthesis, multi-document summarization
Technique 3: Chain-of-Note (CoN)¶
Core Concept: Generate explanatory notes for each retrieved document.
Architecture:
Query + Doc 1 → Note 1 → "Document 1 discusses X, which relates to query because..."
Query + Doc 2 → Note 2 → "Document 2 contradicts Doc 1 on Y..."
Query + Doc 3 → Note 3 → "Document 3 provides additional context Z..."
All Notes → Final Answer
Benefits: - Forces model to engage with each document - Surface conflicts and contradictions - Creates audit trail for answer derivation
Fine-tuning Required: Yes, on note-generation task
Best For: Fact-checking, comparison tasks, citation-heavy work
Technique 4: RAFT (Retrieval-Augmented Fine-Tuning)¶
Core Concept: Train model to handle irrelevant context gracefully.
Training Process:
For each training example:
1. Include gold document
2. Include 2-4 distractor documents (irrelevant)
3. Model must identify and use only relevant info
4. Chain-of-thought reasoning required
Key Insight:
"Models trained only on perfect retrieval fail when retrieval is imperfect. RAFT prepares for real-world noisy retrieval."
Performance Improvement: | Setting | Baseline | RAFT | |---------|----------|------| | Clean retrieval | 72% | 78% | | Noisy retrieval | 45% | 71% |
Best For: Production RAG systems with imperfect retrieval
Technique 5: CorrectiveRAG (CRAG)¶
Core Concept: Retrieval evaluator with fallback strategies.
Flow:
graph TD
Q["Query"] --> RET["Retrieve"]
RET --> EVAL["Evaluator"]
EVAL -->|"Correct"| USE["Use as-is"]
EVAL -->|"Ambiguous"| MIX["+ Web Search"]
EVAL -->|"Incorrect"| WEB["Web Search Only"]
USE --> GEN["Generate"]
MIX --> GEN
WEB --> GEN
style Q fill:#e8eaf6,stroke:#3f51b5
style EVAL fill:#fff3e0,stroke:#ef6c00
style USE fill:#e8f5e9,stroke:#4caf50
style MIX fill:#fff3e0,stroke:#ef6c00
style WEB fill:#fce4ec,stroke:#c62828
style GEN fill:#f3e5f5,stroke:#9c27b0
Evaluator Types: 1. Threshold-based: Similarity score above threshold 2. LLM-based: Ask model "Is this relevant?" 3. Fine-tuned classifier: Binary relevant/irrelevant
Fallback Options: - Web search (Google, Bing, Tavily) - Knowledge graph traversal - Direct LLM generation (no retrieval)
Best For: Production systems needing robustness
Technique 6: Adaptive-RAG¶
Core Concept: Query complexity classifier determines retrieval strategy.
Classification Levels: | Complexity | Strategy | Example | |------------|----------|---------| | Low | No retrieval | "What is 2+2?" | | Medium | Single retrieval | "What is the capital of France?" | | High | Multi-step RAG | "Compare Q1 2025 revenue across all competitors" |
Classifier Training:
# Train on query-strategy pairs
examples = [
("What is Python?", "no_retrieval"),
("What is Django 5.0 features?", "single_retrieval"),
("Compare FastAPI vs Django performance 2025", "multi_retrieval"),
]
Efficiency Gains: - 40% fewer unnecessary retrievals - 25% latency reduction for simple queries - 15% accuracy improvement on complex queries
Best For: High-throughput systems with mixed query types
Technique 7: Graph-Enhanced RAG¶
Core Concept: Knowledge graphs + vector retrieval + graph traversal.
Architecture:
Documents → Chunks → Embeddings → Vector Store
→ Entities → Knowledge Graph
Query → Vector Search (semantic)
→ Entity Extraction → Graph Traversal (structural)
↓
Combined Context → Generation
Graph Types: 1. Entity-Relationship: Nodes = entities, edges = relations 2. Document-Chunk: Nodes = docs/chunks, edges = references 3. Concept-Hierarchy: Nodes = concepts, edges = "is-a", "part-of"
Traversal Strategies: - 1-hop: Direct neighbors only - 2-hop: Neighbors of neighbors - Weighted: Follow high-confidence edges
Performance: | Task | Vector Only | Graph-Enhanced | |------|-------------|----------------| | Multi-hop reasoning | 52% | 74% | | Entity-focused QA | 68% | 85% | | Relationship queries | 45% | 79% |
Best For: Knowledge-heavy domains, entity-centric queries
Part 2: RAG Technique Selection Guide¶
Summary Table¶
| Technique | Fine-tuning Required | Complexity | Best Use Case |
|---|---|---|---|
| Self-RAG | Yes (Self-RAG-7B) | High | High-stakes Q&A |
| ActiveRAG | Optional | Medium | Research synthesis |
| Chain-of-Note | Yes | Medium | Fact-checking |
| RAFT | Yes | Medium | Noisy retrieval environments |
| CorrectiveRAG | No | Low | Production robustness |
| Adaptive-RAG | Yes (classifier) | Medium | Mixed query complexity |
| Graph-Enhanced | No (graph construction) | High | Knowledge domains |
Decision Tree¶
Is retrieval quality critical?
├── Yes → Can you fine-tune?
│ ├── Yes → Self-RAG
│ └── No → CorrectiveRAG
└── No → Is query complexity variable?
├── Yes → Adaptive-RAG
└── No → Vanilla RAG + evaluation
Part 3: Vector Database Comparison 2026¶
Overview¶
| Database | Type | Language | Key Strength |
|---|---|---|---|
| Pinecone | Managed | - | Zero ops, enterprise-ready |
| Weaviate | Self-hosted/Managed | Go | Hybrid search, GraphQL |
| Qdrant | Self-hosted | Rust | Performance, filtering |
| Milvus | Self-hosted | Go | Scale, GPU support |
Detailed Comparison¶
Pinecone¶
Architecture: Fully managed, serverless or pod-based
Key Features: - Zero infrastructure management - Automatic scaling - Built-in monitoring - Namespace isolation for multi-tenancy
Performance: | Metric | Value | |--------|-------| | p50 latency | 20ms | | p99 latency | 50ms | | Throughput | 500 QPS |
Pricing: - Serverless: $0.09/GB + query costs - Pod-based: $0.12/hour per pod
Code Example:
from pinecone import Pinecone
pc = Pinecone(api_key="...")
index = pc.Index("my-index")
# Upsert
index.upsert(vectors=[
{"id": "id1", "values": [0.1, 0.2, ...], "metadata": {"text": "document 1"}},
{"id": "id2", "values": [0.3, 0.4, ...], "metadata": {"text": "document 2"}},
])
# Query
results = index.query(
vector=[0.15, 0.25, ...],
top_k=10,
include_metadata=True
)
Best For: Teams wanting zero ops, enterprise deployments
Weaviate¶
Architecture: Self-hosted or managed cloud
Key Features: - GraphQL API - Hybrid search (vector + keyword) - Built-in vectorization modules - Cross-references between objects - Real-time semantic search
Performance: | Metric | Value | |--------|-------| | p50 latency | 15ms | | p99 latency | 40ms | | Throughput | 800 QPS |
Hybrid Search:
# Vector + BM25 combined
result = client.query.get("Document", ["title", "content"]) \
.with_hybrid(
query="machine learning",
alpha=0.5 # 0 = keyword only, 1 = vector only
) \
.with_limit(10) \
.do()
GraphQL Example:
{
Get {
Document(nearText: {concepts: ["AI safety"]}) {
title
content
_additional {
certainty
distance
}
}
}
}
Best For: Hybrid search needs, GraphQL-first teams
Qdrant¶
Architecture: Self-hosted, Rust-native
Key Features: - Highest throughput among open-source - Rich filtering capabilities - Quantization support (scalar, product, binary) - Distributed mode available - Payload-based filtering
Performance: | Metric | Value | |--------|-------| | p50 latency | 8ms | | p99 latency | 25ms | | Throughput | 1500 QPS |
Rich Filtering:
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, Range
client = QdrantClient(":memory:")
# Search with filter
results = client.search(
collection_name="documents",
query_vector=[0.1, 0.2, ...],
query_filter=Filter(
must=[
FieldCondition(
key="category",
match={"value": "ml"}
),
FieldCondition(
key="year",
range=Range(gte=2024)
)
]
),
limit=10
)
Quantization:
# Enable scalar quantization for 4x memory reduction
client.update_collection(
collection_name="documents",
optimizer_config=models.OptimizersConfigDiff(
quantization=models.ScalarQuantization(
type=models.ScalarType.INT8,
quantile=0.99,
always_ram=True
)
)
)
Best For: High-throughput, low-latency requirements
Milvus¶
Architecture: Self-hosted, designed for scale
Key Features: - Handles billions of vectors - GPU index support - Multiple index types (IVF, HNSW, DiskANN) - Data persistence with object storage - Multi-tenancy via partitioning
Performance: | Metric | Value | |--------|-------| | p50 latency | 12ms | | p99 latency | 35ms | | Throughput | 1200 QPS | | Max vectors | 10B+ |
GPU-Accelerated Index:
from pymilvus import Collection, connections
connections.connect("default", host="localhost", port="19530")
# Create collection with GPU index
collection = Collection("documents")
index_params = {
"metric_type": "COSINE",
"index_type": "GPU_IVF_FLAT",
"params": {"nlist": 1024}
}
collection.create_index(field_name="embedding", index_params=index_params)
Partitioning for Multi-tenancy:
# Create partition per tenant
collection.create_partition("tenant_1")
collection.create_partition("tenant_2")
# Search within partition
results = collection.search(
data=[query_vector],
anns_field="embedding",
param={"metric_type": "COSINE", "params": {"nprobe": 10}},
limit=10,
partition_names=["tenant_1"] # Tenant isolation
)
Best For: Large-scale deployments, GPU infrastructure
Part 4: Vector Database Selection Guide¶
Benchmark Summary (February 2026)¶
| Database | Latency (p50) | Throughput | Scale | Ops Burden |
|---|---|---|---|---|
| Pinecone | 20ms | 500 QPS | 100M | Zero |
| Weaviate | 15ms | 800 QPS | 500M | Medium |
| Qdrant | 8ms | 1500 QPS | 1B | Medium |
| Milvus | 12ms | 1200 QPS | 10B+ | High |
Decision Framework¶
Scale Requirement?
├── <100M vectors
│ └── Team expertise?
│ ├── Want zero ops → Pinecone
│ ├── Need hybrid search → Weaviate
│ └── Need max performance → Qdrant
└── >100M vectors
├── Have GPU infrastructure → Milvus
└── No GPU → Qdrant cluster
Cost Comparison (100M vectors, 1536 dims)¶
| Database | Monthly Cost | Notes |
|---|---|---|
| Pinecone | ~$5,000 | Serverless, pay per query |
| Weaviate Cloud | ~$3,500 | Managed service |
| Qdrant Cloud | ~$2,500 | Managed service |
| Self-hosted | ~$1,000 | Hardware + ops cost |
Part 5: Production Considerations¶
Index Selection¶
| Index Type | Build Time | Memory | Recall | Use Case |
|---|---|---|---|---|
| Flat | None | High | 100% | Small datasets, exact search |
| IVF | Medium | Medium | 90-95% | General purpose |
| HNSW | High | High | 95-99% | High recall, low latency |
| DiskANN | Low | Low | 90-95% | Disk-based, large scale |
| GPU_IVF | Medium | GPU RAM | 90-95% | GPU infrastructure |
Embedding Model Selection¶
| Model | Dimensions | Quality (MTEB) | Speed | Cost |
|---|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | Top-5 | Fast | $0.13/1M |
| OpenAI text-embedding-3-small | 1536 | Top-10 | Very fast | $0.02/1M |
| Cohere embed-v3 | 1024 | Top-5 | Fast | $0.10/1M |
| Voyage-3 | 1024 | Top-5 | Fast | $0.12/1M |
| E5-large-v2 | 1024 | Top-10 | Medium | Free |
| BGE-large-en | 1024 | Top-10 | Medium | Free |
Optimization Tips¶
- Quantization: 4x memory reduction with <1% recall loss
- Dimensionality reduction: PCA/Matryoshka for lower dims
- Batch operations: Batch upserts/searches for throughput
- Caching: Cache frequent queries with Redis
- Sharding: Distribute by tenant or time for scale
Interview Questions¶
Conceptual:
- "Чем Self-RAG отличается от CorrectiveRAG?" -- Self-RAG fine-tuned модель с reflection tokens, принимает решения внутри генерации. CorrectiveRAG -- внешний evaluator + fallback, не требует fine-tuning.
- "Когда Adaptive-RAG лучше vanilla RAG?" -- При mixed query complexity: простые вопросы не нуждаются в retrieval (40% экономии), сложные требуют multi-step.
- "Почему RAFT тренирует с distractor documents?" -- В production retrieval неидеален. Модель, обученная только на gold docs, падает с 72% до 45% на noisy retrieval. RAFT сохраняет 71%.
System Design:
- "Вам нужно выбрать vector DB для RAG с 50M документов и <20ms latency. Какую выберете?" -- Qdrant (p50 8ms, 1500 QPS) или Milvus (если нужен GPU). Pinecone не дотягивает по latency (p50 20ms). pgvector исключён (p50 50ms).
- "Как бы вы скомбинировали несколько RAG-техник в production pipeline?" -- CorrectiveRAG (не нужен fine-tuning) + Adaptive-RAG classifier (query routing) + Graph-Enhanced для entity-heavy доменов.
Частые ошибки
"Vector search всегда лучше BM25" -- Неправда. BM25 сильнее для exact term matching (код, логи, ID). Hybrid (BM25 + dense + RRF fusion) -- стандарт 2026 с 95%+ adoption.
"Больше чанков в контексте = лучше" -- "Lost in the Middle" эффект: attention quality падает с длиной контекста. Re-ranking + top-K фильтрация критичны.
"Self-RAG = просто добавить if-else перед retrieval" -- Self-RAG это fine-tuned модель с 4 специальными reflection tokens ([Retrieve], [IsREL], [IsSUP], [IsUSE]). Это не prompt engineering.
Sources¶
- American Express — "Beyond Vanilla RAG: 7 Advanced RAG Techniques" (Feb 2026)
- Jishu Labs — "Vector Database Comparison 2026: The Ultimate Guide" (Jan 2026)
- Self-RAG paper (arXiv:2310.11511, Asai et al., 2023)
- RAFT paper (arXiv:2403.10131, Zhang et al., 2024)
- Qdrant documentation
- Milvus documentation