RAG-техники и векторные базы данных¶

~8 минут чтения

URL: American Express, Jishu Labs Тип: rag / vector-databases / architecture Дата: Февраль 2026 Сбор: Ralph Research ФАЗА 5

Предварительно: Стратегии чанкинга

Зачем это нужно¶

Vanilla RAG (retrieve-and-generate) ломается в production: при росте коллекции до 100K+ документов retrieval quality падает, irrelevant chunks отравляют контекст, а статический pipeline не адаптируется к сложности запроса. Self-RAG добавляет reflection tokens и поднимает faithfulness с 58% до 82%. CorrectiveRAG классифицирует retrieval quality и откатывается на web search при ошибках. Adaptive-RAG сокращает ненужные retrievals на 40%. Выбор техники -- это trade-off между accuracy, latency и стоимостью fine-tuning.

Part 1: Beyond Vanilla RAG - 7 Advanced Techniques (2026)¶

Why Vanilla RAG Fails¶

Common Issues: - Retrieval quality degrades with large document collections - No reasoning about when to retrieve or when retrieved info is relevant - Irrelevant chunks poison the context - Complex queries need multi-step reasoning - Static retrieval can't adapt to query complexity

Technique 1: Self-RAG (Self-Reflective RAG)¶

Core Concept: Model learns to reflect on retrieval quality and generation.

Architecture:

Query → LLM (should I retrieve?) → [YES] → Retrieve
                                         → LLM (is this relevant?) → [YES] → Generate
                                                                       → [NO] → Re-retrieve
                                    [NO] → Generate without retrieval

Key Innovation: - Fine-tuned model (Self-RAG-7B) learns special tokens: - [Retrieve] - should I retrieve? - [IsREL] - is document relevant? - [IsSUP] - does generation need support? - [IsUSE] - is response useful?

Performance: | Metric | Vanilla RAG | Self-RAG | |--------|-------------|----------| | Accuracy | 65% | 76% | | Faithfulness | 58% | 82% |

Best For: High-stakes Q&A, medical/legal applications

Technique 2: ActiveRAG¶

Core Concept: Dual-tasking with Chain-of-Thought and knowledge construction.

Process:

1. Query + Retrieved Docs
2. CoT Reasoning → "What do I know? What do I need?"
3. Knowledge Construction → Build understanding from chunks
4. Final Answer Generation

Key Features: - Explicitly builds knowledge graph from retrieved chunks - Identifies conflicts between sources - Confidence scoring for each piece of information

Best For: Research synthesis, multi-document summarization

Technique 3: Chain-of-Note (CoN)¶

Core Concept: Generate explanatory notes for each retrieved document.

Architecture:

Query + Doc 1 → Note 1 → "Document 1 discusses X, which relates to query because..."
Query + Doc 2 → Note 2 → "Document 2 contradicts Doc 1 on Y..."
Query + Doc 3 → Note 3 → "Document 3 provides additional context Z..."

All Notes → Final Answer

Benefits: - Forces model to engage with each document - Surface conflicts and contradictions - Creates audit trail for answer derivation

Fine-tuning Required: Yes, on note-generation task

Best For: Fact-checking, comparison tasks, citation-heavy work

Technique 4: RAFT (Retrieval-Augmented Fine-Tuning)¶

Core Concept: Train model to handle irrelevant context gracefully.

Training Process:

For each training example:
  1. Include gold document
  2. Include 2-4 distractor documents (irrelevant)
  3. Model must identify and use only relevant info
  4. Chain-of-thought reasoning required

Key Insight:

"Models trained only on perfect retrieval fail when retrieval is imperfect. RAFT prepares for real-world noisy retrieval."

Performance Improvement: | Setting | Baseline | RAFT | |---------|----------|------| | Clean retrieval | 72% | 78% | | Noisy retrieval | 45% | 71% |

Best For: Production RAG systems with imperfect retrieval

Technique 5: CorrectiveRAG (CRAG)¶

Core Concept: Retrieval evaluator with fallback strategies.

Flow:

graph TD
    Q["Query"] --> RET["Retrieve"]
    RET --> EVAL["Evaluator"]
    EVAL -->|"Correct"| USE["Use as-is"]
    EVAL -->|"Ambiguous"| MIX["+ Web Search"]
    EVAL -->|"Incorrect"| WEB["Web Search Only"]
    USE --> GEN["Generate"]
    MIX --> GEN
    WEB --> GEN

    style Q fill:#e8eaf6,stroke:#3f51b5
    style EVAL fill:#fff3e0,stroke:#ef6c00
    style USE fill:#e8f5e9,stroke:#4caf50
    style MIX fill:#fff3e0,stroke:#ef6c00
    style WEB fill:#fce4ec,stroke:#c62828
    style GEN fill:#f3e5f5,stroke:#9c27b0

Evaluator Types: 1. Threshold-based: Similarity score above threshold 2. LLM-based: Ask model "Is this relevant?" 3. Fine-tuned classifier: Binary relevant/irrelevant

Fallback Options: - Web search (Google, Bing, Tavily) - Knowledge graph traversal - Direct LLM generation (no retrieval)

Best For: Production systems needing robustness

Technique 6: Adaptive-RAG¶

Core Concept: Query complexity classifier determines retrieval strategy.

Classification Levels: | Complexity | Strategy | Example | |------------|----------|---------| | Low | No retrieval | "What is 2+2?" | | Medium | Single retrieval | "What is the capital of France?" | | High | Multi-step RAG | "Compare Q1 2025 revenue across all competitors" |

Classifier Training:

# Train on query-strategy pairs
examples = [
    ("What is Python?", "no_retrieval"),
    ("What is Django 5.0 features?", "single_retrieval"),
    ("Compare FastAPI vs Django performance 2025", "multi_retrieval"),
]

Efficiency Gains: - 40% fewer unnecessary retrievals - 25% latency reduction for simple queries - 15% accuracy improvement on complex queries

Best For: High-throughput systems with mixed query types

Technique 7: Graph-Enhanced RAG¶

Core Concept: Knowledge graphs + vector retrieval + graph traversal.

Architecture:

Documents → Chunks → Embeddings → Vector Store
                   → Entities → Knowledge Graph

Query → Vector Search (semantic)
      → Entity Extraction → Graph Traversal (structural)
                              ↓
                     Combined Context → Generation

Graph Types: 1. Entity-Relationship: Nodes = entities, edges = relations 2. Document-Chunk: Nodes = docs/chunks, edges = references 3. Concept-Hierarchy: Nodes = concepts, edges = "is-a", "part-of"

Traversal Strategies: - 1-hop: Direct neighbors only - 2-hop: Neighbors of neighbors - Weighted: Follow high-confidence edges

Performance: | Task | Vector Only | Graph-Enhanced | |------|-------------|----------------| | Multi-hop reasoning | 52% | 74% | | Entity-focused QA | 68% | 85% | | Relationship queries | 45% | 79% |

Best For: Knowledge-heavy domains, entity-centric queries

Part 2: RAG Technique Selection Guide¶

Summary Table¶

Technique	Fine-tuning Required	Complexity	Best Use Case
Self-RAG	Yes (Self-RAG-7B)	High	High-stakes Q&A
ActiveRAG	Optional	Medium	Research synthesis
Chain-of-Note	Yes	Medium	Fact-checking
RAFT	Yes	Medium	Noisy retrieval environments
CorrectiveRAG	No	Low	Production robustness
Adaptive-RAG	Yes (classifier)	Medium	Mixed query complexity
Graph-Enhanced	No (graph construction)	High	Knowledge domains

Decision Tree¶

Is retrieval quality critical?
├── Yes → Can you fine-tune?
│         ├── Yes → Self-RAG
│         └── No → CorrectiveRAG
└── No → Is query complexity variable?
          ├── Yes → Adaptive-RAG
          └── No → Vanilla RAG + evaluation

Part 3: Vector Database Comparison 2026¶

Overview¶

Database	Type	Language	Key Strength
Pinecone	Managed	-	Zero ops, enterprise-ready
Weaviate	Self-hosted/Managed	Go	Hybrid search, GraphQL
Qdrant	Self-hosted	Rust	Performance, filtering
Milvus	Self-hosted	Go	Scale, GPU support

Detailed Comparison¶

Pinecone¶

Architecture: Fully managed, serverless or pod-based

Key Features: - Zero infrastructure management - Automatic scaling - Built-in monitoring - Namespace isolation for multi-tenancy

Performance: | Metric | Value | |--------|-------| | p50 latency | 20ms | | p99 latency | 50ms | | Throughput | 500 QPS |

Pricing: - Serverless: $0.09/GB + query costs - Pod-based: $0.12/hour per pod

Code Example:

from pinecone import Pinecone

pc = Pinecone(api_key="...")
index = pc.Index("my-index")

# Upsert
index.upsert(vectors=[
    {"id": "id1", "values": [0.1, 0.2, ...], "metadata": {"text": "document 1"}},
    {"id": "id2", "values": [0.3, 0.4, ...], "metadata": {"text": "document 2"}},
])

# Query
results = index.query(
    vector=[0.15, 0.25, ...],
    top_k=10,
    include_metadata=True
)

Best For: Teams wanting zero ops, enterprise deployments

Weaviate¶

Architecture: Self-hosted or managed cloud

Key Features: - GraphQL API - Hybrid search (vector + keyword) - Built-in vectorization modules - Cross-references between objects - Real-time semantic search

Performance: | Metric | Value | |--------|-------| | p50 latency | 15ms | | p99 latency | 40ms | | Throughput | 800 QPS |

Hybrid Search:

# Vector + BM25 combined
result = client.query.get("Document", ["title", "content"]) \
    .with_hybrid(
        query="machine learning",
        alpha=0.5  # 0 = keyword only, 1 = vector only
    ) \
    .with_limit(10) \
    .do()

GraphQL Example:

{
  Get {
    Document(nearText: {concepts: ["AI safety"]}) {
      title
      content
      _additional {
        certainty
        distance
      }
    }
  }
}

Best For: Hybrid search needs, GraphQL-first teams

Qdrant¶

Architecture: Self-hosted, Rust-native

Key Features: - Highest throughput among open-source - Rich filtering capabilities - Quantization support (scalar, product, binary) - Distributed mode available - Payload-based filtering

Performance: | Metric | Value | |--------|-------| | p50 latency | 8ms | | p99 latency | 25ms | | Throughput | 1500 QPS |

Rich Filtering:

from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, Range

client = QdrantClient(":memory:")

# Search with filter
results = client.search(
    collection_name="documents",
    query_vector=[0.1, 0.2, ...],
    query_filter=Filter(
        must=[
            FieldCondition(
                key="category",
                match={"value": "ml"}
            ),
            FieldCondition(
                key="year",
                range=Range(gte=2024)
            )
        ]
    ),
    limit=10
)

Quantization:

# Enable scalar quantization for 4x memory reduction
client.update_collection(
    collection_name="documents",
    optimizer_config=models.OptimizersConfigDiff(
        quantization=models.ScalarQuantization(
            type=models.ScalarType.INT8,
            quantile=0.99,
            always_ram=True
        )
    )
)

Best For: High-throughput, low-latency requirements

Milvus¶

Architecture: Self-hosted, designed for scale

Key Features: - Handles billions of vectors - GPU index support - Multiple index types (IVF, HNSW, DiskANN) - Data persistence with object storage - Multi-tenancy via partitioning

Performance: | Metric | Value | |--------|-------| | p50 latency | 12ms | | p99 latency | 35ms | | Throughput | 1200 QPS | | Max vectors | 10B+ |

GPU-Accelerated Index:

from pymilvus import Collection, connections

connections.connect("default", host="localhost", port="19530")

# Create collection with GPU index
collection = Collection("documents")
index_params = {
    "metric_type": "COSINE",
    "index_type": "GPU_IVF_FLAT",
    "params": {"nlist": 1024}
}
collection.create_index(field_name="embedding", index_params=index_params)

Partitioning for Multi-tenancy:

# Create partition per tenant
collection.create_partition("tenant_1")
collection.create_partition("tenant_2")

# Search within partition
results = collection.search(
    data=[query_vector],
    anns_field="embedding",
    param={"metric_type": "COSINE", "params": {"nprobe": 10}},
    limit=10,
    partition_names=["tenant_1"]  # Tenant isolation
)

Best For: Large-scale deployments, GPU infrastructure

Part 4: Vector Database Selection Guide¶

Benchmark Summary (February 2026)¶

Database	Latency (p50)	Throughput	Scale	Ops Burden
Pinecone	20ms	500 QPS	100M	Zero
Weaviate	15ms	800 QPS	500M	Medium
Qdrant	8ms	1500 QPS	1B	Medium
Milvus	12ms	1200 QPS	10B+	High

Decision Framework¶

Scale Requirement?
├── <100M vectors
│   └── Team expertise?
│       ├── Want zero ops → Pinecone
│       ├── Need hybrid search → Weaviate
│       └── Need max performance → Qdrant
└── >100M vectors
    ├── Have GPU infrastructure → Milvus
    └── No GPU → Qdrant cluster

Cost Comparison (100M vectors, 1536 dims)¶

Database	Monthly Cost	Notes
Pinecone	~$5,000	Serverless, pay per query
Weaviate Cloud	~$3,500	Managed service
Qdrant Cloud	~$2,500	Managed service
Self-hosted	~$1,000	Hardware + ops cost

Part 5: Production Considerations¶

Index Selection¶

Index Type	Build Time	Memory	Recall	Use Case
Flat	None	High	100%	Small datasets, exact search
IVF	Medium	Medium	90-95%	General purpose
HNSW	High	High	95-99%	High recall, low latency
DiskANN	Low	Low	90-95%	Disk-based, large scale
GPU_IVF	Medium	GPU RAM	90-95%	GPU infrastructure

Embedding Model Selection¶

Model	Dimensions	Quality (MTEB)	Speed	Cost
OpenAI text-embedding-3-large	3072	Top-5	Fast	$0.13/1M
OpenAI text-embedding-3-small	1536	Top-10	Very fast	$0.02/1M
Cohere embed-v3	1024	Top-5	Fast	$0.10/1M
Voyage-3	1024	Top-5	Fast	$0.12/1M
E5-large-v2	1024	Top-10	Medium	Free
BGE-large-en	1024	Top-10	Medium	Free

Optimization Tips¶

Quantization: 4x memory reduction with <1% recall loss
Dimensionality reduction: PCA/Matryoshka for lower dims
Batch operations: Batch upserts/searches for throughput
Caching: Cache frequent queries with Redis
Sharding: Distribute by tenant or time for scale

Interview Questions¶

Conceptual:

"Чем Self-RAG отличается от CorrectiveRAG?" -- Self-RAG fine-tuned модель с reflection tokens, принимает решения внутри генерации. CorrectiveRAG -- внешний evaluator + fallback, не требует fine-tuning.
"Когда Adaptive-RAG лучше vanilla RAG?" -- При mixed query complexity: простые вопросы не нуждаются в retrieval (40% экономии), сложные требуют multi-step.
"Почему RAFT тренирует с distractor documents?" -- В production retrieval неидеален. Модель, обученная только на gold docs, падает с 72% до 45% на noisy retrieval. RAFT сохраняет 71%.

System Design:

"Вам нужно выбрать vector DB для RAG с 50M документов и <20ms latency. Какую выберете?" -- Qdrant (p50 8ms, 1500 QPS) или Milvus (если нужен GPU). Pinecone не дотягивает по latency (p50 20ms). pgvector исключён (p50 50ms).
"Как бы вы скомбинировали несколько RAG-техник в production pipeline?" -- CorrectiveRAG (не нужен fine-tuning) + Adaptive-RAG classifier (query routing) + Graph-Enhanced для entity-heavy доменов.

Частые ошибки

"Vector search всегда лучше BM25" -- Неправда. BM25 сильнее для exact term matching (код, логи, ID). Hybrid (BM25 + dense + RRF fusion) -- стандарт 2026 с 95%+ adoption.

"Больше чанков в контексте = лучше" -- "Lost in the Middle" эффект: attention quality падает с длиной контекста. Re-ranking + top-K фильтрация критичны.

"Self-RAG = просто добавить if-else перед retrieval" -- Self-RAG это fine-tuned модель с 4 специальными reflection tokens ([Retrieve], [IsREL], [IsSUP], [IsUSE]). Это не prompt engineering.

Sources¶

American Express — "Beyond Vanilla RAG: 7 Advanced RAG Techniques" (Feb 2026)
Jishu Labs — "Vector Database Comparison 2026: The Ultimate Guide" (Jan 2026)
Self-RAG paper (arXiv:2310.11511, Asai et al., 2023)
RAFT paper (arXiv:2403.10131, Zhang et al., 2024)
Qdrant documentation
Milvus documentation