Перейти к содержанию

RAG-техники и векторные базы данных

~8 минут чтения

URL: American Express, Jishu Labs Тип: rag / vector-databases / architecture Дата: Февраль 2026 Сбор: Ralph Research ФАЗА 5


Предварительно: Стратегии чанкинга

Зачем это нужно

Vanilla RAG (retrieve-and-generate) ломается в production: при росте коллекции до 100K+ документов retrieval quality падает, irrelevant chunks отравляют контекст, а статический pipeline не адаптируется к сложности запроса. Self-RAG добавляет reflection tokens и поднимает faithfulness с 58% до 82%. CorrectiveRAG классифицирует retrieval quality и откатывается на web search при ошибках. Adaptive-RAG сокращает ненужные retrievals на 40%. Выбор техники -- это trade-off между accuracy, latency и стоимостью fine-tuning.

Part 1: Beyond Vanilla RAG - 7 Advanced Techniques (2026)

Why Vanilla RAG Fails

Common Issues: - Retrieval quality degrades with large document collections - No reasoning about when to retrieve or when retrieved info is relevant - Irrelevant chunks poison the context - Complex queries need multi-step reasoning - Static retrieval can't adapt to query complexity

Technique 1: Self-RAG (Self-Reflective RAG)

Core Concept: Model learns to reflect on retrieval quality and generation.

Architecture:

Query → LLM (should I retrieve?) → [YES] → Retrieve
                                         → LLM (is this relevant?) → [YES] → Generate
                                                                       → [NO] → Re-retrieve
                                    [NO] → Generate without retrieval

Key Innovation: - Fine-tuned model (Self-RAG-7B) learns special tokens: - [Retrieve] - should I retrieve? - [IsREL] - is document relevant? - [IsSUP] - does generation need support? - [IsUSE] - is response useful?

Performance: | Metric | Vanilla RAG | Self-RAG | |--------|-------------|----------| | Accuracy | 65% | 76% | | Faithfulness | 58% | 82% |

Best For: High-stakes Q&A, medical/legal applications


Technique 2: ActiveRAG

Core Concept: Dual-tasking with Chain-of-Thought and knowledge construction.

Process:

1. Query + Retrieved Docs
2. CoT Reasoning → "What do I know? What do I need?"
3. Knowledge Construction → Build understanding from chunks
4. Final Answer Generation

Key Features: - Explicitly builds knowledge graph from retrieved chunks - Identifies conflicts between sources - Confidence scoring for each piece of information

Best For: Research synthesis, multi-document summarization


Technique 3: Chain-of-Note (CoN)

Core Concept: Generate explanatory notes for each retrieved document.

Architecture:

Query + Doc 1 → Note 1 → "Document 1 discusses X, which relates to query because..."
Query + Doc 2 → Note 2 → "Document 2 contradicts Doc 1 on Y..."
Query + Doc 3 → Note 3 → "Document 3 provides additional context Z..."

All Notes → Final Answer

Benefits: - Forces model to engage with each document - Surface conflicts and contradictions - Creates audit trail for answer derivation

Fine-tuning Required: Yes, on note-generation task

Best For: Fact-checking, comparison tasks, citation-heavy work


Technique 4: RAFT (Retrieval-Augmented Fine-Tuning)

Core Concept: Train model to handle irrelevant context gracefully.

Training Process:

For each training example:
  1. Include gold document
  2. Include 2-4 distractor documents (irrelevant)
  3. Model must identify and use only relevant info
  4. Chain-of-thought reasoning required

Key Insight:

"Models trained only on perfect retrieval fail when retrieval is imperfect. RAFT prepares for real-world noisy retrieval."

Performance Improvement: | Setting | Baseline | RAFT | |---------|----------|------| | Clean retrieval | 72% | 78% | | Noisy retrieval | 45% | 71% |

Best For: Production RAG systems with imperfect retrieval


Technique 5: CorrectiveRAG (CRAG)

Core Concept: Retrieval evaluator with fallback strategies.

Flow:

graph TD
    Q["Query"] --> RET["Retrieve"]
    RET --> EVAL["Evaluator"]
    EVAL -->|"Correct"| USE["Use as-is"]
    EVAL -->|"Ambiguous"| MIX["+ Web Search"]
    EVAL -->|"Incorrect"| WEB["Web Search Only"]
    USE --> GEN["Generate"]
    MIX --> GEN
    WEB --> GEN

    style Q fill:#e8eaf6,stroke:#3f51b5
    style EVAL fill:#fff3e0,stroke:#ef6c00
    style USE fill:#e8f5e9,stroke:#4caf50
    style MIX fill:#fff3e0,stroke:#ef6c00
    style WEB fill:#fce4ec,stroke:#c62828
    style GEN fill:#f3e5f5,stroke:#9c27b0

Evaluator Types: 1. Threshold-based: Similarity score above threshold 2. LLM-based: Ask model "Is this relevant?" 3. Fine-tuned classifier: Binary relevant/irrelevant

Fallback Options: - Web search (Google, Bing, Tavily) - Knowledge graph traversal - Direct LLM generation (no retrieval)

Best For: Production systems needing robustness


Technique 6: Adaptive-RAG

Core Concept: Query complexity classifier determines retrieval strategy.

Classification Levels: | Complexity | Strategy | Example | |------------|----------|---------| | Low | No retrieval | "What is 2+2?" | | Medium | Single retrieval | "What is the capital of France?" | | High | Multi-step RAG | "Compare Q1 2025 revenue across all competitors" |

Classifier Training:

# Train on query-strategy pairs
examples = [
    ("What is Python?", "no_retrieval"),
    ("What is Django 5.0 features?", "single_retrieval"),
    ("Compare FastAPI vs Django performance 2025", "multi_retrieval"),
]

Efficiency Gains: - 40% fewer unnecessary retrievals - 25% latency reduction for simple queries - 15% accuracy improvement on complex queries

Best For: High-throughput systems with mixed query types


Technique 7: Graph-Enhanced RAG

Core Concept: Knowledge graphs + vector retrieval + graph traversal.

Architecture:

Documents → Chunks → Embeddings → Vector Store
                   → Entities → Knowledge Graph

Query → Vector Search (semantic)
      → Entity Extraction → Graph Traversal (structural)
                     Combined Context → Generation

Graph Types: 1. Entity-Relationship: Nodes = entities, edges = relations 2. Document-Chunk: Nodes = docs/chunks, edges = references 3. Concept-Hierarchy: Nodes = concepts, edges = "is-a", "part-of"

Traversal Strategies: - 1-hop: Direct neighbors only - 2-hop: Neighbors of neighbors - Weighted: Follow high-confidence edges

Performance: | Task | Vector Only | Graph-Enhanced | |------|-------------|----------------| | Multi-hop reasoning | 52% | 74% | | Entity-focused QA | 68% | 85% | | Relationship queries | 45% | 79% |

Best For: Knowledge-heavy domains, entity-centric queries


Part 2: RAG Technique Selection Guide

Summary Table

Technique Fine-tuning Required Complexity Best Use Case
Self-RAG Yes (Self-RAG-7B) High High-stakes Q&A
ActiveRAG Optional Medium Research synthesis
Chain-of-Note Yes Medium Fact-checking
RAFT Yes Medium Noisy retrieval environments
CorrectiveRAG No Low Production robustness
Adaptive-RAG Yes (classifier) Medium Mixed query complexity
Graph-Enhanced No (graph construction) High Knowledge domains

Decision Tree

Is retrieval quality critical?
├── Yes → Can you fine-tune?
│         ├── Yes → Self-RAG
│         └── No → CorrectiveRAG
└── No → Is query complexity variable?
          ├── Yes → Adaptive-RAG
          └── No → Vanilla RAG + evaluation

Part 3: Vector Database Comparison 2026

Overview

Database Type Language Key Strength
Pinecone Managed - Zero ops, enterprise-ready
Weaviate Self-hosted/Managed Go Hybrid search, GraphQL
Qdrant Self-hosted Rust Performance, filtering
Milvus Self-hosted Go Scale, GPU support

Detailed Comparison

Pinecone

Architecture: Fully managed, serverless or pod-based

Key Features: - Zero infrastructure management - Automatic scaling - Built-in monitoring - Namespace isolation for multi-tenancy

Performance: | Metric | Value | |--------|-------| | p50 latency | 20ms | | p99 latency | 50ms | | Throughput | 500 QPS |

Pricing: - Serverless: $0.09/GB + query costs - Pod-based: $0.12/hour per pod

Code Example:

from pinecone import Pinecone

pc = Pinecone(api_key="...")
index = pc.Index("my-index")

# Upsert
index.upsert(vectors=[
    {"id": "id1", "values": [0.1, 0.2, ...], "metadata": {"text": "document 1"}},
    {"id": "id2", "values": [0.3, 0.4, ...], "metadata": {"text": "document 2"}},
])

# Query
results = index.query(
    vector=[0.15, 0.25, ...],
    top_k=10,
    include_metadata=True
)

Best For: Teams wanting zero ops, enterprise deployments


Weaviate

Architecture: Self-hosted or managed cloud

Key Features: - GraphQL API - Hybrid search (vector + keyword) - Built-in vectorization modules - Cross-references between objects - Real-time semantic search

Performance: | Metric | Value | |--------|-------| | p50 latency | 15ms | | p99 latency | 40ms | | Throughput | 800 QPS |

Hybrid Search:

# Vector + BM25 combined
result = client.query.get("Document", ["title", "content"]) \
    .with_hybrid(
        query="machine learning",
        alpha=0.5  # 0 = keyword only, 1 = vector only
    ) \
    .with_limit(10) \
    .do()

GraphQL Example:

{
  Get {
    Document(nearText: {concepts: ["AI safety"]}) {
      title
      content
      _additional {
        certainty
        distance
      }
    }
  }
}

Best For: Hybrid search needs, GraphQL-first teams


Qdrant

Architecture: Self-hosted, Rust-native

Key Features: - Highest throughput among open-source - Rich filtering capabilities - Quantization support (scalar, product, binary) - Distributed mode available - Payload-based filtering

Performance: | Metric | Value | |--------|-------| | p50 latency | 8ms | | p99 latency | 25ms | | Throughput | 1500 QPS |

Rich Filtering:

from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, Range

client = QdrantClient(":memory:")

# Search with filter
results = client.search(
    collection_name="documents",
    query_vector=[0.1, 0.2, ...],
    query_filter=Filter(
        must=[
            FieldCondition(
                key="category",
                match={"value": "ml"}
            ),
            FieldCondition(
                key="year",
                range=Range(gte=2024)
            )
        ]
    ),
    limit=10
)

Quantization:

# Enable scalar quantization for 4x memory reduction
client.update_collection(
    collection_name="documents",
    optimizer_config=models.OptimizersConfigDiff(
        quantization=models.ScalarQuantization(
            type=models.ScalarType.INT8,
            quantile=0.99,
            always_ram=True
        )
    )
)

Best For: High-throughput, low-latency requirements


Milvus

Architecture: Self-hosted, designed for scale

Key Features: - Handles billions of vectors - GPU index support - Multiple index types (IVF, HNSW, DiskANN) - Data persistence with object storage - Multi-tenancy via partitioning

Performance: | Metric | Value | |--------|-------| | p50 latency | 12ms | | p99 latency | 35ms | | Throughput | 1200 QPS | | Max vectors | 10B+ |

GPU-Accelerated Index:

from pymilvus import Collection, connections

connections.connect("default", host="localhost", port="19530")

# Create collection with GPU index
collection = Collection("documents")
index_params = {
    "metric_type": "COSINE",
    "index_type": "GPU_IVF_FLAT",
    "params": {"nlist": 1024}
}
collection.create_index(field_name="embedding", index_params=index_params)

Partitioning for Multi-tenancy:

# Create partition per tenant
collection.create_partition("tenant_1")
collection.create_partition("tenant_2")

# Search within partition
results = collection.search(
    data=[query_vector],
    anns_field="embedding",
    param={"metric_type": "COSINE", "params": {"nprobe": 10}},
    limit=10,
    partition_names=["tenant_1"]  # Tenant isolation
)

Best For: Large-scale deployments, GPU infrastructure


Part 4: Vector Database Selection Guide

Benchmark Summary (February 2026)

Database Latency (p50) Throughput Scale Ops Burden
Pinecone 20ms 500 QPS 100M Zero
Weaviate 15ms 800 QPS 500M Medium
Qdrant 8ms 1500 QPS 1B Medium
Milvus 12ms 1200 QPS 10B+ High

Decision Framework

Scale Requirement?
├── <100M vectors
│   └── Team expertise?
│       ├── Want zero ops → Pinecone
│       ├── Need hybrid search → Weaviate
│       └── Need max performance → Qdrant
└── >100M vectors
    ├── Have GPU infrastructure → Milvus
    └── No GPU → Qdrant cluster

Cost Comparison (100M vectors, 1536 dims)

Database Monthly Cost Notes
Pinecone ~$5,000 Serverless, pay per query
Weaviate Cloud ~$3,500 Managed service
Qdrant Cloud ~$2,500 Managed service
Self-hosted ~$1,000 Hardware + ops cost

Part 5: Production Considerations

Index Selection

Index Type Build Time Memory Recall Use Case
Flat None High 100% Small datasets, exact search
IVF Medium Medium 90-95% General purpose
HNSW High High 95-99% High recall, low latency
DiskANN Low Low 90-95% Disk-based, large scale
GPU_IVF Medium GPU RAM 90-95% GPU infrastructure

Embedding Model Selection

Model Dimensions Quality (MTEB) Speed Cost
OpenAI text-embedding-3-large 3072 Top-5 Fast $0.13/1M
OpenAI text-embedding-3-small 1536 Top-10 Very fast $0.02/1M
Cohere embed-v3 1024 Top-5 Fast $0.10/1M
Voyage-3 1024 Top-5 Fast $0.12/1M
E5-large-v2 1024 Top-10 Medium Free
BGE-large-en 1024 Top-10 Medium Free

Optimization Tips

  1. Quantization: 4x memory reduction with <1% recall loss
  2. Dimensionality reduction: PCA/Matryoshka for lower dims
  3. Batch operations: Batch upserts/searches for throughput
  4. Caching: Cache frequent queries with Redis
  5. Sharding: Distribute by tenant or time for scale

Interview Questions

Conceptual:

  1. "Чем Self-RAG отличается от CorrectiveRAG?" -- Self-RAG fine-tuned модель с reflection tokens, принимает решения внутри генерации. CorrectiveRAG -- внешний evaluator + fallback, не требует fine-tuning.
  2. "Когда Adaptive-RAG лучше vanilla RAG?" -- При mixed query complexity: простые вопросы не нуждаются в retrieval (40% экономии), сложные требуют multi-step.
  3. "Почему RAFT тренирует с distractor documents?" -- В production retrieval неидеален. Модель, обученная только на gold docs, падает с 72% до 45% на noisy retrieval. RAFT сохраняет 71%.

System Design:

  1. "Вам нужно выбрать vector DB для RAG с 50M документов и <20ms latency. Какую выберете?" -- Qdrant (p50 8ms, 1500 QPS) или Milvus (если нужен GPU). Pinecone не дотягивает по latency (p50 20ms). pgvector исключён (p50 50ms).
  2. "Как бы вы скомбинировали несколько RAG-техник в production pipeline?" -- CorrectiveRAG (не нужен fine-tuning) + Adaptive-RAG classifier (query routing) + Graph-Enhanced для entity-heavy доменов.

Частые ошибки

"Vector search всегда лучше BM25" -- Неправда. BM25 сильнее для exact term matching (код, логи, ID). Hybrid (BM25 + dense + RRF fusion) -- стандарт 2026 с 95%+ adoption.

"Больше чанков в контексте = лучше" -- "Lost in the Middle" эффект: attention quality падает с длиной контекста. Re-ranking + top-K фильтрация критичны.

"Self-RAG = просто добавить if-else перед retrieval" -- Self-RAG это fine-tuned модель с 4 специальными reflection tokens ([Retrieve], [IsREL], [IsSUP], [IsUSE]). Это не prompt engineering.


Sources

  1. American Express — "Beyond Vanilla RAG: 7 Advanced RAG Techniques" (Feb 2026)
  2. Jishu Labs — "Vector Database Comparison 2026: The Ultimate Guide" (Jan 2026)
  3. Self-RAG paper (arXiv:2310.11511, Asai et al., 2023)
  4. RAFT paper (arXiv:2403.10131, Zhang et al., 2024)
  5. Qdrant documentation
  6. Milvus documentation