Перейти к содержанию

Оптимизация расходов LLMOps

~4 минуты чтения

Предварительно: Ценообразование API LLM, LLMOps vs MLOps

Связанный файл: Каскадная маршрутизация LLM -- routing architectures (rules/cascade/learned), fallback strategies, gateway tools, mermaid-диаграммы cascade flow

Приложение с 10K запросов/день к GPT-4o тратит ~$1500/мес на API. Те же 10K запросов с semantic caching (73% cost reduction) + model routing (55% savings на оставшихся) обходятся в $375/мес -- экономия $1125. При 100K запросов/день разница достигает \(12K/мес. Три ключевых рычага: (1) semantic caching -- 96.9% снижение latency и 60-85% hit rate на повторяющихся паттернах, (2) intelligent routing -- ML-based router отправляет простые запросы на GPT-4o-mini (\)0.15/M) вместо GPT-4o ($2.50/M) без потери accuracy, (3) continuous batching -- +100% throughput. Без этих техник стоимость inference растет линейно с пользователями.


Обзор

Ключевые метрики оптимизации

  • Semantic Caching: 73% снижение стоимости, 96.9% снижение latency
  • Model Routing: соответствие сложности запроса возможностям модели
  • Batch Processing: 70% рост throughput

Part 2: LLMOps vs MLOps

Key Differences

Aspect MLOps LLMOps
Primary Artifacts Model weights, training data Prompts, RAG databases, guardrails
Evaluation Accuracy, F1, AUC Hallucination rate, relevance, safety
Deployment Model serving API routing, prompt versioning
Monitoring Data drift, model decay Token usage, latency, cost per query
Iteration Speed Days to weeks Hours to days

New LLMOps Concerns

  1. Prompt Management — Version control, A/B testing
  2. Token Budgeting — Per-user, per-feature limits
  3. Guardrails — Input/output filtering
  4. RAG Pipelines — Chunking, embedding, retrieval
  5. Multi-Model Orchestration — Routing, fallbacks

Part 3: Semantic Caching

Также: pipeline-диаграмма, when-to-use матрица, production config -- см. Каскадная маршрутизация LLM (секция 2).

3.1 How It Works

graph TD
    Q["User Query"] --> E["Exact Match<br/>Cache Check<br/>(< 1ms)"]
    E -->|Hit| R1["Cached Response"]
    E -->|Miss| S["Semantic Match<br/>Cache Check<br/>(10-50ms)"]
    S -->|Hit| R2["Cached Response"]
    S -->|Miss| L["LLM Call<br/>(500-2000ms)"]

    style Q fill:#e8eaf6,stroke:#3f51b5
    style E fill:#e8f5e9,stroke:#4caf50
    style S fill:#fff3e0,stroke:#ef6c00
    style L fill:#fce4ec,stroke:#c62828
    style R1 fill:#e8f5e9,stroke:#4caf50
    style R2 fill:#e8f5e9,stroke:#4caf50

3.2 Performance Metrics

Metric Value
Cache Hit Rate 60-85% (with semantic caching)
Cost Reduction Up to 73%
Latency Reduction 96.9% for cache hits
Exact Match Latency < 1ms
Semantic Match Latency 10-50ms

3.3 Implementation Pattern

import redis
from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticCache:
    def __init__(self, threshold: float = 0.95):
        self.redis = redis.Redis(host='localhost', port=6379)
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.threshold = threshold

    async def get_or_generate(self, query: str, llm_callable):
        # 1. Exact match check
        exact_key = f"exact:{query}"
        cached = self.redis.get(exact_key)
        if cached:
            return cached.decode(), "exact_hit"

        # 2. Semantic match check
        query_embedding = self.encoder.encode(query)
        # Vector similarity search in Redis
        similar = self.redis.ft("idx:embeddings").search(
            f"*=>[KNN 1 @embedding $vec AS score]",
            query_params={"vec": query_embedding.tobytes()}
        )

        if similar.total and similar.docs[0].score > self.threshold:
            return similar.docs[0].response, "semantic_hit"

        # 3. LLM call + cache
        response = await llm_callable(query)
        self.cache_response(query, query_embedding, response)
        return response, "llm_call"

3.4 Threshold Tuning

Threshold Precision Recall Use Case
0.99 Very High Low Exact FAQs
0.95 High Medium General Q&A
0.90 Medium High Broad similarity
0.85 Low Very High Exploratory search

Part 4: Intelligent Model Routing

Подробно: архитектуры routing (rules, cascade, learned routers), fallback strategies, gateway tools -- см. Каскадная маршрутизация LLM.

4.1 Router Implementation

class ModelRouter:
    def __init__(self):
        self.models = {
            "simple": "gpt-4o-mini",         # $0.15/1M input tokens
            "medium": "gpt-4o",              # $2.50/1M input tokens
            "complex": "claude-sonnet-4",    # $3/1M input tokens
            "specialized": "claude-opus-4",  # $15/1M input tokens
        }

    def route(self, query: str, context: dict) -> tuple[str, str]:
        complexity = self.analyze_complexity(query, context)

        if complexity < 0.3:
            return self.models["simple"], "simple"
        elif complexity < 0.6:
            return self.models["medium"], "medium"
        elif complexity < 0.8:
            return self.models["complex"], "complex"
        else:
            return self.models["specialized"], "specialized"

4.2 Complexity Signals

Signal Weight Detection
Query Length 0.1 Token count > 100
Multi-step Reasoning 0.3 "then", "after", "first", "finally"
Code Generation 0.2 "write code", "implement", "function"
Domain Knowledge 0.2 Technical terms, jargon
Ambiguity 0.2 Multiple interpretations possible

4.3 Key Numbers

Metric Value
ML-Based Router accuracy 86% (vs 85% single best)
ML-Based Router cost reduction 55%
Routing overhead (rule-based) < 1ms
Routing overhead (ML classifier) 10-20ms

Part 5: Batch Processing

5.1 Types of Batching

Type Description Use Case
Static Batching Wait for N requests Offline processing
Continuous Batching Add/remove dynamically Real-time serving
Multi-bin Batching Group by token length Production serving

5.2 Throughput Improvement

Single Request:    100 queries/sec
Static Batching:   150 queries/sec (+50%)
Continuous Batching: 200 queries/sec (+100%)
Multi-bin Batching: 170 queries/sec (+70%)

5.3 Implementation (vLLM-style)

class ContinuousBatcher:
    def __init__(self, max_batch_size: int = 32):
        self.queue = asyncio.Queue()
        self.max_batch_size = max_batch_size

    async def submit(self, request):
        future = asyncio.Future()
        await self.queue.put((request, future))
        return await future

    async def process_batch(self):
        batch = []
        while len(batch) < self.max_batch_size:
            try:
                item = await asyncio.wait_for(
                    self.queue.get(), timeout=0.01
                )
                batch.append(item)
            except asyncio.TimeoutError:
                break

        if batch:
            responses = await self.llm_generate_batch(
                [r for r, _ in batch]
            )
            for (_, future), response in zip(batch, responses):
                future.set_result(response)

Part 6: Multi-Layer Caching Architecture

6.1 Three-Tier Cache

graph TD
    L1["L1: Local Memory (LRU)<br/>< 1ms | Hit Rate: 20-30%"]
    L1 -->|Miss| L2["L2: Redis Exact Match<br/>1-5ms | Hit Rate: 30-40%"]
    L2 -->|Miss| L3["L3: Vector DB Semantic Match<br/>10-50ms | Hit Rate: 20-30%"]
    L3 -->|Miss| API["LLM API Call<br/>500-2000ms"]

    style L1 fill:#e8f5e9,stroke:#4caf50
    style L2 fill:#e8eaf6,stroke:#3f51b5
    style L3 fill:#fff3e0,stroke:#ef6c00
    style API fill:#fce4ec,stroke:#c62828

Combined Hit Rate: 70-85%

6.2 Cache Invalidation Strategies

Strategy When to Use
TTL-based News, time-sensitive data
Version-based Prompt updates, model changes
Manual Error responses, feedback
LRU Memory-constrained

Part 7: Cost Projection Model

7.1 Monthly Cost Calculator

def calculate_monthly_cost(
    queries_per_day: int,
    avg_tokens_per_query: int,
    model_cost_per_1m: float,
    cache_hit_rate: float = 0.0,
    routing_savings: float = 0.0
) -> dict:
    daily_tokens = queries_per_day * avg_tokens_per_query
    daily_cost = (daily_tokens / 1_000_000) * model_cost_per_1m

    # Apply optimizations
    optimized_cost = daily_cost * (1 - cache_hit_rate) * (1 - routing_savings)

    return {
        "daily_cost": optimized_cost,
        "monthly_cost": optimized_cost * 30,
        "savings": (daily_cost - optimized_cost) * 30
    }

7.2 Example Calculations

Scenario Queries/Day Baseline/Month Optimized/Month Savings
Small App 1,000 $150 $45 $105 (70%)
Medium App 10,000 $1,500 $375 $1,125 (75%)
Large App 100,000 $15,000 $3,000 $12,000 (80%)

Part 8: Interview-Relevant Numbers

Cache Statistics

Metric Value
Semantic cache hit rate 60-85%
Cost reduction from caching Up to 73%
Latency reduction for hits 96.9%
Exact match latency < 1ms
Semantic match latency 10-50ms

Routing Statistics

Metric Value
Router latency overhead 5-20ms
Cost reduction from routing 40-60%
RouterBench accuracy match Yes

Batching Statistics

Metric Value
Static batching improvement +50%
Continuous batching improvement +100%
Multi-bin batching improvement +70%

Заблуждение: semantic cache можно включить и забыть

Semantic cache требует постоянного мониторинга: (1) false positive hits -- когда cache возвращает ответ на похожий, но семантически другой запрос (threshold 0.95 не спасает от "цена Pro" vs "цена Enterprise"). (2) Stale responses -- данные устаревают, а TTL по умолчанию слишком длинный. (3) Cache poisoning -- один неверный LLM-ответ кэшируется и возвращается многократно. Нужен monitoring cache quality: sampling 1-5% cached responses на evaluation pipeline.

Заблуждение: model routing экономит деньги без потери качества на всех задачах

RouterBench показывает, что ML-based router сохраняет 86% accuracy при 55% cost reduction -- но это средняя цифра. На задачах с nuanced reasoning (юридический анализ, сложная математика) routing на budget-модель дает catastrophic quality drop. Rule-based router с explicit exceptions для critical task types надежнее, чем pure ML classifier. Всегда A/B тестируйте routing decisions по quality metrics, не только по cost.

Заблуждение: batch processing всегда выгоднее real-time

Batch API дает 50% скидку, но с latency 24 часа (OpenAI). Для user-facing приложений это неприменимо. Continuous batching (vLLM) дает +100% throughput без увеличения latency -- но требует self-hosting. Правильная стратегия: batch API для offline задач (analytics, content generation), continuous batching для real-time, static batching вообще почти бесполезен в production (увеличивает latency для первых запросов в batch).


Interview Questions

Q: Спроектируйте трехуровневую систему кэширования для LLM API, обрабатывающей 50K запросов/день.

❌ Red flag: "Redis для кэширования запросов по ключу, TTL 1 час"

✅ Strong answer: "L1: in-process LRU cache (20-30% hit rate, <1ms) -- hot queries в памяти приложения. L2: Redis exact match (30-40% hit rate, 1-5ms) -- normalized query string как ключ, TTL зависит от типа контента (FAQ -- 24h, news -- 1h). L3: Vector DB semantic match (20-30%, 10-50ms) -- embeddings через all-MiniLM-L6-v2, cosine similarity threshold 0.95 для general Q&A, 0.99 для критичных запросов. Combined hit rate 70-85%. Инвалидация: TTL-based + version-based при обновлении промптов + manual при feedback 'wrong answer'. Мониторинг: cache hit rate по уровням, false positive rate sampling, cost savings dashboard"

Q: У вас есть LLM-приложение с 5 типами задач: classification, summarization, Q&A, code generation, reasoning. Как настроить model routing?

❌ Red flag: "Все на GPT-4o, это лучшая модель"

✅ Strong answer: "Classification (\(0.0001/query) и summarization (\)0.0003/query) -- Gemini Flash ($0.075/M), это 97% savings vs GPT-4o. Q&A -- semantic cache + GPT-4o-mini для простых + GPT-4o для complex (complexity classifier на основе query length + technical term density). Code generation -- GPT-4o или Claude Sonnet (качество критично). Reasoning -- o3-mini с fallback на Claude Opus для edge cases. Router: rule-based с ML-classifier для Q&A complexity. Overhead: <10ms. A/B тест каждого routing decision по quality metrics + cost dashboard в Langfuse"

Q: Как определить, когда self-hosting LLM выгоднее API?

❌ Red flag: "Когда API дорого"

✅ Strong answer: "Break-even analysis: Llama 4 8B на RTX 4090 (\(200-400/мес) vs API -- break-even при ~50K запросов/мес. Llama 4 70B на 4xA100 (\)3K-5K/мес) -- break-even при ~500K запросов/мес. Но кроме cost учитываем: (1) latency -- self-host дает <100ms vs API 500-2000ms. (2) Privacy -- данные не уходят к провайдеру. (3) Customization -- fine-tuning, custom decoding. (4) Ops overhead -- GPU provisioning, monitoring, failover, model updates. Threshold: >$5K/мес API spend + latency requirement <100ms + team с GPU infrastructure experience. Иначе API + caching + routing дешевле в TCO"


Sources

  1. Redis.io -- "LLMOps: A Complete Guide to LLM Optimization 2026"
  2. RouterBench -- Multi-LLM Routing Benchmark
  3. vLLM -- Continuous Batching Paper
  4. Production case studies (various)

See Also