Оптимизация расходов LLMOps¶

~4 минуты чтения

Предварительно: Ценообразование API LLM, LLMOps vs MLOps

Связанный файл: Каскадная маршрутизация LLM -- routing architectures (rules/cascade/learned), fallback strategies, gateway tools, mermaid-диаграммы cascade flow

Приложение с 10K запросов/день к GPT-4o тратит ~$1500/мес на API. Те же 10K запросов с semantic caching (73% cost reduction) + model routing (55% savings на оставшихся) обходятся в $375/мес -- экономия $1125. При 100K запросов/день разница достигает $12K/мес. Три ключевых рычага: (1) semantic caching -- 96.9% снижение latency и 60-85% hit rate на повторяющихся паттернах, (2) intelligent routing -- ML-based router отправляет простые запросы на GPT-4o-mini ($0.15/M) вместо GPT-4o ($2.50/M) без потери accuracy, (3) continuous batching -- +100% throughput. Без этих техник стоимость inference растет линейно с пользователями.

Обзор¶

Ключевые метрики оптимизации¶

Semantic Caching: 73% снижение стоимости, 96.9% снижение latency
Model Routing: соответствие сложности запроса возможностям модели
Batch Processing: 70% рост throughput

Part 2: LLMOps vs MLOps¶

Key Differences¶

Aspect	MLOps	LLMOps
Primary Artifacts	Model weights, training data	Prompts, RAG databases, guardrails
Evaluation	Accuracy, F1, AUC	Hallucination rate, relevance, safety
Deployment	Model serving	API routing, prompt versioning
Monitoring	Data drift, model decay	Token usage, latency, cost per query
Iteration Speed	Days to weeks	Hours to days

New LLMOps Concerns¶

Prompt Management — Version control, A/B testing
Token Budgeting — Per-user, per-feature limits
Guardrails — Input/output filtering
RAG Pipelines — Chunking, embedding, retrieval
Multi-Model Orchestration — Routing, fallbacks

Part 3: Semantic Caching¶

Также: pipeline-диаграмма, when-to-use матрица, production config -- см. Каскадная маршрутизация LLM (секция 2).

3.1 How It Works¶

graph TD
    Q["User Query"] --> E["Exact Match<br/>Cache Check<br/>(< 1ms)"]
    E -->|Hit| R1["Cached Response"]
    E -->|Miss| S["Semantic Match<br/>Cache Check<br/>(10-50ms)"]
    S -->|Hit| R2["Cached Response"]
    S -->|Miss| L["LLM Call<br/>(500-2000ms)"]

    style Q fill:#e8eaf6,stroke:#3f51b5
    style E fill:#e8f5e9,stroke:#4caf50
    style S fill:#fff3e0,stroke:#ef6c00
    style L fill:#fce4ec,stroke:#c62828
    style R1 fill:#e8f5e9,stroke:#4caf50
    style R2 fill:#e8f5e9,stroke:#4caf50

3.2 Performance Metrics¶

Metric	Value
Cache Hit Rate	60-85% (with semantic caching)
Cost Reduction	Up to 73%
Latency Reduction	96.9% for cache hits
Exact Match Latency	< 1ms
Semantic Match Latency	10-50ms

3.3 Implementation Pattern¶

import redis
from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticCache:
    def __init__(self, threshold: float = 0.95):
        self.redis = redis.Redis(host='localhost', port=6379)
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.threshold = threshold

    async def get_or_generate(self, query: str, llm_callable):
        # 1. Exact match check
        exact_key = f"exact:{query}"
        cached = self.redis.get(exact_key)
        if cached:
            return cached.decode(), "exact_hit"

        # 2. Semantic match check
        query_embedding = self.encoder.encode(query)
        # Vector similarity search in Redis
        similar = self.redis.ft("idx:embeddings").search(
            f"*=>[KNN 1 @embedding $vec AS score]",
            query_params={"vec": query_embedding.tobytes()}
        )

        if similar.total and similar.docs[0].score > self.threshold:
            return similar.docs[0].response, "semantic_hit"

        # 3. LLM call + cache
        response = await llm_callable(query)
        self.cache_response(query, query_embedding, response)
        return response, "llm_call"

3.4 Threshold Tuning¶

Threshold	Precision	Recall	Use Case
0.99	Very High	Low	Exact FAQs
0.95	High	Medium	General Q&A
0.90	Medium	High	Broad similarity
0.85	Low	Very High	Exploratory search

Part 4: Intelligent Model Routing¶

Подробно: архитектуры routing (rules, cascade, learned routers), fallback strategies, gateway tools -- см. Каскадная маршрутизация LLM.

4.1 Router Implementation¶

class ModelRouter:
    def __init__(self):
        self.models = {
            "simple": "gpt-4o-mini",         # $0.15/1M input tokens
            "medium": "gpt-4o",              # $2.50/1M input tokens
            "complex": "claude-sonnet-4",    # $3/1M input tokens
            "specialized": "claude-opus-4",  # $15/1M input tokens
        }

    def route(self, query: str, context: dict) -> tuple[str, str]:
        complexity = self.analyze_complexity(query, context)

        if complexity < 0.3:
            return self.models["simple"], "simple"
        elif complexity < 0.6:
            return self.models["medium"], "medium"
        elif complexity < 0.8:
            return self.models["complex"], "complex"
        else:
            return self.models["specialized"], "specialized"

4.2 Complexity Signals¶

Signal	Weight	Detection
Query Length	0.1	Token count > 100
Multi-step Reasoning	0.3	"then", "after", "first", "finally"
Code Generation	0.2	"write code", "implement", "function"
Domain Knowledge	0.2	Technical terms, jargon
Ambiguity	0.2	Multiple interpretations possible

4.3 Key Numbers¶

Metric	Value
ML-Based Router accuracy	86% (vs 85% single best)
ML-Based Router cost reduction	55%
Routing overhead (rule-based)	< 1ms
Routing overhead (ML classifier)	10-20ms

Part 5: Batch Processing¶

5.1 Types of Batching¶

Type	Description	Use Case
Static Batching	Wait for N requests	Offline processing
Continuous Batching	Add/remove dynamically	Real-time serving
Multi-bin Batching	Group by token length	Production serving

5.2 Throughput Improvement¶

Single Request:    100 queries/sec
Static Batching:   150 queries/sec (+50%)
Continuous Batching: 200 queries/sec (+100%)
Multi-bin Batching: 170 queries/sec (+70%)

5.3 Implementation (vLLM-style)¶

class ContinuousBatcher:
    def __init__(self, max_batch_size: int = 32):
        self.queue = asyncio.Queue()
        self.max_batch_size = max_batch_size

    async def submit(self, request):
        future = asyncio.Future()
        await self.queue.put((request, future))
        return await future

    async def process_batch(self):
        batch = []
        while len(batch) < self.max_batch_size:
            try:
                item = await asyncio.wait_for(
                    self.queue.get(), timeout=0.01
                )
                batch.append(item)
            except asyncio.TimeoutError:
                break

        if batch:
            responses = await self.llm_generate_batch(
                [r for r, _ in batch]
            )
            for (_, future), response in zip(batch, responses):
                future.set_result(response)

Part 6: Multi-Layer Caching Architecture¶

6.1 Three-Tier Cache¶

graph TD
    L1["L1: Local Memory (LRU)<br/>< 1ms | Hit Rate: 20-30%"]
    L1 -->|Miss| L2["L2: Redis Exact Match<br/>1-5ms | Hit Rate: 30-40%"]
    L2 -->|Miss| L3["L3: Vector DB Semantic Match<br/>10-50ms | Hit Rate: 20-30%"]
    L3 -->|Miss| API["LLM API Call<br/>500-2000ms"]

    style L1 fill:#e8f5e9,stroke:#4caf50
    style L2 fill:#e8eaf6,stroke:#3f51b5
    style L3 fill:#fff3e0,stroke:#ef6c00
    style API fill:#fce4ec,stroke:#c62828

Combined Hit Rate: 70-85%

6.2 Cache Invalidation Strategies¶

Strategy	When to Use
TTL-based	News, time-sensitive data
Version-based	Prompt updates, model changes
Manual	Error responses, feedback
LRU	Memory-constrained

Part 7: Cost Projection Model¶

7.1 Monthly Cost Calculator¶

def calculate_monthly_cost(
    queries_per_day: int,
    avg_tokens_per_query: int,
    model_cost_per_1m: float,
    cache_hit_rate: float = 0.0,
    routing_savings: float = 0.0
) -> dict:
    daily_tokens = queries_per_day * avg_tokens_per_query
    daily_cost = (daily_tokens / 1_000_000) * model_cost_per_1m

    # Apply optimizations
    optimized_cost = daily_cost * (1 - cache_hit_rate) * (1 - routing_savings)

    return {
        "daily_cost": optimized_cost,
        "monthly_cost": optimized_cost * 30,
        "savings": (daily_cost - optimized_cost) * 30
    }

7.2 Example Calculations¶

Scenario	Queries/Day	Baseline/Month	Optimized/Month	Savings
Small App	1,000	$150	$45	$105 (70%)
Medium App	10,000	$1,500	$375	$1,125 (75%)
Large App	100,000	$15,000	$3,000	$12,000 (80%)

Part 8: Interview-Relevant Numbers¶

Cache Statistics¶

Metric	Value
Semantic cache hit rate	60-85%
Cost reduction from caching	Up to 73%
Latency reduction for hits	96.9%
Exact match latency	< 1ms
Semantic match latency	10-50ms

Routing Statistics¶

Metric	Value
Router latency overhead	5-20ms
Cost reduction from routing	40-60%
RouterBench accuracy match	Yes

Batching Statistics¶

Metric	Value
Static batching improvement	+50%
Continuous batching improvement	+100%
Multi-bin batching improvement	+70%

Заблуждение: semantic cache можно включить и забыть

Semantic cache требует постоянного мониторинга: (1) false positive hits -- когда cache возвращает ответ на похожий, но семантически другой запрос (threshold 0.95 не спасает от "цена Pro" vs "цена Enterprise"). (2) Stale responses -- данные устаревают, а TTL по умолчанию слишком длинный. (3) Cache poisoning -- один неверный LLM-ответ кэшируется и возвращается многократно. Нужен monitoring cache quality: sampling 1-5% cached responses на evaluation pipeline.

Заблуждение: model routing экономит деньги без потери качества на всех задачах

RouterBench показывает, что ML-based router сохраняет 86% accuracy при 55% cost reduction -- но это средняя цифра. На задачах с nuanced reasoning (юридический анализ, сложная математика) routing на budget-модель дает catastrophic quality drop. Rule-based router с explicit exceptions для critical task types надежнее, чем pure ML classifier. Всегда A/B тестируйте routing decisions по quality metrics, не только по cost.

Заблуждение: batch processing всегда выгоднее real-time

Batch API дает 50% скидку, но с latency 24 часа (OpenAI). Для user-facing приложений это неприменимо. Continuous batching (vLLM) дает +100% throughput без увеличения latency -- но требует self-hosting. Правильная стратегия: batch API для offline задач (analytics, content generation), continuous batching для real-time, static batching вообще почти бесполезен в production (увеличивает latency для первых запросов в batch).

Interview Questions¶

Q: Спроектируйте трехуровневую систему кэширования для LLM API, обрабатывающей 50K запросов/день.

Red flag: "Redis для кэширования запросов по ключу, TTL 1 час"

Strong answer: "L1: in-process LRU cache (20-30% hit rate, <1ms) -- hot queries в памяти приложения. L2: Redis exact match (30-40% hit rate, 1-5ms) -- normalized query string как ключ, TTL зависит от типа контента (FAQ -- 24h, news -- 1h). L3: Vector DB semantic match (20-30%, 10-50ms) -- embeddings через all-MiniLM-L6-v2, cosine similarity threshold 0.95 для general Q&A, 0.99 для критичных запросов. Combined hit rate 70-85%. Инвалидация: TTL-based + version-based при обновлении промптов + manual при feedback 'wrong answer'. Мониторинг: cache hit rate по уровням, false positive rate sampling, cost savings dashboard"

Q: У вас есть LLM-приложение с 5 типами задач: classification, summarization, Q&A, code generation, reasoning. Как настроить model routing?

Red flag: "Все на GPT-4o, это лучшая модель"

Strong answer: "Classification ($0.0001/query) и summarization ($0.0003/query) -- Gemini Flash ($0.075/M), это 97% savings vs GPT-4o. Q&A -- semantic cache + GPT-4o-mini для простых + GPT-4o для complex (complexity classifier на основе query length + technical term density). Code generation -- GPT-4o или Claude Sonnet (качество критично). Reasoning -- o3-mini с fallback на Claude Opus для edge cases. Router: rule-based с ML-classifier для Q&A complexity. Overhead: <10ms. A/B тест каждого routing decision по quality metrics + cost dashboard в Langfuse"

Q: Как определить, когда self-hosting LLM выгоднее API?

Red flag: "Когда API дорого"

Strong answer: "Break-even analysis: Llama 4 8B на RTX 4090 ($200-400/мес) vs API -- break-even при ~50K запросов/мес. Llama 4 70B на 4xA100 ($3K-5K/мес) -- break-even при ~500K запросов/мес. Но кроме cost учитываем: (1) latency -- self-host дает <100ms vs API 500-2000ms. (2) Privacy -- данные не уходят к провайдеру. (3) Customization -- fine-tuning, custom decoding. (4) Ops overhead -- GPU provisioning, monitoring, failover, model updates. Threshold: >$5K/мес API spend + latency requirement <100ms + team с GPU infrastructure experience. Иначе API + caching + routing дешевле в TCO"

Sources¶

Redis.io -- "LLMOps: A Complete Guide to LLM Optimization 2026"
RouterBench -- Multi-LLM Routing Benchmark
vLLM -- Continuous Batching Paper
Production case studies (various)