Каскадная маршрутизация и оптимизация LLM¶

~10 минут чтения

Предварительно: Ценообразование API LLM, Продакшен деплой LLM

Связанный файл: Оптимизация расходов LLMOps -- batch processing, cost projection model, LLMOps vs MLOps, Python-реализация SemanticCache и ModelRouter

40-70% текстовых запросов к LLM не нуждаются в flagship-модели (CascadeFlow, 2026). Cascade routing -- стратегия, при которой запрос сначала направляется дешевой модели (GPT-4o-mini за $0.15/M input), и только при низком confidence эскалируется к дорогой (GPT-4o за $2.50/M). В комбинации с semantic caching (40-86% экономии, 250x ускорение на cache hits) и learned routing (contextual bandits), суммарная экономия достигает 60-85% без деградации качества. RouterBench подтверждает: multi-LLM router match или превосходят лучшую single model при 40-60% снижении стоимости.

Цены: Конкретные суммы актуальны на февраль 2026 и быстро устаревают. См. дисклеймер в разделе ценообразования.

Ключевые концепции¶

Model routing -- направление запросов к наиболее cost-effective модели.

\[\text{Optimal Model} = \arg\min_{m \in M} \{Cost(m) : Quality(m, q) \geq \theta_q\}\]

Three Pillars of Optimization¶

Pillar	What to Measure	Target
Quality	Accuracy, helpfulness, safety	Task-specific criteria
Cost	API fees, GPU time, retries	Total cost of ownership
Latency	P50/P95 response times	End-to-end SLA

Routing Landscape 2026¶

Strategy	Cost Savings	Complexity	Quality Impact
Semantic caching	40-86%	Low	None
Model cascade	30-65%	Medium	Minimal
Learned routing	50-80%	High	Optimized
Hybrid approach	60-85%	High	Optimized

1. Routing Architectures¶

1.1 Rule-Based Routing¶

If-then logic based on keywords, regex, prompt length, task types.

Route short factual questions -> Mistral-7B
Route creative writing -> Claude Opus

Pros: transparent, deterministic, easy to govern. Cons: rigid, hard to maintain.

1.2 Cascade Routing (Uncertainty-Based Escalation)¶

Try cheap model first, escalate if confidence low.

graph TD
    Q["Query"] --> SC["Semantic Cache"]
    SC -->|"HIT"| R1["Return (instant, free)"]
    SC -->|"MISS"| T1["Tier 1: Cheap (Haiku, GPT-4o-mini)"]
    T1 -->|"HIGH confidence"| R2["Return"]
    T1 -->|"LOW confidence"| T2["Tier 2: Medium (Sonnet, GPT-4o)"]
    T2 -->|"HIGH confidence"| R3["Return"]
    T2 -->|"LOW confidence"| T3["Tier 3: Premium (Opus, o3)"]
    T3 --> R4["Return (guaranteed)"]

    style Q fill:#e8eaf6,stroke:#3f51b5
    style SC fill:#f3e5f5,stroke:#9c27b0
    style R1 fill:#e8f5e9,stroke:#4caf50
    style T1 fill:#e8f5e9,stroke:#4caf50
    style R2 fill:#e8f5e9,stroke:#4caf50
    style T2 fill:#fff3e0,stroke:#ef6c00
    style R3 fill:#e8f5e9,stroke:#4caf50
    style T3 fill:#fce4ec,stroke:#c62828
    style R4 fill:#e8f5e9,stroke:#4caf50

\[\text{Savings} = 1 - \sum_{i=1}^{n} P(\text{reach tier } i) \times \frac{Cost_i}{Cost_{max}}\]

Metric (CascadeFlow)	Value
Text prompts not needing flagship	40-70%
Agent calls not needing flagship	20-60%
Typical cost savings	30-65%
Latency overhead	+5-15%

1.3 Learned Routers & Contextual Bandits¶

ML-based routing that adapts over time:

Lightweight classifier: predicts best LLM from prompt embeddings, user history, domain tags
Bandit policy: continuously explores routing choices, exploits best outcomes
Adaptation: adjusts to shifting traffic patterns and model updates

Pros: data-efficient, adaptive, highest optimization potential. Cons: complex to build/debug.

1.4 Architecture Comparison¶

Architecture	Complexity	Transparency	Adaptability	Best For
Rules	Low	High	Low	Starting, regulated envs
Cascades	Medium	Medium	Low	Heterogeneous workloads
Learned	High	Low	High	High-volume production

2. Semantic Caching¶

Pipeline¶

graph LR
    Q["Query"] --> E["Embed"]
    E --> S["Similarity Search"]
    S --> T{"Threshold >0.95?"}
    T -->|"HIT"| R1["Return cached (instant)"]
    T -->|"MISS"| L["Call LLM"]
    L --> ST["Store in Cache"]
    ST --> R2["Return"]

    style Q fill:#e8eaf6,stroke:#3f51b5
    style E fill:#e8eaf6,stroke:#3f51b5
    style S fill:#f3e5f5,stroke:#9c27b0
    style T fill:#fff3e0,stroke:#ef6c00
    style R1 fill:#e8f5e9,stroke:#4caf50
    style L fill:#fce4ec,stroke:#c62828
    style ST fill:#f3e5f5,stroke:#9c27b0
    style R2 fill:#e8f5e9,stroke:#4caf50

Statistics¶

Metric	Value
Cost reduction	40-86%
Response time	250x faster (hit)
Latency reduction	96.9% (1.67s -> 0.052s for hits)
Cache hit improvement	88%

Multi-Layer Cache¶

Layer	Type	Latency	Use Case
L1	Exact match (key-value)	<1ms	Identical queries
L2	Semantic (vector)	10-50ms	Similar meaning queries

Semantic cache с threshold < 0.95 возвращает НЕПРАВИЛЬНЫЕ ответы

Cosine similarity 0.90 кажется "достаточно похожим", но для LLM queries это опасно. "Как удалить файл в Linux?" (similarity 0.91 с "Как удалить директорию в Linux?") -- совершенно разные команды (rm vs rm -r). В production при threshold 0.90 false positive rate достигает 15-20%. Правило: начинайте с 0.97+, снижайте только после A/B теста на вашем трафике. Для code generation и medical -- не ниже 0.98.

When to Use¶

Use Case	Recommended	Reason
Customer support	Yes	Repeated intents
FAQ bots	Yes	High query overlap
Code generation	Maybe	Varies by spec
Real-time data	No	Needs fresh data
Creative writing	No	Unique outputs

Configuration¶

Setting	Recommended	Reason
TTL	5-60 minutes	Balance freshness vs cache
TTL jitter	10-20%	Prevent thundering herd
Similarity threshold	0.95+	Avoid false positives
Max cache size	Based on memory	LRU eviction

3. Smart Routing Strategies¶

Pre-Generation Routing¶

def pre_generation_route(query):
    complexity = classify_complexity(query)
    if complexity == "simple":
        return call_model("gpt-3.5-turbo", query)
    elif complexity == "medium":
        return call_model("claude-3-haiku", query)
    else:
        return call_model("gpt-4", query)

Select-Then-Route (StR, EMNLP 2025)¶

Two-stage: select model pool -> route within it. Advantage: smaller candidate pool.

Taxonomy-Guided Routing¶

Route by query taxonomy: code queries -> code-specialized model, creative -> creative model.

Feedback-Based Routing¶

Learn from user satisfaction (regenerate, thumbs down). Improves over time.

4. Advanced Optimization¶

Task Shaping¶

Make the task easier before model selection: - Prompt engineering: clear instructions, few-shot examples - Structured output: JSON schemas for deterministic parsing - Pre-processing: summarize long docs before synthesis

Teacher-Student Distillation¶

Use GPT-4 ("teacher") for sample production requests
Fine-tune smaller model ("student") on labeled data
Route majority of traffic to student
Escalate to teacher for novel/high-stakes cases

Predictive Latency Modeling¶

Token count estimation: predict response time from input length
Load-based selection: choose model within SLA under current load
Speculative execution: pre-compute responses, serve best instantly

5. Fallback Strategies¶

Triggers¶

Trigger	Action
Model failure	Try next model in cascade
High latency	Switch to faster model
Rate limit	Queue or switch provider
Quality threshold not met	Escalate to stronger model

Best Practices¶

Practice	Description
Cross-provider	Don't fallback within same provider
Health checks	Monitor model availability
Circuit breaker	Skip failing models temporarily
Graceful degradation	Inform user of fallback

6. Evaluation & Deployment¶

Golden Set¶

Representative data + edge cases + ground truth (human ratings, LLM-as-judge calibration).

Evaluation Methods¶

Method	Use Case	Cost
LLM-as-judge	Format, reasoning, factual consistency	$0.05-0.50 per 1K
User feedback	Thumbs up/down, escalation signals	Free (implicit)
Human raters	Edge cases, quality audits	$1-5 per batch

Deployment Strategy¶

Shadow mode: new policy runs parallel, log decisions without impact
Counterfactual logging: compute potential regret/improvement
A/B testing: measure impact on business KPIs
Gradual rollout: 1% -> 5% -> 50% -> 100%

RouterBench Results¶

Finding	Impact
Quality match	Multi-LLM routers match or exceed best single model
Cost reduction	40-60% average
Latency overhead	5-20ms for routing decision

7. Production Configuration¶

Routing Config¶

Setting	Recommended
Confidence threshold	0.7-0.8
Max cascade steps	3
Timeout per model	5-30 seconds
Retry attempts	2

Monitoring Metrics¶

Metric	Alert Threshold
Cache hit rate	< 30%
Cascade escalation rate	> 50%
Model latency	> 5s P95
Fallback rate	> 10%
Cost per query	> baseline + 20%

Governance¶

Routing rules in git/feature flags
Auto-shift to conservative models near budget limits
Multiple providers for fallback
Circuit breakers for failing models

8. Gateway Tools¶

Commercial¶

Gateway	Features	Best For
Portkey	Routing, caching, fallback	Production
LiteLLM	Unified API, routing	Development
Unify	Multi-provider routing	Cost optimization
OpenRouter	Model marketplace	Flexibility
Bifrost	Enterprise routing	Enterprise

Open-Source¶

Tool	Focus
CascadeFlow	Cascade routing (lemony-ai/cascadeflow)
LiteLLM	Unified interface (BerriAI/litellm)
GPTCache	Semantic caching (zilliztech/GPTCache)

Для интервью¶

Q: "Как оптимизировать стоимость LLM-системы?"¶

Три стратегии: (1) Semantic caching -- 40-86% savings, 250x faster для cache hits, similarity threshold 0.95+. (2) Cascade routing -- try cheap model first (GPT-3.5/Haiku), escalate only if confidence low. 40-70% запросов не нуждаются в flagship model (CascadeFlow). Savings 30-65%. (3) Learned routing -- contextual bandit / lightweight classifier, 50-80% savings, adapts over time. Combined: 60-85%.

Q: "Design an LLM routing system."¶

Components: (1) Request classifier (embeddings + complexity features). (2) Semantic cache (L1 exact + L2 vector, similarity > 0.95). (3) Model cascade (3 tiers: cheap/medium/premium, confidence threshold 0.7-0.8). (4) Fallback (cross-provider, circuit breaker). (5) Monitoring (cache hit rate, escalation rate, cost per query). Deployment: shadow mode -> A/B test -> gradual rollout. ROI: semantic caching pays back in 1-2 weeks.

Ключевые числа¶

Факт	Значение
Semantic caching cost reduction	40-86%
Semantic caching speedup	250x
Cascade routing savings	30-65%
Queries not needing flagship (text)	40-70%
Queries not needing flagship (agents)	20-60%
Combined routing savings	60-85%
Routing decision overhead	5-20ms
RouterBench cost reduction	40-60%
Query distribution: simple	40-60%
Routing ROI (caching)	1-2 weeks

Model Cost Comparison (per 1M tokens)¶

Model	Input	Output
GPT-4o	$2.50	$10.00
Claude Sonnet	$3.00	$15.00
GPT-4o-mini	$0.15	$0.60
Claude Haiku	$0.25	$1.25
Gemini Flash	$0.075	$0.30
Llama 3.1 8B (self-hosted)	~$0.10	~$0.10

Заблуждение: cascade routing всегда экономит деньги

Cascade добавляет latency overhead +5-15% и требует confidence estimation на каждом уровне. Если escalation rate > 50%, экономия минимальна, а latency растет. Для homogeneous workloads (все запросы одной сложности) cascade бесполезен -- learned router или прямой routing к одной модели эффективнее. Cascade окупается при heterogeneous traffic с 40%+ простых запросов.

Заблуждение: semantic cache с cosine similarity 0.90 -- достаточно точен

При threshold 0.90 false positive rate достигает 15-20%. "Как удалить файл в Linux?" (similarity 0.91 с "Как удалить директорию в Linux?") -- совершенно разные команды (rm vs rm -r). Для production начинайте с 0.97+, для code generation и medical -- не ниже 0.98. Снижайте только после A/B теста на вашем трафике.

Заблуждение: fallback в рамках одного провайдера -- надежная стратегия

Если OpenAI rate-limited или down, fallback на другую OpenAI модель бесполезен -- проблема на уровне провайдера. Cross-provider fallback (OpenAI -> Anthropic -> Google) -- обязательный паттерн. Circuit breaker должен временно исключать failing provider целиком, а не отдельную модель.

Interview Questions¶

Q: Спроектируйте систему маршрутизации LLM для высоконагруженного сервиса.

Red flag: "Направляем все запросы на GPT-4 для максимального качества"

Strong answer: "5 компонентов: (1) Request classifier на embeddings + complexity features; (2) Semantic cache -- L1 exact match (<1ms) + L2 vector (10-50ms), threshold >0.95; (3) Model cascade из 3 тиров: cheap (Haiku, $0.25/M) -> medium (Sonnet, $3/M) -> premium (Opus), confidence threshold 0.7-0.8; (4) Cross-provider fallback с circuit breaker; (5) Monitoring -- cache hit rate, escalation rate >50% = alert, cost per query. Deployment: shadow mode -> A/B -> gradual rollout. ROI semantic caching: 1-2 недели"

Q: Когда semantic caching вреден?

Red flag: "Semantic caching всегда полезен, он экономит деньги"

Strong answer: "Три случая: (1) Real-time data -- cached ответ устаревает мгновенно (биржевые котировки, погода); (2) Creative writing -- каждый запрос уникален, overlap близок к нулю, cache hit rate < 5%; (3) Code generation с уникальными спецификациями -- similarity 0.92 может вернуть код для другой задачи. Также опасен при threshold < 0.95 -- false positive rate 15-20%. Подходит для FAQ-ботов, customer support, стандартных Q&A с высоким overlap"

Q: Чем learned routing отличается от rule-based и cascade?

Red flag: "Learned routing -- это просто более сложные правила"

Strong answer: "Rule-based: if-then по keywords/regex, детерминированный, transparent, но rigid -- плохо адаптируется к новым паттернам. Cascade: try cheap first, escalate по confidence -- medium complexity, работает при heterogeneous traffic. Learned routing: lightweight classifier или contextual bandit на prompt embeddings + user history + domain tags, адаптируется к shifting traffic, дает 50-80% savings vs 30-65% у cascade. Trade-off: complexity build/debug высокая, transparency низкая. Best for: high-volume production (>100K запросов/день)"

Источники¶

Redis -- "LLMOps Guide 2026: Build Fast, Cost-Effective LLM Apps"
ByAI Team -- "LLM Routing: Cut Costs Up to 80%, Boost Quality" (Jan 2026)
GitHub -- "CascadeFlow: Smart AI model cascading for cost optimization"
arXiv -- "Trust by Design: Skill Profiles for Transparent LLM Routing" (2602.02386)
arXiv -- RouterBench (2403.12031)
arXiv -- Semantic Cache Performance (2411.05276)
Portkey -- "LLM Routing Techniques for High-Volume Applications"
AWS -- "Optimize LLM Response Costs and Latency with Effective Caching"
Percona -- "Semantic Caching for LLM Apps: Reduce Costs by 40-80%"