Каскадная маршрутизация и оптимизация LLM¶
~10 минут чтения
Предварительно: Ценообразование API LLM, Продакшен деплой LLM
Связанный файл: Оптимизация расходов LLMOps -- batch processing, cost projection model, LLMOps vs MLOps, Python-реализация SemanticCache и ModelRouter
40-70% текстовых запросов к LLM не нуждаются в flagship-модели (CascadeFlow, 2026). Cascade routing -- стратегия, при которой запрос сначала направляется дешевой модели (GPT-4o-mini за $0.15/M input), и только при низком confidence эскалируется к дорогой (GPT-4o за $2.50/M). В комбинации с semantic caching (40-86% экономии, 250x ускорение на cache hits) и learned routing (contextual bandits), суммарная экономия достигает 60-85% без деградации качества. RouterBench подтверждает: multi-LLM router match или превосходят лучшую single model при 40-60% снижении стоимости.
Цены: Конкретные суммы актуальны на февраль 2026 и быстро устаревают. См. дисклеймер в разделе ценообразования.
Ключевые концепции¶
Model routing -- направление запросов к наиболее cost-effective модели.
Three Pillars of Optimization¶
| Pillar | What to Measure | Target |
|---|---|---|
| Quality | Accuracy, helpfulness, safety | Task-specific criteria |
| Cost | API fees, GPU time, retries | Total cost of ownership |
| Latency | P50/P95 response times | End-to-end SLA |
Routing Landscape 2026¶
| Strategy | Cost Savings | Complexity | Quality Impact |
|---|---|---|---|
| Semantic caching | 40-86% | Low | None |
| Model cascade | 30-65% | Medium | Minimal |
| Learned routing | 50-80% | High | Optimized |
| Hybrid approach | 60-85% | High | Optimized |
1. Routing Architectures¶
1.1 Rule-Based Routing¶
If-then logic based on keywords, regex, prompt length, task types.
- Route short factual questions -> Mistral-7B
- Route creative writing -> Claude Opus
Pros: transparent, deterministic, easy to govern. Cons: rigid, hard to maintain.
1.2 Cascade Routing (Uncertainty-Based Escalation)¶
Try cheap model first, escalate if confidence low.
graph TD
Q["Query"] --> SC["Semantic Cache"]
SC -->|"HIT"| R1["Return (instant, free)"]
SC -->|"MISS"| T1["Tier 1: Cheap (Haiku, GPT-4o-mini)"]
T1 -->|"HIGH confidence"| R2["Return"]
T1 -->|"LOW confidence"| T2["Tier 2: Medium (Sonnet, GPT-4o)"]
T2 -->|"HIGH confidence"| R3["Return"]
T2 -->|"LOW confidence"| T3["Tier 3: Premium (Opus, o3)"]
T3 --> R4["Return (guaranteed)"]
style Q fill:#e8eaf6,stroke:#3f51b5
style SC fill:#f3e5f5,stroke:#9c27b0
style R1 fill:#e8f5e9,stroke:#4caf50
style T1 fill:#e8f5e9,stroke:#4caf50
style R2 fill:#e8f5e9,stroke:#4caf50
style T2 fill:#fff3e0,stroke:#ef6c00
style R3 fill:#e8f5e9,stroke:#4caf50
style T3 fill:#fce4ec,stroke:#c62828
style R4 fill:#e8f5e9,stroke:#4caf50
| Metric (CascadeFlow) | Value |
|---|---|
| Text prompts not needing flagship | 40-70% |
| Agent calls not needing flagship | 20-60% |
| Typical cost savings | 30-65% |
| Latency overhead | +5-15% |
1.3 Learned Routers & Contextual Bandits¶
ML-based routing that adapts over time:
- Lightweight classifier: predicts best LLM from prompt embeddings, user history, domain tags
- Bandit policy: continuously explores routing choices, exploits best outcomes
- Adaptation: adjusts to shifting traffic patterns and model updates
Pros: data-efficient, adaptive, highest optimization potential. Cons: complex to build/debug.
1.4 Architecture Comparison¶
| Architecture | Complexity | Transparency | Adaptability | Best For |
|---|---|---|---|---|
| Rules | Low | High | Low | Starting, regulated envs |
| Cascades | Medium | Medium | Low | Heterogeneous workloads |
| Learned | High | Low | High | High-volume production |
2. Semantic Caching¶
Pipeline¶
graph LR
Q["Query"] --> E["Embed"]
E --> S["Similarity Search"]
S --> T{"Threshold >0.95?"}
T -->|"HIT"| R1["Return cached (instant)"]
T -->|"MISS"| L["Call LLM"]
L --> ST["Store in Cache"]
ST --> R2["Return"]
style Q fill:#e8eaf6,stroke:#3f51b5
style E fill:#e8eaf6,stroke:#3f51b5
style S fill:#f3e5f5,stroke:#9c27b0
style T fill:#fff3e0,stroke:#ef6c00
style R1 fill:#e8f5e9,stroke:#4caf50
style L fill:#fce4ec,stroke:#c62828
style ST fill:#f3e5f5,stroke:#9c27b0
style R2 fill:#e8f5e9,stroke:#4caf50
Statistics¶
| Metric | Value |
|---|---|
| Cost reduction | 40-86% |
| Response time | 250x faster (hit) |
| Latency reduction | 96.9% (1.67s -> 0.052s for hits) |
| Cache hit improvement | 88% |
Multi-Layer Cache¶
| Layer | Type | Latency | Use Case |
|---|---|---|---|
| L1 | Exact match (key-value) | <1ms | Identical queries |
| L2 | Semantic (vector) | 10-50ms | Similar meaning queries |
Semantic cache с threshold < 0.95 возвращает НЕПРАВИЛЬНЫЕ ответы
Cosine similarity 0.90 кажется "достаточно похожим", но для LLM queries это опасно. "Как удалить файл в Linux?" (similarity 0.91 с "Как удалить директорию в Linux?") -- совершенно разные команды (rm vs rm -r). В production при threshold 0.90 false positive rate достигает 15-20%. Правило: начинайте с 0.97+, снижайте только после A/B теста на вашем трафике. Для code generation и medical -- не ниже 0.98.
When to Use¶
| Use Case | Recommended | Reason |
|---|---|---|
| Customer support | Yes | Repeated intents |
| FAQ bots | Yes | High query overlap |
| Code generation | Maybe | Varies by spec |
| Real-time data | No | Needs fresh data |
| Creative writing | No | Unique outputs |
Configuration¶
| Setting | Recommended | Reason |
|---|---|---|
| TTL | 5-60 minutes | Balance freshness vs cache |
| TTL jitter | 10-20% | Prevent thundering herd |
| Similarity threshold | 0.95+ | Avoid false positives |
| Max cache size | Based on memory | LRU eviction |
3. Smart Routing Strategies¶
Pre-Generation Routing¶
def pre_generation_route(query):
complexity = classify_complexity(query)
if complexity == "simple":
return call_model("gpt-3.5-turbo", query)
elif complexity == "medium":
return call_model("claude-3-haiku", query)
else:
return call_model("gpt-4", query)
Select-Then-Route (StR, EMNLP 2025)¶
Two-stage: select model pool -> route within it. Advantage: smaller candidate pool.
Taxonomy-Guided Routing¶
Route by query taxonomy: code queries -> code-specialized model, creative -> creative model.
Feedback-Based Routing¶
Learn from user satisfaction (regenerate, thumbs down). Improves over time.
4. Advanced Optimization¶
Task Shaping¶
Make the task easier before model selection: - Prompt engineering: clear instructions, few-shot examples - Structured output: JSON schemas for deterministic parsing - Pre-processing: summarize long docs before synthesis
Teacher-Student Distillation¶
- Use GPT-4 ("teacher") for sample production requests
- Fine-tune smaller model ("student") on labeled data
- Route majority of traffic to student
- Escalate to teacher for novel/high-stakes cases
Predictive Latency Modeling¶
- Token count estimation: predict response time from input length
- Load-based selection: choose model within SLA under current load
- Speculative execution: pre-compute responses, serve best instantly
5. Fallback Strategies¶
Triggers¶
| Trigger | Action |
|---|---|
| Model failure | Try next model in cascade |
| High latency | Switch to faster model |
| Rate limit | Queue or switch provider |
| Quality threshold not met | Escalate to stronger model |
Best Practices¶
| Practice | Description |
|---|---|
| Cross-provider | Don't fallback within same provider |
| Health checks | Monitor model availability |
| Circuit breaker | Skip failing models temporarily |
| Graceful degradation | Inform user of fallback |
6. Evaluation & Deployment¶
Golden Set¶
Representative data + edge cases + ground truth (human ratings, LLM-as-judge calibration).
Evaluation Methods¶
| Method | Use Case | Cost |
|---|---|---|
| LLM-as-judge | Format, reasoning, factual consistency | $0.05-0.50 per 1K |
| User feedback | Thumbs up/down, escalation signals | Free (implicit) |
| Human raters | Edge cases, quality audits | $1-5 per batch |
Deployment Strategy¶
- Shadow mode: new policy runs parallel, log decisions without impact
- Counterfactual logging: compute potential regret/improvement
- A/B testing: measure impact on business KPIs
- Gradual rollout: 1% -> 5% -> 50% -> 100%
RouterBench Results¶
| Finding | Impact |
|---|---|
| Quality match | Multi-LLM routers match or exceed best single model |
| Cost reduction | 40-60% average |
| Latency overhead | 5-20ms for routing decision |
7. Production Configuration¶
Routing Config¶
| Setting | Recommended |
|---|---|
| Confidence threshold | 0.7-0.8 |
| Max cascade steps | 3 |
| Timeout per model | 5-30 seconds |
| Retry attempts | 2 |
Monitoring Metrics¶
| Metric | Alert Threshold |
|---|---|
| Cache hit rate | < 30% |
| Cascade escalation rate | > 50% |
| Model latency | > 5s P95 |
| Fallback rate | > 10% |
| Cost per query | > baseline + 20% |
Governance¶
- Routing rules in git/feature flags
- Auto-shift to conservative models near budget limits
- Multiple providers for fallback
- Circuit breakers for failing models
8. Gateway Tools¶
Commercial¶
| Gateway | Features | Best For |
|---|---|---|
| Portkey | Routing, caching, fallback | Production |
| LiteLLM | Unified API, routing | Development |
| Unify | Multi-provider routing | Cost optimization |
| OpenRouter | Model marketplace | Flexibility |
| Bifrost | Enterprise routing | Enterprise |
Open-Source¶
| Tool | Focus |
|---|---|
| CascadeFlow | Cascade routing (lemony-ai/cascadeflow) |
| LiteLLM | Unified interface (BerriAI/litellm) |
| GPTCache | Semantic caching (zilliztech/GPTCache) |
Для интервью¶
Q: "Как оптимизировать стоимость LLM-системы?"¶
Три стратегии: (1) Semantic caching -- 40-86% savings, 250x faster для cache hits, similarity threshold 0.95+. (2) Cascade routing -- try cheap model first (GPT-3.5/Haiku), escalate only if confidence low. 40-70% запросов не нуждаются в flagship model (CascadeFlow). Savings 30-65%. (3) Learned routing -- contextual bandit / lightweight classifier, 50-80% savings, adapts over time. Combined: 60-85%.
Q: "Design an LLM routing system."¶
Components: (1) Request classifier (embeddings + complexity features). (2) Semantic cache (L1 exact + L2 vector, similarity > 0.95). (3) Model cascade (3 tiers: cheap/medium/premium, confidence threshold 0.7-0.8). (4) Fallback (cross-provider, circuit breaker). (5) Monitoring (cache hit rate, escalation rate, cost per query). Deployment: shadow mode -> A/B test -> gradual rollout. ROI: semantic caching pays back in 1-2 weeks.
Ключевые числа¶
| Факт | Значение |
|---|---|
| Semantic caching cost reduction | 40-86% |
| Semantic caching speedup | 250x |
| Cascade routing savings | 30-65% |
| Queries not needing flagship (text) | 40-70% |
| Queries not needing flagship (agents) | 20-60% |
| Combined routing savings | 60-85% |
| Routing decision overhead | 5-20ms |
| RouterBench cost reduction | 40-60% |
| Query distribution: simple | 40-60% |
| Routing ROI (caching) | 1-2 weeks |
Model Cost Comparison (per 1M tokens)¶
| Model | Input | Output |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| Claude Sonnet | $3.00 | $15.00 |
| GPT-4o-mini | $0.15 | $0.60 |
| Claude Haiku | $0.25 | $1.25 |
| Gemini Flash | $0.075 | $0.30 |
| Llama 3.1 8B (self-hosted) | ~$0.10 | ~$0.10 |
Заблуждение: cascade routing всегда экономит деньги
Cascade добавляет latency overhead +5-15% и требует confidence estimation на каждом уровне. Если escalation rate > 50%, экономия минимальна, а latency растет. Для homogeneous workloads (все запросы одной сложности) cascade бесполезен -- learned router или прямой routing к одной модели эффективнее. Cascade окупается при heterogeneous traffic с 40%+ простых запросов.
Заблуждение: semantic cache с cosine similarity 0.90 -- достаточно точен
При threshold 0.90 false positive rate достигает 15-20%. "Как удалить файл в Linux?" (similarity 0.91 с "Как удалить директорию в Linux?") -- совершенно разные команды (rm vs rm -r). Для production начинайте с 0.97+, для code generation и medical -- не ниже 0.98. Снижайте только после A/B теста на вашем трафике.
Заблуждение: fallback в рамках одного провайдера -- надежная стратегия
Если OpenAI rate-limited или down, fallback на другую OpenAI модель бесполезен -- проблема на уровне провайдера. Cross-provider fallback (OpenAI -> Anthropic -> Google) -- обязательный паттерн. Circuit breaker должен временно исключать failing provider целиком, а не отдельную модель.
Interview Questions¶
Q: Спроектируйте систему маршрутизации LLM для высоконагруженного сервиса.
Red flag: "Направляем все запросы на GPT-4 для максимального качества"
Strong answer: "5 компонентов: (1) Request classifier на embeddings + complexity features; (2) Semantic cache -- L1 exact match (<1ms) + L2 vector (10-50ms), threshold >0.95; (3) Model cascade из 3 тиров: cheap (Haiku, $0.25/M) -> medium (Sonnet, $3/M) -> premium (Opus), confidence threshold 0.7-0.8; (4) Cross-provider fallback с circuit breaker; (5) Monitoring -- cache hit rate, escalation rate >50% = alert, cost per query. Deployment: shadow mode -> A/B -> gradual rollout. ROI semantic caching: 1-2 недели"
Q: Когда semantic caching вреден?
Red flag: "Semantic caching всегда полезен, он экономит деньги"
Strong answer: "Три случая: (1) Real-time data -- cached ответ устаревает мгновенно (биржевые котировки, погода); (2) Creative writing -- каждый запрос уникален, overlap близок к нулю, cache hit rate < 5%; (3) Code generation с уникальными спецификациями -- similarity 0.92 может вернуть код для другой задачи. Также опасен при threshold < 0.95 -- false positive rate 15-20%. Подходит для FAQ-ботов, customer support, стандартных Q&A с высоким overlap"
Q: Чем learned routing отличается от rule-based и cascade?
Red flag: "Learned routing -- это просто более сложные правила"
Strong answer: "Rule-based: if-then по keywords/regex, детерминированный, transparent, но rigid -- плохо адаптируется к новым паттернам. Cascade: try cheap first, escalate по confidence -- medium complexity, работает при heterogeneous traffic. Learned routing: lightweight classifier или contextual bandit на prompt embeddings + user history + domain tags, адаптируется к shifting traffic, дает 50-80% savings vs 30-65% у cascade. Trade-off: complexity build/debug высокая, transparency низкая. Best for: high-volume production (>100K запросов/день)"
Источники¶
- Redis -- "LLMOps Guide 2026: Build Fast, Cost-Effective LLM Apps"
- ByAI Team -- "LLM Routing: Cut Costs Up to 80%, Boost Quality" (Jan 2026)
- GitHub -- "CascadeFlow: Smart AI model cascading for cost optimization"
- arXiv -- "Trust by Design: Skill Profiles for Transparent LLM Routing" (2602.02386)
- arXiv -- RouterBench (2403.12031)
- arXiv -- Semantic Cache Performance (2411.05276)
- Portkey -- "LLM Routing Techniques for High-Volume Applications"
- AWS -- "Optimize LLM Response Costs and Latency with Effective Caching"
- Percona -- "Semantic Caching for LLM Apps: Reduce Costs by 40-80%"
See Also¶
- Ценообразование API LLM -- pricing tiers моделей, между которыми routing переключает трафик
- Оптимизация расходов LLMOps -- routing как часть комплексной стратегии: + caching + batching + prompt optimization
- Наблюдаемость LLM -- мониторинг quality/latency/cost после routing решений, Langfuse трейсы
- Паттерны ML System Design интервью -- model serving 1M QPS: tiered serving = routing в масштабе
- Продакшен деплой LLM -- vLLM, PagedAttention -- инфраструктура, поверх которой работает routing