Перейти к содержанию

Каскадная маршрутизация и оптимизация LLM

~10 минут чтения

Предварительно: Ценообразование API LLM, Продакшен деплой LLM

Связанный файл: Оптимизация расходов LLMOps -- batch processing, cost projection model, LLMOps vs MLOps, Python-реализация SemanticCache и ModelRouter

40-70% текстовых запросов к LLM не нуждаются в flagship-модели (CascadeFlow, 2026). Cascade routing -- стратегия, при которой запрос сначала направляется дешевой модели (GPT-4o-mini за $0.15/M input), и только при низком confidence эскалируется к дорогой (GPT-4o за $2.50/M). В комбинации с semantic caching (40-86% экономии, 250x ускорение на cache hits) и learned routing (contextual bandits), суммарная экономия достигает 60-85% без деградации качества. RouterBench подтверждает: multi-LLM router match или превосходят лучшую single model при 40-60% снижении стоимости.

Цены: Конкретные суммы актуальны на февраль 2026 и быстро устаревают. См. дисклеймер в разделе ценообразования.


Ключевые концепции

Model routing -- направление запросов к наиболее cost-effective модели.

\[\text{Optimal Model} = \arg\min_{m \in M} \{Cost(m) : Quality(m, q) \geq \theta_q\}\]

Three Pillars of Optimization

Pillar What to Measure Target
Quality Accuracy, helpfulness, safety Task-specific criteria
Cost API fees, GPU time, retries Total cost of ownership
Latency P50/P95 response times End-to-end SLA

Routing Landscape 2026

Strategy Cost Savings Complexity Quality Impact
Semantic caching 40-86% Low None
Model cascade 30-65% Medium Minimal
Learned routing 50-80% High Optimized
Hybrid approach 60-85% High Optimized

1. Routing Architectures

1.1 Rule-Based Routing

If-then logic based on keywords, regex, prompt length, task types.

  • Route short factual questions -> Mistral-7B
  • Route creative writing -> Claude Opus

Pros: transparent, deterministic, easy to govern. Cons: rigid, hard to maintain.

1.2 Cascade Routing (Uncertainty-Based Escalation)

Try cheap model first, escalate if confidence low.

graph TD
    Q["Query"] --> SC["Semantic Cache"]
    SC -->|"HIT"| R1["Return (instant, free)"]
    SC -->|"MISS"| T1["Tier 1: Cheap (Haiku, GPT-4o-mini)"]
    T1 -->|"HIGH confidence"| R2["Return"]
    T1 -->|"LOW confidence"| T2["Tier 2: Medium (Sonnet, GPT-4o)"]
    T2 -->|"HIGH confidence"| R3["Return"]
    T2 -->|"LOW confidence"| T3["Tier 3: Premium (Opus, o3)"]
    T3 --> R4["Return (guaranteed)"]

    style Q fill:#e8eaf6,stroke:#3f51b5
    style SC fill:#f3e5f5,stroke:#9c27b0
    style R1 fill:#e8f5e9,stroke:#4caf50
    style T1 fill:#e8f5e9,stroke:#4caf50
    style R2 fill:#e8f5e9,stroke:#4caf50
    style T2 fill:#fff3e0,stroke:#ef6c00
    style R3 fill:#e8f5e9,stroke:#4caf50
    style T3 fill:#fce4ec,stroke:#c62828
    style R4 fill:#e8f5e9,stroke:#4caf50
\[\text{Savings} = 1 - \sum_{i=1}^{n} P(\text{reach tier } i) \times \frac{Cost_i}{Cost_{max}}\]
Metric (CascadeFlow) Value
Text prompts not needing flagship 40-70%
Agent calls not needing flagship 20-60%
Typical cost savings 30-65%
Latency overhead +5-15%

1.3 Learned Routers & Contextual Bandits

ML-based routing that adapts over time:

  • Lightweight classifier: predicts best LLM from prompt embeddings, user history, domain tags
  • Bandit policy: continuously explores routing choices, exploits best outcomes
  • Adaptation: adjusts to shifting traffic patterns and model updates

Pros: data-efficient, adaptive, highest optimization potential. Cons: complex to build/debug.

1.4 Architecture Comparison

Architecture Complexity Transparency Adaptability Best For
Rules Low High Low Starting, regulated envs
Cascades Medium Medium Low Heterogeneous workloads
Learned High Low High High-volume production

2. Semantic Caching

Pipeline

graph LR
    Q["Query"] --> E["Embed"]
    E --> S["Similarity Search"]
    S --> T{"Threshold >0.95?"}
    T -->|"HIT"| R1["Return cached (instant)"]
    T -->|"MISS"| L["Call LLM"]
    L --> ST["Store in Cache"]
    ST --> R2["Return"]

    style Q fill:#e8eaf6,stroke:#3f51b5
    style E fill:#e8eaf6,stroke:#3f51b5
    style S fill:#f3e5f5,stroke:#9c27b0
    style T fill:#fff3e0,stroke:#ef6c00
    style R1 fill:#e8f5e9,stroke:#4caf50
    style L fill:#fce4ec,stroke:#c62828
    style ST fill:#f3e5f5,stroke:#9c27b0
    style R2 fill:#e8f5e9,stroke:#4caf50

Statistics

Metric Value
Cost reduction 40-86%
Response time 250x faster (hit)
Latency reduction 96.9% (1.67s -> 0.052s for hits)
Cache hit improvement 88%

Multi-Layer Cache

Layer Type Latency Use Case
L1 Exact match (key-value) <1ms Identical queries
L2 Semantic (vector) 10-50ms Similar meaning queries

Semantic cache с threshold < 0.95 возвращает НЕПРАВИЛЬНЫЕ ответы

Cosine similarity 0.90 кажется "достаточно похожим", но для LLM queries это опасно. "Как удалить файл в Linux?" (similarity 0.91 с "Как удалить директорию в Linux?") -- совершенно разные команды (rm vs rm -r). В production при threshold 0.90 false positive rate достигает 15-20%. Правило: начинайте с 0.97+, снижайте только после A/B теста на вашем трафике. Для code generation и medical -- не ниже 0.98.

When to Use

Use Case Recommended Reason
Customer support Yes Repeated intents
FAQ bots Yes High query overlap
Code generation Maybe Varies by spec
Real-time data No Needs fresh data
Creative writing No Unique outputs

Configuration

Setting Recommended Reason
TTL 5-60 minutes Balance freshness vs cache
TTL jitter 10-20% Prevent thundering herd
Similarity threshold 0.95+ Avoid false positives
Max cache size Based on memory LRU eviction

3. Smart Routing Strategies

Pre-Generation Routing

def pre_generation_route(query):
    complexity = classify_complexity(query)
    if complexity == "simple":
        return call_model("gpt-3.5-turbo", query)
    elif complexity == "medium":
        return call_model("claude-3-haiku", query)
    else:
        return call_model("gpt-4", query)

Select-Then-Route (StR, EMNLP 2025)

Two-stage: select model pool -> route within it. Advantage: smaller candidate pool.

Taxonomy-Guided Routing

Route by query taxonomy: code queries -> code-specialized model, creative -> creative model.

Feedback-Based Routing

Learn from user satisfaction (regenerate, thumbs down). Improves over time.


4. Advanced Optimization

Task Shaping

Make the task easier before model selection: - Prompt engineering: clear instructions, few-shot examples - Structured output: JSON schemas for deterministic parsing - Pre-processing: summarize long docs before synthesis

Teacher-Student Distillation

  1. Use GPT-4 ("teacher") for sample production requests
  2. Fine-tune smaller model ("student") on labeled data
  3. Route majority of traffic to student
  4. Escalate to teacher for novel/high-stakes cases

Predictive Latency Modeling

  • Token count estimation: predict response time from input length
  • Load-based selection: choose model within SLA under current load
  • Speculative execution: pre-compute responses, serve best instantly

5. Fallback Strategies

Triggers

Trigger Action
Model failure Try next model in cascade
High latency Switch to faster model
Rate limit Queue or switch provider
Quality threshold not met Escalate to stronger model

Best Practices

Practice Description
Cross-provider Don't fallback within same provider
Health checks Monitor model availability
Circuit breaker Skip failing models temporarily
Graceful degradation Inform user of fallback

6. Evaluation & Deployment

Golden Set

Representative data + edge cases + ground truth (human ratings, LLM-as-judge calibration).

Evaluation Methods

Method Use Case Cost
LLM-as-judge Format, reasoning, factual consistency $0.05-0.50 per 1K
User feedback Thumbs up/down, escalation signals Free (implicit)
Human raters Edge cases, quality audits $1-5 per batch

Deployment Strategy

  1. Shadow mode: new policy runs parallel, log decisions without impact
  2. Counterfactual logging: compute potential regret/improvement
  3. A/B testing: measure impact on business KPIs
  4. Gradual rollout: 1% -> 5% -> 50% -> 100%

RouterBench Results

Finding Impact
Quality match Multi-LLM routers match or exceed best single model
Cost reduction 40-60% average
Latency overhead 5-20ms for routing decision

7. Production Configuration

Routing Config

Setting Recommended
Confidence threshold 0.7-0.8
Max cascade steps 3
Timeout per model 5-30 seconds
Retry attempts 2

Monitoring Metrics

Metric Alert Threshold
Cache hit rate < 30%
Cascade escalation rate > 50%
Model latency > 5s P95
Fallback rate > 10%
Cost per query > baseline + 20%

Governance

  • Routing rules in git/feature flags
  • Auto-shift to conservative models near budget limits
  • Multiple providers for fallback
  • Circuit breakers for failing models

8. Gateway Tools

Commercial

Gateway Features Best For
Portkey Routing, caching, fallback Production
LiteLLM Unified API, routing Development
Unify Multi-provider routing Cost optimization
OpenRouter Model marketplace Flexibility
Bifrost Enterprise routing Enterprise

Open-Source

Tool Focus
CascadeFlow Cascade routing (lemony-ai/cascadeflow)
LiteLLM Unified interface (BerriAI/litellm)
GPTCache Semantic caching (zilliztech/GPTCache)

Для интервью

Q: "Как оптимизировать стоимость LLM-системы?"

Три стратегии: (1) Semantic caching -- 40-86% savings, 250x faster для cache hits, similarity threshold 0.95+. (2) Cascade routing -- try cheap model first (GPT-3.5/Haiku), escalate only if confidence low. 40-70% запросов не нуждаются в flagship model (CascadeFlow). Savings 30-65%. (3) Learned routing -- contextual bandit / lightweight classifier, 50-80% savings, adapts over time. Combined: 60-85%.

Q: "Design an LLM routing system."

Components: (1) Request classifier (embeddings + complexity features). (2) Semantic cache (L1 exact + L2 vector, similarity > 0.95). (3) Model cascade (3 tiers: cheap/medium/premium, confidence threshold 0.7-0.8). (4) Fallback (cross-provider, circuit breaker). (5) Monitoring (cache hit rate, escalation rate, cost per query). Deployment: shadow mode -> A/B test -> gradual rollout. ROI: semantic caching pays back in 1-2 weeks.

Ключевые числа

Факт Значение
Semantic caching cost reduction 40-86%
Semantic caching speedup 250x
Cascade routing savings 30-65%
Queries not needing flagship (text) 40-70%
Queries not needing flagship (agents) 20-60%
Combined routing savings 60-85%
Routing decision overhead 5-20ms
RouterBench cost reduction 40-60%
Query distribution: simple 40-60%
Routing ROI (caching) 1-2 weeks

Model Cost Comparison (per 1M tokens)

Model Input Output
GPT-4o $2.50 $10.00
Claude Sonnet $3.00 $15.00
GPT-4o-mini $0.15 $0.60
Claude Haiku $0.25 $1.25
Gemini Flash $0.075 $0.30
Llama 3.1 8B (self-hosted) ~$0.10 ~$0.10

Заблуждение: cascade routing всегда экономит деньги

Cascade добавляет latency overhead +5-15% и требует confidence estimation на каждом уровне. Если escalation rate > 50%, экономия минимальна, а latency растет. Для homogeneous workloads (все запросы одной сложности) cascade бесполезен -- learned router или прямой routing к одной модели эффективнее. Cascade окупается при heterogeneous traffic с 40%+ простых запросов.

Заблуждение: semantic cache с cosine similarity 0.90 -- достаточно точен

При threshold 0.90 false positive rate достигает 15-20%. "Как удалить файл в Linux?" (similarity 0.91 с "Как удалить директорию в Linux?") -- совершенно разные команды (rm vs rm -r). Для production начинайте с 0.97+, для code generation и medical -- не ниже 0.98. Снижайте только после A/B теста на вашем трафике.

Заблуждение: fallback в рамках одного провайдера -- надежная стратегия

Если OpenAI rate-limited или down, fallback на другую OpenAI модель бесполезен -- проблема на уровне провайдера. Cross-provider fallback (OpenAI -> Anthropic -> Google) -- обязательный паттерн. Circuit breaker должен временно исключать failing provider целиком, а не отдельную модель.


Interview Questions

Q: Спроектируйте систему маршрутизации LLM для высоконагруженного сервиса.

❌ Red flag: "Направляем все запросы на GPT-4 для максимального качества"

✅ Strong answer: "5 компонентов: (1) Request classifier на embeddings + complexity features; (2) Semantic cache -- L1 exact match (<1ms) + L2 vector (10-50ms), threshold >0.95; (3) Model cascade из 3 тиров: cheap (Haiku, $0.25/M) -> medium (Sonnet, $3/M) -> premium (Opus), confidence threshold 0.7-0.8; (4) Cross-provider fallback с circuit breaker; (5) Monitoring -- cache hit rate, escalation rate >50% = alert, cost per query. Deployment: shadow mode -> A/B -> gradual rollout. ROI semantic caching: 1-2 недели"

Q: Когда semantic caching вреден?

❌ Red flag: "Semantic caching всегда полезен, он экономит деньги"

✅ Strong answer: "Три случая: (1) Real-time data -- cached ответ устаревает мгновенно (биржевые котировки, погода); (2) Creative writing -- каждый запрос уникален, overlap близок к нулю, cache hit rate < 5%; (3) Code generation с уникальными спецификациями -- similarity 0.92 может вернуть код для другой задачи. Также опасен при threshold < 0.95 -- false positive rate 15-20%. Подходит для FAQ-ботов, customer support, стандартных Q&A с высоким overlap"

Q: Чем learned routing отличается от rule-based и cascade?

❌ Red flag: "Learned routing -- это просто более сложные правила"

✅ Strong answer: "Rule-based: if-then по keywords/regex, детерминированный, transparent, но rigid -- плохо адаптируется к новым паттернам. Cascade: try cheap first, escalate по confidence -- medium complexity, работает при heterogeneous traffic. Learned routing: lightweight classifier или contextual bandit на prompt embeddings + user history + domain tags, адаптируется к shifting traffic, дает 50-80% savings vs 30-65% у cascade. Trade-off: complexity build/debug высокая, transparency низкая. Best for: high-volume production (>100K запросов/день)"


Источники

  1. Redis -- "LLMOps Guide 2026: Build Fast, Cost-Effective LLM Apps"
  2. ByAI Team -- "LLM Routing: Cut Costs Up to 80%, Boost Quality" (Jan 2026)
  3. GitHub -- "CascadeFlow: Smart AI model cascading for cost optimization"
  4. arXiv -- "Trust by Design: Skill Profiles for Transparent LLM Routing" (2602.02386)
  5. arXiv -- RouterBench (2403.12031)
  6. arXiv -- Semantic Cache Performance (2411.05276)
  7. Portkey -- "LLM Routing Techniques for High-Volume Applications"
  8. AWS -- "Optimize LLM Response Costs and Latency with Effective Caching"
  9. Percona -- "Semantic Caching for LLM Apps: Reduce Costs by 40-80%"

See Also