Продакшен деплой LLM: практики и кейсы¶

~7 минут чтения

Предварительно: vLLM и Paged Attention, Квантизация LLM

Разница между "работает в ноутбуке" и "обслуживает 10K пользователей" -- это 6-12 месяцев инженерной работы. В 2026 году 78% enterprise-команд сократили time-to-market AI-фич на 30% благодаря стандартизированным паттернам деплоя: semantic routing снижает cost на 48%, continuous batching утилизирует GPU на 70-90% вместо 15-30%, а autoscaling по queue depth предотвращает и переплату за idle GPU, и деградацию latency при пиковых нагрузках. Ключевой инсайт: модель -- это 20% production системы, остальные 80% -- routing, monitoring, scaling, fallback, cost management.

URL: Multiple sources (Ryz Labs, Iterathon) Тип: production / case-study / best-practices Дата: Январь-Февраль 2026 Сбор: Ralph Research ФАЗА 5

Executive Summary¶

Production LLM deployment practices matured significantly in 2026. Key findings: - 48% cost reduction with semantic routing - 47% latency improvement through intelligent model selection - 78% of enterprises report 30% faster time-to-market for AI features

Part 1: 10 Best Practices for LLM Deployment (2026)¶

1. Choose the Right Model Version¶

Model	Pricing	Best For	Limitations
GPT-5.2	~$15/M input tokens	Creative tasks	Factual accuracy issues
Gemini 3 Flash	~$2/M tokens	Structured data	Less robust context
Claude Opus 4.6	~$18/M tokens	Reasoning, analysis	Higher cost
Llama 3 8B	~$0.8/M tokens	Simple queries	Limited reasoning

Note: Pricing is approximate (Feb 2026) and varies by provider. Check provider pricing pages for exact rates.

Key insight: Model choice should be dynamic, not static.

2. Optimize Inference Latency¶

Configuration	p99 Latency
Well-tuned	45ms
Poorly configured	200ms+

Techniques: - Model quantization (4-bit, 8-bit) - Model distillation - Continuous batching (vLLM) - PagedAttention

3. Implement Robust Monitoring¶

Metrics to track: - Latency (p50, p95, p99) - Error rates - User engagement - Cost per query - Token usage

Tools: Prometheus, Grafana, Sentry, OpenTelemetry

4. Automate Scaling¶

# Kubernetes HPA configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Trade-offs: Autoscaling improves responsiveness but may introduce latency during scale-up.

5. Implement A/B Testing¶

Frameworks: Optimizely, LaunchDarkly, custom implementations

Metrics: - User satisfaction - Response quality - Engagement rates - Conversion rates

6. Manage Costs Effectively¶

Strategies: - Schedule usage during off-peak hours - Use reserved instances - Implement semantic caching (50-80% hit rate) - Model routing based on complexity

7. Ensure Data Privacy and Compliance¶

Regulations: GDPR, CCPA, HIPAA

Techniques: - Data anonymization - PII detection and removal - Secure data handling - Access logging and auditing

8. Prepare for Model Drift¶

Frameworks: MLflow, Kubeflow

Best practices: - Continuous training pipelines - Quarterly performance benchmarks - Automated drift detection - Fallback to previous model versions

9. Optimize User Experience¶

Feedback loops for refinement
Context-aware responses
Streaming responses
Graceful degradation

10. Document Everything¶

Components to document: - Architecture diagrams - API specifications - Deployment procedures - Runbooks - Cost models

Part 2: vLLM Semantic Router (January 2026)¶

Key Statistics¶

Metric	Value
Cost reduction	48%
Latency reduction	47%
PRs merged	600+
Beta testers	50+ engineers
MMLU-Pro improvement	+10.2%

The Cost Problem¶

Before semantic routing: - All queries → GPT-5.2 ($0.015/1K input, $0.06/1K output) - 300,000 queries/month with 2,000 input tokens, 500 output tokens - Total: $18,000/month

The Solution: 6 Signal Types¶

vLLM Semantic Router v0.1 "Iris" uses 6 signals:

Signal 1: Keyword Matching (Fast Path)¶

Patterns: "what is", "who is", "when did", "how many"
Speed: ~5ms
Coverage: 15-20% of queries

Signal 2: Embedding-Based Similarity¶

Model: text-embedding-3-large or all-MiniLM-L6-v2
Process: Embed query → cosine similarity to centroids
Speed: ~120ms
Threshold: 0.85 for simple, 0.80 for complex

Signal 3: Domain Classification (14 MMLU Categories)¶

Domain	Llama3 8B	Gemini Flash	GPT-5.2	Claude 4.6
Mathematics	0.72	0.78	0.94	0.91
Computer Science	0.68	0.74	0.91	0.89
Physics	0.64	0.71	0.88	0.92
Law	0.61	0.69	0.86	0.90
Medicine	0.59	0.66	0.85	0.88

Signal 4: Complexity Scoring¶

Token-based:
  <50 tokens → Simple (Llama3 8B)
  50-150 tokens → Moderate (Gemini Flash)
  >150 tokens → Complex (GPT-5.2)

Syntactic depth:
  Depth 1: "What is X?" → Simple
  Depth 2: "What is X and how does it differ from Y?" → Moderate
  Depth 4+: Multi-part questions → Complex

Signal 5: Preference Signals (LLM-Based)¶

Classifier: Gemini 3 Flash ($0.0002/query)
Categories: creative, factual, analytical, code

Routing:
  creative → Claude Opus 4.6
  factual → Llama3 8B / Gemini Flash
  analytical → GPT-5.2
  code → GPT-5.2 / Claude Opus 4.6

Signal 6: Safety Filtering¶

Jailbreak patterns:
  - "ignore previous instructions"
  - "disregard safety guidelines"
  - "DAN mode" / "Developer mode"

PII detection:
  - Email addresses
  - SSN (XXX-XX-XXXX)
  - Credit cards (Luhn validation)

Weighted Vote Algorithm¶

scores = {}
for model in model_pool:
    scores[model] = (
        keyword_signal[model] * 0.10 +
        embedding_signal[model] * 0.40 +
        domain_signal[model] * 0.30 +
        complexity_signal[model] * 0.20
    )

selected_model = max(scores, key=scores.get)

# Fallback if confidence < 0.6
if confidence < 0.6:
    selected_model = get_best_general_model()

After Semantic Routing: 48% Cost Reduction¶

Query distribution: - 60% simple → Llama3 8B - 25% moderate → Gemini 3 Flash - 15% complex → GPT-5.2

Cost calculation (300K queries/month):

Simple (180K): 180K × 0.5 × $0.0008 = $72 (input)
              180K × 0.15 × $0.0008 = $21.60 (output)
              Subtotal: $93.60

Moderate (75K): 75K × 1.0 × $0.002 = $150 (input)
               75K × 0.3 × $0.002 = $45 (output)
               Subtotal: $195

Complex (45K): 45K × 2.0 × $0.015 = $1,350 (input)
              45K × 0.5 × $0.06 = $1,350 (output)
              Subtotal: $2,700

Total: $93.60 + $195 + $2,700 = $2,988.60/month
Note: savings vs same token distribution all-GPT-5.2 baseline (~$8,100)

For original $18K scenario: **$18,000 → $9,360 (48% reduction)**

Latency Improvement¶

Before: All GPT-5.2 → 2.4s average

After: Weighted average
  (0.6 × 0.8s) + (0.25 × 1.2s) + (0.15 × 2.4s) = 1.14s

Improvement: 2.4s → 1.14s = 47% reduction

Part 3: Kubernetes Deployment¶

Helm Chart Installation¶

# Note: Check vLLM docs for current Helm installation
# See: https://docs.vllm.ai/en/latest/deployment/frameworks/helm/
# Production stack: https://github.com/vllm-project/production-stack
helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update

helm install semantic-router vllm/semantic-router \
  --namespace ai-infrastructure \
  --create-namespace \
  --set router.signals.embedding.enabled=true \
  --set router.signals.embedding.model="text-embedding-3-large" \
  --set router.models.llama3_8b.endpoint="http://vllm-llama3:8000" \
  --set router.models.llama3_8b.cost_per_1k_tokens=0.0008

values.yaml Configuration¶

router:
  replicas: 3
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

  signals:
    keyword:
      enabled: true
    embedding:
      enabled: true
      model: "text-embedding-3-large"
      threshold: 0.85
    domain:
      enabled: true
      mmlu_categories: 14
    complexity:
      enabled: true
      token_thresholds:
        simple: 50
        moderate: 150
    safety:
      enabled: true
      jailbreak_detection: true
      pii_detection: true

  weights:
    keyword: 0.10
    embedding: 0.40
    domain: 0.30
    complexity: 0.20

monitoring:
  prometheus:
    enabled: true
    port: 9090

Prometheus Metrics¶

# Routing decisions by model
routing_decisions_total{model="llama3_8b"} 60000
routing_decisions_total{model="gemini_flash"} 25000
routing_decisions_total{model="gpt52"} 15000

# Latency per signal type
signal_latency_ms{signal="keyword"} 5
signal_latency_ms{signal="embedding"} 120
signal_latency_ms{signal="domain"} 80

# Cost tracking
total_cost_saved_usd 8440.50

Shadow Routing (A/B Testing)¶

router:
  shadow_mode:
    enabled: true
    shadow_model: "gpt52"
    comparison_metrics:
      - accuracy
      - user_satisfaction
      - latency

Routes to both selected model AND GPT-5.2 for comparison.

Part 4: Production Case Studies¶

Case Study 1: B2B SaaS Customer Support (200K Queries/Month)¶

Before: - All queries to GPT-5.2 - $12,000/month - 2.4s average latency

After semantic routing: - 55% simple → Llama3 8B - 30% moderate → Gemini Flash - 15% complex → GPT-5.2 - $5,800/month (52% reduction) - 1.1s average latency (54% improvement)

Case Study 2: Healthcare AI (HIPAA-Compliant)¶

Challenges: - PII in queries (patient data) - Strict compliance requirements - High accuracy needed

Solution: - Safety filtering (Signal 6) enabled - PII detection and redaction - HIPAA-compliant models only - Audit logging

Results: - 100% PII detection rate - 40% cost reduction - Full HIPAA compliance maintained

Case Study 3: E-Commerce Recommendations¶

Query types: - Product search (simple): 70% - Comparison (moderate): 20% - Complex recommendations: 10%

Results: - 60% cost reduction - 35% latency improvement - Same conversion rate

Part 5: Best Practices & Pitfalls¶

Best Practices¶

Start with shadow routing - Compare quality before full rollout
Continuous learning from feedback - Update centroids monthly
Monitor MMLU scores - Rebalance quarterly

Common Pitfalls¶

Over-routing to cheap models
Quality degradation
User dissatisfaction
Solution: Set confidence threshold > 0.6
Ignoring routing latency
150ms overhead can negate gains
Solution: Use fast embedding models
Static routing rules
Model capabilities change
Solution: Update model scores monthly

Part 6: ROI Summary¶

Cost Comparison¶

Scenario	Monthly Cost	Latency	Quality
All GPT-5.2	$18,000	2.4s	Baseline
Semantic Router	$9,360	1.1s	+10.2% MMLU-Pro
Savings	$8,640	54%	+10.2%

Infrastructure Cost¶

Kubernetes deployment: ~$100/month
Embedding model: ~$50/month
Monitoring: ~$50/month
Total: ~$200/month

Net ROI¶

Monthly savings: $8,640
Infrastructure cost: $200
Net savings: $8,440/month
Payback period: Immediate (Day 1)

Заблуждение: semantic routing всегда экономит 48%

Цифра 48% получена на конкретном распределении: 60% simple / 25% moderate / 15% complex. Если ваш трафик -- 70% сложных аналитических запросов (юридический сервис, медицина), routing направит большинство на дорогую модель и экономия составит 5-10%. Реальная экономия зависит от профиля трафика: измерьте распределение complexity на 10K запросов ПЕРЕД внедрением router. Threshold 0.7 для similarity может отправить сложный запрос на слабую модель -- начинайте с 0.9 и снижайте через A/B тесты.

Заблуждение: autoscaling по CPU -- достаточная метрика для LLM

LLM inference -- memory-bound операция, не compute-bound. GPU utilization может быть 30% при полностью загруженной VRAM и очереди на 200 запросов. Правильные метрики для HPA: queue depth (pending requests), KV cache utilization, TTFT p95. CPU-based autoscaling реагирует на 2-5 минут позже реальной нагрузки, что для пользователя означает таймауты.

Заблуждение: p99 latency 45ms достижима в production

Цифра 45ms p99 из бенчмарков -- это single-request latency на прогретом GPU без очереди. В production с 100+ concurrent users реальный p99 составляет 200-800ms из-за: queuing delay, KV cache pressure, batch scheduling overhead. Well-tuned production p99 для Llama-70B на 4xH100: 300-500ms при 50 concurrent users. Не путайте бенчмарк с SLA.

Interview Questions¶

Q: Как спроектировать production LLM систему для 10K concurrent users?

Red flag: "Поставить vLLM на большой GPU и масштабировать вертикально"

Strong answer: "Горизонтальная архитектура: (1) Load balancer (nginx/envoy) -> N vLLM instances с tensor parallelism на 4xH100. (2) Continuous batching для GPU utilization 70-90%. (3) Prefix caching для multi-turn (RadixAttention если SGLang). (4) FP8 квантизация для 2x memory savings без потери качества. (5) Autoscaling по queue depth: min 4 replicas, max 20, target queue <10. (6) Semantic routing: simple запросы на 8B модель, complex на 70B. Target SLA: TTFT <200ms p95, throughput 850+ tok/s per node. Мониторинг: Prometheus + Grafana, алерты на TTFT p99 > 500ms."

Q: Semantic routing снизил cost на 48%, но качество упало. Как диагностировать?

Red flag: "Повысить threshold для всех запросов"

Strong answer: "Систематический подход: (1) Shadow routing -- параллельно отправлять на baseline модель (GPT-5.2) и сравнивать ответы, метрика: win rate per-route. (2) Разбивка quality по complexity bucket: если simple-route показывает 95% quality, а moderate -- 70%, проблема в moderate threshold. (3) Анализ signal weights: если embedding signal доминирует (0.4), но ваши запросы плохо кластеризуются, увеличить вес domain signal. (4) Per-domain MMLU scores: Llama 8B на Math (0.72) может быть ниже acceptable threshold, перенаправить Math на Gemini Flash (0.78). (5) Добавить confidence-based fallback: если score <0.6 -- всегда premium модель."

Q: Как организовать мониторинг production LLM системы?

Red flag: "Стандартный мониторинг CPU/Memory через CloudWatch"

Strong answer: "LLM-специфичные метрики в 4 категориях: (1) Latency: TTFT p50/p95/p99, TPOT (time per output token), end-to-end latency -- разбивка показывает bottleneck (queue vs prefill vs decode). (2) Throughput: tokens/sec per GPU, requests/sec, batch utilization. (3) Quality: semantic similarity to baseline, user feedback rate, retry rate (>5% = проблема). (4) Cost: cost/1K tokens per model, cost/запрос, cost per route. Стек: Prometheus для метрик, Grafana для dashboards, PagerDuty алерты на TTFT p99 >1s и error rate >2%. OpenTelemetry для distributed tracing через routing -> inference -> response."

Sources¶

Ryz Labs — "10 Best Practices for LLM Deployment in Production 2026" (Jan 2026)
Iterathon — "LLM Semantic Router Production Implementation vLLM SR 2026" (Jan 2026)
vLLM Blog — Semantic Router v0.1 "Iris" announcement
Red Hat — vLLM SR Qwen3 30B benchmarks