Продакшен деплой LLM: практики и кейсы¶
~7 минут чтения
Предварительно: vLLM и Paged Attention, Квантизация LLM
Разница между "работает в ноутбуке" и "обслуживает 10K пользователей" -- это 6-12 месяцев инженерной работы. В 2026 году 78% enterprise-команд сократили time-to-market AI-фич на 30% благодаря стандартизированным паттернам деплоя: semantic routing снижает cost на 48%, continuous batching утилизирует GPU на 70-90% вместо 15-30%, а autoscaling по queue depth предотвращает и переплату за idle GPU, и деградацию latency при пиковых нагрузках. Ключевой инсайт: модель -- это 20% production системы, остальные 80% -- routing, monitoring, scaling, fallback, cost management.
URL: Multiple sources (Ryz Labs, Iterathon) Тип: production / case-study / best-practices Дата: Январь-Февраль 2026 Сбор: Ralph Research ФАЗА 5
Executive Summary¶
Production LLM deployment practices matured significantly in 2026. Key findings: - 48% cost reduction with semantic routing - 47% latency improvement through intelligent model selection - 78% of enterprises report 30% faster time-to-market for AI features
Part 1: 10 Best Practices for LLM Deployment (2026)¶
1. Choose the Right Model Version¶
| Model | Pricing | Best For | Limitations |
|---|---|---|---|
| GPT-5.2 | ~$15/M input tokens | Creative tasks | Factual accuracy issues |
| Gemini 3 Flash | ~$2/M tokens | Structured data | Less robust context |
| Claude Opus 4.6 | ~$18/M tokens | Reasoning, analysis | Higher cost |
| Llama 3 8B | ~$0.8/M tokens | Simple queries | Limited reasoning |
Note: Pricing is approximate (Feb 2026) and varies by provider. Check provider pricing pages for exact rates.
Key insight: Model choice should be dynamic, not static.
2. Optimize Inference Latency¶
| Configuration | p99 Latency |
|---|---|
| Well-tuned | 45ms |
| Poorly configured | 200ms+ |
Techniques: - Model quantization (4-bit, 8-bit) - Model distillation - Continuous batching (vLLM) - PagedAttention
3. Implement Robust Monitoring¶
Metrics to track: - Latency (p50, p95, p99) - Error rates - User engagement - Cost per query - Token usage
Tools: Prometheus, Grafana, Sentry, OpenTelemetry
4. Automate Scaling¶
# Kubernetes HPA configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-inference-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-inference
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Trade-offs: Autoscaling improves responsiveness but may introduce latency during scale-up.
5. Implement A/B Testing¶
Frameworks: Optimizely, LaunchDarkly, custom implementations
Metrics: - User satisfaction - Response quality - Engagement rates - Conversion rates
6. Manage Costs Effectively¶
Strategies: - Schedule usage during off-peak hours - Use reserved instances - Implement semantic caching (50-80% hit rate) - Model routing based on complexity
7. Ensure Data Privacy and Compliance¶
Regulations: GDPR, CCPA, HIPAA
Techniques: - Data anonymization - PII detection and removal - Secure data handling - Access logging and auditing
8. Prepare for Model Drift¶
Frameworks: MLflow, Kubeflow
Best practices: - Continuous training pipelines - Quarterly performance benchmarks - Automated drift detection - Fallback to previous model versions
9. Optimize User Experience¶
- Feedback loops for refinement
- Context-aware responses
- Streaming responses
- Graceful degradation
10. Document Everything¶
Components to document: - Architecture diagrams - API specifications - Deployment procedures - Runbooks - Cost models
Part 2: vLLM Semantic Router (January 2026)¶
Key Statistics¶
| Metric | Value |
|---|---|
| Cost reduction | 48% |
| Latency reduction | 47% |
| PRs merged | 600+ |
| Beta testers | 50+ engineers |
| MMLU-Pro improvement | +10.2% |
The Cost Problem¶
Before semantic routing: - All queries → GPT-5.2 ($0.015/1K input, $0.06/1K output) - 300,000 queries/month with 2,000 input tokens, 500 output tokens - Total: $18,000/month
The Solution: 6 Signal Types¶
vLLM Semantic Router v0.1 "Iris" uses 6 signals:
Signal 1: Keyword Matching (Fast Path)¶
Signal 2: Embedding-Based Similarity¶
Model: text-embedding-3-large or all-MiniLM-L6-v2
Process: Embed query → cosine similarity to centroids
Speed: ~120ms
Threshold: 0.85 for simple, 0.80 for complex
Signal 3: Domain Classification (14 MMLU Categories)¶
| Domain | Llama3 8B | Gemini Flash | GPT-5.2 | Claude 4.6 |
|---|---|---|---|---|
| Mathematics | 0.72 | 0.78 | 0.94 | 0.91 |
| Computer Science | 0.68 | 0.74 | 0.91 | 0.89 |
| Physics | 0.64 | 0.71 | 0.88 | 0.92 |
| Law | 0.61 | 0.69 | 0.86 | 0.90 |
| Medicine | 0.59 | 0.66 | 0.85 | 0.88 |
Signal 4: Complexity Scoring¶
Token-based:
<50 tokens → Simple (Llama3 8B)
50-150 tokens → Moderate (Gemini Flash)
>150 tokens → Complex (GPT-5.2)
Syntactic depth:
Depth 1: "What is X?" → Simple
Depth 2: "What is X and how does it differ from Y?" → Moderate
Depth 4+: Multi-part questions → Complex
Signal 5: Preference Signals (LLM-Based)¶
Classifier: Gemini 3 Flash ($0.0002/query)
Categories: creative, factual, analytical, code
Routing:
creative → Claude Opus 4.6
factual → Llama3 8B / Gemini Flash
analytical → GPT-5.2
code → GPT-5.2 / Claude Opus 4.6
Signal 6: Safety Filtering¶
Jailbreak patterns:
- "ignore previous instructions"
- "disregard safety guidelines"
- "DAN mode" / "Developer mode"
PII detection:
- Email addresses
- SSN (XXX-XX-XXXX)
- Credit cards (Luhn validation)
Weighted Vote Algorithm¶
scores = {}
for model in model_pool:
scores[model] = (
keyword_signal[model] * 0.10 +
embedding_signal[model] * 0.40 +
domain_signal[model] * 0.30 +
complexity_signal[model] * 0.20
)
selected_model = max(scores, key=scores.get)
# Fallback if confidence < 0.6
if confidence < 0.6:
selected_model = get_best_general_model()
After Semantic Routing: 48% Cost Reduction¶
Query distribution: - 60% simple → Llama3 8B - 25% moderate → Gemini 3 Flash - 15% complex → GPT-5.2
Cost calculation (300K queries/month):
Simple (180K): 180K × 0.5 × $0.0008 = $72 (input)
180K × 0.15 × $0.0008 = $21.60 (output)
Subtotal: $93.60
Moderate (75K): 75K × 1.0 × $0.002 = $150 (input)
75K × 0.3 × $0.002 = $45 (output)
Subtotal: $195
Complex (45K): 45K × 2.0 × $0.015 = $1,350 (input)
45K × 0.5 × $0.06 = $1,350 (output)
Subtotal: $2,700
Total: $93.60 + $195 + $2,700 = $2,988.60/month
Note: savings vs same token distribution all-GPT-5.2 baseline (~$8,100)
For original \(18K scenario: **\)18,000 → $9,360 (48% reduction)**
Latency Improvement¶
Before: All GPT-5.2 → 2.4s average
After: Weighted average
(0.6 × 0.8s) + (0.25 × 1.2s) + (0.15 × 2.4s) = 1.14s
Improvement: 2.4s → 1.14s = 47% reduction
Part 3: Kubernetes Deployment¶
Helm Chart Installation¶
# Note: Check vLLM docs for current Helm installation
# See: https://docs.vllm.ai/en/latest/deployment/frameworks/helm/
# Production stack: https://github.com/vllm-project/production-stack
helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update
helm install semantic-router vllm/semantic-router \
--namespace ai-infrastructure \
--create-namespace \
--set router.signals.embedding.enabled=true \
--set router.signals.embedding.model="text-embedding-3-large" \
--set router.models.llama3_8b.endpoint="http://vllm-llama3:8000" \
--set router.models.llama3_8b.cost_per_1k_tokens=0.0008
values.yaml Configuration¶
router:
replicas: 3
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
signals:
keyword:
enabled: true
embedding:
enabled: true
model: "text-embedding-3-large"
threshold: 0.85
domain:
enabled: true
mmlu_categories: 14
complexity:
enabled: true
token_thresholds:
simple: 50
moderate: 150
safety:
enabled: true
jailbreak_detection: true
pii_detection: true
weights:
keyword: 0.10
embedding: 0.40
domain: 0.30
complexity: 0.20
monitoring:
prometheus:
enabled: true
port: 9090
Prometheus Metrics¶
# Routing decisions by model
routing_decisions_total{model="llama3_8b"} 60000
routing_decisions_total{model="gemini_flash"} 25000
routing_decisions_total{model="gpt52"} 15000
# Latency per signal type
signal_latency_ms{signal="keyword"} 5
signal_latency_ms{signal="embedding"} 120
signal_latency_ms{signal="domain"} 80
# Cost tracking
total_cost_saved_usd 8440.50
Shadow Routing (A/B Testing)¶
router:
shadow_mode:
enabled: true
shadow_model: "gpt52"
comparison_metrics:
- accuracy
- user_satisfaction
- latency
Routes to both selected model AND GPT-5.2 for comparison.
Part 4: Production Case Studies¶
Case Study 1: B2B SaaS Customer Support (200K Queries/Month)¶
Before: - All queries to GPT-5.2 - $12,000/month - 2.4s average latency
After semantic routing: - 55% simple → Llama3 8B - 30% moderate → Gemini Flash - 15% complex → GPT-5.2 - $5,800/month (52% reduction) - 1.1s average latency (54% improvement)
Case Study 2: Healthcare AI (HIPAA-Compliant)¶
Challenges: - PII in queries (patient data) - Strict compliance requirements - High accuracy needed
Solution: - Safety filtering (Signal 6) enabled - PII detection and redaction - HIPAA-compliant models only - Audit logging
Results: - 100% PII detection rate - 40% cost reduction - Full HIPAA compliance maintained
Case Study 3: E-Commerce Recommendations¶
Query types: - Product search (simple): 70% - Comparison (moderate): 20% - Complex recommendations: 10%
Results: - 60% cost reduction - 35% latency improvement - Same conversion rate
Part 5: Best Practices & Pitfalls¶
Best Practices¶
- Start with shadow routing - Compare quality before full rollout
- Continuous learning from feedback - Update centroids monthly
- Monitor MMLU scores - Rebalance quarterly
Common Pitfalls¶
- Over-routing to cheap models
- Quality degradation
- User dissatisfaction
-
Solution: Set confidence threshold > 0.6
-
Ignoring routing latency
- 150ms overhead can negate gains
-
Solution: Use fast embedding models
-
Static routing rules
- Model capabilities change
- Solution: Update model scores monthly
Part 6: ROI Summary¶
Cost Comparison¶
| Scenario | Monthly Cost | Latency | Quality |
|---|---|---|---|
| All GPT-5.2 | $18,000 | 2.4s | Baseline |
| Semantic Router | $9,360 | 1.1s | +10.2% MMLU-Pro |
| Savings | $8,640 | 54% | +10.2% |
Infrastructure Cost¶
- Kubernetes deployment: ~$100/month
- Embedding model: ~$50/month
- Monitoring: ~$50/month
- Total: ~$200/month
Net ROI¶
- Monthly savings: $8,640
- Infrastructure cost: $200
- Net savings: $8,440/month
- Payback period: Immediate (Day 1)
Заблуждение: semantic routing всегда экономит 48%
Цифра 48% получена на конкретном распределении: 60% simple / 25% moderate / 15% complex. Если ваш трафик -- 70% сложных аналитических запросов (юридический сервис, медицина), routing направит большинство на дорогую модель и экономия составит 5-10%. Реальная экономия зависит от профиля трафика: измерьте распределение complexity на 10K запросов ПЕРЕД внедрением router. Threshold 0.7 для similarity может отправить сложный запрос на слабую модель -- начинайте с 0.9 и снижайте через A/B тесты.
Заблуждение: autoscaling по CPU -- достаточная метрика для LLM
LLM inference -- memory-bound операция, не compute-bound. GPU utilization может быть 30% при полностью загруженной VRAM и очереди на 200 запросов. Правильные метрики для HPA: queue depth (pending requests), KV cache utilization, TTFT p95. CPU-based autoscaling реагирует на 2-5 минут позже реальной нагрузки, что для пользователя означает таймауты.
Заблуждение: p99 latency 45ms достижима в production
Цифра 45ms p99 из бенчмарков -- это single-request latency на прогретом GPU без очереди. В production с 100+ concurrent users реальный p99 составляет 200-800ms из-за: queuing delay, KV cache pressure, batch scheduling overhead. Well-tuned production p99 для Llama-70B на 4xH100: 300-500ms при 50 concurrent users. Не путайте бенчмарк с SLA.
Interview Questions¶
Q: Как спроектировать production LLM систему для 10K concurrent users?
Red flag: "Поставить vLLM на большой GPU и масштабировать вертикально"
Strong answer: "Горизонтальная архитектура: (1) Load balancer (nginx/envoy) -> N vLLM instances с tensor parallelism на 4xH100. (2) Continuous batching для GPU utilization 70-90%. (3) Prefix caching для multi-turn (RadixAttention если SGLang). (4) FP8 квантизация для 2x memory savings без потери качества. (5) Autoscaling по queue depth: min 4 replicas, max 20, target queue <10. (6) Semantic routing: simple запросы на 8B модель, complex на 70B. Target SLA: TTFT <200ms p95, throughput 850+ tok/s per node. Мониторинг: Prometheus + Grafana, алерты на TTFT p99 > 500ms."
Q: Semantic routing снизил cost на 48%, но качество упало. Как диагностировать?
Red flag: "Повысить threshold для всех запросов"
Strong answer: "Систематический подход: (1) Shadow routing -- параллельно отправлять на baseline модель (GPT-5.2) и сравнивать ответы, метрика: win rate per-route. (2) Разбивка quality по complexity bucket: если simple-route показывает 95% quality, а moderate -- 70%, проблема в moderate threshold. (3) Анализ signal weights: если embedding signal доминирует (0.4), но ваши запросы плохо кластеризуются, увеличить вес domain signal. (4) Per-domain MMLU scores: Llama 8B на Math (0.72) может быть ниже acceptable threshold, перенаправить Math на Gemini Flash (0.78). (5) Добавить confidence-based fallback: если score <0.6 -- всегда premium модель."
Q: Как организовать мониторинг production LLM системы?
Red flag: "Стандартный мониторинг CPU/Memory через CloudWatch"
Strong answer: "LLM-специфичные метрики в 4 категориях: (1) Latency: TTFT p50/p95/p99, TPOT (time per output token), end-to-end latency -- разбивка показывает bottleneck (queue vs prefill vs decode). (2) Throughput: tokens/sec per GPU, requests/sec, batch utilization. (3) Quality: semantic similarity to baseline, user feedback rate, retry rate (>5% = проблема). (4) Cost: cost/1K tokens per model, cost/запрос, cost per route. Стек: Prometheus для метрик, Grafana для dashboards, PagerDuty алерты на TTFT p99 >1s и error rate >2%. OpenTelemetry для distributed tracing через routing -> inference -> response."
Sources¶
- Ryz Labs — "10 Best Practices for LLM Deployment in Production 2026" (Jan 2026)
- Iterathon — "LLM Semantic Router Production Implementation vLLM SR 2026" (Jan 2026)
- vLLM Blog — Semantic Router v0.1 "Iris" announcement
- Red Hat — vLLM SR Qwen3 30B benchmarks
See Also¶
- vLLM и Paged Attention -- serving engine: PagedAttention, continuous batching, prefix caching -- инфраструктура для production
- Квантизация LLM -- GPTQ/AWQ/FP8 для снижения memory footprint при деплое
- Облачный деплой LLM -- AWS Bedrock vs Azure AI vs GCP Vertex -- managed альтернатива self-hosted
- Каскадная маршрутизация LLM -- semantic router и cascade как часть production архитектуры
- Наблюдаемость LLM -- мониторинг quality/latency/cost после deployment