ML Production Patterns & War Stories (Layer 6)¶

~4 минуты чтения

Real-world lessons from Netflix, Uber, Airbnb, Google, Meta Обновлено: 2026-02-11

Netflix ML Lessons¶

Recommendation System at Scale¶

Architecture: - Multi-stage: Retrieval (ANN) → Ranking → Re-ranking - Real-time + Batch features - A/B testing infrastructure

War Story: Cold Start Problem

"We realized 20% of users are new every month. We built a separate onboarding model that uses only session data."

Key Lessons: 1. Personalization requires diversity (don't show same genre) 2. Freshness matters (users want new content) 3. Explainability improves trust ("Because you watched...")

Sources¶

Netflix Tech Blog: Foundation Model for Personalized Recommendation
URL: https://netflixtechblog.com/foundation-model-for-personalized-recommendation-1a0bd8e02d39

Uber ML Lessons¶

Real-Time Pricing (Surge)¶

Challenge: Predict demand in <100ms globally

Solution: - Pre-compute features in streaming (Kafka + Flink) - Model inference at edge (regional) - Fallback to rules if model fails

War Story: New Year's Eve

"Our model saw 100x traffic spike. Pre-scaling based on historical patterns saved us."

Key Lessons: 1. Capacity planning for predictable spikes 2. Graceful degradation essential 3. Real-time features need streaming infrastructure

Fraud Detection¶

Challenge: Detect fraud before transaction completes

Solution: - Feature freshness < 1 second - Model in Redis (fast lookup) - Multi-model ensemble (speed vs accuracy)

Sources¶

Uber Engineering Blog
Patrick Koss: Why Your ML Platform Will Fail at 3 AM

Airbnb ML Lessons¶

Search Ranking¶

Evolution: 1. Rules-based → GBDT → Neural Networks → LLM-enhanced

War Story: Position Bias

"Users click top results regardless of quality. We added position as feature and trained with IPS (Inverse Propensity Scoring)."

Key Lessons: 1. Position bias correction essential for ranking 2. Offline metrics don't always correlate with online 3. A/B test everything

Pricing Model¶

Challenge: Dynamic pricing for 6M+ listings

Solution: - Separate models per market - Seasonal features - Competitor pricing signals

Sources¶

Airbnb Deep Learning Journey
URL: https://zayunsna.github.io/ds/2025-05-02-airbnb_model/

Google ML Lessons¶

Hidden Technical Debt¶

From the famous paper:

"Only a small fraction of real-world ML systems are actual ML code. The rest is infrastructure."

Key Points: 1. Data dependencies are hidden 2. Configuration changes break models 3. Monitoring is an afterthought 4. Model updates affect other systems

Production Patterns¶

1. Canary Deployment
   - 1% traffic → new model
   - Monitor metrics
   - Gradual rollout

2. Shadow Mode
   - Run new model in parallel
   - Compare predictions (no user impact)
   - Validate before switching

3. Feature Stores
   - Single source of truth
   - Consistent train/serve
   - Version control for features

Sources¶

Google: Hidden Technical Debt in ML Systems
URL: https://research.google/pubs/pub43146/

Meta (Facebook) ML Lessons¶

News Feed Ranking¶

Scale: Billions of predictions per second

Architecture: - Multi-stage ranking - Click prediction + engagement + long-term value - Real-time personalization

War Story: Filter Bubbles

"Users only saw content they agreed with. We added diversity constraints to ensure exposure to different viewpoints."

Key Lessons: 1. Optimization metrics affect user behavior 2. Long-term value > short-term engagement 3. Diversity controls needed

Common Production Failures¶

1. Data Drift Not Detected¶

Symptom: Model accuracy drops slowly

Cause: Feature distribution changed, no alerts

Fix: PSI monitoring on all features, automated retraining

2. Cold Start Cascade¶

Symptom: New model fails for subset of users

Cause: Missing features for new users

Fix: Fallback features, imputation, separate cold-start model

3. Latency Spike¶

Symptom: P99 > 500ms suddenly

Cause: Model size increased, no batching

Fix: Model quantization, dynamic batching, caching

4. Memory Leak¶

Symptom: OOM after hours/days

Cause: Accumulating predictions in memory

Fix: Batch processing, garbage collection, memory profiling

5. Feature Store Inconsistency¶

Symptom: Training-serving skew

Cause: Different feature computation in batch vs online

Fix: Single feature definition, centralized feature store

Production Checklist¶

Before Deployment¶

[ ] Model metrics meet threshold (accuracy, latency)
[ ] Feature dependencies documented
[ ] Fallback strategy defined
[ ] Monitoring dashboards ready
[ ] Rollback plan tested
[ ] A/B test framework configured
[ ] Capacity planning done
[ ] Security review passed

During Deployment¶

[ ] Shadow mode validated
[ ] Canary at 1% traffic
[ ] Metrics within bounds
[ ] Gradual rollout (1% → 10% → 50% → 100%)
[ ] Real-time alerts active
[ ] On-call engineer assigned

After Deployment¶

[ ] A/B test significance achieved
[ ] Business metrics improved
[ ] Model performance stable
[ ] Documentation updated
[ ] Retrospective scheduled

Monitoring Patterns¶

Key Metrics¶

Service Health:
- Request latency (P50, P99)
- Error rate
- Throughput (RPS)

Model Health:
- Prediction distribution
- Feature distribution (PSI)
- Model confidence

Business Impact:
- Conversion rate
- Revenue impact
- User engagement

Alerting Strategy¶

P0 (Wake on-call):
- Error rate > 1%
- Latency P99 > 1s
- Model returning errors

P1 (Next business day):
- PSI > 0.25
- Accuracy drop > 2%
- Feature freshness stale

P2 (Weekly review):
- Drift trend analysis
- Cost optimization
- Capacity trends

Cost Optimization Patterns¶

Compute¶

1. Spot Instances      — 70% cheaper, handle preemption
2. Right-sizing        — Match instance to workload
3. Auto-scaling        — Scale down during low traffic
4. Batch inference     — Group predictions, reduce overhead

Model¶

1. Quantization        — INT8 = ~4x smaller vs FP32, ~2x faster
2. Distillation        — Smaller model, similar accuracy
3. Pruning             — Remove unnecessary weights
4. Caching             — Cache frequent predictions

Anti-Patterns to Avoid¶

Anti-Pattern	Problem	Solution
Manual deployment	Error-prone, slow	CI/CD pipeline
No monitoring	Silent failures	Comprehensive dashboards
Single model	No fallback	Model ensemble / fallback
Hardcoded features	Inflexible	Feature store
Training-serving skew	Inconsistent predictions	Same pipeline
No versioning	Can't rollback	Model registry
Overfitting to metrics	Bad user experience	Multi-metric optimization

Sources: Netflix Tech Blog, Uber Engineering, Airbnb Tech, Google Research, Meta Engineering