LLMOps vs MLOps: ключевые различия и паттерны¶

~5 минут чтения

Предварительно: Продакшен деплой LLM, Наблюдаемость LLM

MLOps-пайплайны проектировались под модели в мегабайтах с детерминированным выходом -- feature store, retraining triggers, метрики вроде F1. LLMOps работает с моделями в сотни гигабайт, где основной артефакт -- промпт, а не модель, главный cost driver -- inference tokens (не training compute), и failure mode -- галлюцинации вместо "wrong prediction". По данным USAII (2026), команды, внедрившие LLMOps-практики (prompt versioning, semantic caching, model routing), снижают inference-расходы на 50-80% по сравнению с наивным подходом "один frontier-модель на все запросы".

Обзор¶

Aspect	MLOps	LLMOps
Focus	Traditional ML models	Large Language Models
Model Size	MB to GB	GB to TB
Input	Structured features	Unstructured text
Output	Deterministic	Probabilistic
Evaluation	Metrics (accuracy, F1)	Human + automated eval

1. Fundamental Differences¶

Model Development¶

Stage	MLOps	LLMOps
Data	Feature engineering	Prompt engineering
Training	From scratch common	Fine-tuning dominant
Iteration	Hours to days	Days to weeks
Compute	Single GPU often	Distributed required

Output Nature¶

Aspect	MLOps	LLMOps
Type	Structured predictions	Text generation
Determinism	High	Low (temperature, sampling)
Validation	Automated tests	Human + LLM-as-judge
Failure mode	Wrong prediction	Hallucination, toxicity

2. Lifecycle Comparison¶

MLOps Lifecycle¶

graph LR
    A["Data Collection"] --> B["Feature Engineering"]
    B --> C["Model Training"]
    C --> D["Model Evaluation"]
    D --> E["Model Registry"]
    E --> F["Deployment"]
    F --> G["Monitoring"]
    G -->|"drift detected"| C

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8eaf6,stroke:#3f51b5
    style C fill:#e8f5e9,stroke:#4caf50
    style D fill:#fff3e0,stroke:#ef6c00
    style E fill:#f3e5f5,stroke:#9c27b0
    style F fill:#e8f5e9,stroke:#4caf50
    style G fill:#fff3e0,stroke:#ef6c00

LLMOps Lifecycle¶

graph LR
    A["Prompt Engineering"] --> B["Fine-tuning (optional)"]
    B --> C["Evaluation"]
    C --> D["Model Selection"]
    D --> E["Deployment + Routing"]
    E --> F["Monitoring"]
    F --> G["Feedback Collection"]
    G --> H["Prompt Iteration"]
    H --> A

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#f3e5f5,stroke:#9c27b0
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#e8eaf6,stroke:#3f51b5
    style E fill:#e8f5e9,stroke:#4caf50
    style F fill:#fff3e0,stroke:#ef6c00
    style G fill:#e8f5e9,stroke:#4caf50
    style H fill:#e8eaf6,stroke:#3f51b5

3. LLMOps-Specific Components¶

A. Prompt Management¶

MLOps equivalent: Feature engineering

# Prompt versioning
prompts = {
    "v1": "Summarize: {text}",
    "v2": "Summarize in 3 bullet points: {text}",
    "v3": "Create executive summary (max 100 words): {text}"
}

# A/B testing prompts
result_v1 = llm.generate(prompts["v1"].format(text=input))
result_v2 = llm.generate(prompts["v2"].format(text=input))

Best Practices: - Version control prompts like code - A/B test prompt variations - Track prompt → performance metrics - Use prompt registries (LangSmith, Promptlayer)

B. Model Routing¶

MLOps equivalent: Model selection

def route_request(query, complexity_score):
    if complexity_score < 0.3:
        return "gpt-4o-mini"     # Cheap, fast (gpt-3.5-turbo deprecated Jan 2025)
    elif complexity_score < 0.7:
        return "claude-haiku-4-5" # Balanced
    else:
        return "gpt-4o"          # Best quality

Routing Strategies: - Complexity-based - Cost optimization - Latency requirements - Domain specialization

C. Semantic Caching¶

MLOps equivalent: Not applicable

# Cache semantically similar queries
cache = SemanticCache(threshold=0.95)

def get_response(query):
    cached = cache.get(query)
    if cached:
        return cached  # Cache hit = free

    response = llm.generate(query)
    cache.set(query, response)
    return response

Savings: 70-90% cost reduction for repeated queries

D. Token Budget Management¶

class TokenBudget:
    def __init__(self, max_tokens_per_user=10000):
        self.limits = defaultdict(lambda: max_tokens_per_user)
        self.usage = defaultdict(int)

    def check(self, user_id, requested_tokens):
        if self.usage[user_id] + requested_tokens > self.limits[user_id]:
            raise BudgetExceeded()
        self.usage[user_id] += requested_tokens

4. Evaluation Differences¶

MLOps Evaluation¶

Metric	Formula
Accuracy	$\frac{TP + TN}{Total}$
Precision	$\frac{TP}{TP + FP}$
Recall	$\frac{TP}{TP + FN}$
F1	$2 \cdot \frac{P \cdot R}{P + R}$

LLMOps Evaluation¶

Metric	Method
Hallucination	LLM-as-judge, fact checking
Relevance	RAGAS, DeepEval
Safety	Red-teaming, toxicity classifiers
Coherence	Human eval, G-Eval
Cost	Tokens × price per model

LLM-as-Judge Pattern¶

def evaluate_response(query, response, criteria):
    prompt = f"""
    Rate this response on {criteria} (1-10):
    Query: {query}
    Response: {response}

    Output only the score.
    """
    score = llm.generate(prompt)
    return int(score)

5. Monitoring Differences¶

MLOps Monitoring¶

Signal	Detection
Data drift	Statistical tests (KS, PSI)
Concept drift	Accuracy degradation
Model staleness	Scheduled retraining

LLMOps Monitoring¶

Signal	Detection
Hallucination rate	Automated + sampling
Cost spike	Token tracking
Latency degradation	P95 monitoring
User satisfaction	Feedback signals
Safety violations	Content filters

LLMOps-Specific Alerts¶

alerts:
  - name: cost_spike
    condition: daily_cost > 2 * avg_weekly_cost
    severity: warning

  - name: hallucination_high
    condition: hallucination_rate > 0.05
    severity: critical

  - name: latency_degraded
    condition: p95_latency > 3000ms
    severity: warning

6. Cost Optimization (LLMOps-Specific)¶

Strategies¶

Strategy	Savings	Trade-off
Semantic caching	70-90%	Memory
Model routing	50-80%	Complexity
Token limits	20-50%	Quality
Smaller models	80-95%	Capability
Batch processing	30-50%	Latency

Cost Formula¶

\[ \text{Cost} = \frac{\text{Input Tokens} \times \text{Input Price} + \text{Output Tokens} \times \text{Output Price}}{1,000,000} \]

Example: GPT-4o pricing - Input: $2.50 / 1M tokens - Output: $10.00 / 1M tokens

7. Tool Comparison¶

MLOps Tools¶

Category	Tools
Experiment Tracking	MLflow, Weights & Biases
Feature Store	Feast, Tecton
Model Registry	MLflow, Seldon
Deployment	Seldon, KServe
Monitoring	Evidently, WhyLabs

LLMOps Tools¶

Category	Tools
Prompt Management	LangSmith, Promptlayer
Evaluation	RAGAS, DeepEval, TruLens
Observability	Langfuse, Helicone, Arize
Guardrails	NeMo Guardrails, Guardrails AI
Cost Tracking	OpenAI Usage, Helicone

8. Infrastructure Differences¶

MLOps Infrastructure¶

graph TD
    A["Feature Store"] --> B["Training Pipeline"]
    B --> C["Model Registry"]
    C --> D["Model Serving (REST/gRPC)"]
    D --> E["Monitoring"]
    E -->|"drift"| F["Retraining Trigger"]
    F --> B

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8f5e9,stroke:#4caf50
    style C fill:#f3e5f5,stroke:#9c27b0
    style D fill:#e8f5e9,stroke:#4caf50
    style E fill:#fff3e0,stroke:#ef6c00
    style F fill:#fce4ec,stroke:#c62828

LLMOps Infrastructure¶

graph TD
    A["Prompt Registry"] --> B["LLM Gateway (routing + caching)"]
    B --> C["Model Serving (vLLM, SGLang)"]
    C --> D["Evaluation Pipeline"]
    D --> E["Feedback Collection"]
    E --> F["Prompt Iteration"]
    F --> A

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#f3e5f5,stroke:#9c27b0
    style C fill:#e8f5e9,stroke:#4caf50
    style D fill:#fff3e0,stroke:#ef6c00
    style E fill:#e8f5e9,stroke:#4caf50
    style F fill:#e8eaf6,stroke:#3f51b5

9. Team Skills Comparison¶

MLOps Skills¶

Feature engineering
Model training
Hyperparameter tuning
Distributed training
A/B testing models

LLMOps Skills (Additional)¶

Prompt engineering
LLM evaluation methods
Cost optimization
Semantic caching
Guardrails implementation
RAG system design

10. Summary Table¶

Aspect	MLOps	LLMOps
Primary artifact	Model	Prompt + Model
Training paradigm	From scratch	Fine-tuning
Input processing	Feature engineering	Tokenization
Output validation	Automated metrics	Human + automated
Main cost driver	Training compute	Inference tokens
Key skill	Feature engineering	Prompt engineering
Monitoring focus	Data drift	Hallucination, cost
Caching strategy	Prediction cache	Semantic cache
Deployment concern	Model version	Model + prompt version

Заблуждение: LLMOps -- это просто MLOps с большими моделями

MLOps оптимизирует training pipeline (feature engineering, retraining по drift). LLMOps оптимизирует inference pipeline (prompt versioning, model routing, token budgets). В MLOps основной артефакт -- модель; в LLMOps -- промпт + конфигурация routing. Команда, применяющая MLOps-практики к LLM (scheduled retraining, feature stores), тратит ресурсы на нерелевантные процессы.

Заблуждение: LLM-as-judge полностью заменяет human evaluation

LLM-as-judge хорошо оценивает формат, coherence и factual consistency ($0.05-0.50 за 1K оценок). Но для safety, edge cases и domain-specific correctness human raters незаменимы. G-Eval (GPT-4 судья) показывает correlation 0.5-0.7 с human judgment на open-ended tasks -- это полезно для скрининга, но не для финальной оценки качества.

Заблуждение: semantic caching подходит для любых LLM-задач

Semantic caching дает 40-86% экономии для FAQ-ботов и customer support (высокий overlap запросов). Но для code generation, creative writing и real-time data он бесполезен или вреден -- каждый запрос уникален, а cached ответ может быть неточным. Порог similarity < 0.95 дает 15-20% false positives.

Interview Questions¶

Q: В чем ключевое различие между MLOps и LLMOps?

Red flag: "LLMOps -- это MLOps для больших моделей, просто нужно больше GPU"

Strong answer: "MLOps оптимизирует training pipeline: feature engineering, model registry, retraining по data drift. LLMOps оптимизирует inference pipeline: prompt versioning, model routing, semantic caching, token budgets. В MLOps основной cost driver -- training compute, в LLMOps -- inference tokens. Evaluation тоже принципиально отличается: автоматические метрики (F1, accuracy) vs human + LLM-as-judge для hallucination и safety"

Q: Как бы вы спроектировали стратегию оптимизации стоимости для высоконагруженного LLM API?

Red flag: "Просто использовать самую дешевую модель для всех запросов"

Strong answer: "Три слоя: (1) Semantic cache с threshold 0.95+ -- 40-86% запросов бесплатно при высоком overlap; (2) Cascade routing -- 40-70% текстовых запросов не нуждаются в flagship модели, начинаем с Haiku/GPT-4o-mini и эскалируем по confidence; (3) Token budget management на уровне пользователя. Мониторинг: cache hit rate, escalation rate, cost per query. Deployment через shadow mode -> A/B test -> gradual rollout"

Q: Объясните подход LLM-as-judge и его ограничения

Red flag: "GPT-4 может оценивать любые ответы, это объективнее людей"

Strong answer: "LLM-as-judge -- паттерн, где сильная модель оценивает output другой модели по заданным criteria (1-10 scale). Хорош для формата, coherence, factual consistency -- $0.05-0.50 за 1K оценок vs $1-5 за human batch. Ограничения: position bias (предпочитает первый ответ), self-preference bias (GPT-4 предпочитает GPT-4 ответы), correlation с humans 0.5-0.7 на open-ended tasks. Для safety и domain-specific correctness -- human raters обязательны"

Q: Чем мониторинг LLM отличается от мониторинга классических ML-моделей?

Red flag: "То же самое, просто отслеживаем accuracy и latency"

Strong answer: "В MLOps мониторим data drift (KS-тест, PSI), concept drift (деградация accuracy), model staleness. В LLMOps добавляются специфичные сигналы: hallucination rate (automated sampling + LLM-as-judge), cost spike (daily cost > 2x avg weekly), safety violations (content filters), user satisfaction (thumbs up/down, regeneration rate). Критический alert: hallucination_rate > 5%. В LLMOps нет data drift в классическом смысле -- вместо этого prompt drift и model version changes"

Cross-references¶

См. также: kv-кэш-оптимизация -- LLMOps serving optimization

Sources & Further Reading¶

Articles (2025-2026)¶

USAII: "DevOps vs MLOps vs LLMOps" (2026)
N-iX: "LLMOps vs MLOps: Differences and Use Cases"
Medium: "Complete MLOps/LLMOps Roadmap for 2026"
Dev.to: "What Every Developer Needs to Know in 2025"
Medium: "Complete Guide to MLOps, AIOps, LLMOps, AgentOps"

Tools Documentation¶

LangSmith: Prompt management
Langfuse: LLM observability
RAGAS: RAG evaluation
DeepEval: LLM testing

Metric	Formula
Accuracy	\(\frac{TP + TN}{Total}\)
Precision	\(\frac{TP}{TP + FP}\)
Recall	\(\frac{TP}{TP + FN}\)
F1	\(2 \cdot \frac{P \cdot R}{P + R}\)