Наблюдаемость LLM¶
~7 минут чтения
Предварительно: LLMOps vs MLOps, Метрики оценки LLM
Компания с 50K ежедневных LLM-запросов без наблюдаемости теряет в среднем $3000-5000/мес на избыточных токенах, узнает о деградации качества только из жалоб пользователей, и не может отследить, какой промпт вызвал галлюцинацию. Наблюдаемость LLM -- это не просто логирование: это tracing каждого вызова (промпт, ответ, токены, стоимость), evaluation качества в реальном времени (hallucination rate, relevance), и alerting при аномалиях. В production-системах 76% времени ответа -- это decode-фаза, а semantic caching снижает расходы на 70-90%. Без observability-стека вы летите вслепую.
Ключевые концепции¶
LLM Observability = Logging + Tracing + Metrics + Evaluation + Cost Tracking
graph TD
subgraph stack["LLM OBSERVABILITY STACK"]
direction TB
subgraph row1[" "]
T["TRACING<br/>Request flow<br/>Token use"]
M["METRICS<br/>Latency / Cost<br/>Throughput"]
E["EVALUATION<br/>Quality<br/>Hallucinations<br/>Relevance"]
end
subgraph row2[" "]
L["LOGGING<br/>Prompts<br/>Responses<br/>Errors"]
A["ALERTING<br/>Anomalies<br/>Cost spikes<br/>Quality"]
D["DEBUGGING<br/>Playground<br/>Experiments<br/>Comparison"]
end
end
style T fill:#e8eaf6,stroke:#3f51b5
style M fill:#e8f5e9,stroke:#4caf50
style E fill:#fff3e0,stroke:#ef6c00
style L fill:#f3e5f5,stroke:#9c27b0
style A fill:#fce4ec,stroke:#c62828
style D fill:#e8eaf6,stroke:#3f51b5
7 измерений (2025)¶
- Trust -- Factual grounding, self-auditing
- Safety -- Bias detection, toxic content
- Quality -- Output coherence, accuracy
- Performance -- Latency, throughput
- Cost -- API usage, resource consumption
- User Feedback -- Ratings, thumbs up/down
- Analytics -- Cost trends, quality trends
| Concern | Impact | Solution |
|---|---|---|
| Cost | API bills grow exponentially | Token tracking, caching |
| Quality | Hallucinations, errors | Evaluation metrics |
| Latency | User experience degradation | P50/P95/P99 monitoring |
| Compliance | GDPR, data retention | Audit trails |
| Debugging | Black box models | Tracing, prompt inspection |
1. Платформы 2026¶
Feature Comparison¶
| Feature | Maxim | Arize Phoenix | LangSmith | Langfuse | Braintrust |
|---|---|---|---|---|---|
| Tracing | Full | Full | Full | Full | Full |
| Agent Simulation | Yes | No | No | No | Limited |
| Evaluation Suite | Yes | Yes | Yes | Yes | Yes |
| Open Source | No | ELv2 | No | MIT | Partial |
| Self-Hosting | No | Yes | Enterprise | Yes | Enterprise |
| Prompt Mgmt | Yes | Basic | Yes | Yes | Yes |
| Cost Tracking | Yes | Yes | Yes | Yes | Yes |
| OpenTelemetry | No | Yes | No | No | No |
Pricing¶
| Platform | Free Tier | Self-Hosted | Paid |
|---|---|---|---|
| Langfuse | 50K events/mo | Free | $59/mo |
| LangSmith | 5K traces/mo | Enterprise | $39/seat/mo |
| Arize Phoenix | Open source | Free | Custom |
| Braintrust | Limited | Enterprise | Custom |
| Maxim AI | Demo | No | Enterprise |
Selection Guide¶
| Need | Recommended | Reason |
|---|---|---|
| Open source | Langfuse | MIT, 23M+ SDK installs/month, 21K+ GitHub stars |
| LangChain native | LangSmith | Zero-friction setup |
| OTel standard | Arize Phoenix | No vendor lock-in |
| All-in-one | Maxim AI | Simulation + eval + observability |
| Evaluation-first | Braintrust | Brainstore database |
| Self-hosting | Langfuse или Phoenix | Both fully self-hostable |
| Enterprise integration | Datadog AI | Existing infra |
By Team Size¶
| Team | Recommended |
|---|---|
| Solo / small | Langfuse (free tier) |
| Startup | LangSmith или Langfuse |
| Mid-size | Arize Phoenix |
| Enterprise | Maxim или custom |
Other Tools¶
| Tool | Focus |
|---|---|
| Helicone | Proxy-based, zero-code, caching |
| Portkey | Production routing, caching |
| TruLens | Evaluation framework |
| DeepEval | Comprehensive metrics |
| Datadog AI | Enterprise integration |
2. Evaluation Metrics¶
Metric Categories¶
| Category | Metrics |
|---|---|
| Groundedness | Response based on context? |
| Relevance | Response answers the query? |
| Hallucination | Response factually correct? |
| Coherence | Response well-structured? |
| Toxicity | Response harmful? |
| Bias | Response fair? |
RAG-Specific Metrics¶
| Metric | Formula/Approach |
|---|---|
| Context Precision | TP / (TP + FP) |
| Context Recall | TP / (TP + FN) |
| Faithfulness | Claims supported / Total claims |
| Answer Relevance | LLM-as-judge score |
| Answer Correctness | vs ground truth |
DeepEval Hallucination Detection¶
from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="The capital of France is Paris.",
retrieval_context=["France is a country in Europe. Paris is its capital."]
)
metric = HallucinationMetric(threshold=0.5)
evaluate(test_cases=[test_case], metrics=[metric])
Ragas (RAG Evaluation)¶
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall
)
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy,
context_precision, context_recall]
)
3. LLM-as-a-Judge¶
graph LR
Q["Input Query"] --> MA["Model A"]
Q --> MB["Model B"]
MA --> RA["Response A"]
MB --> RB["Response B"]
C["Criteria:<br/>Accuracy, Safety..."] --> J["Judge LLM"]
RA --> J
RB --> J
J --> R["Evaluation Result"]
style Q fill:#e8eaf6,stroke:#3f51b5
style MA fill:#e8f5e9,stroke:#4caf50
style MB fill:#e8f5e9,stroke:#4caf50
style J fill:#fff3e0,stroke:#ef6c00
style R fill:#f3e5f5,stroke:#9c27b0
Reliability (2025 Research)¶
Key Finding: LLM judges show only "mediocre alignment" with human evaluators.
Best Practices: 1. Use multiple judges (ensemble) 2. Calibrate with human feedback 3. Use structured evaluation prompts 4. Report confidence intervals
| Issue | Description |
|---|---|
| Position bias | Prefers first option |
| Self-preference | Prefers own outputs |
| Inconsistency | Same input, different scores |
| Subtle errors | Can't detect nuanced mistakes |
4. Cost Optimization¶
Cost Drivers¶
| Factor | Impact | Optimization |
|---|---|---|
| Input tokens | $/1M tokens | Prompt compression |
| Output tokens | 2-3x input cost | Limit max_tokens |
| Model selection | 10-100x variance | Use smallest viable |
| Repeated queries | Redundant API calls | Caching |
| Context length | Quadratic attention | Chunking |
Cost Reduction Strategies¶
| Strategy | Savings |
|---|---|
| Prompt Caching (provider) | 90% |
| Semantic Caching | 70-90% |
| Model Routing | 50-80% |
| Batch Processing | 50% |
| Token Limits | 20-50% |
| Prompt Compression | 30-50% |
Semantic Caching Implementation¶
import time
import numpy as np
from dataclasses import dataclass
from typing import Optional
@dataclass
class CacheEntry:
query: str
embedding: np.ndarray
response: str
timestamp: float
class SemanticCache:
def __init__(self, embed_fn, similarity_threshold=0.95, ttl_seconds=3600):
self.embed_fn = embed_fn
self.threshold = similarity_threshold
self.ttl = ttl_seconds
self.cache: list[CacheEntry] = []
def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def get(self, query: str) -> Optional[str]:
query_emb = self.embed_fn(query)
current_time = time.time()
for entry in self.cache:
if current_time - entry.timestamp > self.ttl:
continue
sim = self._cosine_similarity(query_emb, entry.embedding)
if sim >= self.threshold:
return entry.response
return None
def set(self, query: str, response: str):
embedding = self.embed_fn(query)
self.cache.append(CacheEntry(
query=query, embedding=embedding,
response=response, timestamp=time.time()
))
5. Latency Monitoring¶
Key Metrics¶
| Metric | Description | Target |
|---|---|---|
| TTFT | Time to First Token | <500ms |
| TPS | Tokens Per Second | >30 |
| E2E Latency | End-to-end response | <2s |
| P50 | Median latency | <1s |
| P95 | 95th percentile | <3s |
| P99 | 99th percentile | <5s |
Latency Breakdown (typical)¶
Total E2E: 1850ms
+-- Network: 50ms (3%)
+-- Auth: 10ms (1%)
+-- Prompt Prep: 30ms (2%)
+-- Model Queue: 100ms (5%)
+-- Prefill: 200ms (11%)
+-- Decode: 1400ms (76%) <-- BOTTLENECK
+-- Post-proc: 60ms (3%)
Optimization¶
| Technique | Improvement | Trade-off |
|---|---|---|
| Speculative Decoding | 2-3x faster | More complex |
| Streaming | Perceived faster | Same total time |
| Smaller Model | 5-10x faster | Lower quality |
| Batching | Higher throughput | Higher latency |
| Caching | Near-instant | Cache invalidation |
6. Alerting¶
| Alert Type | Condition | Action |
|---|---|---|
| Cost Spike | >2x daily average | Notify + investigate |
| Latency Increase | P95 > 3s for 5min | Scale + notify |
| Error Rate | >1% failures | Page on-call |
| Quality Drop | Score < threshold | Review prompts |
| Hallucination | Rate >5% | Investigate RAG |
alerts:
- name: cost_spike
condition: daily_spend > 2 * avg_daily_spend
severity: warning
channels: [slack, email]
- name: latency_degradation
condition: p95_latency > 3000ms AND duration > 5m
severity: critical
channels: [pagerduty, slack]
- name: error_rate_high
condition: error_rate > 1%
severity: critical
channels: [pagerduty]
7. Production Architecture¶
graph TD
APP["Application"] --> PROXY["Proxy<br/>(Helicone)"]
PROXY --> CACHE["Cache<br/>(Redis)"]
CACHE --> LLM["LLM Provider"]
PROXY --> LF["Langfuse<br/>(Tracing)"]
CACHE --> LF
LLM --> LF
LF --> PG["PostgreSQL"]
LF --> EVAL["Evaluation<br/>DeepEval / Ragas / Custom"]
LF --> ALERT["Alerting<br/>Prometheus -> Alertmanager<br/>-> Slack / PagerDuty"]
style APP fill:#e8eaf6,stroke:#3f51b5
style PROXY fill:#fff3e0,stroke:#ef6c00
style CACHE fill:#e8f5e9,stroke:#4caf50
style LLM fill:#f3e5f5,stroke:#9c27b0
style LF fill:#e8eaf6,stroke:#3f51b5
style PG fill:#e8f5e9,stroke:#4caf50
style EVAL fill:#fff3e0,stroke:#ef6c00
style ALERT fill:#fce4ec,stroke:#c62828
Integration Code¶
# Langfuse
from langfuse import Langfuse
langfuse = Langfuse(
public_key="pk-...", secret_key="sk-...",
host="https://cloud.langfuse.com"
)
trace = langfuse.trace(name="chat-completion", user_id="user-123")
generation = trace.generation(
name="llm-call", model="gpt-4o",
input=query, output=response,
usage={"input": 100, "output": 50}, cost=0.015
)
trace.score(name="relevance", value=0.85)
# Arize Phoenix (OTel)
from phoenix.trace.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument()
# LangSmith (auto with env vars)
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "xxx"
# All LangChain calls automatically traced
# Helicone (zero-code)
openai.base_url = "https://api.helicone.ai/v1"
# All requests automatically logged
8. CAG и Observational Memory (2026)¶
Cache-Augmented Generation (CAG)¶
graph LR
subgraph hybrid["HYBRID MEMORY"]
direction LR
subgraph cag["CAG (Core)"]
C1["Identity / docs"]
C2["Stable context"]
C3["Aggressive cache"]
C4["90%+ hit rate"]
end
subgraph rag["RAG (Long-tail)"]
R1["Search queries"]
R2["Rare information"]
R3["Dynamic retrieval"]
R4["Fresh data"]
end
end
style cag fill:#e8f5e9,stroke:#4caf50
style rag fill:#e8eaf6,stroke:#3f51b5
| Memory Type | Use Case | Cost | Latency |
|---|---|---|---|
| CAG | FAQs, docs, identity | Near-zero | <10ms |
| RAG | Search, rare queries | Higher | 100-500ms |
| Hybrid | Production systems | Optimized | Variable |
Observational Memory¶
Stable context windows enable 10x cost reduction via aggressive caching of intermediate states.
| Approach | Relative Cost | Accuracy |
|---|---|---|
| Naive RAG | 1.0x | Baseline |
| Semantic Cache | 0.3-0.5x | Same |
| Observational Memory | 0.1x | Higher |
Context Window Management (5 techniques)¶
- Smart Chunking -- semantic boundaries vs fixed-size
- Semantic Caching -- 50-80% cost reduction
- RAG Optimization -- hybrid search (BM25 + dense), re-ranking
- Agent Memory -- episodic (conversations) + working (tasks)
- KV Cache Management -- PagedAttention (vLLM), batching
9. Future Trends¶
- Evaluation + Observability merging -- single platform for both
- Agent simulation becoming standard -- test before production
- OpenTelemetry adoption -- standard instrumentation
- Cost as first-class metric -- token-aware observability
- Safety integration -- guardrail monitoring built-in
Формулы¶
Cost¶
Cache Savings¶
Hallucination Rate¶
Context Precision / Recall¶
Для интервью¶
Q: "Что такое LLM observability и чем отличается от ML monitoring?"¶
LLM observability расширяет традиционный ML monitoring: помимо accuracy/drift отслеживаются token-level cost tracking, prompt/response logging, hallucination detection, context window monitoring, multi-turn conversation tracing, RAG retrieval quality, LLM-as-a-Judge evaluation. Ключевая разница: LLM non-deterministic, оценка сложнее чем classification. Production stack: proxy (Helicone) -> cache (Redis) -> LLM -> tracing (Langfuse) -> evaluation (DeepEval/Ragas) -> alerting (Prometheus).
Q: "Как оптимизировать стоимость LLM API в production?"¶
5 стратегий: (1) Caching (70-90% savings) -- semantic cache (similarity threshold 0.95) + prompt cache (provider API). (2) Model routing (50-80%) -- route simple queries to cheaper models. (3) Token optimization (20-50%) -- max_tokens limits, prompt compression. (4) Batch processing (50%) -- batch similar requests. (5) Monitoring -- cost per query type, anomaly alerts, weekly reviews. CAG (2026): hybrid 80% cached + 20% RAG = near-optimal. Observational memory: 10x cost reduction через stable context windows.
Q: "Как детектировать hallucinations в LLM outputs?"¶
3 подхода: (1) NLI-based -- check if response entails from context (fast, limited). (2) LLM-as-a-Judge -- ask judge LLM if response is grounded (flexible, slower). (3) Fact verification -- extract claims, verify each against KB (accurate, expensive). LLM-as-a-Judge caveat: "mediocre alignment" with humans, position bias, self-preference. Best practice: multiple judges + human calibration. Tools: DeepEval HallucinationMetric, Ragas faithfulness score.
Q: "Сравните платформы LLM observability."¶
Langfuse -- MIT, 23M+ SDK installs/month, self-hostable, best for open-source. LangSmith -- LangChain native, fastest setup for LangChain users. Arize Phoenix -- OpenTelemetry, no vendor lock-in. Maxim AI -- all-in-one с agent simulation. Braintrust -- evaluation-first approach. Helicone -- proxy-based zero-code. All 5 major platforms support tracing + evaluation + cost tracking. Self-hostable: Langfuse (MIT) и Phoenix (ELv2).
Ключевые числа¶
| Факт | Значение |
|---|---|
| Prompt caching savings | 90% on cached reads |
| Semantic caching savings | 70-90% |
| Token caching latency improvement | 50-85% |
| Observational memory cost reduction | 10x |
| CAG optimal ratio | 80% CAG + 20% RAG |
| TTFT target | <500ms |
| P95 latency target | <3s |
| Decode bottleneck share | ~76% of E2E |
| Langfuse SDK installs/month | 23M+ |
| Langfuse GitHub stars | 21K+ |
| Arize Phoenix GitHub stars | 5,000+ |
| LLM-as-Judge human alignment | "mediocre" |
| Hallucination alert threshold | >5% |
Заблуждение: LLM-as-a-Judge дает объективную оценку качества
Исследования 2025 года показывают лишь "mediocre alignment" между LLM-судьями и людьми. LLM-судья имеет position bias (предпочитает первый вариант), self-preference (предпочитает собственные ответы) и inconsistency (один и тот же input -- разные оценки). В production обязательны: ансамбль из нескольких судей, калибровка на человеческих оценках, и confidence intervals в отчетах.
Заблуждение: semantic cache с threshold 0.95 безопасен для любых задач
Threshold 0.95 подходит для FAQ и general Q&A, но для задач с точными числами, юридических запросов или кода даже cosine similarity 0.98 может вернуть неправильный кэшированный ответ. Например, "цена подписки на Pro план" и "цена подписки на Enterprise план" могут иметь similarity > 0.95. Всегда используйте domain-specific валидацию поверх semantic match.
Заблуждение: Langfuse/LangSmith покрывают 100% observability потребностей
Платформы observability отлично справляются с tracing и cost tracking, но evaluation quality metrics (hallucination detection, faithfulness) требуют отдельных инструментов (DeepEval, Ragas). Production-система нуждается в observability platform + evaluation framework + alerting (Prometheus/Grafana) -- ни одна платформа не решает все три задачи одинаково хорошо.
Interview Questions¶
Q: Как бы вы спроектировали систему мониторинга hallucinations для production RAG-системы с 100K запросов/день?
Red flag: "Прогоню все ответы через LLM-as-a-Judge и буду считать hallucination rate"
Strong answer: "Трехуровневый подход: (1) online -- NLI-based проверка faithfulness для каждого ответа (быстро, <50ms), с alert при score <0.7; (2) sampling -- LLM-as-a-Judge на 5-10% запросов для глубокой оценки с ансамблем из 2-3 судей; (3) offline -- Ragas faithfulness + context precision на полном дневном батче. Калибровка судей на 500+ размеченных human-примерах, reporting с confidence intervals. Cost: NLI ~\(0, sampling ~\)50-100/день, offline ~$200/день"
Q: Latency вашего LLM-сервиса деградировала с P95=1.5s до P95=4s. Ваши действия?
Red flag: "Увеличу количество реплик"
Strong answer: "Сначала диагностика по breakdown: network (3%), auth (1%), prompt prep (2%), model queue (5%), prefill (11%), decode (76%), post-proc (3%). Если decode вырос -- проверить, не увеличился ли средний output length (новые промпты?). Если queue вырос -- burst traffic, нужен autoscaling. Если prefill -- длинные контексты, проверить RAG chunking. Immediate mitigations: streaming для perceived latency, speculative decoding для 2-3x ускорения decode, semantic cache для повторяющихся запросов"
Q: Ваша команда тратит $15K/мес на LLM API. Предложите план сокращения до $5K без потери качества.
Red flag: "Переключимся на самую дешевую модель"
Strong answer: "Поэтапно: (1) Инструментировать все вызовы через Langfuse -- понять distribution по типам запросов и моделям (неделя). (2) Semantic caching для повторяющихся паттернов -- обычно 60-85% hit rate, экономия 70% на этих запросах. (3) Model routing -- simple queries (classification, extraction) на GPT-4o-mini (\(0.15/M) вместо GPT-4o (\)2.50/M), savings 90%+ на 40-60% запросов. (4) Prompt caching (Anthropic) -- 90% скидка на cached prefix. (5) Batch API для offline задач -- 50% discount. Целевая экономия: caching $4K + routing $5K + batch $1K = $10K savings"
Источники¶
- Langfuse -- "LLM Observability" (official docs)
- DeepEval -- Documentation (hallucination, evaluation metrics)
- Ragas -- Evaluation Framework (GitHub)
- Maxim AI -- "Top 5 LLM Observability Platforms 2026"
- Arize Phoenix -- Official Documentation
- LangChain/LangSmith -- Official Documentation
- Braintrust -- Official Documentation
- Redis -- "Context Window Overflow 2026"
- VentureBeat -- "Observational Memory Cuts AI Agent Costs 10x"
- Medium -- "The 2025 LLM API Playbook" (cost optimization)
- arXiv -- Token Caching Research (2601.06007v2)
- arXiv -- Hallucination Detection Survey (2504.18114)
- Helicone -- Official Documentation
- Logz.io -- "Top LLM Observability Tools"
See Also¶
- Фреймворки оценки LLM -- DeepEval и Ragas -- evaluation tools интегрируются в observability pipeline
- Гардрейлы оценки LLM -- guardrails как часть monitoring: hallucination alerts, toxicity tracking
- Метрики оценки LLM -- BLEU, ROUGE, BERTScore, LLM-as-Judge -- метрики, которые observability трекает
- Каскадная маршрутизация LLM -- routing decisions мониторятся через observability: cost/quality split по моделям
- Оптимизация расходов LLMOps -- cost tracking и anomaly detection -- ключевые сигналы observability