Наблюдаемость LLM¶

~7 минут чтения

Предварительно: LLMOps vs MLOps, Метрики оценки LLM

Компания с 50K ежедневных LLM-запросов без наблюдаемости теряет в среднем $3000-5000/мес на избыточных токенах, узнает о деградации качества только из жалоб пользователей, и не может отследить, какой промпт вызвал галлюцинацию. Наблюдаемость LLM -- это не просто логирование: это tracing каждого вызова (промпт, ответ, токены, стоимость), evaluation качества в реальном времени (hallucination rate, relevance), и alerting при аномалиях. В production-системах 76% времени ответа -- это decode-фаза, а semantic caching снижает расходы на 70-90%. Без observability-стека вы летите вслепую.

Ключевые концепции¶

LLM Observability = Logging + Tracing + Metrics + Evaluation + Cost Tracking

graph TD
    subgraph stack["LLM OBSERVABILITY STACK"]
        direction TB
        subgraph row1[" "]
            T["TRACING<br/>Request flow<br/>Token use"]
            M["METRICS<br/>Latency / Cost<br/>Throughput"]
            E["EVALUATION<br/>Quality<br/>Hallucinations<br/>Relevance"]
        end
        subgraph row2[" "]
            L["LOGGING<br/>Prompts<br/>Responses<br/>Errors"]
            A["ALERTING<br/>Anomalies<br/>Cost spikes<br/>Quality"]
            D["DEBUGGING<br/>Playground<br/>Experiments<br/>Comparison"]
        end
    end

    style T fill:#e8eaf6,stroke:#3f51b5
    style M fill:#e8f5e9,stroke:#4caf50
    style E fill:#fff3e0,stroke:#ef6c00
    style L fill:#f3e5f5,stroke:#9c27b0
    style A fill:#fce4ec,stroke:#c62828
    style D fill:#e8eaf6,stroke:#3f51b5

7 измерений (2025)¶

Trust -- Factual grounding, self-auditing
Safety -- Bias detection, toxic content
Quality -- Output coherence, accuracy
Performance -- Latency, throughput
Cost -- API usage, resource consumption
User Feedback -- Ratings, thumbs up/down
Analytics -- Cost trends, quality trends

Concern	Impact	Solution
Cost	API bills grow exponentially	Token tracking, caching
Quality	Hallucinations, errors	Evaluation metrics
Latency	User experience degradation	P50/P95/P99 monitoring
Compliance	GDPR, data retention	Audit trails
Debugging	Black box models	Tracing, prompt inspection

1. Платформы 2026¶

Feature Comparison¶

Feature	Maxim	Arize Phoenix	LangSmith	Langfuse	Braintrust
Tracing	Full	Full	Full	Full	Full
Agent Simulation	Yes	No	No	No	Limited
Evaluation Suite	Yes	Yes	Yes	Yes	Yes
Open Source	No	ELv2	No	MIT	Partial
Self-Hosting	No	Yes	Enterprise	Yes	Enterprise
Prompt Mgmt	Yes	Basic	Yes	Yes	Yes
Cost Tracking	Yes	Yes	Yes	Yes	Yes
OpenTelemetry	No	Yes	No	No	No

Pricing¶

Platform	Free Tier	Self-Hosted	Paid
Langfuse	50K events/mo	Free	$59/mo
LangSmith	5K traces/mo	Enterprise	$39/seat/mo
Arize Phoenix	Open source	Free	Custom
Braintrust	Limited	Enterprise	Custom
Maxim AI	Demo	No	Enterprise

Selection Guide¶

Need	Recommended	Reason
Open source	Langfuse	MIT, 23M+ SDK installs/month, 21K+ GitHub stars
LangChain native	LangSmith	Zero-friction setup
OTel standard	Arize Phoenix	No vendor lock-in
All-in-one	Maxim AI	Simulation + eval + observability
Evaluation-first	Braintrust	Brainstore database
Self-hosting	Langfuse или Phoenix	Both fully self-hostable
Enterprise integration	Datadog AI	Existing infra

By Team Size¶

Team	Recommended
Solo / small	Langfuse (free tier)
Startup	LangSmith или Langfuse
Mid-size	Arize Phoenix
Enterprise	Maxim или custom

Other Tools¶

Tool	Focus
Helicone	Proxy-based, zero-code, caching
Portkey	Production routing, caching
TruLens	Evaluation framework
DeepEval	Comprehensive metrics
Datadog AI	Enterprise integration

2. Evaluation Metrics¶

Metric Categories¶

Category	Metrics
Groundedness	Response based on context?
Relevance	Response answers the query?
Hallucination	Response factually correct?
Coherence	Response well-structured?
Toxicity	Response harmful?
Bias	Response fair?

RAG-Specific Metrics¶

Metric	Formula/Approach
Context Precision	TP / (TP + FP)
Context Recall	TP / (TP + FN)
Faithfulness	Claims supported / Total claims
Answer Relevance	LLM-as-judge score
Answer Correctness	vs ground truth

DeepEval Hallucination Detection¶

from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris.",
    retrieval_context=["France is a country in Europe. Paris is its capital."]
)

metric = HallucinationMetric(threshold=0.5)
evaluate(test_cases=[test_case], metrics=[metric])

Ragas (RAG Evaluation)¶

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)

result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy,
             context_precision, context_recall]
)

3. LLM-as-a-Judge¶

graph LR
    Q["Input Query"] --> MA["Model A"]
    Q --> MB["Model B"]
    MA --> RA["Response A"]
    MB --> RB["Response B"]
    C["Criteria:<br/>Accuracy, Safety..."] --> J["Judge LLM"]
    RA --> J
    RB --> J
    J --> R["Evaluation Result"]

    style Q fill:#e8eaf6,stroke:#3f51b5
    style MA fill:#e8f5e9,stroke:#4caf50
    style MB fill:#e8f5e9,stroke:#4caf50
    style J fill:#fff3e0,stroke:#ef6c00
    style R fill:#f3e5f5,stroke:#9c27b0

Reliability (2025 Research)¶

Key Finding: LLM judges show only "mediocre alignment" with human evaluators.

Best Practices: 1. Use multiple judges (ensemble) 2. Calibrate with human feedback 3. Use structured evaluation prompts 4. Report confidence intervals

Issue	Description
Position bias	Prefers first option
Self-preference	Prefers own outputs
Inconsistency	Same input, different scores
Subtle errors	Can't detect nuanced mistakes

4. Cost Optimization¶

Cost Drivers¶

Factor	Impact	Optimization
Input tokens	$/1M tokens	Prompt compression
Output tokens	2-3x input cost	Limit max_tokens
Model selection	10-100x variance	Use smallest viable
Repeated queries	Redundant API calls	Caching
Context length	Quadratic attention	Chunking

Cost Reduction Strategies¶

Strategy	Savings
Prompt Caching (provider)	90%
Semantic Caching	70-90%
Model Routing	50-80%
Batch Processing	50%
Token Limits	20-50%
Prompt Compression	30-50%

Semantic Caching Implementation¶

import time
import numpy as np
from dataclasses import dataclass
from typing import Optional

@dataclass
class CacheEntry:
    query: str
    embedding: np.ndarray
    response: str
    timestamp: float

class SemanticCache:
    def __init__(self, embed_fn, similarity_threshold=0.95, ttl_seconds=3600):
        self.embed_fn = embed_fn
        self.threshold = similarity_threshold
        self.ttl = ttl_seconds
        self.cache: list[CacheEntry] = []

    def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

    def get(self, query: str) -> Optional[str]:
        query_emb = self.embed_fn(query)
        current_time = time.time()
        for entry in self.cache:
            if current_time - entry.timestamp > self.ttl:
                continue
            sim = self._cosine_similarity(query_emb, entry.embedding)
            if sim >= self.threshold:
                return entry.response
        return None

    def set(self, query: str, response: str):
        embedding = self.embed_fn(query)
        self.cache.append(CacheEntry(
            query=query, embedding=embedding,
            response=response, timestamp=time.time()
        ))

5. Latency Monitoring¶

Key Metrics¶

Metric	Description	Target
TTFT	Time to First Token	<500ms
TPS	Tokens Per Second	>30
E2E Latency	End-to-end response	<2s
P50	Median latency	<1s
P95	95^th percentile	<3s
P99	99^th percentile	<5s

Latency Breakdown (typical)¶

Total E2E: 1850ms

+-- Network:       50ms  (3%)
+-- Auth:          10ms  (1%)
+-- Prompt Prep:   30ms  (2%)
+-- Model Queue:  100ms  (5%)
+-- Prefill:      200ms  (11%)
+-- Decode:      1400ms  (76%)  <-- BOTTLENECK
+-- Post-proc:    60ms  (3%)

Optimization¶

Technique	Improvement	Trade-off
Speculative Decoding	2-3x faster	More complex
Streaming	Perceived faster	Same total time
Smaller Model	5-10x faster	Lower quality
Batching	Higher throughput	Higher latency
Caching	Near-instant	Cache invalidation

6. Alerting¶

Alert Type	Condition	Action
Cost Spike	>2x daily average	Notify + investigate
Latency Increase	P95 > 3s for 5min	Scale + notify
Error Rate	>1% failures	Page on-call
Quality Drop	Score < threshold	Review prompts
Hallucination	Rate >5%	Investigate RAG

alerts:
  - name: cost_spike
    condition: daily_spend > 2 * avg_daily_spend
    severity: warning
    channels: [slack, email]

  - name: latency_degradation
    condition: p95_latency > 3000ms AND duration > 5m
    severity: critical
    channels: [pagerduty, slack]

  - name: error_rate_high
    condition: error_rate > 1%
    severity: critical
    channels: [pagerduty]

7. Production Architecture¶

graph TD
    APP["Application"] --> PROXY["Proxy<br/>(Helicone)"]
    PROXY --> CACHE["Cache<br/>(Redis)"]
    CACHE --> LLM["LLM Provider"]
    PROXY --> LF["Langfuse<br/>(Tracing)"]
    CACHE --> LF
    LLM --> LF
    LF --> PG["PostgreSQL"]
    LF --> EVAL["Evaluation<br/>DeepEval / Ragas / Custom"]
    LF --> ALERT["Alerting<br/>Prometheus -> Alertmanager<br/>-> Slack / PagerDuty"]

    style APP fill:#e8eaf6,stroke:#3f51b5
    style PROXY fill:#fff3e0,stroke:#ef6c00
    style CACHE fill:#e8f5e9,stroke:#4caf50
    style LLM fill:#f3e5f5,stroke:#9c27b0
    style LF fill:#e8eaf6,stroke:#3f51b5
    style PG fill:#e8f5e9,stroke:#4caf50
    style EVAL fill:#fff3e0,stroke:#ef6c00
    style ALERT fill:#fce4ec,stroke:#c62828

Integration Code¶

# Langfuse
from langfuse import Langfuse

langfuse = Langfuse(
    public_key="pk-...", secret_key="sk-...",
    host="https://cloud.langfuse.com"
)

trace = langfuse.trace(name="chat-completion", user_id="user-123")
generation = trace.generation(
    name="llm-call", model="gpt-4o",
    input=query, output=response,
    usage={"input": 100, "output": 50}, cost=0.015
)
trace.score(name="relevance", value=0.85)

# Arize Phoenix (OTel)
from phoenix.trace.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument()

# LangSmith (auto with env vars)
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "xxx"
# All LangChain calls automatically traced

# Helicone (zero-code)
openai.base_url = "https://api.helicone.ai/v1"
# All requests automatically logged

8. CAG и Observational Memory (2026)¶

Cache-Augmented Generation (CAG)¶

graph LR
    subgraph hybrid["HYBRID MEMORY"]
        direction LR
        subgraph cag["CAG (Core)"]
            C1["Identity / docs"]
            C2["Stable context"]
            C3["Aggressive cache"]
            C4["90%+ hit rate"]
        end
        subgraph rag["RAG (Long-tail)"]
            R1["Search queries"]
            R2["Rare information"]
            R3["Dynamic retrieval"]
            R4["Fresh data"]
        end
    end

    style cag fill:#e8f5e9,stroke:#4caf50
    style rag fill:#e8eaf6,stroke:#3f51b5

\[\text{Optimal} = 0.8 \cdot \text{CAG} + 0.2 \cdot \text{RAG}\]

Memory Type	Use Case	Cost	Latency
CAG	FAQs, docs, identity	Near-zero	<10ms
RAG	Search, rare queries	Higher	100-500ms
Hybrid	Production systems	Optimized	Variable

Observational Memory¶

Stable context windows enable 10x cost reduction via aggressive caching of intermediate states.

Approach	Relative Cost	Accuracy
Naive RAG	1.0x	Baseline
Semantic Cache	0.3-0.5x	Same
Observational Memory	0.1x	Higher

Context Window Management (5 techniques)¶

Smart Chunking -- semantic boundaries vs fixed-size
Semantic Caching -- 50-80% cost reduction
RAG Optimization -- hybrid search (BM25 + dense), re-ranking
Agent Memory -- episodic (conversations) + working (tasks)
KV Cache Management -- PagedAttention (vLLM), batching

9. Future Trends¶

Evaluation + Observability merging -- single platform for both
Agent simulation becoming standard -- test before production
OpenTelemetry adoption -- standard instrumentation
Cost as first-class metric -- token-aware observability
Safety integration -- guardrail monitoring built-in

Формулы¶

Cost¶

\[\text{Cost} = \frac{\text{Input Tokens} \times \text{Input Price} + \text{Output Tokens} \times \text{Output Price}}{1{,}000{,}000}\]

Cache Savings¶

\[\text{Savings} = \text{Hit Rate} \times \text{Avg Cost per Request} \times \text{Requests}\]

Hallucination Rate¶

\[\text{Hallucination Rate} = \frac{\text{Ungrounded Responses}}{\text{Total Responses}}\]

Context Precision / Recall¶

\[\text{Precision} = \frac{\text{Relevant Retrieved}}{\text{Total Retrieved}} \qquad \text{Recall} = \frac{\text{Relevant Retrieved}}{\text{Total Relevant}}\]

Для интервью¶

Q: "Что такое LLM observability и чем отличается от ML monitoring?"¶

LLM observability расширяет традиционный ML monitoring: помимо accuracy/drift отслеживаются token-level cost tracking, prompt/response logging, hallucination detection, context window monitoring, multi-turn conversation tracing, RAG retrieval quality, LLM-as-a-Judge evaluation. Ключевая разница: LLM non-deterministic, оценка сложнее чем classification. Production stack: proxy (Helicone) -> cache (Redis) -> LLM -> tracing (Langfuse) -> evaluation (DeepEval/Ragas) -> alerting (Prometheus).

Q: "Как оптимизировать стоимость LLM API в production?"¶

5 стратегий: (1) Caching (70-90% savings) -- semantic cache (similarity threshold 0.95) + prompt cache (provider API). (2) Model routing (50-80%) -- route simple queries to cheaper models. (3) Token optimization (20-50%) -- max_tokens limits, prompt compression. (4) Batch processing (50%) -- batch similar requests. (5) Monitoring -- cost per query type, anomaly alerts, weekly reviews. CAG (2026): hybrid 80% cached + 20% RAG = near-optimal. Observational memory: 10x cost reduction через stable context windows.

Q: "Как детектировать hallucinations в LLM outputs?"¶

3 подхода: (1) NLI-based -- check if response entails from context (fast, limited). (2) LLM-as-a-Judge -- ask judge LLM if response is grounded (flexible, slower). (3) Fact verification -- extract claims, verify each against KB (accurate, expensive). LLM-as-a-Judge caveat: "mediocre alignment" with humans, position bias, self-preference. Best practice: multiple judges + human calibration. Tools: DeepEval HallucinationMetric, Ragas faithfulness score.

Q: "Сравните платформы LLM observability."¶

Langfuse -- MIT, 23M+ SDK installs/month, self-hostable, best for open-source. LangSmith -- LangChain native, fastest setup for LangChain users. Arize Phoenix -- OpenTelemetry, no vendor lock-in. Maxim AI -- all-in-one с agent simulation. Braintrust -- evaluation-first approach. Helicone -- proxy-based zero-code. All 5 major platforms support tracing + evaluation + cost tracking. Self-hostable: Langfuse (MIT) и Phoenix (ELv2).

Ключевые числа¶

Факт	Значение
Prompt caching savings	90% on cached reads
Semantic caching savings	70-90%
Token caching latency improvement	50-85%
Observational memory cost reduction	10x
CAG optimal ratio	80% CAG + 20% RAG
TTFT target	<500ms
P95 latency target	<3s
Decode bottleneck share	~76% of E2E
Langfuse SDK installs/month	23M+
Langfuse GitHub stars	21K+
Arize Phoenix GitHub stars	5,000+
LLM-as-Judge human alignment	"mediocre"
Hallucination alert threshold	>5%

Заблуждение: LLM-as-a-Judge дает объективную оценку качества

Исследования 2025 года показывают лишь "mediocre alignment" между LLM-судьями и людьми. LLM-судья имеет position bias (предпочитает первый вариант), self-preference (предпочитает собственные ответы) и inconsistency (один и тот же input -- разные оценки). В production обязательны: ансамбль из нескольких судей, калибровка на человеческих оценках, и confidence intervals в отчетах.

Заблуждение: semantic cache с threshold 0.95 безопасен для любых задач

Threshold 0.95 подходит для FAQ и general Q&A, но для задач с точными числами, юридических запросов или кода даже cosine similarity 0.98 может вернуть неправильный кэшированный ответ. Например, "цена подписки на Pro план" и "цена подписки на Enterprise план" могут иметь similarity > 0.95. Всегда используйте domain-specific валидацию поверх semantic match.

Заблуждение: Langfuse/LangSmith покрывают 100% observability потребностей

Платформы observability отлично справляются с tracing и cost tracking, но evaluation quality metrics (hallucination detection, faithfulness) требуют отдельных инструментов (DeepEval, Ragas). Production-система нуждается в observability platform + evaluation framework + alerting (Prometheus/Grafana) -- ни одна платформа не решает все три задачи одинаково хорошо.

Interview Questions¶

Q: Как бы вы спроектировали систему мониторинга hallucinations для production RAG-системы с 100K запросов/день?

Red flag: "Прогоню все ответы через LLM-as-a-Judge и буду считать hallucination rate"

Strong answer: "Трехуровневый подход: (1) online -- NLI-based проверка faithfulness для каждого ответа (быстро, <50ms), с alert при score <0.7; (2) sampling -- LLM-as-a-Judge на 5-10% запросов для глубокой оценки с ансамблем из 2-3 судей; (3) offline -- Ragas faithfulness + context precision на полном дневном батче. Калибровка судей на 500+ размеченных human-примерах, reporting с confidence intervals. Cost: NLI ~$0, sampling ~$50-100/день, offline ~$200/день"

Q: Latency вашего LLM-сервиса деградировала с P95=1.5s до P95=4s. Ваши действия?

Red flag: "Увеличу количество реплик"

Strong answer: "Сначала диагностика по breakdown: network (3%), auth (1%), prompt prep (2%), model queue (5%), prefill (11%), decode (76%), post-proc (3%). Если decode вырос -- проверить, не увеличился ли средний output length (новые промпты?). Если queue вырос -- burst traffic, нужен autoscaling. Если prefill -- длинные контексты, проверить RAG chunking. Immediate mitigations: streaming для perceived latency, speculative decoding для 2-3x ускорения decode, semantic cache для повторяющихся запросов"

Q: Ваша команда тратит $15K/мес на LLM API. Предложите план сокращения до $5K без потери качества.

Red flag: "Переключимся на самую дешевую модель"

Strong answer: "Поэтапно: (1) Инструментировать все вызовы через Langfuse -- понять distribution по типам запросов и моделям (неделя). (2) Semantic caching для повторяющихся паттернов -- обычно 60-85% hit rate, экономия 70% на этих запросах. (3) Model routing -- simple queries (classification, extraction) на GPT-4o-mini ($0.15/M) вместо GPT-4o ($2.50/M), savings 90%+ на 40-60% запросов. (4) Prompt caching (Anthropic) -- 90% скидка на cached prefix. (5) Batch API для offline задач -- 50% discount. Целевая экономия: caching $4K + routing $5K + batch $1K = $10K savings"

Источники¶

Langfuse -- "LLM Observability" (official docs)
DeepEval -- Documentation (hallucination, evaluation metrics)
Ragas -- Evaluation Framework (GitHub)
Maxim AI -- "Top 5 LLM Observability Platforms 2026"
Arize Phoenix -- Official Documentation
LangChain/LangSmith -- Official Documentation
Braintrust -- Official Documentation
Redis -- "Context Window Overflow 2026"
VentureBeat -- "Observational Memory Cuts AI Agent Costs 10x"
Medium -- "The 2025 LLM API Playbook" (cost optimization)
arXiv -- Token Caching Research (2601.06007v2)
arXiv -- Hallucination Detection Survey (2504.18114)
Helicone -- Official Documentation
Logz.io -- "Top LLM Observability Tools"