Перейти к содержанию

Наблюдаемость LLM

~7 минут чтения

Предварительно: LLMOps vs MLOps, Метрики оценки LLM

Компания с 50K ежедневных LLM-запросов без наблюдаемости теряет в среднем $3000-5000/мес на избыточных токенах, узнает о деградации качества только из жалоб пользователей, и не может отследить, какой промпт вызвал галлюцинацию. Наблюдаемость LLM -- это не просто логирование: это tracing каждого вызова (промпт, ответ, токены, стоимость), evaluation качества в реальном времени (hallucination rate, relevance), и alerting при аномалиях. В production-системах 76% времени ответа -- это decode-фаза, а semantic caching снижает расходы на 70-90%. Без observability-стека вы летите вслепую.


Ключевые концепции

LLM Observability = Logging + Tracing + Metrics + Evaluation + Cost Tracking

graph TD
    subgraph stack["LLM OBSERVABILITY STACK"]
        direction TB
        subgraph row1[" "]
            T["TRACING<br/>Request flow<br/>Token use"]
            M["METRICS<br/>Latency / Cost<br/>Throughput"]
            E["EVALUATION<br/>Quality<br/>Hallucinations<br/>Relevance"]
        end
        subgraph row2[" "]
            L["LOGGING<br/>Prompts<br/>Responses<br/>Errors"]
            A["ALERTING<br/>Anomalies<br/>Cost spikes<br/>Quality"]
            D["DEBUGGING<br/>Playground<br/>Experiments<br/>Comparison"]
        end
    end

    style T fill:#e8eaf6,stroke:#3f51b5
    style M fill:#e8f5e9,stroke:#4caf50
    style E fill:#fff3e0,stroke:#ef6c00
    style L fill:#f3e5f5,stroke:#9c27b0
    style A fill:#fce4ec,stroke:#c62828
    style D fill:#e8eaf6,stroke:#3f51b5

7 измерений (2025)

  1. Trust -- Factual grounding, self-auditing
  2. Safety -- Bias detection, toxic content
  3. Quality -- Output coherence, accuracy
  4. Performance -- Latency, throughput
  5. Cost -- API usage, resource consumption
  6. User Feedback -- Ratings, thumbs up/down
  7. Analytics -- Cost trends, quality trends
Concern Impact Solution
Cost API bills grow exponentially Token tracking, caching
Quality Hallucinations, errors Evaluation metrics
Latency User experience degradation P50/P95/P99 monitoring
Compliance GDPR, data retention Audit trails
Debugging Black box models Tracing, prompt inspection

1. Платформы 2026

Feature Comparison

Feature Maxim Arize Phoenix LangSmith Langfuse Braintrust
Tracing Full Full Full Full Full
Agent Simulation Yes No No No Limited
Evaluation Suite Yes Yes Yes Yes Yes
Open Source No ELv2 No MIT Partial
Self-Hosting No Yes Enterprise Yes Enterprise
Prompt Mgmt Yes Basic Yes Yes Yes
Cost Tracking Yes Yes Yes Yes Yes
OpenTelemetry No Yes No No No

Pricing

Platform Free Tier Self-Hosted Paid
Langfuse 50K events/mo Free $59/mo
LangSmith 5K traces/mo Enterprise $39/seat/mo
Arize Phoenix Open source Free Custom
Braintrust Limited Enterprise Custom
Maxim AI Demo No Enterprise

Selection Guide

Need Recommended Reason
Open source Langfuse MIT, 23M+ SDK installs/month, 21K+ GitHub stars
LangChain native LangSmith Zero-friction setup
OTel standard Arize Phoenix No vendor lock-in
All-in-one Maxim AI Simulation + eval + observability
Evaluation-first Braintrust Brainstore database
Self-hosting Langfuse или Phoenix Both fully self-hostable
Enterprise integration Datadog AI Existing infra

By Team Size

Team Recommended
Solo / small Langfuse (free tier)
Startup LangSmith или Langfuse
Mid-size Arize Phoenix
Enterprise Maxim или custom

Other Tools

Tool Focus
Helicone Proxy-based, zero-code, caching
Portkey Production routing, caching
TruLens Evaluation framework
DeepEval Comprehensive metrics
Datadog AI Enterprise integration

2. Evaluation Metrics

Metric Categories

Category Metrics
Groundedness Response based on context?
Relevance Response answers the query?
Hallucination Response factually correct?
Coherence Response well-structured?
Toxicity Response harmful?
Bias Response fair?

RAG-Specific Metrics

Metric Formula/Approach
Context Precision TP / (TP + FP)
Context Recall TP / (TP + FN)
Faithfulness Claims supported / Total claims
Answer Relevance LLM-as-judge score
Answer Correctness vs ground truth

DeepEval Hallucination Detection

from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris.",
    retrieval_context=["France is a country in Europe. Paris is its capital."]
)

metric = HallucinationMetric(threshold=0.5)
evaluate(test_cases=[test_case], metrics=[metric])

Ragas (RAG Evaluation)

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)

result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy,
             context_precision, context_recall]
)

3. LLM-as-a-Judge

graph LR
    Q["Input Query"] --> MA["Model A"]
    Q --> MB["Model B"]
    MA --> RA["Response A"]
    MB --> RB["Response B"]
    C["Criteria:<br/>Accuracy, Safety..."] --> J["Judge LLM"]
    RA --> J
    RB --> J
    J --> R["Evaluation Result"]

    style Q fill:#e8eaf6,stroke:#3f51b5
    style MA fill:#e8f5e9,stroke:#4caf50
    style MB fill:#e8f5e9,stroke:#4caf50
    style J fill:#fff3e0,stroke:#ef6c00
    style R fill:#f3e5f5,stroke:#9c27b0

Reliability (2025 Research)

Key Finding: LLM judges show only "mediocre alignment" with human evaluators.

Best Practices: 1. Use multiple judges (ensemble) 2. Calibrate with human feedback 3. Use structured evaluation prompts 4. Report confidence intervals

Issue Description
Position bias Prefers first option
Self-preference Prefers own outputs
Inconsistency Same input, different scores
Subtle errors Can't detect nuanced mistakes

4. Cost Optimization

Cost Drivers

Factor Impact Optimization
Input tokens $/1M tokens Prompt compression
Output tokens 2-3x input cost Limit max_tokens
Model selection 10-100x variance Use smallest viable
Repeated queries Redundant API calls Caching
Context length Quadratic attention Chunking

Cost Reduction Strategies

Strategy Savings
Prompt Caching (provider) 90%
Semantic Caching 70-90%
Model Routing 50-80%
Batch Processing 50%
Token Limits 20-50%
Prompt Compression 30-50%

Semantic Caching Implementation

import time
import numpy as np
from dataclasses import dataclass
from typing import Optional

@dataclass
class CacheEntry:
    query: str
    embedding: np.ndarray
    response: str
    timestamp: float

class SemanticCache:
    def __init__(self, embed_fn, similarity_threshold=0.95, ttl_seconds=3600):
        self.embed_fn = embed_fn
        self.threshold = similarity_threshold
        self.ttl = ttl_seconds
        self.cache: list[CacheEntry] = []

    def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

    def get(self, query: str) -> Optional[str]:
        query_emb = self.embed_fn(query)
        current_time = time.time()
        for entry in self.cache:
            if current_time - entry.timestamp > self.ttl:
                continue
            sim = self._cosine_similarity(query_emb, entry.embedding)
            if sim >= self.threshold:
                return entry.response
        return None

    def set(self, query: str, response: str):
        embedding = self.embed_fn(query)
        self.cache.append(CacheEntry(
            query=query, embedding=embedding,
            response=response, timestamp=time.time()
        ))

5. Latency Monitoring

Key Metrics

Metric Description Target
TTFT Time to First Token <500ms
TPS Tokens Per Second >30
E2E Latency End-to-end response <2s
P50 Median latency <1s
P95 95th percentile <3s
P99 99th percentile <5s

Latency Breakdown (typical)

Total E2E: 1850ms

+-- Network:       50ms  (3%)
+-- Auth:          10ms  (1%)
+-- Prompt Prep:   30ms  (2%)
+-- Model Queue:  100ms  (5%)
+-- Prefill:      200ms  (11%)
+-- Decode:      1400ms  (76%)  <-- BOTTLENECK
+-- Post-proc:    60ms  (3%)

Optimization

Technique Improvement Trade-off
Speculative Decoding 2-3x faster More complex
Streaming Perceived faster Same total time
Smaller Model 5-10x faster Lower quality
Batching Higher throughput Higher latency
Caching Near-instant Cache invalidation

6. Alerting

Alert Type Condition Action
Cost Spike >2x daily average Notify + investigate
Latency Increase P95 > 3s for 5min Scale + notify
Error Rate >1% failures Page on-call
Quality Drop Score < threshold Review prompts
Hallucination Rate >5% Investigate RAG
alerts:
  - name: cost_spike
    condition: daily_spend > 2 * avg_daily_spend
    severity: warning
    channels: [slack, email]

  - name: latency_degradation
    condition: p95_latency > 3000ms AND duration > 5m
    severity: critical
    channels: [pagerduty, slack]

  - name: error_rate_high
    condition: error_rate > 1%
    severity: critical
    channels: [pagerduty]

7. Production Architecture

graph TD
    APP["Application"] --> PROXY["Proxy<br/>(Helicone)"]
    PROXY --> CACHE["Cache<br/>(Redis)"]
    CACHE --> LLM["LLM Provider"]
    PROXY --> LF["Langfuse<br/>(Tracing)"]
    CACHE --> LF
    LLM --> LF
    LF --> PG["PostgreSQL"]
    LF --> EVAL["Evaluation<br/>DeepEval / Ragas / Custom"]
    LF --> ALERT["Alerting<br/>Prometheus -> Alertmanager<br/>-> Slack / PagerDuty"]

    style APP fill:#e8eaf6,stroke:#3f51b5
    style PROXY fill:#fff3e0,stroke:#ef6c00
    style CACHE fill:#e8f5e9,stroke:#4caf50
    style LLM fill:#f3e5f5,stroke:#9c27b0
    style LF fill:#e8eaf6,stroke:#3f51b5
    style PG fill:#e8f5e9,stroke:#4caf50
    style EVAL fill:#fff3e0,stroke:#ef6c00
    style ALERT fill:#fce4ec,stroke:#c62828

Integration Code

# Langfuse
from langfuse import Langfuse

langfuse = Langfuse(
    public_key="pk-...", secret_key="sk-...",
    host="https://cloud.langfuse.com"
)

trace = langfuse.trace(name="chat-completion", user_id="user-123")
generation = trace.generation(
    name="llm-call", model="gpt-4o",
    input=query, output=response,
    usage={"input": 100, "output": 50}, cost=0.015
)
trace.score(name="relevance", value=0.85)
# Arize Phoenix (OTel)
from phoenix.trace.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument()
# LangSmith (auto with env vars)
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "xxx"
# All LangChain calls automatically traced
# Helicone (zero-code)
openai.base_url = "https://api.helicone.ai/v1"
# All requests automatically logged

8. CAG и Observational Memory (2026)

Cache-Augmented Generation (CAG)

graph LR
    subgraph hybrid["HYBRID MEMORY"]
        direction LR
        subgraph cag["CAG (Core)"]
            C1["Identity / docs"]
            C2["Stable context"]
            C3["Aggressive cache"]
            C4["90%+ hit rate"]
        end
        subgraph rag["RAG (Long-tail)"]
            R1["Search queries"]
            R2["Rare information"]
            R3["Dynamic retrieval"]
            R4["Fresh data"]
        end
    end

    style cag fill:#e8f5e9,stroke:#4caf50
    style rag fill:#e8eaf6,stroke:#3f51b5
\[\text{Optimal} = 0.8 \cdot \text{CAG} + 0.2 \cdot \text{RAG}\]
Memory Type Use Case Cost Latency
CAG FAQs, docs, identity Near-zero <10ms
RAG Search, rare queries Higher 100-500ms
Hybrid Production systems Optimized Variable

Observational Memory

Stable context windows enable 10x cost reduction via aggressive caching of intermediate states.

Approach Relative Cost Accuracy
Naive RAG 1.0x Baseline
Semantic Cache 0.3-0.5x Same
Observational Memory 0.1x Higher

Context Window Management (5 techniques)

  1. Smart Chunking -- semantic boundaries vs fixed-size
  2. Semantic Caching -- 50-80% cost reduction
  3. RAG Optimization -- hybrid search (BM25 + dense), re-ranking
  4. Agent Memory -- episodic (conversations) + working (tasks)
  5. KV Cache Management -- PagedAttention (vLLM), batching

  1. Evaluation + Observability merging -- single platform for both
  2. Agent simulation becoming standard -- test before production
  3. OpenTelemetry adoption -- standard instrumentation
  4. Cost as first-class metric -- token-aware observability
  5. Safety integration -- guardrail monitoring built-in

Формулы

Cost

\[\text{Cost} = \frac{\text{Input Tokens} \times \text{Input Price} + \text{Output Tokens} \times \text{Output Price}}{1{,}000{,}000}\]

Cache Savings

\[\text{Savings} = \text{Hit Rate} \times \text{Avg Cost per Request} \times \text{Requests}\]

Hallucination Rate

\[\text{Hallucination Rate} = \frac{\text{Ungrounded Responses}}{\text{Total Responses}}\]

Context Precision / Recall

\[\text{Precision} = \frac{\text{Relevant Retrieved}}{\text{Total Retrieved}} \qquad \text{Recall} = \frac{\text{Relevant Retrieved}}{\text{Total Relevant}}\]

Для интервью

Q: "Что такое LLM observability и чем отличается от ML monitoring?"

LLM observability расширяет традиционный ML monitoring: помимо accuracy/drift отслеживаются token-level cost tracking, prompt/response logging, hallucination detection, context window monitoring, multi-turn conversation tracing, RAG retrieval quality, LLM-as-a-Judge evaluation. Ключевая разница: LLM non-deterministic, оценка сложнее чем classification. Production stack: proxy (Helicone) -> cache (Redis) -> LLM -> tracing (Langfuse) -> evaluation (DeepEval/Ragas) -> alerting (Prometheus).

Q: "Как оптимизировать стоимость LLM API в production?"

5 стратегий: (1) Caching (70-90% savings) -- semantic cache (similarity threshold 0.95) + prompt cache (provider API). (2) Model routing (50-80%) -- route simple queries to cheaper models. (3) Token optimization (20-50%) -- max_tokens limits, prompt compression. (4) Batch processing (50%) -- batch similar requests. (5) Monitoring -- cost per query type, anomaly alerts, weekly reviews. CAG (2026): hybrid 80% cached + 20% RAG = near-optimal. Observational memory: 10x cost reduction через stable context windows.

Q: "Как детектировать hallucinations в LLM outputs?"

3 подхода: (1) NLI-based -- check if response entails from context (fast, limited). (2) LLM-as-a-Judge -- ask judge LLM if response is grounded (flexible, slower). (3) Fact verification -- extract claims, verify each against KB (accurate, expensive). LLM-as-a-Judge caveat: "mediocre alignment" with humans, position bias, self-preference. Best practice: multiple judges + human calibration. Tools: DeepEval HallucinationMetric, Ragas faithfulness score.

Q: "Сравните платформы LLM observability."

Langfuse -- MIT, 23M+ SDK installs/month, self-hostable, best for open-source. LangSmith -- LangChain native, fastest setup for LangChain users. Arize Phoenix -- OpenTelemetry, no vendor lock-in. Maxim AI -- all-in-one с agent simulation. Braintrust -- evaluation-first approach. Helicone -- proxy-based zero-code. All 5 major platforms support tracing + evaluation + cost tracking. Self-hostable: Langfuse (MIT) и Phoenix (ELv2).


Ключевые числа

Факт Значение
Prompt caching savings 90% on cached reads
Semantic caching savings 70-90%
Token caching latency improvement 50-85%
Observational memory cost reduction 10x
CAG optimal ratio 80% CAG + 20% RAG
TTFT target <500ms
P95 latency target <3s
Decode bottleneck share ~76% of E2E
Langfuse SDK installs/month 23M+
Langfuse GitHub stars 21K+
Arize Phoenix GitHub stars 5,000+
LLM-as-Judge human alignment "mediocre"
Hallucination alert threshold >5%

Заблуждение: LLM-as-a-Judge дает объективную оценку качества

Исследования 2025 года показывают лишь "mediocre alignment" между LLM-судьями и людьми. LLM-судья имеет position bias (предпочитает первый вариант), self-preference (предпочитает собственные ответы) и inconsistency (один и тот же input -- разные оценки). В production обязательны: ансамбль из нескольких судей, калибровка на человеческих оценках, и confidence intervals в отчетах.

Заблуждение: semantic cache с threshold 0.95 безопасен для любых задач

Threshold 0.95 подходит для FAQ и general Q&A, но для задач с точными числами, юридических запросов или кода даже cosine similarity 0.98 может вернуть неправильный кэшированный ответ. Например, "цена подписки на Pro план" и "цена подписки на Enterprise план" могут иметь similarity > 0.95. Всегда используйте domain-specific валидацию поверх semantic match.

Заблуждение: Langfuse/LangSmith покрывают 100% observability потребностей

Платформы observability отлично справляются с tracing и cost tracking, но evaluation quality metrics (hallucination detection, faithfulness) требуют отдельных инструментов (DeepEval, Ragas). Production-система нуждается в observability platform + evaluation framework + alerting (Prometheus/Grafana) -- ни одна платформа не решает все три задачи одинаково хорошо.


Interview Questions

Q: Как бы вы спроектировали систему мониторинга hallucinations для production RAG-системы с 100K запросов/день?

❌ Red flag: "Прогоню все ответы через LLM-as-a-Judge и буду считать hallucination rate"

✅ Strong answer: "Трехуровневый подход: (1) online -- NLI-based проверка faithfulness для каждого ответа (быстро, <50ms), с alert при score <0.7; (2) sampling -- LLM-as-a-Judge на 5-10% запросов для глубокой оценки с ансамблем из 2-3 судей; (3) offline -- Ragas faithfulness + context precision на полном дневном батче. Калибровка судей на 500+ размеченных human-примерах, reporting с confidence intervals. Cost: NLI ~\(0, sampling ~\)50-100/день, offline ~$200/день"

Q: Latency вашего LLM-сервиса деградировала с P95=1.5s до P95=4s. Ваши действия?

❌ Red flag: "Увеличу количество реплик"

✅ Strong answer: "Сначала диагностика по breakdown: network (3%), auth (1%), prompt prep (2%), model queue (5%), prefill (11%), decode (76%), post-proc (3%). Если decode вырос -- проверить, не увеличился ли средний output length (новые промпты?). Если queue вырос -- burst traffic, нужен autoscaling. Если prefill -- длинные контексты, проверить RAG chunking. Immediate mitigations: streaming для perceived latency, speculative decoding для 2-3x ускорения decode, semantic cache для повторяющихся запросов"

Q: Ваша команда тратит $15K/мес на LLM API. Предложите план сокращения до $5K без потери качества.

❌ Red flag: "Переключимся на самую дешевую модель"

✅ Strong answer: "Поэтапно: (1) Инструментировать все вызовы через Langfuse -- понять distribution по типам запросов и моделям (неделя). (2) Semantic caching для повторяющихся паттернов -- обычно 60-85% hit rate, экономия 70% на этих запросах. (3) Model routing -- simple queries (classification, extraction) на GPT-4o-mini (\(0.15/M) вместо GPT-4o (\)2.50/M), savings 90%+ на 40-60% запросов. (4) Prompt caching (Anthropic) -- 90% скидка на cached prefix. (5) Batch API для offline задач -- 50% discount. Целевая экономия: caching $4K + routing $5K + batch $1K = $10K savings"


Источники

  1. Langfuse -- "LLM Observability" (official docs)
  2. DeepEval -- Documentation (hallucination, evaluation metrics)
  3. Ragas -- Evaluation Framework (GitHub)
  4. Maxim AI -- "Top 5 LLM Observability Platforms 2026"
  5. Arize Phoenix -- Official Documentation
  6. LangChain/LangSmith -- Official Documentation
  7. Braintrust -- Official Documentation
  8. Redis -- "Context Window Overflow 2026"
  9. VentureBeat -- "Observational Memory Cuts AI Agent Costs 10x"
  10. Medium -- "The 2025 LLM API Playbook" (cost optimization)
  11. arXiv -- Token Caching Research (2601.06007v2)
  12. arXiv -- Hallucination Detection Survey (2504.18114)
  13. Helicone -- Official Documentation
  14. Logz.io -- "Top LLM Observability Tools"

See Also