Перейти к содержанию

Структурированный вывод

~6 минут чтения

Constrained decoding, FSM, Outlines, Instructor, LLGuidance, XGrammar, native API -- полный разбор (2025-2026)


Ключевые концепции

Проблема: почему structured output сложен

LLM генерирует токены авторегрессивно -- каждый следующий токен сэмплируется из распределения. Без ограничений модель может:

Типичные сбои при prompt-based JSON:
1. Пропущенные кавычки:     {name: "John", age: 30}
2. Trailing comma:           {"name": "John", "age": 30,}
3. Неверный тип:             {"age": "30"}  // string вместо int
4. Неполный JSON:            {"name": "John", "age":
5. Текст вокруг JSON:        Here's the JSON: {"name": "John"}

Результат: 10-20% outputs требуют ретраев или ломаются

Эволюция подходов (2023-2026)

Год Подход Надёжность
2023 Prompt engineering ("return JSON") 70-80%
2024 JSON Mode + retries 85-95%
2025 Constrained decoding (Outlines, XGrammar) 99%+
2026 Native API support (OpenAI, Gemini, Mistral) 99.9%+

Constrained Decoding: механизм

Маскировка невалидных токенов на каждом шаге генерации:

\[P_{constrained}(t_i | t_{<i}, S) = \frac{P(t_i | t_{<i}) \cdot \mathbb{1}[t_i \in Valid(S, t_{<i})]}{\sum_{t'} P(t' | t_{<i}) \cdot \mathbb{1}[t' \in Valid(S, t_{<i})]}\]

Где \(S\) -- схема, \(Valid(S, t_{<i})\) -- множество токенов, сохраняющих валидность.

def constrained_decode(model, prompt, grammar):
    tokens = tokenize(prompt)
    while not eos:
        logits = model(tokens)
        valid_mask = grammar.get_valid_tokens(tokens)
        logits[~valid_mask] = -float('inf')
        next_token = sample(softmax(logits))
        tokens.append(next_token)
    return detokenize(tokens)

Finite State Machine (FSM)

JSON Schema конвертируется в конечный автомат, отслеживающий состояние парсинга:

State 0 (Start):       valid: {, [, whitespace
State 1 (Object):      valid: ", }, whitespace
State 2 (Key):          valid: string characters
State 3 (After key):   valid: :, whitespace
State 4 (Value start): valid: ", {, [, digit, true, false, null
...
На каждой позиции: state -> valid token set -> mask -> sample -> update state

Инструменты и библиотеки

Tool Тип Best For Метод Скорость Лицензия
LLGuidance Library Production, speed Trie-based prefix matching 2-5x быстрее Outlines Apache 2.0
Outlines Library Flexibility, research FSM constrained decoding Baseline Apache 2.0
Instructor Library Python DX, multi-provider Validation + retries Зависит от ретраев MIT
XGrammar Library vLLM/SGLang integration Context-Free Grammar Near-zero overhead Open
LMQL Language Custom constraints Query language Fast Open

Outlines

FSM-based constrained decoding. Поддерживает JSON Schema, regex, CFG.

import outlines
from pydantic import BaseModel
from typing import List

class Person(BaseModel):
    name: str
    age: int
    skills: List[str]

model = outlines.models.transformers("mistral-7b")
generator = outlines.generate.json(model, Person)
result = generator("Generate a person profile")
# ГАРАНТИРОВАННО валидный Person

Regex constraint:

generator = outlines.generate.regex(model, r"\d{3}-\d{3}-\d{4}")
phone = generator("My phone number is")  # "555-123-4567"

Custom Grammar (SQL):

sql_grammar = """
    ?start: select_statement
    select_statement: "SELECT" column_list "FROM" table_name
    column_list: column ("," column)*
    column: WORD
    table_name: WORD
    %import common.WORD
"""
generator = outlines.generate.cfg(model, sql_grammar)
result = generator("Write a SQL query")  # "SELECT name FROM users"

Instructor

Validation-based: Pydantic schema в промпте -> парсинг -> ретрай при ошибке. Работает с ЛЮБЫМ LLM API.

import instructor
from openai import OpenAI
from pydantic import BaseModel, Field, field_validator

class User(BaseModel):
    name: str = Field(..., min_length=2)
    age: int = Field(..., ge=0, le=150)
    email: str

    @field_validator('email')
    @classmethod
    def validate_email(cls, v):
        if '@' not in v:
            raise ValueError('Invalid email')
        return v

client = instructor.from_openai(OpenAI())
user = client.chat.completions.create(
    model="gpt-4",
    response_model=User,
    messages=[{"role": "user", "content": "Create a user named John"}],
    max_retries=3
)

LLGuidance

Архитектура: Schema Parser (JSON Schema -> Grammar -> LL(1) Parser) -> Token Matcher (trie-based prefix matching) -> Logit Processor (mask + renormalize).

Ключевое: 2-5x быстрее Outlines, sub-millisecond overhead. Принят OpenAI (May 2025). Интеграции: vLLM, SGLang, TensorRT-LLM.

Native API Support

OpenAI Structured Outputs:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Generate a person"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"}
                },
                "required": ["name", "age"]
            }
        }
    }
)

Anthropic Tool Use:

response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    tools=[{
        "name": "get_weather",
        "description": "Get weather for location",
        "input_schema": {
            "type": "object",
            "properties": {
                "location": {"type": "string"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location"]
        }
    }],
    messages=[{"role": "user", "content": "What's the weather in Paris?"}]
)

Провайдеры 2026

Provider Метод JSON Schema Strict Mode Overhead
OpenAI Constrained decoding Full Yes ~10-30s first req
Anthropic Tool use Via tools No Variable
Google Gemini Constrained decoding Full Yes Low
Mistral Constrained decoding Full Yes Low
Local (vLLM) XGrammar Full -- Near-zero
Local (SGLang) Native Full -- Near-zero

Детали и сравнения

Надёжность

Метод Valid JSON Correct Types Overall
Prompt only 85% 80% 70%
JSON Mode (API) 99% 90% 89%
Outlines (constrained) 100% 100%* 100%*
Instructor (3 retries) 100% 100% 100%

Латентность

Метод Avg Latency P99 Latency Overhead
Prompt-based 500ms 800ms 0%
Constrained decoding 600ms 950ms +5-15%
Instructor (no retry) 500ms 800ms 0%
Instructor (1 retry) 1000ms 2000ms +50-200%

Производительность по инструментам

Tool Tokens/sec Overhead vs base
Base generation 100-150 Baseline
LLGuidance 85-130 +10-15%
Outlines 70-120 +15-25%
OpenAI constrained 80-110 +10-20%

First Request vs Subsequent

Метод First Request Subsequent
Native API (OpenAI) 10-30s (schema compilation) <100ms
Outlines FSM 1-5s (FSM build) Near-zero
XGrammar <1s Near-zero

Сложность схемы -> Overhead

Schema Type Overhead
Simple (3 fields) +5%
Medium (10 fields) +10%
Complex (nested) +15-20%
Deep nesting (5+ levels) +20-30%

Token Overhead

Метод Extra Tokens
Prompt-based ~20 (schema in prompt)
JSON Mode ~5
Outlines (masking) 0
Instructor (retries) ~50

Advanced Patterns

Nested Objects:

class Address(BaseModel):
    street: str
    city: str
    country: str = Field(pattern=r"^[A-Z]{2}$")

class Company(BaseModel):
    name: str
    address: Address
    employees: int = Field(ge=0)

Union Types (discriminated):

class Cat(BaseModel):
    pet_type: Literal['cat']
    meows: bool

class Dog(BaseModel):
    pet_type: Literal['dog']
    barks: bool

Pet = Union[Cat, Dog]

Dynamic Schemas (runtime):

def create_response_schema(fields: list[str]):
    return create_model(
        'DynamicResponse',
        **{field: (str, None) for field in fields}
    )

Optional + Enums:

class Status(str, Enum):
    ACTIVE = "active"
    INACTIVE = "inactive"

class User(BaseModel):
    name: str
    email: Optional[str] = None
    status: Status

Error Handling и Retry

# Retry с Instructor
@retry(stop=stop_after_attempt(3))
def extract_with_retry(text: str) -> User:
    return client.chat.completions.create(
        model="gpt-4",
        response_model=User,
        messages=[{"role": "user", "content": text}]
    )

# Fallback
def safe_extract(text: str) -> Optional[User]:
    try:
        return extract_with_retry(text)
    except ValidationError:
        raw = client.chat.completions.create(...)
        return parse_json_fallback(raw)

Expected retries: \(E[\text{calls}] = 1/p\), где \(p\) -- success rate одного вызова.

JSONSchemaBench (2025)

10K real-world JSON schemas, varying complexity.

Метод Valid Rate Latency
Prompting 70% Baseline
JSON Mode 85% +5%
Instructor (retries) 95% +50%
Constrained Decoding 99.9% +10%

Частые ошибки

Ошибка Влияние Fix
Слишком строгая schema Generation fails Ослабить constraints
Missing required fields Validation fails Использовать Optional
Circular references FSM не строится Restructure schema
Large enums Медленный matching String + regex

Выбор инструмента (Decision Tree)

OpenAI API?               -> OpenAI Structured Outputs
Self-hosted, нужна скорость? -> LLGuidance + vLLM
Максимум flexibility?      -> Outlines
Python-first, Pydantic?    -> Instructor
Multi-provider support?    -> Instructor
Research / custom grammar? -> Outlines

Use Cases

Use Case Tool Пример
Data extraction Outlines / Instructor Invoice -> structured JSON
API response generation OpenAI Structured Guaranteed valid response
Agent tool calling Native API Structured tool arguments
Evaluation & testing Instructor Structured eval outputs
Complex schemas Outlines Custom CFG grammar

LogitsProcessor (реализация)

import torch
from transformers import LogitsProcessor

class JSONLogitsProcessor(LogitsProcessor):
    def __init__(self, tokenizer, schema=None):
        self.tokenizer = tokenizer
        self.schema = schema
        self.fsm_state = 0
        self.valid_tokens = self._get_initial_valid_tokens()

    def _get_initial_valid_tokens(self):
        valid_chars = ['{', '[', ' ', '\n', '\t']
        valid_ids = []
        for char in valid_chars:
            ids = self.tokenizer.encode(char, add_special_tokens=False)
            valid_ids.extend(ids)
        return set(valid_ids)

    def __call__(self, input_ids, scores):
        mask = torch.full_like(scores, float('-inf'))
        for token_id in self.valid_tokens:
            mask[:, token_id] = 0
        return scores + mask

Для интервью

Q: "Что такое structured output и зачем он нужен?"

LLM генерирует текст свободной формы. Для production нужен гарантированно валидный JSON/XML. Prompt-based подход даёт 70-85% reliability. Constrained decoding маскирует невалидные токены на каждом шаге -> 100% compliance без ретраев.

Q: "Как работает constrained decoding на уровне токенов?"

На каждом шаге: (1) вычислить logits для всех токенов, (2) построить маску валидных токенов по грамматике/FSM, (3) set logits невалидных -> -inf, (4) sample из валидного распределения. FSM отслеживает текущее состояние парсинга (after { можно " или }).

Q: "Outlines vs Instructor -- когда что?"

Outlines: constrained decoding, 100% за один проход, только local models, FSM-based. Лучше для research и complex grammars. Instructor: validation + retries, работает с ЛЮБЫМ API (OpenAI, Anthropic, etc.), Pydantic native. Лучше для production Python apps и multi-provider.

Q: "Что такое XGrammar и LLGuidance?"

XGrammar (2025-2026): Context-Free Grammar -> pushdown automaton, near-zero overhead. Используется vLLM, SGLang. LLGuidance: trie-based prefix matching, 2-5x быстрее Outlines, sub-millisecond. Принят OpenAI (May 2025). Production-grade.

Q: "Расскажите про JSONSchemaBench."

Бенчмарк из 10K реальных JSON schemas разной сложности. Prompting: 70%, JSON Mode: 85%, Instructor (retries): 95%, Constrained Decoding: 99.9%.

Q: "Как выбрать подход для production?"

OpenAI/Gemini API -> native structured outputs. Self-hosted -> LLGuidance + vLLM. Python + multi-provider -> Instructor. Research/custom grammar -> Outlines. Ключевой trade-off: constrained decoding даёт 100% compliance за +5-15% latency vs +50-200% при retry-based.

Q: "Design data extraction pipeline."

Schema: Pydantic model (Invoice, Resume, etc.). Tool: Instructor (multi-provider) или Outlines (self-hosted). Fallback: retry 3x -> unstructured + parsing. Monitoring: validation rate, latency P99, schema complexity tracking.

Ключевые числа

Факт Значение
Prompt-based reliability 70-85%
Constrained decoding reliability 100%
Constrained decoding overhead +5-15% latency
Retry-based overhead +50-200% latency
LLGuidance vs Outlines speed 2-5x faster
OpenAI first request (schema compile) 10-30s
JSONSchemaBench constrained 99.9%
Schema complexity 5+ levels +20-30% overhead
Instructor extra tokens (retries) ~50
Outlines extra tokens 0 (masking)

Источники

  1. dev.to -- "Taming LLMs: How to Get Structured Output Every Time"
  2. McGinnis Blog -- "Comparing Python Libraries for Structured LLM Extraction"
  3. Towards Data Science -- "Generating Structured Outputs from LLMs"
  4. LLGuidance -- "Making Structured Outputs Go Brrr" (Microsoft, 2025)
  5. arXiv -- "XGrammar 2: Dynamic and Efficient Structured Generation" (Li et al., 2026)
  6. arXiv -- "JSONSchemaBench: Evaluating Constrained Decoding" (2502.18878)
  7. arXiv -- "AdapTrack: Constrained Decoding without Distorting" (ICSE 2026)
  8. Outlines Documentation -- github.com/dottxt-ai/outlines
  9. Instructor Documentation -- python.useinstructor.com
  10. Aidan Cooper -- "A Guide to Structured Generation Using Constrained Decoding"
  11. BentoML -- "LLM Inference Handbook: Structured Outputs"
  12. LetsDataScience -- "How Structured Outputs and Constrained Decoding Work"
  13. DeepLearning.AI -- "Getting Structured LLM Output" (2025)
  14. OpenAI API Documentation -- Structured Outputs
  15. Anthropic API Documentation -- Tool Use