Структурированный вывод¶
~6 минут чтения
Constrained decoding, FSM, Outlines, Instructor, LLGuidance, XGrammar, native API -- полный разбор (2025-2026)
Ключевые концепции¶
Проблема: почему structured output сложен¶
LLM генерирует токены авторегрессивно -- каждый следующий токен сэмплируется из распределения. Без ограничений модель может:
Типичные сбои при prompt-based JSON:
1. Пропущенные кавычки: {name: "John", age: 30}
2. Trailing comma: {"name": "John", "age": 30,}
3. Неверный тип: {"age": "30"} // string вместо int
4. Неполный JSON: {"name": "John", "age":
5. Текст вокруг JSON: Here's the JSON: {"name": "John"}
Результат: 10-20% outputs требуют ретраев или ломаются
Эволюция подходов (2023-2026)¶
| Год | Подход | Надёжность |
|---|---|---|
| 2023 | Prompt engineering ("return JSON") | 70-80% |
| 2024 | JSON Mode + retries | 85-95% |
| 2025 | Constrained decoding (Outlines, XGrammar) | 99%+ |
| 2026 | Native API support (OpenAI, Gemini, Mistral) | 99.9%+ |
Constrained Decoding: механизм¶
Маскировка невалидных токенов на каждом шаге генерации:
Где \(S\) -- схема, \(Valid(S, t_{<i})\) -- множество токенов, сохраняющих валидность.
def constrained_decode(model, prompt, grammar):
tokens = tokenize(prompt)
while not eos:
logits = model(tokens)
valid_mask = grammar.get_valid_tokens(tokens)
logits[~valid_mask] = -float('inf')
next_token = sample(softmax(logits))
tokens.append(next_token)
return detokenize(tokens)
Finite State Machine (FSM)¶
JSON Schema конвертируется в конечный автомат, отслеживающий состояние парсинга:
State 0 (Start): valid: {, [, whitespace
State 1 (Object): valid: ", }, whitespace
State 2 (Key): valid: string characters
State 3 (After key): valid: :, whitespace
State 4 (Value start): valid: ", {, [, digit, true, false, null
...
На каждой позиции: state -> valid token set -> mask -> sample -> update state
Инструменты и библиотеки¶
| Tool | Тип | Best For | Метод | Скорость | Лицензия |
|---|---|---|---|---|---|
| LLGuidance | Library | Production, speed | Trie-based prefix matching | 2-5x быстрее Outlines | Apache 2.0 |
| Outlines | Library | Flexibility, research | FSM constrained decoding | Baseline | Apache 2.0 |
| Instructor | Library | Python DX, multi-provider | Validation + retries | Зависит от ретраев | MIT |
| XGrammar | Library | vLLM/SGLang integration | Context-Free Grammar | Near-zero overhead | Open |
| LMQL | Language | Custom constraints | Query language | Fast | Open |
Outlines¶
FSM-based constrained decoding. Поддерживает JSON Schema, regex, CFG.
import outlines
from pydantic import BaseModel
from typing import List
class Person(BaseModel):
name: str
age: int
skills: List[str]
model = outlines.models.transformers("mistral-7b")
generator = outlines.generate.json(model, Person)
result = generator("Generate a person profile")
# ГАРАНТИРОВАННО валидный Person
Regex constraint:
generator = outlines.generate.regex(model, r"\d{3}-\d{3}-\d{4}")
phone = generator("My phone number is") # "555-123-4567"
Custom Grammar (SQL):
sql_grammar = """
?start: select_statement
select_statement: "SELECT" column_list "FROM" table_name
column_list: column ("," column)*
column: WORD
table_name: WORD
%import common.WORD
"""
generator = outlines.generate.cfg(model, sql_grammar)
result = generator("Write a SQL query") # "SELECT name FROM users"
Instructor¶
Validation-based: Pydantic schema в промпте -> парсинг -> ретрай при ошибке. Работает с ЛЮБЫМ LLM API.
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field, field_validator
class User(BaseModel):
name: str = Field(..., min_length=2)
age: int = Field(..., ge=0, le=150)
email: str
@field_validator('email')
@classmethod
def validate_email(cls, v):
if '@' not in v:
raise ValueError('Invalid email')
return v
client = instructor.from_openai(OpenAI())
user = client.chat.completions.create(
model="gpt-4",
response_model=User,
messages=[{"role": "user", "content": "Create a user named John"}],
max_retries=3
)
LLGuidance¶
Архитектура: Schema Parser (JSON Schema -> Grammar -> LL(1) Parser) -> Token Matcher (trie-based prefix matching) -> Logit Processor (mask + renormalize).
Ключевое: 2-5x быстрее Outlines, sub-millisecond overhead. Принят OpenAI (May 2025). Интеграции: vLLM, SGLang, TensorRT-LLM.
Native API Support¶
OpenAI Structured Outputs:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Generate a person"}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "person",
"strict": True,
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"}
},
"required": ["name", "age"]
}
}
}
)
Anthropic Tool Use:
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1024,
tools=[{
"name": "get_weather",
"description": "Get weather for location",
"input_schema": {
"type": "object",
"properties": {
"location": {"type": "string"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
}],
messages=[{"role": "user", "content": "What's the weather in Paris?"}]
)
Провайдеры 2026¶
| Provider | Метод | JSON Schema | Strict Mode | Overhead |
|---|---|---|---|---|
| OpenAI | Constrained decoding | Full | Yes | ~10-30s first req |
| Anthropic | Tool use | Via tools | No | Variable |
| Google Gemini | Constrained decoding | Full | Yes | Low |
| Mistral | Constrained decoding | Full | Yes | Low |
| Local (vLLM) | XGrammar | Full | -- | Near-zero |
| Local (SGLang) | Native | Full | -- | Near-zero |
Детали и сравнения¶
Надёжность¶
| Метод | Valid JSON | Correct Types | Overall |
|---|---|---|---|
| Prompt only | 85% | 80% | 70% |
| JSON Mode (API) | 99% | 90% | 89% |
| Outlines (constrained) | 100% | 100%* | 100%* |
| Instructor (3 retries) | 100% | 100% | 100% |
Латентность¶
| Метод | Avg Latency | P99 Latency | Overhead |
|---|---|---|---|
| Prompt-based | 500ms | 800ms | 0% |
| Constrained decoding | 600ms | 950ms | +5-15% |
| Instructor (no retry) | 500ms | 800ms | 0% |
| Instructor (1 retry) | 1000ms | 2000ms | +50-200% |
Производительность по инструментам¶
| Tool | Tokens/sec | Overhead vs base |
|---|---|---|
| Base generation | 100-150 | Baseline |
| LLGuidance | 85-130 | +10-15% |
| Outlines | 70-120 | +15-25% |
| OpenAI constrained | 80-110 | +10-20% |
First Request vs Subsequent¶
| Метод | First Request | Subsequent |
|---|---|---|
| Native API (OpenAI) | 10-30s (schema compilation) | <100ms |
| Outlines FSM | 1-5s (FSM build) | Near-zero |
| XGrammar | <1s | Near-zero |
Сложность схемы -> Overhead¶
| Schema Type | Overhead |
|---|---|
| Simple (3 fields) | +5% |
| Medium (10 fields) | +10% |
| Complex (nested) | +15-20% |
| Deep nesting (5+ levels) | +20-30% |
Token Overhead¶
| Метод | Extra Tokens |
|---|---|
| Prompt-based | ~20 (schema in prompt) |
| JSON Mode | ~5 |
| Outlines (masking) | 0 |
| Instructor (retries) | ~50 |
Advanced Patterns¶
Nested Objects:
class Address(BaseModel):
street: str
city: str
country: str = Field(pattern=r"^[A-Z]{2}$")
class Company(BaseModel):
name: str
address: Address
employees: int = Field(ge=0)
Union Types (discriminated):
class Cat(BaseModel):
pet_type: Literal['cat']
meows: bool
class Dog(BaseModel):
pet_type: Literal['dog']
barks: bool
Pet = Union[Cat, Dog]
Dynamic Schemas (runtime):
def create_response_schema(fields: list[str]):
return create_model(
'DynamicResponse',
**{field: (str, None) for field in fields}
)
Optional + Enums:
class Status(str, Enum):
ACTIVE = "active"
INACTIVE = "inactive"
class User(BaseModel):
name: str
email: Optional[str] = None
status: Status
Error Handling и Retry¶
# Retry с Instructor
@retry(stop=stop_after_attempt(3))
def extract_with_retry(text: str) -> User:
return client.chat.completions.create(
model="gpt-4",
response_model=User,
messages=[{"role": "user", "content": text}]
)
# Fallback
def safe_extract(text: str) -> Optional[User]:
try:
return extract_with_retry(text)
except ValidationError:
raw = client.chat.completions.create(...)
return parse_json_fallback(raw)
Expected retries: \(E[\text{calls}] = 1/p\), где \(p\) -- success rate одного вызова.
JSONSchemaBench (2025)¶
10K real-world JSON schemas, varying complexity.
| Метод | Valid Rate | Latency |
|---|---|---|
| Prompting | 70% | Baseline |
| JSON Mode | 85% | +5% |
| Instructor (retries) | 95% | +50% |
| Constrained Decoding | 99.9% | +10% |
Частые ошибки¶
| Ошибка | Влияние | Fix |
|---|---|---|
| Слишком строгая schema | Generation fails | Ослабить constraints |
| Missing required fields | Validation fails | Использовать Optional |
| Circular references | FSM не строится | Restructure schema |
| Large enums | Медленный matching | String + regex |
Выбор инструмента (Decision Tree)¶
OpenAI API? -> OpenAI Structured Outputs
Self-hosted, нужна скорость? -> LLGuidance + vLLM
Максимум flexibility? -> Outlines
Python-first, Pydantic? -> Instructor
Multi-provider support? -> Instructor
Research / custom grammar? -> Outlines
Use Cases¶
| Use Case | Tool | Пример |
|---|---|---|
| Data extraction | Outlines / Instructor | Invoice -> structured JSON |
| API response generation | OpenAI Structured | Guaranteed valid response |
| Agent tool calling | Native API | Structured tool arguments |
| Evaluation & testing | Instructor | Structured eval outputs |
| Complex schemas | Outlines | Custom CFG grammar |
LogitsProcessor (реализация)¶
import torch
from transformers import LogitsProcessor
class JSONLogitsProcessor(LogitsProcessor):
def __init__(self, tokenizer, schema=None):
self.tokenizer = tokenizer
self.schema = schema
self.fsm_state = 0
self.valid_tokens = self._get_initial_valid_tokens()
def _get_initial_valid_tokens(self):
valid_chars = ['{', '[', ' ', '\n', '\t']
valid_ids = []
for char in valid_chars:
ids = self.tokenizer.encode(char, add_special_tokens=False)
valid_ids.extend(ids)
return set(valid_ids)
def __call__(self, input_ids, scores):
mask = torch.full_like(scores, float('-inf'))
for token_id in self.valid_tokens:
mask[:, token_id] = 0
return scores + mask
Для интервью¶
Q: "Что такое structured output и зачем он нужен?"¶
LLM генерирует текст свободной формы. Для production нужен гарантированно валидный JSON/XML. Prompt-based подход даёт 70-85% reliability. Constrained decoding маскирует невалидные токены на каждом шаге -> 100% compliance без ретраев.
Q: "Как работает constrained decoding на уровне токенов?"¶
На каждом шаге: (1) вычислить logits для всех токенов, (2) построить маску валидных токенов по грамматике/FSM, (3) set logits невалидных -> -inf, (4) sample из валидного распределения. FSM отслеживает текущее состояние парсинга (after
{можно"или}).
Q: "Outlines vs Instructor -- когда что?"¶
Outlines: constrained decoding, 100% за один проход, только local models, FSM-based. Лучше для research и complex grammars. Instructor: validation + retries, работает с ЛЮБЫМ API (OpenAI, Anthropic, etc.), Pydantic native. Лучше для production Python apps и multi-provider.
Q: "Что такое XGrammar и LLGuidance?"¶
XGrammar (2025-2026): Context-Free Grammar -> pushdown automaton, near-zero overhead. Используется vLLM, SGLang. LLGuidance: trie-based prefix matching, 2-5x быстрее Outlines, sub-millisecond. Принят OpenAI (May 2025). Production-grade.
Q: "Расскажите про JSONSchemaBench."¶
Бенчмарк из 10K реальных JSON schemas разной сложности. Prompting: 70%, JSON Mode: 85%, Instructor (retries): 95%, Constrained Decoding: 99.9%.
Q: "Как выбрать подход для production?"¶
OpenAI/Gemini API -> native structured outputs. Self-hosted -> LLGuidance + vLLM. Python + multi-provider -> Instructor. Research/custom grammar -> Outlines. Ключевой trade-off: constrained decoding даёт 100% compliance за +5-15% latency vs +50-200% при retry-based.
Q: "Design data extraction pipeline."¶
Schema: Pydantic model (Invoice, Resume, etc.). Tool: Instructor (multi-provider) или Outlines (self-hosted). Fallback: retry 3x -> unstructured + parsing. Monitoring: validation rate, latency P99, schema complexity tracking.
Ключевые числа¶
| Факт | Значение |
|---|---|
| Prompt-based reliability | 70-85% |
| Constrained decoding reliability | 100% |
| Constrained decoding overhead | +5-15% latency |
| Retry-based overhead | +50-200% latency |
| LLGuidance vs Outlines speed | 2-5x faster |
| OpenAI first request (schema compile) | 10-30s |
| JSONSchemaBench constrained | 99.9% |
| Schema complexity 5+ levels | +20-30% overhead |
| Instructor extra tokens (retries) | ~50 |
| Outlines extra tokens | 0 (masking) |
Источники¶
- dev.to -- "Taming LLMs: How to Get Structured Output Every Time"
- McGinnis Blog -- "Comparing Python Libraries for Structured LLM Extraction"
- Towards Data Science -- "Generating Structured Outputs from LLMs"
- LLGuidance -- "Making Structured Outputs Go Brrr" (Microsoft, 2025)
- arXiv -- "XGrammar 2: Dynamic and Efficient Structured Generation" (Li et al., 2026)
- arXiv -- "JSONSchemaBench: Evaluating Constrained Decoding" (2502.18878)
- arXiv -- "AdapTrack: Constrained Decoding without Distorting" (ICSE 2026)
- Outlines Documentation -- github.com/dottxt-ai/outlines
- Instructor Documentation -- python.useinstructor.com
- Aidan Cooper -- "A Guide to Structured Generation Using Constrained Decoding"
- BentoML -- "LLM Inference Handbook: Structured Outputs"
- LetsDataScience -- "How Structured Outputs and Constrained Decoding Work"
- DeepLearning.AI -- "Getting Structured LLM Output" (2025)
- OpenAI API Documentation -- Structured Outputs
- Anthropic API Documentation -- Tool Use