Перейти к содержанию

Оптимизация инференса LLM

~10 минут чтения

Continuous batching, PagedAttention, FlashAttention, speculative decoding (EAGLE-3), quantization (FP8/INT4/NVFP4), MoE, сравнение движков (vLLM, SGLang, TensorRT-LLM) (2025-2026)

Предварительно: Архитектура памяти LLM, KV-кэш оптимизация

Зачем это нужно. Инференс LLM -- 80-90% от общей стоимости владения в production. Обслуживание Llama 70B на 4xA100 стоит ~$25K/мес при cloud-ценах. Без оптимизации GPU utilization составляет 30-50%, и вы платите за простой. Пять core-техник (continuous batching, PagedAttention, quantization, FlashAttention, speculative decoding) в сумме дают 10-50x улучшение throughput и 2-4x снижение стоимости per token. Порядок внедрения критичен: continuous batching дает 2-24x throughput при нулевой сложности, а speculative decoding -- 2-6x latency при средней сложности.


Ключевые концепции

Пять core техник

# Техника Speedup Сложность внедрения
1 Continuous batching 2-24x throughput Low
2 KV cache (PagedAttention) 2-4x memory efficiency Low
3 Quantization (FP8/INT8) 2x memory, 2x speed Low
4 FlashAttention 10-20x memory savings Low
5 Speculative decoding 2-3x latency Medium

Порядок внедрения: сверху вниз -- наибольший ROI при минимальной сложности.

1. Continuous Batching

Проблема: static batching: GPU простаивает, ожидая медленные запросы.

Решение: dynamically add/remove requests as they complete.

Метрика Static Batching Continuous Batching
GPU utilization 30-50% 70-90%
Throughput Baseline 2-4x
Latency variance High Low
vLLM vs baseline -- Up to 23x
vLLM vs TGI (high concurrency) -- 24x

Framework support:

Framework Название фичи
vLLM Iteration-level scheduling (core)
SGLang Built-in
TensorRT-LLM In-flight batching
LMDeploy Persistent batching
HuggingFace TGI Supported

2. PagedAttention (KV Cache)

Проблема: KV cache contiguous allocation -> 60-80% memory waste из-за fragmentation.

Решение: page-based memory management (аналог virtual memory в OS).

graph LR
    subgraph Traditional["Traditional KV Cache (60-80% waste)"]
        direction LR
        T1["Request 1: contiguous block"] --> TF["Fragmentation (waste)"]
        T2["Request 2: contiguous block"] --> TF
    end

    subgraph Paged["PagedAttention (<4% waste)"]
        direction LR
        P1["R1"] --> P2["R1"] --> P3["R2"] --> P4["R1"] --> P5["R2"] --> P6["R2"] --> P7["R1"]
    end

    style T1 fill:#e8eaf6,stroke:#3f51b5
    style T2 fill:#e8f5e9,stroke:#4caf50
    style TF fill:#fce4ec,stroke:#c62828
    style P1 fill:#e8eaf6,stroke:#3f51b5
    style P2 fill:#e8eaf6,stroke:#3f51b5
    style P3 fill:#e8f5e9,stroke:#4caf50
    style P4 fill:#e8eaf6,stroke:#3f51b5
    style P5 fill:#e8f5e9,stroke:#4caf50
    style P6 fill:#e8f5e9,stroke:#4caf50
    style P7 fill:#e8eaf6,stroke:#3f51b5
Метрика Traditional PagedAttention
Memory waste 60-80% <4%
Memory efficiency 60-70% 95%+
Max batch size Limited by longest seq 2-4x larger
Throughput vs FasterTransformer -- 2-4x

Advanced techniques (2025-2026):

Техника Описание
LMCache Hierarchical caching: GPU -> CPU -> network
Prefix caching Reuse common prompt prefixes
KV compression FP8/INT8 for 2-3x memory savings
Automatic prefix caching vLLM built-in feature

3. Quantization

Format Memory (7B) Quality Loss Best For
FP16/BF16 ~14 GB Baseline Maximum quality
FP8 ~7 GB <1% Production inference (Hopper+)
INT8 ~7 GB ~2% Balanced deployment
INT4 ~3.5 GB 8-10% Edge/resource-constrained
NVFP4 ~4 GB <1% Next-gen GPUs (Blackwell)

FP8 (State of the Art): requires Hopper (H100) or Ada. 2.3x inference speedup vs FP16 на LLaMA-v2-7B.

NVFP4 (Blackwell): 3.5x memory reduction vs FP16, <1% accuracy degradation (LiveCodeBench, MMLU-PRO).

Rankings:

Rank Method Best For
1 FP8 Batch >= 16, optimal perf/accuracy
2 Q5_K_M / GPTQ-INT8 Best trade-off for most domains
3 AWQ Better than GPTQ weight-only
4 INT4 (GPTQ) Use cautiously, significant loss

Task-specific impact: coding/STEM most affected. 70B+ can maintain 4-bit; smaller models need 8-bit.

4. FlashAttention

Fused operations: single kernel, block processing, never materializes full attention matrix.

Version GPU TFLOPS (FP16) Utilization Key Features
FA-1 A100 ~300 ~50% Basic fusion
FA-2 A100/H100 ~400 ~35% Improved kernels
FA-3 H100 840 85% Warp specialization, FP8
FA-4 Blackwell TBD Higher Scalability, reduced overhead

FA-3 на H100: BF16 = 840 TFLOPS (85% utilization). FP8 = 1,300 TFLOPS (1.3 PFLOPS). 1.5-2x faster vs FA-2.

Memory: O(n^2) -> O(n). Standard attention at 128K = prohibitive. FlashAttention = manageable.

5. Speculative Decoding

Step Описание
Draft Smaller model predicts multiple tokens ahead
Verify Target model verifies candidates in single forward pass
Accept/Reject Accepted tokens used; rejected trigger regeneration
Output Mathematically identical to standard decoding

EAGLE-3 (NeurIPS 2025, SOTA): - Architecture: 1-2 transformer layers as "draft head" (2-5% of target size) - Innovation: fusion of low-, mid-, high-level semantic features - Speedup: 2-6x - Framework: vLLM v0.8.5+, SGLang via SpecForge

Adoption: Google Search (AI Overviews), vLLM (native EAGLE-⅓ support), SGLang.

Limiting factors: acceptance rate 0.6-0.8, domain mismatch reduces effectiveness.

6. MoE для инференса

Метрика Improvement
Compute per inference 90-95% reduction
Training efficiency 2-7x faster
Power consumption Up to 50% reduction

Notable models:

Model Total Active Context
DeepSeek R1 671B 37B Standard
Gemini 1.5 ~1T 150-200B 1M tokens
Kimi K2 ~1T 32B Long context

Research (2025): Super Experts (critical subset), MaxScore routing (constrained optimization), MegaScale-Infer (disaggregated expert parallelism).


Движки инференса

Сравнение лидеров

Engine Best For Throughput TTFT License
TensorRT-LLM NVIDIA hardware, max perf 180-220 req/s 35-50ms NVIDIA
vLLM Flexibility, community 100-150 req/s 50-80ms Apache 2.0
SGLang Structured output, multi-turn 120-180 req/s 40-70ms Apache 2.0
llama.cpp CPU/edge 10-50 req/s 100-500ms MIT
TGI HuggingFace ecosystem -- -- HF License

vLLM

graph TD
    VS["vLLM Server"] --> SCH["Scheduler<br/>continuous batching, preemption/swapping"]
    SCH --> BM["Block Manager<br/>PagedAttention, copy-on-write prefix caching"]
    BM --> W["Worker(s)<br/>FlashAttention, tensor parallelism"]

    style VS fill:#e8eaf6,stroke:#3f51b5
    style SCH fill:#e8f5e9,stroke:#4caf50
    style BM fill:#fff3e0,stroke:#ef6c00
    style W fill:#f3e5f5,stroke:#9c27b0
  • Developer: UC Berkeley Sky Lab (2023)
  • 50+ model architectures, 2M+ monthly downloads
  • Key: PagedAttention

SGLang

  • Developer: Stanford/Berkeley (2024)
  • Key: RadixAttention (tree-based KV cache sharing), best structured output (regex-constrained JSON/XML)
  • Optimized for multi-turn, 5x throughput in multi-call workloads
  • Most stable per-token latency (4-21ms)

TensorRT-LLM

graph TD
    TRT["TensorRT-LLM"] --> RT["Runtime<br/>in-flight batching, KV cache management"]
    RT --> KL["Kernel Library<br/>custom attention kernels, quantized matmul, fused ops"]
    KL --> TE["TensorRT Engine<br/>graph optimization, kernel auto-tuning"]

    style TRT fill:#fff3e0,stroke:#ef6c00
    style RT fill:#e8eaf6,stroke:#3f51b5
    style KL fill:#e8f5e9,stroke:#4caf50
    style TE fill:#f3e5f5,stroke:#9c27b0
  • Developer: NVIDIA (2023)
  • Best on A100/H100/B200, highest throughput, lowest latency

Benchmarks по GPU tier (Llama 70B)

Throughput (req/sec):

Engine A100 (80GB) H100 (80GB) B200
TensorRT-LLM 120-150 180-220 250-300
SGLang 80-120 120-180 180-240
vLLM 70-100 100-150 150-200
llama.cpp 10-20 15-30 N/A

TTFT (ms) by sequence length:

Engine 128 tokens 1024 tokens 8192 tokens
TensorRT-LLM 35-50 60-100 200-400
SGLang 40-70 80-120 250-500
vLLM 50-80 100-150 300-600
llama.cpp 100-200 200-400 500-1500

Memory:

Engine 7B 70B 120B
vLLM 16GB 140GB 240GB
SGLang 14GB 130GB 220GB
TensorRT-LLM 12GB 120GB 200GB

Prefix Caching

Engine Метод Cache Hit Benefit
vLLM Automatic 5-10x faster
SGLang RadixAttention 10-20x faster
TensorRT-LLM Manual 3-5x faster

Decision Tree

graph TD
    START["Выбор движка"] --> Q1{"Max throughput<br/>на NVIDIA?"}
    Q1 -->|"Да"| TRT["TensorRT-LLM"]
    Q1 -->|"Нет"| Q2{"Structured output<br/>(JSON/XML)?"}
    Q2 -->|"Да"| SGL["SGLang"]
    Q2 -->|"Нет"| Q3{"Multi-turn<br/>conversations?"}
    Q3 -->|"Да"| SGL2["SGLang"]
    Q3 -->|"Нет"| Q4{"CPU / edge?"}
    Q4 -->|"Да"| LLAMA["llama.cpp"]
    Q4 -->|"Нет"| VLLM["vLLM"]

    style START fill:#e8eaf6,stroke:#3f51b5
    style TRT fill:#fff3e0,stroke:#ef6c00
    style SGL fill:#e8f5e9,stroke:#4caf50
    style SGL2 fill:#e8f5e9,stroke:#4caf50
    style LLAMA fill:#f3e5f5,stroke:#9c27b0
    style VLLM fill:#e8f5e9,stroke:#4caf50

Interview Questions

Q: Назовите 5 ключевых техник оптимизации инференса LLM в порядке приоритета.

❌ Red flag: "Квантизация, pruning, knowledge distillation..." (путает inference optimization с model compression)

✅ Strong answer: "1) Continuous batching (2-24x throughput, dynamically add/remove requests -- biggest ROI при нулевой сложности). 2) PagedAttention (KV cache paging, memory waste: 60-80% -> <4%, enables 2-4x larger batch sizes). 3) Quantization FP8 (2x memory + 2x speed, <1% quality loss на Hopper+). 4) FlashAttention (O(n^2) -> O(n) memory, FA-3: 840 TFLOPS, 85% utilization на H100). 5) Speculative decoding (draft-verify pattern, 2-6x latency с EAGLE-3, mathematically identical output). Порядок: throughput -> memory -> latency."

Q: Как работает PagedAttention и почему это важно?

❌ Red flag: "Это оптимизация attention, ускоряет вычисления."

✅ Strong answer: "Аналог virtual memory в OS. KV cache разбивается на fixed-size blocks (16 tokens), выделяемые non-contiguously on-demand. Block tables mapят virtual -> physical addresses. Решает проблему fragmentation: traditional KV cache -- 60-80% memory waste из-за contiguous allocation, PagedAttention -- <4% waste. Это позволяет 2-4x larger batch sizes на том же GPU. vLLM built-in, copy-on-write для prefix caching. Без PagedAttention Llama 70B на A100 80GB обслуживает ~8 concurrent requests, с ним -- до 32."

Q: Speculative decoding -- как работает и когда не помогает?

❌ Red flag: "Маленькая модель генерирует ответ, большая проверяет."

✅ Strong answer: "Draft model (2-5% params от target) генерирует N candidate tokens. Target model верифицирует все N за один forward pass (batch verification). Accepted tokens используются, rejected trigger regeneration. Output mathematically identical -- это не приближение. EAGLE-3 (NeurIPS 2025): 2-6x speedup, 1-2 transformer layers как draft head. Google Search использует для AI Overviews. Ограничение: acceptance rate 0.6-0.8, и при domain mismatch (draft не обучен на вашем домене) может быть медленнее прямой генерации."

Q: vLLM vs SGLang vs TensorRT-LLM -- как выбирать?

❌ Red flag: "vLLM самый популярный, его и используем."

✅ Strong answer: "Зависит от workload. TensorRT-LLM: max throughput (180-220 req/s), lowest latency (35-50ms TTFT), лидирует при concurrency < 16, но proprietary и сложнее в деплое. vLLM: best flexibility, 50+ model architectures, Apache 2.0, лидирует при high concurrency (128+), 2M+ monthly downloads. SGLang: best structured output (RadixAttention + regex-constrained JSON/XML), best multi-turn (5x throughput), RadixAttention дает 10-20x speedup при cache hit. Для chat с JSON output -- SGLang. Для batch processing -- TRT-LLM. Для прототипирования и гибкости -- vLLM."

Ключевые числа

Факт Значение
Continuous batching (vLLM vs TGI) Up to 24x throughput
PagedAttention memory waste 60-80% -> <4%
FP8 speedup vs FP16 2.3x
FlashAttention-3 BF16 840 TFLOPS, 85% utilization
FlashAttention-3 FP8 1.3 PFLOPS
EAGLE-3 speedup 2-6x
Speculative acceptance rate 0.6-0.8
TensorRT-LLM throughput (H100) 180-220 req/s
vLLM throughput (H100) 100-150 req/s
SGLang RadixAttention cache hit 10-20x faster
MoE compute reduction 90-95%
Latency target TTFT <100ms
Latency alert TBT >200ms

Заблуждение: квантизация всегда безопасна

FP8 дает <1% quality loss на general benchmarks, но на coding/STEM задачах degradation может быть 3-5%. INT4 теряет 8-10% на общих бенчмарках и до 15% на математике. Модели <13B параметров особенно чувствительны: 7B INT4 может терять 12-15% на HumanEval, тогда как 70B INT4 теряет только 5-7%. Правило: для моделей <13B используйте минимум INT8, для coding-задач -- минимум FP8.

Заблуждение: speculative decoding всегда ускоряет в 2-3x

Speedup 2-6x (EAGLE-3) достигается при acceptance rate 0.6-0.8. Но acceptance rate зависит от domain match между draft и target моделями. На out-of-distribution данных (модель обучена на English, запросы на Chinese) acceptance rate падает до 0.3-0.4, и speculative decoding может быть медленнее прямой генерации из-за overhead на верификацию. Всегда проверяйте acceptance rate на вашем реальном workload.

Заблуждение: больше GPU = линейно больше throughput

Tensor parallelism на 2 GPU дает ~1.8x throughput (не 2x) из-за communication overhead. На 4 GPU -- ~3.2x, на 8 GPU -- ~5-6x. При этом latency на single request может увеличиться из-за all-reduce synchronization. Для latency-sensitive workloads часто лучше 1 GPU с меньшей моделью (FP8 quantization), чем 4 GPU с полной.

Источники

  1. Zylos Research -- "AI Inference Optimization Techniques (2025-2026)"
  2. NVIDIA -- Speculative Decoding Blog, NVFP4 Blog
  3. Google Research -- Speculative Decoding retrospective
  4. arXiv -- EAGLE-3 (2503.01840), PagedAttention (2309.06180)
  5. vLLM Blog -- "Anatomy of vLLM"
  6. Clarifai -- "Comparing SGLang, vLLM, and TensorRT-LLM with GPT-OSS-120B"
  7. Kanerika -- "SGLang vs vLLM: Which is Better in 2026?"
  8. arXiv -- "A Survey of LLM Inference Systems" (2506.21901)
  9. SemiAnalysis -- "InferenceMAX: Open Source Inference Benchmarking"
  10. DigitalOcean -- "FlashAttention 4: Faster, Memory-Efficient Attention"
  11. Modal -- "High-performance LLM Inference Guide"

See Also