Сравнение движков инференса LLM¶
~13 минут чтения
Предварительно: vLLM и Paged Attention, Квантизация LLM
Выбор движка инференса определяет 3-5x разницу в стоимости обслуживания одного и того же трафика. vLLM на H100 выдает 680 tok/s на Llama-70B, TensorRT-LLM -- 850 tok/s (+25%), а SGLang на агентских workloads -- 1800 tok/s (3x vLLM). При 10M токенов/день это разница между $3K и $9K в месяц на GPU. При этом 90% production систем используют vLLM не потому что он быстрее, а потому что pip install vllm && vllm serve занимает 5 минут против нескольких часов компиляции TensorRT-LLM engine. Правильный выбор -- не "самый быстрый", а "самый быстрый для вашего workload при допустимой операционной сложности".
PagedAttention, RadixAttention, CUDA Graph Fusion, GGUF, continuous batching, disaggregated inference, benchmarks, decision framework, code examples (2025-2026)
Ключевые концепции¶
Comparison Matrix (2026)¶
| Engine | Best For | Key Feature | Hardware | Throughput | Ease of Use |
|---|---|---|---|---|---|
| vLLM | Production serving | PagedAttention | NVIDIA/AMD/Ascend | High | Easy |
| SGLang | Agents, structured output | RadixAttention | NVIDIA | Highest (agents) | Medium |
| TensorRT-LLM | Maximum latency | CUDA optimization | NVIDIA only | Best single-req | Hard |
| llama.cpp | CPU, Edge, Apple | GGUF format | CPU/Apple/ARM | Medium | Easy |
| Ollama | Development, local | Wrapper llama.cpp | CPU/GPU | Low | Easiest |
| LMDeploy | Tencent internal | Turbomind kernels | NVIDIA | High | Medium |
| TGI | Enterprise | HuggingFace integration | NVIDIA/AMD | Medium | Easy |
1. vLLM (PagedAttention Engine)¶
Paper: Kwon et al., SOSP 2023. Stars: 35k+ (Feb 2026).
Core Innovation: PagedAttention¶
Traditional allocation:
Request 1: [Reserved 4096 tokens] -> Actual 500 -> 87% waste
Request 2: [Reserved 4096 tokens] -> Actual 1200 -> 70% waste
PagedAttention:
GPU Memory -> Fixed 16-token blocks
Request 1: 32 blocks (512 tokens) -> <4% waste
Request 2: 75 blocks (1200 tokens) -> <4% waste
Key Features¶
| Feature | Description |
|---|---|
| PagedAttention | Block-based KV cache allocation |
| Continuous Batching | Iteration-level scheduling |
| Prefix Caching | Automatic prefix sharing |
| OpenAI-compatible API | Drop-in replacement |
| Multi-GPU | Tensor + Pipeline parallelism |
| Multi-modal | Vision-language models |
| Hardware | NVIDIA, AMD (ROCm), Huawei Ascend |
Performance (Llama-70B on H100)¶
| Metric | vLLM | HF Transformers |
|---|---|---|
| Throughput | 300 tok/s | 50 tok/s |
| Memory waste | <4% | 60-80% |
| Max batch size | 128+ | 8 |
| TTFT (128 ctx) | 0.3s | 1.8s |
Code Example¶
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.1-70B",
tensor_parallel_size=4,
gpu_memory_utilization=0.9,
max_model_len=32768,
)
prompts = ["Hello"] * 100
sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=256)
outputs = llm.generate(prompts, sampling_params)
# OpenAI-compatible server:
# vllm serve meta-llama/Llama-3.1-70B --port 8000
Production Setup¶
from vllm.entrypoints.openai.api_server import run_server
if __name__ == "__main__":
run_server(
model="meta-llama/Llama-3.1-70B",
tensor_parallel_size=4,
host="0.0.0.0",
port=8000,
max_model_len=32768,
gpu_memory_utilization=0.9,
enable_prefix_caching=True,
)
Strengths: easiest deployment (pip install, one command), best community, widest hardware support, extensive ecosystem (LangChain, LlamaIndex).
Limitations: not fastest single-request latency (TensorRT-LLM beats), SGLang outperforms on agent/structured workloads, less CUDA graph optimization.
2. SGLang (Structured Language Engine)¶
Paper: Zheng et al., arXiv:2312.07104. Developer: LMSYS (UC Berkeley).
Core Innovation: RadixAttention¶
Radix Tree for KV Cache:
[root]
/ \
"System" "User"
/ \ \
"helpful" "smart" "query1"
| | |
[KV1] [KV2] [KV3]
Benefits:
- Automatic prefix sharing across requests
- Token-level granularity (not block-level like PagedAttention)
- Perfect for multi-turn agents
Key Features¶
| Feature | Description |
|---|---|
| RadixAttention | Radix tree for prefix sharing |
| Structured Output | Native JSON/regex constrained decoding |
| Disaggregated | Prefill-decode separation |
| Speculative Decoding | Built-in speculation |
| Function Calling | Native tool use |
Performance¶
H800 Benchmarks (July 2025):
| Metric | SGLang | vLLM | Speedup |
|---|---|---|---|
| Agent throughput | 1800 tok/s | 600 tok/s | 3x |
| Structured output | 1200 tok/s | 400 tok/s | 3x |
| Multi-turn (5 turns) | 900 tok/s | 300 tok/s | 3x |
| Cost efficiency | $0.12/1M tok | $0.20/1M tok | 40% cheaper |
Prefix Reuse Latency (Llama-70B, 4x A100):
| Scenario | vLLM | SGLang | Speedup |
|---|---|---|---|
| Unique prompts | 100ms | 90ms | 1.1x |
| Shared prefix (10 req) | 1000ms | 200ms | 5x |
| Multi-turn conversation | 500ms | 150ms | 3x |
Code Example¶
import sglang as sgl
@sgl.function
def extract_info(s, text):
s += "Extract from: " + text + "\n\n"
s += sgl.gen(
"json_output",
max_tokens=512,
regex=r'\{\s*"name":\s*"[^"]+",\s*"age":\s*\d+,\s*"email":\s*"[^"]+"\s*\}'
)
runtime = sgl.Runtime(model_path="meta-llama/Llama-3.1-70B", tp_size=4)
# Multiple requests with shared prefix -- RadixAttention reuses KV cache
result = extract_info.run("John Smith is 30. Email: john@example.com", runtime=runtime)
print(result["json_output"])
# {"name": "John Smith", "age": 30, "email": "john@example.com"}
Strengths: best for agents (multi-turn, shared context), native structured output (JSON schema, regex), disaggregated inference, 40% cost savings on agent workloads, built-in speculative decoding.
Limitations: NVIDIA only (no AMD/Ascend), no pipeline parallelism (planned), smaller community than vLLM.
3. TensorRT-LLM (NVIDIA Optimized)¶
Developer: NVIDIA. Purpose: Maximum performance on NVIDIA GPUs.
Core Innovation: CUDA Graph Fusion¶
Standard PyTorch:
Layer 1 -> Kernel Launch -> Layer 2 -> Kernel Launch -> ...
Overhead: 5-10us per kernel
TensorRT-LLM:
Fused Kernel (Layer 1 + Layer 2 + ...) -> Single Launch
Overhead: 5-10us total
Key Features¶
| Feature | Description |
|---|---|
| CUDA Graphs | Kernel fusion, reduced launch overhead |
| INT4/FP8 | Native quantization support |
| In-flight Batching | NVIDIA-optimized scheduling |
| Multi-Query Attention | Optimized attention kernels |
| B200 Optimization | Best on Blackwell |
Performance (Llama-70B on B200, Dec 2025)¶
| Metric | TensorRT-LLM | vLLM | Speedup |
|---|---|---|---|
| Single-req latency | 28ms | 45ms | 1.6x |
| Throughput (batch=1) | 520 tok/s | 300 tok/s | 1.7x |
| Throughput (batch=64) | 850 tok/s | 680 tok/s | 1.25x |
| Memory efficiency | 85% | 75% | +10% |
Code Example¶
# Step 1: Convert HuggingFace to TRT format
python convert_checkpoint.py \
--model_dir ./llama-70b-hf \
--output_dir ./llama-70b-trt \
--tp_size 4
# Step 2: Build TensorRT engine
trtllm-build \
--checkpoint_dir ./llama-70b-trt \
--output_dir ./engine \
--max_batch_size 128 \
--max_input_len 4096 \
--max_seq_len 8192 \
--gemm_plugin auto
# Step 3: Run inference
python run.py --engine_dir ./engine --tokenizer_dir ./llama-70b-hf
Strengths: best single-request latency, best on H100/B200, native INT4/FP8, NVIDIA's recommended solution.
Limitations: NVIDIA only, complex setup (engine compilation takes hours), fewer model architectures, harder to customize.
Setup Complexity¶
| Engine | Setup Time | Maintenance |
|---|---|---|
| vLLM | Minutes | Easy |
| SGLang | Minutes | Easy |
| TensorRT-LLM | Hours | Medium |
| llama.cpp | Minutes | Easy |
4. llama.cpp (CPU + Edge)¶
Developer: Georgi Gerganov. Stars: 70k+ (Feb 2026).
Core Innovation: GGUF Format¶
GGUF File:
+-- Header (magic, version)
+-- Metadata (architecture, params)
+-- Tokenizer (vocabulary)
+-- Tensors (quantized weights)
Key Features¶
| Feature | Description |
|---|---|
| CPU inference | Runs on laptop CPU |
| Apple Metal | M-series GPU acceleration |
| GGUF format | Efficient quantized storage |
| Cross-platform | Linux, macOS, Windows, Android, iOS |
| Ecosystem | Ollama, LM Studio built on it |
Quantization Trade-offs (Llama-7B, MacBook Pro M3 Max)¶
| Quantization | Memory | Speed | Quality Loss |
|---|---|---|---|
| FP16 | 14 GB | 18 tok/s | 0% |
| Q8_0 | 7.5 GB | 28 tok/s | ~1% |
| Q5_K_M | 5.5 GB | 38 tok/s | ~2% |
| Q4_K_M | 4.5 GB | 45 tok/s | ~3-5% |
| Q3_K_M | 3 GB | ~55 tok/s | ~8% |
Code Example¶
from llama_cpp import Llama
llm = Llama(
model_path="./llama-3.1-8b-q5_k_m.gguf",
n_ctx=8192,
n_gpu_layers=35, # Offload to GPU
verbose=False,
)
response = llm.create_chat_completion(
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256,
)
Strengths: no GPU required, best on Apple Silicon, edge/mobile deployment, single binary, Ollama/LM Studio ecosystem.
Limitations: 10-20x slower than vLLM on H100, limited batching (not for high concurrency), manual model conversion, single-node only.
5. LMDeploy и TGI¶
LMDeploy (Shanghai AI Laboratory)¶
| Feature | Description |
|---|---|
| Turbomind | Custom CUDA kernels |
| FasterTransformer | Based on FT optimization |
| INT4/INT8 | SmoothQuant quantization |
Performance (InternLM-20B on A100): 680 tok/s (vs vLLM 620 tok/s).
TGI (Text Generation Inference)¶
| Feature | Description |
|---|---|
| HF integration | Direct Hub model loading |
| Flash Attention | Optimized attention |
| Docker | Containerized deployment |
Best for: teams already using HuggingFace ecosystem, enterprise deployments.
6. Benchmarks¶
Throughput (Llama-70B, H100, batch=64)¶
| Engine | Throughput (tok/s) | Relative |
|---|---|---|
| TensorRT-LLM | 850 | 100% |
| SGLang | 820 | 96% |
| vLLM | 680 | 80% |
| LMDeploy | 650 | 76% |
| TGI | 580 | 68% |
Throughput по batch size (Llama-7B, A100)¶
| Engine | Batch=1 | Batch=32 | Batch=128 |
|---|---|---|---|
| HF Transformers | 50 tok/s | 500 tok/s | OOM |
| vLLM | 100 tok/s | 2,500 tok/s | 5,000 tok/s |
| SGLang | 110 tok/s | 2,400 tok/s | 4,800 tok/s |
| TensorRT-LLM | 120 tok/s | 3,000 tok/s | 5,500 tok/s |
Latency (Llama-70B, single request)¶
| Engine | TTFT (ms) | TBT (ms) | Total (100 tok) |
|---|---|---|---|
| TensorRT-LLM | 28 | 15 | 1,800 |
| SGLang | 40 | 20 | 2,150 |
| vLLM | 35 | 25 | 2,700 |
| TGI | 48 | ~30 | ~3,500 |
Agent Workloads (multi-turn, shared context)¶
| Engine | Throughput | Relative to vLLM |
|---|---|---|
| SGLang | 1800 tok/s | 3x |
| vLLM | 600 tok/s | 1x |
| TensorRT-LLM | 520 tok/s | 0.87x |
Structured Output (JSON)¶
| Engine | Throughput | Constraint Support |
|---|---|---|
| SGLang | 1200 tok/s | Native |
| vLLM + Outlines | 400 tok/s | Plugin |
| TensorRT-LLM | 350 tok/s | Limited |
Memory Efficiency (Llama-70B)¶
| Engine | Memory | Max Batch | Max Seq Len |
|---|---|---|---|
| HF Transformers | 140 GB | 4 | 2K |
| vLLM | 80 GB | 32 | 8K |
| TensorRT-LLM | 70 GB | 40 | 8K |
7. Architecture Comparison¶
KV Cache Management¶
| Engine | Approach | Memory Waste |
|---|---|---|
| vLLM | PagedAttention (blocks) | <4% |
| SGLang | RadixAttention (tree) | <1% |
| TensorRT-LLM | Pre-allocated | 10-20% |
| llama.cpp | Static | 15-30% |
Batching Strategy¶
| Engine | Strategy | Benefit |
|---|---|---|
| vLLM | Continuous batching | No waiting |
| SGLang | Continuous + radix | + Prefix sharing |
| TensorRT-LLM | In-flight batching | NVIDIA optimized |
| llama.cpp | Simple batching | CPU-friendly |
Multi-GPU Support¶
| Engine | TP | PP | Notes |
|---|---|---|---|
| vLLM | Yes | Yes | Best multi-node |
| SGLang | Yes | No | Planned |
| TensorRT-LLM | Yes | Yes | NVIDIA optimized |
| llama.cpp | No | No | Single-node |
Feature Matrix¶
| Feature | vLLM | SGLang | TensorRT-LLM | llama.cpp |
|---|---|---|---|---|
| GPU support | All | NVIDIA | NVIDIA | All |
| CPU support | No | No | No | Yes |
| Quantization | Yes | Yes | Yes | Yes |
| Streaming | Yes | Yes | Yes | Yes |
| OpenAI API | Yes | Yes | Yes | Yes |
| Prefix caching | Yes | Best | Yes | No |
| Structured output | Plugin | Native | Limited | No |
| Multi-modal | Yes | Yes | Yes | Partial |
8. Decision Framework¶
By Hardware¶
graph TD
HW{"Hardware?"} -->|"NVIDIA H100/B200"| N1{"Priority?"}
HW -->|"NVIDIA Consumer<br/>(RTX 4090)"| N2["vLLM<br/>(llama.cpp if VRAM limited)"]
HW -->|"AMD MI300X"| AMD["vLLM (ROCm)"]
HW -->|"Apple M-series"| APPLE["llama.cpp (Metal)"]
HW -->|"CPU only"| CPU["llama.cpp"]
HW -->|"Local dev"| DEV["Ollama"]
N1 -->|"Max latency"| TRT["TensorRT-LLM"]
N1 -->|"Agents"| SGL["SGLang"]
N1 -->|"General"| VLLM["vLLM"]
N1 -->|"HF ecosystem"| TGI["TGI"]
style TRT fill:#e8eaf6,stroke:#3f51b5
style SGL fill:#fff3e0,stroke:#ff9800
style VLLM fill:#e8f5e9,stroke:#4caf50
By Use Case¶
| Use Case | Recommended | Why |
|---|---|---|
| Chatbot (high QPS) | vLLM | Best throughput/latency balance |
| Multi-turn agents | SGLang | RadixAttention, prefix sharing, 3x throughput |
| JSON output | SGLang | Native constrained decoding |
| Maximum latency | TensorRT-LLM | CUDA optimization, 1.6x faster |
| Edge deployment | llama.cpp | No GPU required |
| Local development | Ollama | One-command setup |
| Production scale | TensorRT-LLM or vLLM | Battle-tested |
| Cost optimization | SGLang | 40% cheaper for agents |
By Priority¶
| Priority | Ranking |
|---|---|
| Throughput | TensorRT-LLM > SGLang > vLLM > LMDeploy > TGI |
| Latency | TensorRT-LLM > vLLM > SGLang > TGI |
| Agent workloads | SGLang >> vLLM > TensorRT-LLM |
| Ease of use | Ollama > vLLM > llama.cpp > SGLang > TensorRT-LLM |
| Hardware flexibility | llama.cpp > vLLM > TensorRT-LLM |
| Community | vLLM > llama.cpp > SGLang > TensorRT-LLM |
9. Formulas¶
Memory Savings (PagedAttention)¶
Where \(B\) = block size (typically 16 tokens).
Throughput Calculation¶
- \(B\) = batch size, \(N\) = sequences per batch, \(L\) = avg output length, \(T\) = total time
Latency Components¶
Для интервью¶
Q: "Сравните vLLM, SGLang и TensorRT-LLM."¶
vLLM: PagedAttention (block-based KV cache, <4% waste), continuous batching, easiest deployment, NVIDIA/AMD/Ascend -- best general-purpose. SGLang: RadixAttention (tree-based prefix sharing, <1% waste), 3x throughput on agent workloads, native JSON constraints -- best for agents. TensorRT-LLM: CUDA graph fusion, lowest single-request latency (1.6x faster), NVIDIA only, complex setup (hours) -- best raw performance.
Q: "Что такое PagedAttention и RadixAttention?"¶
PagedAttention (vLLM): KV cache разбит на fixed-size blocks (16 tokens), allocated on-demand. Memory waste: 60-80% -> <4%. Аналог virtual memory в OS. RadixAttention (SGLang): KV cache хранится в radix tree. Automatic prefix detection и sharing между запросами. Token-level granularity (не block-level). 5x speedup при shared prefixes, <1% waste.
Q: "Когда выбрать llama.cpp?"¶
CPU-only / Edge / Apple M-series. GGUF format с квантизацией (Q4_K_M: 4.5 GB для 7B, ~3-5% quality loss). Runs on mobile (Android, iOS), no GPU required. Ecosystem: Ollama, LM Studio. Limitations: 10-20x slower than vLLM on H100, no distributed, limited batching.
Q: "Спроектируйте систему инференса для 10K concurrent users."¶
(1) vLLM или TensorRT-LLM на кластере H100 с tensor parallelism. (2) Load balancer -> multiple vLLM instances. (3) Continuous batching для GPU utilization 70-90%. (4) Prefix caching (RadixAttention если multi-turn). (5) FP8 quantization для 2x memory savings. (6) Autoscaling по queue depth. Target: TTFT <100ms, throughput 850+ tok/s per node.
Q: "Disaggregated inference -- что это?"¶
Разделение prefill и decode фаз на разные GPU. Prefill = compute-bound (parallel processing input tokens). Decode = memory-bound (sequential token generation). SGLang поддерживает disaggregated inference: prefill workers отдельно от decode workers. Результат: лучшее utilization обоих типов GPU, оптимизация под каждую фазу.
Ключевые числа¶
| Факт | Значение |
|---|---|
| vLLM PagedAttention waste | <4% (vs 60-80% traditional) |
| SGLang RadixAttention waste | <1% |
| SGLang agent speedup vs vLLM | 3x |
| SGLang cost savings (agents) | 40% cheaper |
| SGLang prefix reuse speedup | 5x (10 shared requests) |
| TensorRT-LLM single-req latency advantage | 1.6x faster |
| llama.cpp Q4_K_M (7B) | 4.5 GB, 45 tok/s, ~3-5% loss |
| vLLM vs HF Transformers | 6x throughput (Llama-70B) |
| TensorRT-LLM H100 throughput (batch=64) | 850 tok/s |
| vLLM stars | 35k+ |
| llama.cpp stars | 70k+ |
| TensorRT-LLM setup time | Hours (engine compilation) |
Заблуждение: TensorRT-LLM всегда быстрее vLLM
TensorRT-LLM выигрывает 1.6x на single-request latency (28ms vs 45ms), но при batch=64 разница сокращается до 1.25x (850 vs 680 tok/s). На agent workloads с multi-turn контекстом SGLang обгоняет TensorRT-LLM в 3.5x благодаря RadixAttention. "Самый быстрый" зависит от workload: batched serving (TRT-LLM), agents (SGLang), general purpose (vLLM).
Заблуждение: vLLM и SGLang -- взаимозаменяемые
При unique prompts (zero prefix sharing) разница минимальна: 100ms vs 90ms. Но при 10 запросах с shared prefix SGLang выигрывает 5x (200ms vs 1000ms). Если ваш workload -- chatbot с system prompt + multi-turn, SGLang экономит 40% compute. Если single-shot API без shared контекста -- vLLM проще и имеет лучшую поддержку AMD/Ascend.
Заблуждение: квантизация Q4 на llama.cpp -- бесплатное ускорение
Q4_K_M дает 45 tok/s вместо 18 tok/s (FP16) на MacBook M3, но quality loss составляет 3-5% на общих бенчмарках и до 15% на math/reasoning задачах. Для production chatbot -- допустимо. Для юридического или медицинского AI -- Q8_0 (1% loss) или FP16. Всегда измеряйте quality на ВАШЕМ домене, а не на MMLU.
Interview Questions¶
Q: Вам нужно развернуть Llama-70B для 10K concurrent users. Какой движок выберете и почему?
Red flag: "vLLM, потому что он самый популярный"
Strong answer: "Зависит от workload profile. Если chatbot с multi-turn (shared system prompt, 3-5 turns) -- SGLang: RadixAttention дает 5x speedup на shared prefix, 3x throughput на agent workloads. Если single-shot API (translation, classification) -- vLLM: проще deployment, лучше community, NVIDIA+AMD support. Если latency-critical (real-time trading, voice) -- TensorRT-LLM: 28ms TTFT vs 45ms vLLM. Конкретно для 10K concurrent: vLLM на 4xH100 с tensor parallelism, continuous batching, prefix caching. HPA по queue depth, min 4 / max 20 replicas. Target: TTFT <200ms p95, 680+ tok/s per node."
Q: Что такое PagedAttention и почему это важно для serving?
Red flag: "Это способ ускорить attention computation"
Strong answer: "PagedAttention решает проблему memory fragmentation в KV cache. Традиционный подход: pre-allocate max_seq_len (4096 tokens) на каждый запрос, реальное использование -- 500 tokens, waste 87%. PagedAttention разбивает KV cache на fixed-size blocks (16 tokens), allocated on-demand -- аналог virtual memory в OS. Waste падает с 60-80% до <4%. Практический эффект: на тех же 80GB H100 max batch size растет с 8 до 128+, throughput 6x vs HuggingFace Transformers (300 vs 50 tok/s на Llama-70B). RadixAttention (SGLang) развивает идею: tree-based structure с token-level granularity, waste <1%, автоматический prefix sharing."
Q: Когда оправдано использовать llama.cpp вместо vLLM?
Red flag: "Когда нет GPU"
Strong answer: "Три сценария: (1) Edge/mobile deployment -- llama.cpp единственный движок работающий на Android/iOS, single binary без зависимостей. (2) Apple Silicon development -- Metal GPU acceleration, Q5_K_M на M3 Max дает 38 tok/s для 7B модели при 5.5GB RAM, достаточно для local dev. (3) CPU-only inference для low-traffic внутренних инструментов (<100 req/day) где GPU стоимость не оправдана. Но НЕ для production high-concurrency: llama.cpp 10-20x медленнее vLLM на H100, нет distributed inference, limited batching. Ecosystem (Ollama, LM Studio) -- для разработки и прототипирования, не для serving."
Q: Disaggregated inference -- что это и когда применять?
Red flag: "Это когда модель распределена по нескольким GPU"
Strong answer: "Disaggregated inference -- разделение prefill и decode фаз на разные GPU pools. Prefill = compute-bound (параллельная обработка всех input tokens, нагружает FLOPS). Decode = memory-bound (последовательная генерация по одному токену, нагружает memory bandwidth). На одном GPU оптимизация под обе фазы -- компромисс. Disaggregated: prefill workers на compute-оптимальных GPU (H100 SXM), decode workers на memory-оптимальных (с высоким bandwidth). SGLang поддерживает нативно. Применять при >1000 QPS: экономия 20-30% GPU за счет лучшего utilization каждого типа worker. При малом трафике overhead координации съедает выигрыш."
Источники¶
- Kwon et al. -- "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
- Zheng et al. -- "SGLang: Efficient Execution of Structured Language Model Programs" (arXiv:2312.07104)
- NVIDIA -- TensorRT-LLM In-flight Batching whitepaper
- Clarifai -- "Comparing SGLang, vLLM, and TensorRT-LLM with GPT-OSS-120B" (Aug 2025)
- Kanerika -- "SGLang vs vLLM: Which is Better in 2026?" (Sept 2025)
- LangCopilot -- "SGLang 3x Faster" (July 2025)
- SemiAnalysis -- "InferenceMAX" (Oct 2025)
- Medium (Ordina Data) -- "Choosing Your LLM Framework: Ollama, vLLM, SGLang, TensorRT-LLM"
- AI Merge -- "The AI Engineer's Guide to Inference Engines"
See Also¶
- vLLM Paged Attention — подробный разбор PagedAttention
- Quantization — INT4/INT8/FP8 методы квантизации
- Production Deploy — паттерны деплоя в продакшен
- Cloud Deploy — облачные решения для инференса
- Model Serving Benchmark — детальные бенчмарки