ML на устройствах: Edge-инференс LLM¶
~4 минуты чтения
Предварительно: Квантизация LLM | Оптимизация инференса
Запуск LLM прямо на смартфоне -- не нишевая задача, а один из главных трендов 2025-2026. Apple Intelligence, Google Gemini Nano, Samsung Galaxy AI -- все крупные платформы переносят инференс на устройство. Причина проста: задержка при обращении к облаку составляет 200-500 мс, а on-device инференс дает 10-50 мс. При этом Llama 3.2 3B в формате GGUF Q4_K_M занимает всего 2 ГБ и помещается даже на iPhone 15 Pro с 8 ГБ RAM. Рынок edge AI, по прогнозам, достигнет $40 млрд к 2027 году.
1. Edge LLM Deployment Overview¶
1.1 Why On-Device Inference?¶
| Benefit | Description |
|---|---|
| Privacy | Data never leaves device |
| Latency | No network round-trip |
| Offline | Works without connectivity |
| Cost | No cloud API charges |
| Compliance | GDPR, HIPAA requirements |
1.2 Key Challenges¶
graph TD
A["Memory Constraints<br/>4-16GB RAM typical on mobile"] --> B["Compute Limitations<br/>Mobile GPU/NPU vs datacenter GPU"]
B --> C["Power Budget<br/>Battery life constraints"]
C --> D["Thermal Throttling<br/>Sustained performance limits"]
D --> E["Model Size<br/>7B+ models need aggressive compression"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#e8eaf6,stroke:#3f51b5
style C fill:#fff3e0,stroke:#ef6c00
style D fill:#fce4ec,stroke:#c62828
style E fill:#fce4ec,stroke:#c62828
1.3 Device Memory Constraints (2025)¶
| Device Type | RAM | Model Capacity (4-bit) |
|---|---|---|
| iPhone 15 Pro | 8GB | ~3B params |
| iPhone 16 Pro Max | 8GB | ~3-5B params |
| Samsung S24 Ultra | 12GB | ~5B params |
| Pixel 9 Pro | 16GB | ~7B params |
| iPad Pro M4 | 16GB | ~7B params |
| Jetson Orin AGX | 64GB | ~32B params |
2. Quantization Methods¶
2.1 Quantization Overview¶
Goal: Reduce model size from FP32 to INT8/INT4/INT2
| Precision | Bytes/Param | 7B Model Size | Quality |
|---|---|---|---|
| FP32 | 4 | 28 GB | 100% |
| FP16 | 2 | 14 GB | ~99.9% |
| BF16 | 2 | 14 GB | ~99.9% |
| INT8 | 1 | 7 GB | ~99% |
| INT4 | 0.5 | 3.5 GB | ~95-98% |
| INT2 | 0.25 | 1.75 GB | ~80-90% |
2.2 GPTQ (GPU-Optimized)¶
Key Idea: Post-training quantization optimized for GPU inference
GPTQ Process:
1. Calibrate on small dataset (128-512 samples)
2. Quantize weights layer-by-layer
3. Update remaining weights to compensate for errors
Formula: $$ \hat{W}q = \arg\min |WX - W_q X|^2 $$
Pros: - High quality preservation - Fast GPU inference - Works with Transformers/AutoGPTQ
Cons: - GPU-focused (not ideal for mobile CPU) - Calibration required - Larger file size than GGUF
2.3 AWQ (Activation-Aware)¶
Key Idea: Protect important weights based on activation magnitudes
AWQ Insight:
- Not all weights equally important
- Weights connected to high-activation channels matter more
- Protect salient weights with higher precision
Formula: $$ \text{Importance}(w) = \frac{1}{N} \sum_{i=1}^{N} |x_i| \cdot |w| $$
Results (2025): | Model | FP16 | AWQ INT4 | Quality Retention | |-------|------|----------|-------------------| | Llama-2-7B | 100% | 97.8% | 97.8% | | Llama-2-70B | 100% | 98.5% | 98.5% | | Mistral-7B | 100% | 97.5% | 97.5% |
Pros: - Best quality preservation - Lightweight calibration (small dataset, less sensitive than GPTQ) - Works well for production
Cons: - More complex quantization process - GPU-focused
2.4 GGUF (llama.cpp Format)¶
Key Idea: CPU-optimized format with multiple quantization variants
GGUF Variants (Q = Quantization):
- Q4_K_M: 4-bit, medium quality (recommended)
- Q4_K_S: 4-bit, small size
- Q5_K_M: 5-bit, higher quality
- Q8_0: 8-bit, near-lossless
- Q2_K: 2-bit, aggressive compression
Comparison (Llama-2-7B): | Variant | Size | Quality | Speed | |---------|------|---------|-------| | Q4_K_M | 4.1 GB | 92% | Fast | | Q5_K_M | 5.0 GB | 95% | Medium | | Q8_0 | 7.2 GB | 99% | Slow | | Q2_K | 2.4 GB | 80% | Fastest |
Pros: - Works on CPU (no GPU needed) - Broad hardware support - Ollama/llama.cpp ecosystem - Easy to use
Cons: - Slower than GPU quantization - Less quality than AWQ
2.5 Quantization Selection Guide (2025)¶
| Use Case | Recommended | Why |
|---|---|---|
| Mobile CPU | GGUF Q4_K_M | Broad support, good balance |
| Apple Silicon | MLC-LLM / GGUF | Metal GPU acceleration |
| NVIDIA GPU | GPTQ or AWQ | Fast inference |
| Production API | AWQ INT4 | Best quality |
| Extreme compression | Q2_K / 2-bit | Under 1GB models |
3. Edge Hardware & NPUs¶
3.1 Mobile NPU Landscape (2025)¶
| Chip | NPU | TOPS | Key Features |
|---|---|---|---|
| Snapdragon 8 Gen 3 | Hexagon | 75 | Qualcomm AI Engine |
| Snapdragon 8 Elite | Hexagon | 100+ | Latest flagship |
| Apple A17 Pro | Neural Engine | 35 | Core ML optimized |
| Apple M4 | Neural Engine | 38 | iPad Pro |
| Tensor G4 | TPU | ~30 | Google Pixel |
| Dimensity 9300 | APU | ~50 | MediaTek |
3.2 NPU Acceleration Techniques¶
llm.npu Framework (ASPLOS 2025):
Key Innovation: NPU offloading for prefill latency reduction
Pipeline:
1. Prefill phase → NPU (parallel processing)
2. Decode phase → CPU (sequential tokens)
Result: Significant latency reduction on Snapdragon
m²LLM Framework (IEEE 2025): - Multi-dimensional optimization for mobile LLM - Tested on Snapdragon 8 Gen 3 - Memory-compute-precision co-optimization
3.3 Edge Hardware Recommendations¶
| Deployment | Hardware | Framework |
|---|---|---|
| Phone (Android) | Snapdragon 8+ Gen 1 | MLC-LLM, llama.cpp |
| Phone (iOS) | A17 Pro / M-series | MLC-LLM, Core ML |
| Tablet | iPad Pro M4 | MLC-LLM, MLX |
| IoT Device | Jetson Orin Nano | TensorRT |
| Edge Server | Jetson AGX Orin | TensorRT-LLM |
4. Mobile Inference Frameworks¶
4.1 Framework Comparison (2025)¶
| Framework | Platform | GPU | CPU | Best For |
|---|---|---|---|---|
| llama.cpp | All | Metal/CUDA/Vulkan | Yes | Broad support |
| MLC-LLM | iOS/Android | Metal/OpenCL/Vulkan | Yes | Mobile-optimized |
| Ollama | Desktop | Metal/CUDA | Yes | Ease of use |
| MLX | Apple Silicon | Metal | Yes | Apple devices |
| TensorRT-LLM | NVIDIA | CUDA | No | Production GPU |
| Core ML | iOS/macOS | Metal | Yes | Apple ecosystem |
| ExecuTorch | All | Various | Yes | PyTorch mobile |
4.2 llama.cpp¶
Key Features: - C++ inference engine - GGUF format support - Runs on CPU, Metal, CUDA, Vulkan - Used by Ollama, LM Studio
Usage:
# Install
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Run inference
./main -m model.gguf -p "Hello" -n 128
# Server mode
./server -m model.gguf --host 0.0.0.0 --port 8080
4.3 MLC-LLM¶
Key Features: - Compiler-based optimization - Metal GPU on Apple - Vulkan on Android - WebGPU support
Architecture:
MLC-LLM Pipeline:
Model → TVM Compiler → Optimized Binary → Device Runtime
│
├── Metal (Apple)
├── Vulkan (Android)
└── WebGPU (Browser)
Performance (Llama-2-7B on RTX 3090): | Framework | Tokens/sec | |-----------|------------| | MLC-LLM | ~51 | | llama.cpp | ~46 | | Ollama | ~46 |
4.4 Mobile Deployment Guide¶
Android (Termux + Ollama):
# Install Termux from F-Droid
pkg install git cmake
# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && mkdir build && cd build
cmake .. && make
# Run GGUF model
./main -m /sdcard/model.gguf -p "Hello"
iOS (MLC-LLM):
// Swift integration
import MLCChat
let config = MLCEngine.Config(
modelPath: "path/to/model",
temperature: 0.7
)
let engine = try MLCEngine(config: config)
let response = try await engine.generate("Hello")
5. Model Compression Techniques¶
5.1 Compression Pipeline¶
graph TD
A["Original Model FP32"] --> B["Pruning<br/>Remove unimportant weights"]
B --> C["Quantization<br/>Reduce precision INT4/INT8"]
C --> D["Distillation<br/>Train smaller model from larger"]
D --> E["Compressed Model<br/>Edge-ready"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#fff3e0,stroke:#ef6c00
style C fill:#fff3e0,stroke:#ef6c00
style D fill:#fff3e0,stroke:#ef6c00
style E fill:#e8f5e9,stroke:#4caf50
5.2 Pruning¶
Types: | Type | Description | Use Case | |------|-------------|----------| | Unstructured | Remove individual weights | Research | | Structured | Remove entire neurons/heads | Production | | Semi-structured | N:M sparsity pattern | GPU acceleration |
Sparsity vs Quality:
5.3 Knowledge Distillation¶
Process:
Teacher Model (Large) → Student Model (Small)
│ │
│ Transfer │
└─────Knowledge────────┘
(soft labels)
Distillation Loss: $$ \mathcal{L}{KD} = \alpha \cdot \mathcal{L}(p_t, p_s) $$}(y, y_s) + (1-\alpha) \cdot T^2 \cdot \mathcal{L}_{KL
Where: - \(T\) = temperature - \(\alpha\) = weight balancing - \(p_t, p_s\) = teacher/student probabilities
5.4 EDGE-LLM Framework (2024-2025)¶
Innovation: Computation and memory-efficient LLM tuning for edge
Features: - Memory-efficient fine-tuning - Edge-aware architecture search - Hardware-aware optimization
6. Inference Optimization¶
6.1 KV-Cache Optimization¶
Problem: KV-cache grows with sequence length
Solutions: | Technique | Memory Reduction | Quality Impact | |-----------|------------------|----------------| | PagedAttention | Variable | None | | MQA/GQA | 4-8x | Minimal | | Sliding Window | Fixed | None | | KV-Cache Quantization | 2-4x | Low |
6.2 Speculative Decoding on Mobile¶
Key Paper (Oct 2025): "Accelerating Mobile Language Model via Speculative Decoding"
Speculative Decoding:
1. Small draft model generates K tokens quickly
2. Large target model verifies in parallel
3. Accept correct tokens, reject incorrect
Result: Up to 2-3x speedup on mobile
6.3 Inference Latency Targets (Edge)¶
| Model Size | Target Latency | Hardware |
|---|---|---|
| 1-3B | <50ms | Phone |
| 3-7B | <200ms | Phone |
| 7-13B | <500ms | Tablet |
| 13-30B | <1s | Edge server |
7. Production Edge Deployment¶
7.1 Deployment Checklist¶
□ Model Selection
├── Choose appropriate size (1-7B for phones)
├── Select quantization (GGUF Q4_K_M recommended)
└── Test on target device
□ Framework Selection
├── iOS: MLC-LLM or Core ML
├── Android: MLC-LLM or llama.cpp
└── Cross-platform: llama.cpp
□ Performance Optimization
├── Enable GPU/NPU acceleration
├── Optimize batch size (usually 1)
├── Set appropriate context length
└── Implement caching where possible
□ Monitoring
├── Track inference latency
├── Monitor memory usage
├── Log errors and crashes
└── Collect user feedback
7.2 NexaSDK for Android (Qualcomm 2025)¶
Key Features: - On-device AI for Snapdragon - GPT-OSS-20B running entirely on-device - Qualcomm Hexagon NPU integration
Capabilities:
- Zero cloud dependency
- Privacy-preserving inference
- 20B parameter models on mobile
- Direct NPU access
7.3 Google AI Edge (2025)¶
Features: - On-device small language models - Multimodality support - RAG integration - Function calling - Model customization and quantization
8. Interview Questions¶
8.1 Concept Questions¶
Q: Compare GPTQ, AWQ, and GGUF quantization methods.
A: GPTQ:
- GPU-optimized post-training quantization
- Layer-by-layer calibration
- Best for NVIDIA GPUs
AWQ:
- Activation-aware weight quantization
- Protects salient weights
- Best quality preservation
- No calibration data needed
GGUF:
- CPU-optimized format
- Multiple variants (Q4_K_M, Q5_K_M, etc.)
- Broad hardware support
- Best for mobile deployment
Q: What are the main challenges of deploying LLMs on edge devices?
A: Four main challenges:
1. Memory: Limited RAM (4-16GB typical)
2. Compute: Mobile GPU/NPU << datacenter GPU
3. Power: Battery constraints limit sustained inference
4. Thermal: Throttling under load
Solutions:
- Aggressive quantization (INT4)
- Model pruning
- Knowledge distillation to smaller models
- Efficient attention (MQA, GQA)
Q: Explain how NPU acceleration works for mobile LLM inference.
A: NPU (Neural Processing Unit) acceleration:
- Dedicated hardware for matrix operations
- Parallel processing capability
- Lower power than GPU for inference
Pipeline:
1. Prefill phase → NPU (parallel token processing)
2. Decode phase → CPU (sequential generation)
Frameworks: llm.npu, m²LLM, NexaSDK
Hardware: Snapdragon Hexagon, Apple Neural Engine
8.2 Architecture Questions¶
Q: Design an on-device LLM deployment for a mobile app.
A: Architecture:
1. Model Layer:
- 3-7B parameter model (Llama 3.2, Phi-3)
- GGUF Q4_K_M quantization
- Context window: 2048-4096 tokens
2. Inference Engine:
- MLC-LLM for iOS/Android
- GPU acceleration (Metal/Vulkan)
- KV-cache optimization
3. Application Layer:
- Request queue for batching
- Streaming response
- Error handling and fallback
4. Optimization:
- Warm start for faster first token
- Response caching
- Dynamic context pruning
Constraints:
- Memory budget: <3GB
- Latency: <500ms per token
- Battery: Minimal drain
Q: How would you optimize inference latency on a resource-constrained device?
A: Multi-level optimization:
1. Model Level:
- Use smallest viable model
- Apply INT4 quantization
- Implement GQA/MQA attention
2. Inference Level:
- Enable NPU/GPU acceleration
- Optimize KV-cache management
- Use speculative decoding
3. System Level:
- Warm model loading
- Request batching (where possible)
- Memory-mapped model files
4. Application Level:
- Streaming responses
- Progressive loading
- Graceful degradation
8.3 Implementation Questions¶
Q: Implement model size calculation for edge deployment.
def calculate_model_requirements(
params_billions: float,
bits_per_param: int,
context_length: int,
kv_cache_bits: int = 16
) -> dict:
"""
Calculate memory requirements for LLM deployment.
Returns dict with sizes in GB.
"""
# Model weights
weight_bytes = params_billions * 1e9 * (bits_per_param / 8)
weight_gb = weight_bytes / 1e9
# KV-cache (per layer, simplified)
# Assuming standard transformer architecture
hidden_dim = 4096 # typical for 7B
n_layers = 32
kv_bytes = 2 * n_layers * context_length * hidden_dim * (kv_cache_bits / 8)
kv_gb = kv_bytes / 1e9
# Runtime overhead (activations, etc.)
overhead_gb = weight_gb * 0.2
return {
"weights_gb": round(weight_gb, 2),
"kv_cache_gb": round(kv_gb, 3),
"overhead_gb": round(overhead_gb, 2),
"total_gb": round(weight_gb + kv_gb + overhead_gb, 2),
"fits_in_8gb": (weight_gb + kv_gb + overhead_gb) < 7.0,
"fits_in_16gb": (weight_gb + kv_gb + overhead_gb) < 14.0
}
# Example: 7B model at 4-bit with 4096 context
result = calculate_model_requirements(7, 4, 4096)
# weights_gb: 3.5, total_gb: ~4.2, fits_in_8gb: True
Q: Implement quantization-aware model selection.
def select_edge_model(
device_ram_gb: float,
target_latency_ms: float,
quality_threshold: float = 0.95
) -> dict:
"""
Select appropriate model configuration for edge deployment.
"""
models = [
{"name": "Phi-3-mini", "params": 3.8, "base_quality": 0.90},
{"name": "Llama-3.2-3B", "params": 3.2, "base_quality": 0.92},
{"name": "Llama-3.2-1B", "params": 1.2, "base_quality": 0.88},
{"name": "Qwen-2.5-3B", "params": 3.1, "base_quality": 0.91},
{"name": "Gemma-2-2B", "params": 2.6, "base_quality": 0.89},
]
quant_configs = [
{"bits": 4, "quality_mult": 0.97, "name": "Q4_K_M"},
{"bits": 5, "quality_mult": 0.98, "name": "Q5_K_M"},
{"bits": 8, "quality_mult": 0.99, "name": "Q8_0"},
]
candidates = []
for model in models:
for quant in quant_configs:
size_gb = model["params"] * (quant["bits"] / 8)
quality = model["base_quality"] * quant["quality_mult"]
if size_gb < device_ram_gb * 0.7: # 70% RAM budget
candidates.append({
"model": model["name"],
"quantization": quant["name"],
"size_gb": round(size_gb, 2),
"quality": round(quality, 2),
"meets_threshold": quality >= quality_threshold
})
# Sort by quality, return best
candidates.sort(key=lambda x: x["quality"], reverse=True)
return candidates[0] if candidates else None
9. Key Papers & Resources¶
| Paper/Resource | Year | Key Contribution |
|---|---|---|
| llm.npu | ASPLOS 2025 | First NPU-aware LLM inference system |
| m²LLM | IEEE 2025 | Multi-dimensional mobile optimization |
| EDGE-LLM | ACM 2024 | Memory-efficient edge tuning |
| GPTQ Paper | 2023 | GPU-optimized quantization |
| AWQ Paper | 2023 | Activation-aware quantization |
| Sustainable LLM Inference for Edge | arXiv 2025 | Energy efficiency evaluation |
| PartInfer | OpenReview 2025 | Partitioned inference for edge |
10. Formulas Quick Reference¶
Model Size¶
KV-Cache Size¶
Where \(L\) = layers, \(H\) = heads, \(D_h\) = head dim, \(S\) = seq len, \(B\) = bytes
Quantization Quality Loss¶
Where \(\text{Retention}_4 \approx 0.95\), \(\text{Retention}_8 \approx 0.99\)
Inference Latency¶
Частые заблуждения¶
Заблуждение: INT4 квантизация всегда теряет 5% качества
Это средняя цифра для naive round-to-nearest. AWQ (activation-aware quantization) сохраняет 97-98% качества на INT4, потому что защищает salient weights. На практике для задач classification/summarization разница с FP16 часто в пределах 1-2%. Потери заметны в основном на длинных цепочках рассуждений и арифметике.
Заблуждение: NPU всегда быстрее GPU для инференса LLM
NPU (Neural Processing Unit) оптимизирован под параллельные матричные операции, что идеально для prefill-фазы. Но decode-фаза (последовательная генерация токенов) часто быстрее на CPU/GPU. Поэтому лучшие фреймворки (llm.npu) используют гибридный подход: prefill на NPU, decode на CPU.
Заблуждение: on-device LLM может полностью заменить облачный API
Модели на устройстве ограничены 1-7B параметрами, что сравнимо с GPT-3.5, но существенно уступает GPT-4/Claude в сложных задачах. Лучшая стратегия -- гибридная: простые запросы обрабатываются локально (приватность + скорость), сложные маршрутизируются в облако.
Вопросы для собеседования¶
Какую стратегию квантизации выбрать для мобильного деплоя 7B модели на Android?
"Возьмем GPTQ, он лучше всего квантизирует" -- GPTQ оптимизирован под GPU, а на Android обычно нет CUDA.
Сильный ответ: Для Android лучший выбор -- GGUF Q4_K_M через llama.cpp или MLC-LLM. GGUF оптимизирован под CPU и поддерживает Vulkan GPU acceleration. 7B модель в Q4_K_M займет ~4 ГБ, что помещается в устройства с 12+ ГБ RAM (Samsung S24 Ultra). Для устройств с 8 ГБ лучше взять 3B модель (Llama 3.2 3B) -- ~2 ГБ в Q4_K_M. Если доступен Snapdragon 8 Gen 3+, можно задействовать NPU через Hexagon SDK для ускорения prefill-фазы.
Как измерить деградацию качества модели после квантизации?
"Запустим несколько примеров и посмотрим" -- нет методологии, нерепрезентативно.
Сильный ответ: Нужен systematic evaluation: (1) benchmark suite, покрывающий целевые use cases (MMLU, HumanEval, или domain-specific); (2) сравнение метрик FP16 vs INT4 -- если retention > 95%, модель deployable; (3) perplexity на reference corpus как proxy-метрика; (4) A/B тестирование на реальных пользователях для субъективного качества. Формула: Quality_quantized ~ Quality_FP16 x Retention_b, где Retention_4 ~ 0.95-0.98 для AWQ.
Спроектируйте систему on-device LLM с fallback на облако.
"Если модель не справилась, отправляем запрос в облако" -- нет критериев routing, нет latency budget.
Сильный ответ: Двухуровневая архитектура: (1) Router -- classifies запрос по сложности (simple/complex) через легкую модель или rule-based (длина промпта, наличие code/math). (2) Simple -> on-device 3B модель (GGUF Q4_K_M), latency target <200ms, privacy-preserving. (3) Complex -> cloud API (GPT-4/Claude), latency <2s, quality-critical. (4) Fallback: если on-device confidence < threshold или timeout > 500ms, escalate to cloud. (5) Мониторинг: track routing ratio, latency P50/P99, user satisfaction per path.