ML на устройствах: Edge-инференс LLM¶

~4 минуты чтения

Предварительно: Квантизация LLM | Оптимизация инференса

Запуск LLM прямо на смартфоне -- не нишевая задача, а один из главных трендов 2025-2026. Apple Intelligence, Google Gemini Nano, Samsung Galaxy AI -- все крупные платформы переносят инференс на устройство. Причина проста: задержка при обращении к облаку составляет 200-500 мс, а on-device инференс дает 10-50 мс. При этом Llama 3.2 3B в формате GGUF Q4_K_M занимает всего 2 ГБ и помещается даже на iPhone 15 Pro с 8 ГБ RAM. Рынок edge AI, по прогнозам, достигнет $40 млрд к 2027 году.

1. Edge LLM Deployment Overview¶

1.1 Why On-Device Inference?¶

Benefit	Description
Privacy	Data never leaves device
Latency	No network round-trip
Offline	Works without connectivity
Cost	No cloud API charges
Compliance	GDPR, HIPAA requirements

1.2 Key Challenges¶

graph TD
    A["Memory Constraints<br/>4-16GB RAM typical on mobile"] --> B["Compute Limitations<br/>Mobile GPU/NPU vs datacenter GPU"]
    B --> C["Power Budget<br/>Battery life constraints"]
    C --> D["Thermal Throttling<br/>Sustained performance limits"]
    D --> E["Model Size<br/>7B+ models need aggressive compression"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8eaf6,stroke:#3f51b5
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#fce4ec,stroke:#c62828
    style E fill:#fce4ec,stroke:#c62828

1.3 Device Memory Constraints (2025)¶

Device Type	RAM	Model Capacity (4-bit)
iPhone 15 Pro	8GB	~3B params
iPhone 16 Pro Max	8GB	~3-5B params
Samsung S24 Ultra	12GB	~5B params
Pixel 9 Pro	16GB	~7B params
iPad Pro M4	16GB	~7B params
Jetson Orin AGX	64GB	~32B params

2. Quantization Methods¶

2.1 Quantization Overview¶

Goal: Reduce model size from FP32 to INT8/INT4/INT2

\[ \text{Model Size} = \text{Params} \times \text{Bytes per Param} \]

Precision	Bytes/Param	7B Model Size	Quality
FP32	4	28 GB	100%
FP16	2	14 GB	~99.9%
BF16	2	14 GB	~99.9%
INT8	1	7 GB	~99%
INT4	0.5	3.5 GB	~95-98%
INT2	0.25	1.75 GB	~80-90%

2.2 GPTQ (GPU-Optimized)¶

Key Idea: Post-training quantization optimized for GPU inference

GPTQ Process:
1. Calibrate on small dataset (128-512 samples)
2. Quantize weights layer-by-layer
3. Update remaining weights to compensate for errors

Formula: $$ \hat{W}q = \arg\min |WX - W_q X|^2 $$

Pros: - High quality preservation - Fast GPU inference - Works with Transformers/AutoGPTQ

Cons: - GPU-focused (not ideal for mobile CPU) - Calibration required - Larger file size than GGUF

2.3 AWQ (Activation-Aware)¶

Key Idea: Protect important weights based on activation magnitudes

AWQ Insight:
- Not all weights equally important
- Weights connected to high-activation channels matter more
- Protect salient weights with higher precision

Formula: $$ \text{Importance}(w) = \frac{1}{N} \sum_{i=1}^{N} |x_i| \cdot |w| $$

Results (2025): | Model | FP16 | AWQ INT4 | Quality Retention | |-------|------|----------|-------------------| | Llama-2-7B | 100% | 97.8% | 97.8% | | Llama-2-70B | 100% | 98.5% | 98.5% | | Mistral-7B | 100% | 97.5% | 97.5% |

Pros: - Best quality preservation - Lightweight calibration (small dataset, less sensitive than GPTQ) - Works well for production

Cons: - More complex quantization process - GPU-focused

2.4 GGUF (llama.cpp Format)¶

Key Idea: CPU-optimized format with multiple quantization variants

GGUF Variants (Q = Quantization):
- Q4_K_M: 4-bit, medium quality (recommended)
- Q4_K_S: 4-bit, small size
- Q5_K_M: 5-bit, higher quality
- Q8_0: 8-bit, near-lossless
- Q2_K: 2-bit, aggressive compression

Comparison (Llama-2-7B): | Variant | Size | Quality | Speed | |---------|------|---------|-------| | Q4_K_M | 4.1 GB | 92% | Fast | | Q5_K_M | 5.0 GB | 95% | Medium | | Q8_0 | 7.2 GB | 99% | Slow | | Q2_K | 2.4 GB | 80% | Fastest |

Pros: - Works on CPU (no GPU needed) - Broad hardware support - Ollama/llama.cpp ecosystem - Easy to use

Cons: - Slower than GPU quantization - Less quality than AWQ

2.5 Quantization Selection Guide (2025)¶

Use Case	Recommended	Why
Mobile CPU	GGUF Q4_K_M	Broad support, good balance
Apple Silicon	MLC-LLM / GGUF	Metal GPU acceleration
NVIDIA GPU	GPTQ or AWQ	Fast inference
Production API	AWQ INT4	Best quality
Extreme compression	Q2_K / 2-bit	Under 1GB models

3. Edge Hardware & NPUs¶

3.1 Mobile NPU Landscape (2025)¶

Chip	NPU	TOPS	Key Features
Snapdragon 8 Gen 3	Hexagon	75	Qualcomm AI Engine
Snapdragon 8 Elite	Hexagon	100+	Latest flagship
Apple A17 Pro	Neural Engine	35	Core ML optimized
Apple M4	Neural Engine	38	iPad Pro
Tensor G4	TPU	~30	Google Pixel
Dimensity 9300	APU	~50	MediaTek

3.2 NPU Acceleration Techniques¶

llm.npu Framework (ASPLOS 2025):

Key Innovation: NPU offloading for prefill latency reduction

Pipeline:
1. Prefill phase → NPU (parallel processing)
2. Decode phase → CPU (sequential tokens)

Result: Significant latency reduction on Snapdragon

m²LLM Framework (IEEE 2025): - Multi-dimensional optimization for mobile LLM - Tested on Snapdragon 8 Gen 3 - Memory-compute-precision co-optimization

3.3 Edge Hardware Recommendations¶

Deployment	Hardware	Framework
Phone (Android)	Snapdragon 8+ Gen 1	MLC-LLM, llama.cpp
Phone (iOS)	A17 Pro / M-series	MLC-LLM, Core ML
Tablet	iPad Pro M4	MLC-LLM, MLX
IoT Device	Jetson Orin Nano	TensorRT
Edge Server	Jetson AGX Orin	TensorRT-LLM

4. Mobile Inference Frameworks¶

4.1 Framework Comparison (2025)¶

Framework	Platform	GPU	CPU	Best For
llama.cpp	All	Metal/CUDA/Vulkan	Yes	Broad support
MLC-LLM	iOS/Android	Metal/OpenCL/Vulkan	Yes	Mobile-optimized
Ollama	Desktop	Metal/CUDA	Yes	Ease of use
MLX	Apple Silicon	Metal	Yes	Apple devices
TensorRT-LLM	NVIDIA	CUDA	No	Production GPU
Core ML	iOS/macOS	Metal	Yes	Apple ecosystem
ExecuTorch	All	Various	Yes	PyTorch mobile

4.2 llama.cpp¶

Key Features: - C++ inference engine - GGUF format support - Runs on CPU, Metal, CUDA, Vulkan - Used by Ollama, LM Studio

Usage:

# Install
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Run inference
./main -m model.gguf -p "Hello" -n 128

# Server mode
./server -m model.gguf --host 0.0.0.0 --port 8080

4.3 MLC-LLM¶

Key Features: - Compiler-based optimization - Metal GPU on Apple - Vulkan on Android - WebGPU support

Architecture:

MLC-LLM Pipeline:
Model → TVM Compiler → Optimized Binary → Device Runtime
                     │
                     ├── Metal (Apple)
                     ├── Vulkan (Android)
                     └── WebGPU (Browser)

Performance (Llama-2-7B on RTX 3090): | Framework | Tokens/sec | |-----------|------------| | MLC-LLM | ~51 | | llama.cpp | ~46 | | Ollama | ~46 |

4.4 Mobile Deployment Guide¶

Android (Termux + Ollama):

# Install Termux from F-Droid
pkg install git cmake

# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && mkdir build && cd build
cmake .. && make

# Run GGUF model
./main -m /sdcard/model.gguf -p "Hello"

iOS (MLC-LLM):

// Swift integration
import MLCChat

let config = MLCEngine.Config(
    modelPath: "path/to/model",
    temperature: 0.7
)
let engine = try MLCEngine(config: config)
let response = try await engine.generate("Hello")

5. Model Compression Techniques¶

5.1 Compression Pipeline¶

graph TD
    A["Original Model FP32"] --> B["Pruning<br/>Remove unimportant weights"]
    B --> C["Quantization<br/>Reduce precision INT4/INT8"]
    C --> D["Distillation<br/>Train smaller model from larger"]
    D --> E["Compressed Model<br/>Edge-ready"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#fff3e0,stroke:#ef6c00
    style E fill:#e8f5e9,stroke:#4caf50

5.2 Pruning¶

Types: | Type | Description | Use Case | |------|-------------|----------| | Unstructured | Remove individual weights | Research | | Structured | Remove entire neurons/heads | Production | | Semi-structured | N:M sparsity pattern | GPU acceleration |

Sparsity vs Quality:

50% sparsity → ~98% quality
70% sparsity → ~95% quality
90% sparsity → ~85% quality

5.3 Knowledge Distillation¶

Process:

Teacher Model (Large) → Student Model (Small)
        │                      │
        │    Transfer          │
        └─────Knowledge────────┘
              (soft labels)

Distillation Loss: $$ \mathcal{L}{KD} = \alpha \cdot \mathcal{L}(p_t, p_s) $$}(y, y_s) + (1-\alpha) \cdot T^2 \cdot \mathcal{L}_{KL

Where: - $T$ = temperature - $\alpha$ = weight balancing - $p_t, p_s$ = teacher/student probabilities

5.4 EDGE-LLM Framework (2024-2025)¶

Innovation: Computation and memory-efficient LLM tuning for edge

Features: - Memory-efficient fine-tuning - Edge-aware architecture search - Hardware-aware optimization

6. Inference Optimization¶

6.1 KV-Cache Optimization¶

Problem: KV-cache grows with sequence length

\[ \text{KV-Cache Size} = 2 \times n_{layers} \times n_{heads} \times d_{head} \times \text{seq\_len} \]

Solutions: | Technique | Memory Reduction | Quality Impact | |-----------|------------------|----------------| | PagedAttention | Variable | None | | MQA/GQA | 4-8x | Minimal | | Sliding Window | Fixed | None | | KV-Cache Quantization | 2-4x | Low |

6.2 Speculative Decoding on Mobile¶

Key Paper (Oct 2025): "Accelerating Mobile Language Model via Speculative Decoding"

Speculative Decoding:
1. Small draft model generates K tokens quickly
2. Large target model verifies in parallel
3. Accept correct tokens, reject incorrect

Result: Up to 2-3x speedup on mobile

6.3 Inference Latency Targets (Edge)¶

Model Size	Target Latency	Hardware
1-3B	<50ms	Phone
3-7B	<200ms	Phone
7-13B	<500ms	Tablet
13-30B	<1s	Edge server

7. Production Edge Deployment¶

7.1 Deployment Checklist¶

□ Model Selection
  ├── Choose appropriate size (1-7B for phones)
  ├── Select quantization (GGUF Q4_K_M recommended)
  └── Test on target device

□ Framework Selection
  ├── iOS: MLC-LLM or Core ML
  ├── Android: MLC-LLM or llama.cpp
  └── Cross-platform: llama.cpp

□ Performance Optimization
  ├── Enable GPU/NPU acceleration
  ├── Optimize batch size (usually 1)
  ├── Set appropriate context length
  └── Implement caching where possible

□ Monitoring
  ├── Track inference latency
  ├── Monitor memory usage
  ├── Log errors and crashes
  └── Collect user feedback

7.2 NexaSDK for Android (Qualcomm 2025)¶

Key Features: - On-device AI for Snapdragon - GPT-OSS-20B running entirely on-device - Qualcomm Hexagon NPU integration

Capabilities:

- Zero cloud dependency
- Privacy-preserving inference
- 20B parameter models on mobile
- Direct NPU access

7.3 Google AI Edge (2025)¶

Features: - On-device small language models - Multimodality support - RAG integration - Function calling - Model customization and quantization

8. Interview Questions¶

8.1 Concept Questions¶

Q: Compare GPTQ, AWQ, and GGUF quantization methods.

A: GPTQ:
   - GPU-optimized post-training quantization
   - Layer-by-layer calibration
   - Best for NVIDIA GPUs

   AWQ:
   - Activation-aware weight quantization
   - Protects salient weights
   - Best quality preservation
   - No calibration data needed

   GGUF:
   - CPU-optimized format
   - Multiple variants (Q4_K_M, Q5_K_M, etc.)
   - Broad hardware support
   - Best for mobile deployment

Q: What are the main challenges of deploying LLMs on edge devices?

A: Four main challenges:
   1. Memory: Limited RAM (4-16GB typical)
   2. Compute: Mobile GPU/NPU << datacenter GPU
   3. Power: Battery constraints limit sustained inference
   4. Thermal: Throttling under load

   Solutions:
   - Aggressive quantization (INT4)
   - Model pruning
   - Knowledge distillation to smaller models
   - Efficient attention (MQA, GQA)

Q: Explain how NPU acceleration works for mobile LLM inference.

A: NPU (Neural Processing Unit) acceleration:
   - Dedicated hardware for matrix operations
   - Parallel processing capability
   - Lower power than GPU for inference

   Pipeline:
   1. Prefill phase → NPU (parallel token processing)
   2. Decode phase → CPU (sequential generation)

   Frameworks: llm.npu, m²LLM, NexaSDK
   Hardware: Snapdragon Hexagon, Apple Neural Engine

8.2 Architecture Questions¶

Q: Design an on-device LLM deployment for a mobile app.

A: Architecture:
   1. Model Layer:
      - 3-7B parameter model (Llama 3.2, Phi-3)
      - GGUF Q4_K_M quantization
      - Context window: 2048-4096 tokens

   2. Inference Engine:
      - MLC-LLM for iOS/Android
      - GPU acceleration (Metal/Vulkan)
      - KV-cache optimization

   3. Application Layer:
      - Request queue for batching
      - Streaming response
      - Error handling and fallback

   4. Optimization:
      - Warm start for faster first token
      - Response caching
      - Dynamic context pruning

   Constraints:
   - Memory budget: <3GB
   - Latency: <500ms per token
   - Battery: Minimal drain

Q: How would you optimize inference latency on a resource-constrained device?

A: Multi-level optimization:

   1. Model Level:
      - Use smallest viable model
      - Apply INT4 quantization
      - Implement GQA/MQA attention

   2. Inference Level:
      - Enable NPU/GPU acceleration
      - Optimize KV-cache management
      - Use speculative decoding

   3. System Level:
      - Warm model loading
      - Request batching (where possible)
      - Memory-mapped model files

   4. Application Level:
      - Streaming responses
      - Progressive loading
      - Graceful degradation

8.3 Implementation Questions¶

Q: Implement model size calculation for edge deployment.

def calculate_model_requirements(
    params_billions: float,
    bits_per_param: int,
    context_length: int,
    kv_cache_bits: int = 16
) -> dict:
    """
    Calculate memory requirements for LLM deployment.

    Returns dict with sizes in GB.
    """
    # Model weights
    weight_bytes = params_billions * 1e9 * (bits_per_param / 8)
    weight_gb = weight_bytes / 1e9

    # KV-cache (per layer, simplified)
    # Assuming standard transformer architecture
    hidden_dim = 4096  # typical for 7B
    n_layers = 32
    kv_bytes = 2 * n_layers * context_length * hidden_dim * (kv_cache_bits / 8)
    kv_gb = kv_bytes / 1e9

    # Runtime overhead (activations, etc.)
    overhead_gb = weight_gb * 0.2

    return {
        "weights_gb": round(weight_gb, 2),
        "kv_cache_gb": round(kv_gb, 3),
        "overhead_gb": round(overhead_gb, 2),
        "total_gb": round(weight_gb + kv_gb + overhead_gb, 2),
        "fits_in_8gb": (weight_gb + kv_gb + overhead_gb) < 7.0,
        "fits_in_16gb": (weight_gb + kv_gb + overhead_gb) < 14.0
    }

# Example: 7B model at 4-bit with 4096 context
result = calculate_model_requirements(7, 4, 4096)
# weights_gb: 3.5, total_gb: ~4.2, fits_in_8gb: True

Q: Implement quantization-aware model selection.

def select_edge_model(
    device_ram_gb: float,
    target_latency_ms: float,
    quality_threshold: float = 0.95
) -> dict:
    """
    Select appropriate model configuration for edge deployment.
    """
    models = [
        {"name": "Phi-3-mini", "params": 3.8, "base_quality": 0.90},
        {"name": "Llama-3.2-3B", "params": 3.2, "base_quality": 0.92},
        {"name": "Llama-3.2-1B", "params": 1.2, "base_quality": 0.88},
        {"name": "Qwen-2.5-3B", "params": 3.1, "base_quality": 0.91},
        {"name": "Gemma-2-2B", "params": 2.6, "base_quality": 0.89},
    ]

    quant_configs = [
        {"bits": 4, "quality_mult": 0.97, "name": "Q4_K_M"},
        {"bits": 5, "quality_mult": 0.98, "name": "Q5_K_M"},
        {"bits": 8, "quality_mult": 0.99, "name": "Q8_0"},
    ]

    candidates = []
    for model in models:
        for quant in quant_configs:
            size_gb = model["params"] * (quant["bits"] / 8)
            quality = model["base_quality"] * quant["quality_mult"]

            if size_gb < device_ram_gb * 0.7:  # 70% RAM budget
                candidates.append({
                    "model": model["name"],
                    "quantization": quant["name"],
                    "size_gb": round(size_gb, 2),
                    "quality": round(quality, 2),
                    "meets_threshold": quality >= quality_threshold
                })

    # Sort by quality, return best
    candidates.sort(key=lambda x: x["quality"], reverse=True)
    return candidates[0] if candidates else None

9. Key Papers & Resources¶

Paper/Resource	Year	Key Contribution
llm.npu	ASPLOS 2025	First NPU-aware LLM inference system
m²LLM	IEEE 2025	Multi-dimensional mobile optimization
EDGE-LLM	ACM 2024	Memory-efficient edge tuning
GPTQ Paper	2023	GPU-optimized quantization
AWQ Paper	2023	Activation-aware quantization
Sustainable LLM Inference for Edge	arXiv 2025	Energy efficiency evaluation
PartInfer	OpenReview 2025	Partitioned inference for edge

10. Formulas Quick Reference¶

Model Size¶

\[\text{Size}_{GB} = \frac{\text{Params} \times \text{Bits}}{8 \times 10^9}\]

KV-Cache Size¶

\[\text{KV}_{GB} = \frac{2 \times L \times H \times D_h \times S \times B}{8 \times 10^9}\]

Where $L$ = layers, $H$ = heads, $D_h$ = head dim, $S$ = seq len, $B$ = bytes

Quantization Quality Loss¶

\[\text{Quality}_{quantized} \approx \text{Quality}_{FP16} \times \text{Retention}_b\]

Where $\text{Retention}_4 \approx 0.95$, $\text{Retention}_8 \approx 0.99$

Inference Latency¶

\[T_{total} = T_{prefill} + N_{tokens} \times T_{decode}\]

Частые заблуждения¶

Заблуждение: INT4 квантизация всегда теряет 5% качества

Это средняя цифра для naive round-to-nearest. AWQ (activation-aware quantization) сохраняет 97-98% качества на INT4, потому что защищает salient weights. На практике для задач classification/summarization разница с FP16 часто в пределах 1-2%. Потери заметны в основном на длинных цепочках рассуждений и арифметике.

Заблуждение: NPU всегда быстрее GPU для инференса LLM

NPU (Neural Processing Unit) оптимизирован под параллельные матричные операции, что идеально для prefill-фазы. Но decode-фаза (последовательная генерация токенов) часто быстрее на CPU/GPU. Поэтому лучшие фреймворки (llm.npu) используют гибридный подход: prefill на NPU, decode на CPU.

Заблуждение: on-device LLM может полностью заменить облачный API

Модели на устройстве ограничены 1-7B параметрами, что сравнимо с GPT-3.5, но существенно уступает GPT-4/Claude в сложных задачах. Лучшая стратегия -- гибридная: простые запросы обрабатываются локально (приватность + скорость), сложные маршрутизируются в облако.

Вопросы для собеседования¶

Какую стратегию квантизации выбрать для мобильного деплоя 7B модели на Android?

"Возьмем GPTQ, он лучше всего квантизирует" -- GPTQ оптимизирован под GPU, а на Android обычно нет CUDA.

Сильный ответ: Для Android лучший выбор -- GGUF Q4_K_M через llama.cpp или MLC-LLM. GGUF оптимизирован под CPU и поддерживает Vulkan GPU acceleration. 7B модель в Q4_K_M займет ~4 ГБ, что помещается в устройства с 12+ ГБ RAM (Samsung S24 Ultra). Для устройств с 8 ГБ лучше взять 3B модель (Llama 3.2 3B) -- ~2 ГБ в Q4_K_M. Если доступен Snapdragon 8 Gen 3+, можно задействовать NPU через Hexagon SDK для ускорения prefill-фазы.

Как измерить деградацию качества модели после квантизации?

"Запустим несколько примеров и посмотрим" -- нет методологии, нерепрезентативно.

Сильный ответ: Нужен systematic evaluation: (1) benchmark suite, покрывающий целевые use cases (MMLU, HumanEval, или domain-specific); (2) сравнение метрик FP16 vs INT4 -- если retention > 95%, модель deployable; (3) perplexity на reference corpus как proxy-метрика; (4) A/B тестирование на реальных пользователях для субъективного качества. Формула: Quality_quantized ~ Quality_FP16 x Retention_b, где Retention_4 ~ 0.95-0.98 для AWQ.

Спроектируйте систему on-device LLM с fallback на облако.

"Если модель не справилась, отправляем запрос в облако" -- нет критериев routing, нет latency budget.

Сильный ответ: Двухуровневая архитектура: (1) Router -- classifies запрос по сложности (simple/complex) через легкую модель или rule-based (длина промпта, наличие code/math). (2) Simple -> on-device 3B модель (GGUF Q4_K_M), latency target <200ms, privacy-preserving. (3) Complex -> cloud API (GPT-4/Claude), latency <2s, quality-critical. (4) Fallback: если on-device confidence < threshold или timeout > 500ms, escalate to cloud. (5) Мониторинг: track routing ratio, latency P50/P99, user satisfaction per path.