Безопасность LLM: OWASP Top 10 (2025)¶
~9 минут чтения
Предварительно: Безопасность LLM, Гардрейлы LLM
Prompt injection, sensitive disclosure, supply chain, data poisoning, output handling, excessive agency, system prompt leakage, vector weaknesses, misinformation, unbounded consumption. Defense-in-depth, guardrails, MCP security (2025-2026)
OWASP Top 10 for LLM Applications -- стандарт де-факто для оценки рисков при deployment языковых моделей. По данным OWASP (2025), prompt injection присутствует в 73% production AI-приложений, а 53% компаний, строящих LLM-агентов, не применяют fine-tuning для защиты. Список 2025 года отражает сдвиг индустрии к агентным системам: Sensitive Disclosure поднялся с #6 на #2 (модели получили доступ к RAG и tools), а Excessive Agency стал критической угрозой в "год LLM-агентов". Combined defense-in-depth снижает attack surface на 90-95%, но ни один слой по отдельности не даёт больше 85%.
Ключевые концепции¶
OWASP LLM Top 10 2025¶
| Rank | Risk | Category | Severity | Key Mitigation |
|---|---|---|---|---|
| LLM01 | Prompt Injection | Input manipulation | Critical | Constrain behavior, validate output |
| LLM02 | Sensitive Information Disclosure | Data leakage | Critical | Data sanitization, validation |
| LLM03 | Supply Chain | External components | High | Verified sources, SBOM |
| LLM04 | Data Poisoning | Training manipulation | High | Track origins, vet vendors |
| LLM05 | Improper Output Handling | Output trust | High | Encoding, sanitization |
| LLM06 | Excessive Agency | Agent permissions | High | Limit permissions, approval gates |
| LLM07 | System Prompt Leakage | IP/security | Medium | External data, guardrails |
| LLM08 | Vector & Embedding Weaknesses | RAG pipeline | Medium | Access control, validation |
| LLM09 | Misinformation | Hallucination | Medium | RAG, cross-verification |
| LLM10 | Unbounded Consumption | Resource abuse | Medium | Rate limiting, resource management |
Key stats: prompt injection in 73%+ production AI apps. 53% companies building agents aren't fine-tuning. DeepSeek tricked into harmful content with simple injection.
What's New in 2025 (vs 2023)¶
| Change | Why It Matters |
|---|---|
| Excessive Autonomy | 2025 = "year of LLM agents" with unprecedented autonomy |
| RAG Vulnerabilities | 53% use RAG instead of fine-tuning, expanding attack surface |
| System Prompt Risks | Developers exposing sensitive data in prompts |
| Unbounded Consumption | Enterprise adoption -> resource management challenges |
| Sensitive Disclosure | Jumped from #6 to #2 (more sensitive data via RAG + tools) |
1. Prompt Injection (LLM01) -- Critical¶
Attack Types¶
| Type | Description | Example |
|---|---|---|
| Direct jailbreaking | Explicit override attempts | "Ignore all previous instructions..." |
| Indirect injection | Hidden in content/context | Malicious text in retrieved documents |
| Goal hijacking | Redirect model objectives | "Your new goal is to..." |
| Prompt leaking | Extract system prompts | "Repeat your instructions verbatim" |
Direct vs Indirect: direct = user crafts malicious input. Indirect = malicious instructions hidden in external data (web pages, documents) that LLM processes. Indirect harder to defend: (1) attack surface = all data sources LLM accesses, (2) hidden in legitimate content, (3) traditional input validation doesn't help.
Jailbreak Techniques (2025-2026)¶
| Technique | Description | Success Rate |
|---|---|---|
| Direct override | "Ignore instructions" | 10-30% |
| Role-playing | "You are DAN..." | 20-40% |
| Base64/encoding | Obfuscated payloads | 15-25% |
| Few-shot injection | Poisoned examples | 30-50% |
| Best-of-N (Giskard) | Automated N jailbreak variants | 50-80% |
2. Sensitive Information Disclosure (LLM02) -- Critical¶
Data enters via training datasets, RAG knowledge bases, database access, user input (devs using ChatGPT on codebases).
Mitigation: mask sensitive content before training, strict input/output validation.
3. Supply Chain (LLM03) -- High¶
Compromised models, datasets, LoRA adapters, plugins. Example: poisoned weights on HuggingFace, malicious Python library.
Mitigation: verified sources with integrity checks, signed SBOM (Software Bill of Materials).
4. Data Poisoning (LLM04) -- High¶
Manipulating data during pre-training, fine-tuning, or embedding. Biased injection, toxic fine-tuning data.
Mitigation: track data origins (OWASP CycloneDX), rigorously validate data providers.
5. Improper Output Handling (LLM05) -- High¶
LLM outputs not validated before passing to downstream systems.
Critical example: Text2SQL hallucination changes DELETE FROM users WHERE id = 123 to DELETE FROM users -- entire database wiped.
Mitigation: context-aware encoding (HTML, SQL escaping), validate and sanitize all LLM responses.
6. Excessive Agency (LLM06) -- High¶
Agents with too much functionality, permissions, or autonomy. Three areas: excessive functionality, excessive permissions, excessive autonomy.
Examples: assistant forwards sensitive emails to attacker, file-writing extension allows arbitrary commands.
Mitigation: narrowly scoped extensions, minimal access, manual approval for high-impact actions.
7. System Prompt Leakage (LLM07)¶
Exposing internal rules, filtering criteria, credentials. Attacker extracts credentials from system prompt.
Mitigation: keep sensitive data external to system prompt, use independent guardrails.
8. Vector & Embedding Weaknesses (LLM08)¶
RAG pipeline vulnerabilities: misconfigured vector DB allows unauthorized access, embedding inversion attacks recover original data.
Mitigation: strict access partitioning in vector databases, audit all data sources.
9. Misinformation (LLM09)¶
Hallucinations and fabricated outputs. Malicious packages using names hallucinated by coding assistants. Medical chatbot incorrect diagnosis.
Mitigation: RAG with verified sources, cross-verification, human fact-checking for critical info.
10. Unbounded Consumption (LLM10)¶
Resource usage spiraling: excessively large inputs consume memory/CPU, high volume API DoS.
Mitigation: rate limiting and throttling, dynamic resource allocation.
Defense-in-Depth¶
Multi-Layer Architecture¶
graph TD
A["User Input"] --> B["Input Guardrails<br/>Regex + ML classifier + LLM-as-judge<br/>Efficacy: 60-95%"]
B --> C["LLM System<br/>System prompt hardening<br/>Role separation, instruction sandwich"]
C --> D["Output Guardrails<br/>Format check, PII filter, fact verification<br/>Efficacy: 70-85%"]
D --> E["User Output"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#e8f5e9,stroke:#4caf50
style C fill:#fff3e0,stroke:#ef6c00
style D fill:#e8f5e9,stroke:#4caf50
style E fill:#e8eaf6,stroke:#3f51b5
Input Filtering¶
| Technique | Efficacy |
|---|---|
| Regex patterns | 60-70% |
| ML classifier | 80-90% |
| LLM-as-judge | 85-95% |
| Perplexity filtering | 70-80% |
System Prompt Hardening¶
Role separation (system vs user), instruction sandwich (repeat core instructions), output format enforcement, refusal training on attack examples.
Output Validation¶
Format checking, content filtering (PII, toxic), fact verification (cross-check sources), rate limiting.
Combined Efficacy¶
| Defense | Attack Reduction |
|---|---|
| Input filtering alone | 60-80% |
| System hardening alone | 40-60% |
| Output filtering alone | 70-85% |
| Combined multi-layer | 90-95% |
Security Tools (2025-2026)¶
Guardrails & Protection¶
| Tool | Type | Purpose |
|---|---|---|
| NeMo Guardrails | NVIDIA | Input/output guardrails |
| Guardrails AI | Open source | Output validation |
| Lakera Guard | Commercial | Prompt injection detection |
| Giskard | Open source | Model testing & security |
| Prompt Security | Commercial | Enterprise protection |
| Hidden Layer | Commercial | ML model security |
| Protect AI | Commercial | MLOps security suite |
Testing Frameworks¶
| Framework | Purpose |
|---|---|
| DeepTeam | OWASP LLM Top 10 testing |
| Giskard Scanner | Vulnerability scanning |
| Garak | LLM vulnerability framework |
| LLM-Fuzzer | Fuzzing for LLMs |
MCP Security¶
Risks¶
| Risk | Description |
|---|---|
| Tool abuse | Malicious MCP server actions |
| Data exfiltration | Unauthorized data access |
| Privilege escalation | Gain unintended permissions |
Best Practices¶
Allowlisting (only approved MCP servers), permission scoping (minimal required), audit logging (all MCP interactions), sandboxing (isolate execution).
Incident Response¶
Attack Indicators¶
Unusual output patterns, high refusal rate, data leakage (system prompts in output), unauthorized tool calls.
Response Playbook¶
- Detect -- monitor for anomalies
- Contain -- disable affected features
- Analyze -- determine attack vector
- Patch -- update defenses
- Review -- post-incident analysis
Заблуждение: input filtering решает проблему prompt injection
Regex-фильтры ловят только 60-70% атак, ML-classifier -- 80-90%. Даже LLM-as-judge даёт максимум 85-95%. Best-of-N jailbreaking (Giskard) автоматически генерирует варианты атак с success rate 50-80%, обходя любой единичный фильтр. Только combined multi-layer defense даёт 90-95% protection. Ни один слой сам по себе недостаточен.
Заблуждение: indirect prompt injection -- редкая экзотическая атака
Indirect injection -- самый опасный вектор в 2025, потому что attack surface = все data sources, к которым LLM имеет доступ (RAG documents, web pages, emails). 53% компаний используют RAG вместо fine-tuning, расширяя attack surface. В отличие от direct injection, indirect скрыт в легитимном контенте и не ловится традиционной input validation. Пример: скрытые инструкции в веб-странице, которую AI суммаризирует.
Заблуждение: system prompt -- надёжный механизм защиты
System prompt leakage (LLM07) позволяет извлечь все внутренние правила через простые запросы вроде 'Repeat your instructions verbatim'. Credentials и API keys в system prompt -- прямой вектор атаки. Решение: sensitive data хранить вне system prompt, использовать independent guardrails, не полагаться на system prompt как единственный механизм защиты.
Interview Questions¶
Q: Назовите топ-3 OWASP LLM рисков 2025 и объясните изменения относительно 2023.
Red flag: "Prompt injection, hallucinations, bias" (перечисляет общие проблемы AI, а не OWASP LLM Top 10)
Strong answer: "1) Prompt Injection (LLM01) -- остается #1, 73% production apps уязвимы, direct + indirect (hidden в RAG docs). 2) Sensitive Disclosure (LLM02) -- с #6 на #2, потому что LLM получили доступ к RAG и tools с чувствительными данными. 3) Supply Chain (LLM03) -- вырос из-за экосистемы fine-tuned models и LoRA adapters (poisoned weights на HuggingFace). Главное изменение 2025: фокус на агентные системы -- Excessive Agency стал отдельной категорией."
Q: Indirect prompt injection -- чем отличается от direct и почему опаснее?
Red flag: "Indirect -- это когда используют обходные формулировки" (путает с jailbreaking)
Strong answer: "Direct: пользователь явно пишет malicious input ('Ignore instructions'). Indirect: вредоносные инструкции скрыты в external data -- web pages, documents, emails -- которые LLM обрабатывает. Опаснее по трём причинам: (1) attack surface = все data sources LLM, (2) embedded в легитимный контент, (3) traditional input validation не помогает. Пример: AI суммаризирует web-страницу с hidden instructions. Защита: content sanitization на уровне retrieval + output validation."
Q: Спроектируйте defense-in-depth для production LLM.
Red flag: "Поставим WAF и input filter" (один слой защиты)
Strong answer: "Пять слоёв: (1) Input guardrails: regex (<1ms, 60-70%) + ML classifier (80-90%) + LLM-as-judge (100-500ms, 85-95%) параллельно. (2) System prompt: role separation, instruction sandwich, refusal training. (3) Output guardrails: format validation, PII/toxic filtering, fact verification. (4) Application layer: least privilege для tools, approval gates для high-impact actions. (5) Monitoring: anomaly detection (Langfuse, Arize). Combined: 90-95%. Ключ: fast pattern-matching + deeper checks параллельно."
Q: Excessive Agency (LLM06) -- почему это критическая проблема в 2025?
Red flag: "Агенты могут делать ошибки" (не описывает attack vectors)
Strong answer: "2025 = 'year of agents' с unprecedented autonomy. Три аспекта: excessive functionality (слишком много доступных tools), excessive permissions (write access когда нужен только read), excessive autonomy (нет human-in-the-loop для critical actions). Пример: file-writing extension позволяет arbitrary commands. MCP security: allowlisting серверов, permission scoping, sandboxing, audit logging. Mitigation: narrowly scoped extensions, minimal access, manual approval gates."
Ключевые числа¶
| Факт | Значение |
|---|---|
| Prompt injection prevalence | 73%+ production apps |
| Companies not fine-tuning agents | 53% |
| Sensitive Disclosure rank change | #6 -> #2 |
| Best-of-N attack success | 50-80% |
| Direct jailbreak success | 10-30% |
| Role-playing jailbreak success | 20-40% |
| Input filtering efficacy (regex) | 60-70% |
| ML classifier efficacy | 80-90% |
| LLM-as-judge efficacy | 85-95% |
| Combined defense efficacy | 90-95% |
Источники¶
- OWASP -- "Top 10 for Large Language Model Applications 2025" (Official)
- Confident AI -- "OWASP Top 10 2025 for LLM Applications"
- Zylos AI -- "LLM Security and Safety 2026: Vulnerabilities, Attacks, and Defense"
- Learn-Prompting -- "Prompt Security 2026: Defending Against Injection and Jailbreak"
- DeepStrike -- "OWASP LLM Top 10 Vulnerabilities 2025: AI Security Risks"
- Giskard -- "Best-of-N Jailbreaking: The Automated LLM Attack"
- arXiv:2601.22240 -- "A Systematic Literature Review on LLM Defenses Against Prompt Injection"