Перейти к содержанию

Безопасность LLM: OWASP Top 10 (2025)

~9 минут чтения

Предварительно: Безопасность LLM, Гардрейлы LLM

Prompt injection, sensitive disclosure, supply chain, data poisoning, output handling, excessive agency, system prompt leakage, vector weaknesses, misinformation, unbounded consumption. Defense-in-depth, guardrails, MCP security (2025-2026)

OWASP Top 10 for LLM Applications -- стандарт де-факто для оценки рисков при deployment языковых моделей. По данным OWASP (2025), prompt injection присутствует в 73% production AI-приложений, а 53% компаний, строящих LLM-агентов, не применяют fine-tuning для защиты. Список 2025 года отражает сдвиг индустрии к агентным системам: Sensitive Disclosure поднялся с #6 на #2 (модели получили доступ к RAG и tools), а Excessive Agency стал критической угрозой в "год LLM-агентов". Combined defense-in-depth снижает attack surface на 90-95%, но ни один слой по отдельности не даёт больше 85%.


Ключевые концепции

OWASP LLM Top 10 2025

Rank Risk Category Severity Key Mitigation
LLM01 Prompt Injection Input manipulation Critical Constrain behavior, validate output
LLM02 Sensitive Information Disclosure Data leakage Critical Data sanitization, validation
LLM03 Supply Chain External components High Verified sources, SBOM
LLM04 Data Poisoning Training manipulation High Track origins, vet vendors
LLM05 Improper Output Handling Output trust High Encoding, sanitization
LLM06 Excessive Agency Agent permissions High Limit permissions, approval gates
LLM07 System Prompt Leakage IP/security Medium External data, guardrails
LLM08 Vector & Embedding Weaknesses RAG pipeline Medium Access control, validation
LLM09 Misinformation Hallucination Medium RAG, cross-verification
LLM10 Unbounded Consumption Resource abuse Medium Rate limiting, resource management

Key stats: prompt injection in 73%+ production AI apps. 53% companies building agents aren't fine-tuning. DeepSeek tricked into harmful content with simple injection.

What's New in 2025 (vs 2023)

Change Why It Matters
Excessive Autonomy 2025 = "year of LLM agents" with unprecedented autonomy
RAG Vulnerabilities 53% use RAG instead of fine-tuning, expanding attack surface
System Prompt Risks Developers exposing sensitive data in prompts
Unbounded Consumption Enterprise adoption -> resource management challenges
Sensitive Disclosure Jumped from #6 to #2 (more sensitive data via RAG + tools)

1. Prompt Injection (LLM01) -- Critical

Attack Types

Type Description Example
Direct jailbreaking Explicit override attempts "Ignore all previous instructions..."
Indirect injection Hidden in content/context Malicious text in retrieved documents
Goal hijacking Redirect model objectives "Your new goal is to..."
Prompt leaking Extract system prompts "Repeat your instructions verbatim"

Direct vs Indirect: direct = user crafts malicious input. Indirect = malicious instructions hidden in external data (web pages, documents) that LLM processes. Indirect harder to defend: (1) attack surface = all data sources LLM accesses, (2) hidden in legitimate content, (3) traditional input validation doesn't help.

Jailbreak Techniques (2025-2026)

Technique Description Success Rate
Direct override "Ignore instructions" 10-30%
Role-playing "You are DAN..." 20-40%
Base64/encoding Obfuscated payloads 15-25%
Few-shot injection Poisoned examples 30-50%
Best-of-N (Giskard) Automated N jailbreak variants 50-80%

2. Sensitive Information Disclosure (LLM02) -- Critical

Data enters via training datasets, RAG knowledge bases, database access, user input (devs using ChatGPT on codebases).

Mitigation: mask sensitive content before training, strict input/output validation.

3. Supply Chain (LLM03) -- High

Compromised models, datasets, LoRA adapters, plugins. Example: poisoned weights on HuggingFace, malicious Python library.

Mitigation: verified sources with integrity checks, signed SBOM (Software Bill of Materials).

4. Data Poisoning (LLM04) -- High

Manipulating data during pre-training, fine-tuning, or embedding. Biased injection, toxic fine-tuning data.

Mitigation: track data origins (OWASP CycloneDX), rigorously validate data providers.

5. Improper Output Handling (LLM05) -- High

LLM outputs not validated before passing to downstream systems.

Critical example: Text2SQL hallucination changes DELETE FROM users WHERE id = 123 to DELETE FROM users -- entire database wiped.

Mitigation: context-aware encoding (HTML, SQL escaping), validate and sanitize all LLM responses.

6. Excessive Agency (LLM06) -- High

Agents with too much functionality, permissions, or autonomy. Three areas: excessive functionality, excessive permissions, excessive autonomy.

Examples: assistant forwards sensitive emails to attacker, file-writing extension allows arbitrary commands.

Mitigation: narrowly scoped extensions, minimal access, manual approval for high-impact actions.

7. System Prompt Leakage (LLM07)

Exposing internal rules, filtering criteria, credentials. Attacker extracts credentials from system prompt.

Mitigation: keep sensitive data external to system prompt, use independent guardrails.

8. Vector & Embedding Weaknesses (LLM08)

RAG pipeline vulnerabilities: misconfigured vector DB allows unauthorized access, embedding inversion attacks recover original data.

Mitigation: strict access partitioning in vector databases, audit all data sources.

9. Misinformation (LLM09)

Hallucinations and fabricated outputs. Malicious packages using names hallucinated by coding assistants. Medical chatbot incorrect diagnosis.

Mitigation: RAG with verified sources, cross-verification, human fact-checking for critical info.

10. Unbounded Consumption (LLM10)

Resource usage spiraling: excessively large inputs consume memory/CPU, high volume API DoS.

Mitigation: rate limiting and throttling, dynamic resource allocation.


Defense-in-Depth

Multi-Layer Architecture

graph TD
    A["User Input"] --> B["Input Guardrails<br/>Regex + ML classifier + LLM-as-judge<br/>Efficacy: 60-95%"]
    B --> C["LLM System<br/>System prompt hardening<br/>Role separation, instruction sandwich"]
    C --> D["Output Guardrails<br/>Format check, PII filter, fact verification<br/>Efficacy: 70-85%"]
    D --> E["User Output"]
    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8f5e9,stroke:#4caf50
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#e8f5e9,stroke:#4caf50
    style E fill:#e8eaf6,stroke:#3f51b5

Input Filtering

Technique Efficacy
Regex patterns 60-70%
ML classifier 80-90%
LLM-as-judge 85-95%
Perplexity filtering 70-80%

System Prompt Hardening

Role separation (system vs user), instruction sandwich (repeat core instructions), output format enforcement, refusal training on attack examples.

Output Validation

Format checking, content filtering (PII, toxic), fact verification (cross-check sources), rate limiting.

Combined Efficacy

Defense Attack Reduction
Input filtering alone 60-80%
System hardening alone 40-60%
Output filtering alone 70-85%
Combined multi-layer 90-95%

Security Tools (2025-2026)

Guardrails & Protection

Tool Type Purpose
NeMo Guardrails NVIDIA Input/output guardrails
Guardrails AI Open source Output validation
Lakera Guard Commercial Prompt injection detection
Giskard Open source Model testing & security
Prompt Security Commercial Enterprise protection
Hidden Layer Commercial ML model security
Protect AI Commercial MLOps security suite

Testing Frameworks

Framework Purpose
DeepTeam OWASP LLM Top 10 testing
Giskard Scanner Vulnerability scanning
Garak LLM vulnerability framework
LLM-Fuzzer Fuzzing for LLMs

MCP Security

Risks

Risk Description
Tool abuse Malicious MCP server actions
Data exfiltration Unauthorized data access
Privilege escalation Gain unintended permissions

Best Practices

Allowlisting (only approved MCP servers), permission scoping (minimal required), audit logging (all MCP interactions), sandboxing (isolate execution).


Incident Response

Attack Indicators

Unusual output patterns, high refusal rate, data leakage (system prompts in output), unauthorized tool calls.

Response Playbook

  1. Detect -- monitor for anomalies
  2. Contain -- disable affected features
  3. Analyze -- determine attack vector
  4. Patch -- update defenses
  5. Review -- post-incident analysis

Заблуждение: input filtering решает проблему prompt injection

Regex-фильтры ловят только 60-70% атак, ML-classifier -- 80-90%. Даже LLM-as-judge даёт максимум 85-95%. Best-of-N jailbreaking (Giskard) автоматически генерирует варианты атак с success rate 50-80%, обходя любой единичный фильтр. Только combined multi-layer defense даёт 90-95% protection. Ни один слой сам по себе недостаточен.

Заблуждение: indirect prompt injection -- редкая экзотическая атака

Indirect injection -- самый опасный вектор в 2025, потому что attack surface = все data sources, к которым LLM имеет доступ (RAG documents, web pages, emails). 53% компаний используют RAG вместо fine-tuning, расширяя attack surface. В отличие от direct injection, indirect скрыт в легитимном контенте и не ловится традиционной input validation. Пример: скрытые инструкции в веб-странице, которую AI суммаризирует.

Заблуждение: system prompt -- надёжный механизм защиты

System prompt leakage (LLM07) позволяет извлечь все внутренние правила через простые запросы вроде 'Repeat your instructions verbatim'. Credentials и API keys в system prompt -- прямой вектор атаки. Решение: sensitive data хранить вне system prompt, использовать independent guardrails, не полагаться на system prompt как единственный механизм защиты.

Interview Questions

Q: Назовите топ-3 OWASP LLM рисков 2025 и объясните изменения относительно 2023.

❌ Red flag: "Prompt injection, hallucinations, bias" (перечисляет общие проблемы AI, а не OWASP LLM Top 10)

✅ Strong answer: "1) Prompt Injection (LLM01) -- остается #1, 73% production apps уязвимы, direct + indirect (hidden в RAG docs). 2) Sensitive Disclosure (LLM02) -- с #6 на #2, потому что LLM получили доступ к RAG и tools с чувствительными данными. 3) Supply Chain (LLM03) -- вырос из-за экосистемы fine-tuned models и LoRA adapters (poisoned weights на HuggingFace). Главное изменение 2025: фокус на агентные системы -- Excessive Agency стал отдельной категорией."


Q: Indirect prompt injection -- чем отличается от direct и почему опаснее?

❌ Red flag: "Indirect -- это когда используют обходные формулировки" (путает с jailbreaking)

✅ Strong answer: "Direct: пользователь явно пишет malicious input ('Ignore instructions'). Indirect: вредоносные инструкции скрыты в external data -- web pages, documents, emails -- которые LLM обрабатывает. Опаснее по трём причинам: (1) attack surface = все data sources LLM, (2) embedded в легитимный контент, (3) traditional input validation не помогает. Пример: AI суммаризирует web-страницу с hidden instructions. Защита: content sanitization на уровне retrieval + output validation."


Q: Спроектируйте defense-in-depth для production LLM.

❌ Red flag: "Поставим WAF и input filter" (один слой защиты)

✅ Strong answer: "Пять слоёв: (1) Input guardrails: regex (<1ms, 60-70%) + ML classifier (80-90%) + LLM-as-judge (100-500ms, 85-95%) параллельно. (2) System prompt: role separation, instruction sandwich, refusal training. (3) Output guardrails: format validation, PII/toxic filtering, fact verification. (4) Application layer: least privilege для tools, approval gates для high-impact actions. (5) Monitoring: anomaly detection (Langfuse, Arize). Combined: 90-95%. Ключ: fast pattern-matching + deeper checks параллельно."


Q: Excessive Agency (LLM06) -- почему это критическая проблема в 2025?

❌ Red flag: "Агенты могут делать ошибки" (не описывает attack vectors)

✅ Strong answer: "2025 = 'year of agents' с unprecedented autonomy. Три аспекта: excessive functionality (слишком много доступных tools), excessive permissions (write access когда нужен только read), excessive autonomy (нет human-in-the-loop для critical actions). Пример: file-writing extension позволяет arbitrary commands. MCP security: allowlisting серверов, permission scoping, sandboxing, audit logging. Mitigation: narrowly scoped extensions, minimal access, manual approval gates."

Ключевые числа

Факт Значение
Prompt injection prevalence 73%+ production apps
Companies not fine-tuning agents 53%
Sensitive Disclosure rank change #6 -> #2
Best-of-N attack success 50-80%
Direct jailbreak success 10-30%
Role-playing jailbreak success 20-40%
Input filtering efficacy (regex) 60-70%
ML classifier efficacy 80-90%
LLM-as-judge efficacy 85-95%
Combined defense efficacy 90-95%

Источники

  1. OWASP -- "Top 10 for Large Language Model Applications 2025" (Official)
  2. Confident AI -- "OWASP Top 10 2025 for LLM Applications"
  3. Zylos AI -- "LLM Security and Safety 2026: Vulnerabilities, Attacks, and Defense"
  4. Learn-Prompting -- "Prompt Security 2026: Defending Against Injection and Jailbreak"
  5. DeepStrike -- "OWASP LLM Top 10 Vulnerabilities 2025: AI Security Risks"
  6. Giskard -- "Best-of-N Jailbreaking: The Automated LLM Attack"
  7. arXiv:2601.22240 -- "A Systematic Literature Review on LLM Defenses Against Prompt Injection"