Безопасность LLM: OWASP Top 10 (2025)¶

~9 минут чтения

Предварительно: Безопасность LLM, Гардрейлы LLM

Prompt injection, sensitive disclosure, supply chain, data poisoning, output handling, excessive agency, system prompt leakage, vector weaknesses, misinformation, unbounded consumption. Defense-in-depth, guardrails, MCP security (2025-2026)

OWASP Top 10 for LLM Applications -- стандарт де-факто для оценки рисков при deployment языковых моделей. По данным OWASP (2025), prompt injection присутствует в 73% production AI-приложений, а 53% компаний, строящих LLM-агентов, не применяют fine-tuning для защиты. Список 2025 года отражает сдвиг индустрии к агентным системам: Sensitive Disclosure поднялся с #6 на #2 (модели получили доступ к RAG и tools), а Excessive Agency стал критической угрозой в "год LLM-агентов". Combined defense-in-depth снижает attack surface на 90-95%, но ни один слой по отдельности не даёт больше 85%.

Ключевые концепции¶

OWASP LLM Top 10 2025¶

Rank	Risk	Category	Severity	Key Mitigation
LLM01	Prompt Injection	Input manipulation	Critical	Constrain behavior, validate output
LLM02	Sensitive Information Disclosure	Data leakage	Critical	Data sanitization, validation
LLM03	Supply Chain	External components	High	Verified sources, SBOM
LLM04	Data Poisoning	Training manipulation	High	Track origins, vet vendors
LLM05	Improper Output Handling	Output trust	High	Encoding, sanitization
LLM06	Excessive Agency	Agent permissions	High	Limit permissions, approval gates
LLM07	System Prompt Leakage	IP/security	Medium	External data, guardrails
LLM08	Vector & Embedding Weaknesses	RAG pipeline	Medium	Access control, validation
LLM09	Misinformation	Hallucination	Medium	RAG, cross-verification
LLM10	Unbounded Consumption	Resource abuse	Medium	Rate limiting, resource management

Key stats: prompt injection in 73%+ production AI apps. 53% companies building agents aren't fine-tuning. DeepSeek tricked into harmful content with simple injection.

What's New in 2025 (vs 2023)¶

Change	Why It Matters
Excessive Autonomy	2025 = "year of LLM agents" with unprecedented autonomy
RAG Vulnerabilities	53% use RAG instead of fine-tuning, expanding attack surface
System Prompt Risks	Developers exposing sensitive data in prompts
Unbounded Consumption	Enterprise adoption -> resource management challenges
Sensitive Disclosure	Jumped from #6 to #2 (more sensitive data via RAG + tools)

1. Prompt Injection (LLM01) -- Critical¶

Attack Types¶

Type	Description	Example
Direct jailbreaking	Explicit override attempts	"Ignore all previous instructions..."
Indirect injection	Hidden in content/context	Malicious text in retrieved documents
Goal hijacking	Redirect model objectives	"Your new goal is to..."
Prompt leaking	Extract system prompts	"Repeat your instructions verbatim"

Direct vs Indirect: direct = user crafts malicious input. Indirect = malicious instructions hidden in external data (web pages, documents) that LLM processes. Indirect harder to defend: (1) attack surface = all data sources LLM accesses, (2) hidden in legitimate content, (3) traditional input validation doesn't help.

Jailbreak Techniques (2025-2026)¶

Technique	Description	Success Rate
Direct override	"Ignore instructions"	10-30%
Role-playing	"You are DAN..."	20-40%
Base64/encoding	Obfuscated payloads	15-25%
Few-shot injection	Poisoned examples	30-50%
Best-of-N (Giskard)	Automated N jailbreak variants	50-80%

2. Sensitive Information Disclosure (LLM02) -- Critical¶

Data enters via training datasets, RAG knowledge bases, database access, user input (devs using ChatGPT on codebases).

Mitigation: mask sensitive content before training, strict input/output validation.

3. Supply Chain (LLM03) -- High¶

Compromised models, datasets, LoRA adapters, plugins. Example: poisoned weights on HuggingFace, malicious Python library.

Mitigation: verified sources with integrity checks, signed SBOM (Software Bill of Materials).

4. Data Poisoning (LLM04) -- High¶

Manipulating data during pre-training, fine-tuning, or embedding. Biased injection, toxic fine-tuning data.

Mitigation: track data origins (OWASP CycloneDX), rigorously validate data providers.

5. Improper Output Handling (LLM05) -- High¶

LLM outputs not validated before passing to downstream systems.

Critical example: Text2SQL hallucination changes DELETE FROM users WHERE id = 123 to DELETE FROM users -- entire database wiped.

Mitigation: context-aware encoding (HTML, SQL escaping), validate and sanitize all LLM responses.

6. Excessive Agency (LLM06) -- High¶

Agents with too much functionality, permissions, or autonomy. Three areas: excessive functionality, excessive permissions, excessive autonomy.

Examples: assistant forwards sensitive emails to attacker, file-writing extension allows arbitrary commands.

Mitigation: narrowly scoped extensions, minimal access, manual approval for high-impact actions.

7. System Prompt Leakage (LLM07)¶

Exposing internal rules, filtering criteria, credentials. Attacker extracts credentials from system prompt.

Mitigation: keep sensitive data external to system prompt, use independent guardrails.

8. Vector & Embedding Weaknesses (LLM08)¶

RAG pipeline vulnerabilities: misconfigured vector DB allows unauthorized access, embedding inversion attacks recover original data.

Mitigation: strict access partitioning in vector databases, audit all data sources.

9. Misinformation (LLM09)¶

Hallucinations and fabricated outputs. Malicious packages using names hallucinated by coding assistants. Medical chatbot incorrect diagnosis.

Mitigation: RAG with verified sources, cross-verification, human fact-checking for critical info.

10. Unbounded Consumption (LLM10)¶

Resource usage spiraling: excessively large inputs consume memory/CPU, high volume API DoS.

Mitigation: rate limiting and throttling, dynamic resource allocation.

Defense-in-Depth¶

Multi-Layer Architecture¶

graph TD
    A["User Input"] --> B["Input Guardrails<br/>Regex + ML classifier + LLM-as-judge<br/>Efficacy: 60-95%"]
    B --> C["LLM System<br/>System prompt hardening<br/>Role separation, instruction sandwich"]
    C --> D["Output Guardrails<br/>Format check, PII filter, fact verification<br/>Efficacy: 70-85%"]
    D --> E["User Output"]
    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8f5e9,stroke:#4caf50
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#e8f5e9,stroke:#4caf50
    style E fill:#e8eaf6,stroke:#3f51b5

Input Filtering¶

Technique	Efficacy
Regex patterns	60-70%
ML classifier	80-90%
LLM-as-judge	85-95%
Perplexity filtering	70-80%

System Prompt Hardening¶

Role separation (system vs user), instruction sandwich (repeat core instructions), output format enforcement, refusal training on attack examples.

Output Validation¶

Format checking, content filtering (PII, toxic), fact verification (cross-check sources), rate limiting.

Combined Efficacy¶

Defense	Attack Reduction
Input filtering alone	60-80%
System hardening alone	40-60%
Output filtering alone	70-85%
Combined multi-layer	90-95%

Security Tools (2025-2026)¶

Guardrails & Protection¶

Tool	Type	Purpose
NeMo Guardrails	NVIDIA	Input/output guardrails
Guardrails AI	Open source	Output validation
Lakera Guard	Commercial	Prompt injection detection
Giskard	Open source	Model testing & security
Prompt Security	Commercial	Enterprise protection
Hidden Layer	Commercial	ML model security
Protect AI	Commercial	MLOps security suite

Testing Frameworks¶

Framework	Purpose
DeepTeam	OWASP LLM Top 10 testing
Giskard Scanner	Vulnerability scanning
Garak	LLM vulnerability framework
LLM-Fuzzer	Fuzzing for LLMs

MCP Security¶

Risks¶

Risk	Description
Tool abuse	Malicious MCP server actions
Data exfiltration	Unauthorized data access
Privilege escalation	Gain unintended permissions

Best Practices¶

Allowlisting (only approved MCP servers), permission scoping (minimal required), audit logging (all MCP interactions), sandboxing (isolate execution).

Incident Response¶

Attack Indicators¶

Unusual output patterns, high refusal rate, data leakage (system prompts in output), unauthorized tool calls.

Response Playbook¶

Detect -- monitor for anomalies
Contain -- disable affected features
Analyze -- determine attack vector
Patch -- update defenses
Review -- post-incident analysis

Заблуждение: input filtering решает проблему prompt injection

Regex-фильтры ловят только 60-70% атак, ML-classifier -- 80-90%. Даже LLM-as-judge даёт максимум 85-95%. Best-of-N jailbreaking (Giskard) автоматически генерирует варианты атак с success rate 50-80%, обходя любой единичный фильтр. Только combined multi-layer defense даёт 90-95% protection. Ни один слой сам по себе недостаточен.

Заблуждение: indirect prompt injection -- редкая экзотическая атака

Indirect injection -- самый опасный вектор в 2025, потому что attack surface = все data sources, к которым LLM имеет доступ (RAG documents, web pages, emails). 53% компаний используют RAG вместо fine-tuning, расширяя attack surface. В отличие от direct injection, indirect скрыт в легитимном контенте и не ловится традиционной input validation. Пример: скрытые инструкции в веб-странице, которую AI суммаризирует.

Заблуждение: system prompt -- надёжный механизм защиты

System prompt leakage (LLM07) позволяет извлечь все внутренние правила через простые запросы вроде 'Repeat your instructions verbatim'. Credentials и API keys в system prompt -- прямой вектор атаки. Решение: sensitive data хранить вне system prompt, использовать independent guardrails, не полагаться на system prompt как единственный механизм защиты.

Interview Questions¶

Q: Назовите топ-3 OWASP LLM рисков 2025 и объясните изменения относительно 2023.

Red flag: "Prompt injection, hallucinations, bias" (перечисляет общие проблемы AI, а не OWASP LLM Top 10)

Strong answer: "1) Prompt Injection (LLM01) -- остается #1, 73% production apps уязвимы, direct + indirect (hidden в RAG docs). 2) Sensitive Disclosure (LLM02) -- с #6 на #2, потому что LLM получили доступ к RAG и tools с чувствительными данными. 3) Supply Chain (LLM03) -- вырос из-за экосистемы fine-tuned models и LoRA adapters (poisoned weights на HuggingFace). Главное изменение 2025: фокус на агентные системы -- Excessive Agency стал отдельной категорией."

Q: Indirect prompt injection -- чем отличается от direct и почему опаснее?

Red flag: "Indirect -- это когда используют обходные формулировки" (путает с jailbreaking)

Strong answer: "Direct: пользователь явно пишет malicious input ('Ignore instructions'). Indirect: вредоносные инструкции скрыты в external data -- web pages, documents, emails -- которые LLM обрабатывает. Опаснее по трём причинам: (1) attack surface = все data sources LLM, (2) embedded в легитимный контент, (3) traditional input validation не помогает. Пример: AI суммаризирует web-страницу с hidden instructions. Защита: content sanitization на уровне retrieval + output validation."

Q: Спроектируйте defense-in-depth для production LLM.

Red flag: "Поставим WAF и input filter" (один слой защиты)

Strong answer: "Пять слоёв: (1) Input guardrails: regex (<1ms, 60-70%) + ML classifier (80-90%) + LLM-as-judge (100-500ms, 85-95%) параллельно. (2) System prompt: role separation, instruction sandwich, refusal training. (3) Output guardrails: format validation, PII/toxic filtering, fact verification. (4) Application layer: least privilege для tools, approval gates для high-impact actions. (5) Monitoring: anomaly detection (Langfuse, Arize). Combined: 90-95%. Ключ: fast pattern-matching + deeper checks параллельно."

Q: Excessive Agency (LLM06) -- почему это критическая проблема в 2025?

Red flag: "Агенты могут делать ошибки" (не описывает attack vectors)

Strong answer: "2025 = 'year of agents' с unprecedented autonomy. Три аспекта: excessive functionality (слишком много доступных tools), excessive permissions (write access когда нужен только read), excessive autonomy (нет human-in-the-loop для critical actions). Пример: file-writing extension позволяет arbitrary commands. MCP security: allowlisting серверов, permission scoping, sandboxing, audit logging. Mitigation: narrowly scoped extensions, minimal access, manual approval gates."

Ключевые числа¶

Факт	Значение
Prompt injection prevalence	73%+ production apps
Companies not fine-tuning agents	53%
Sensitive Disclosure rank change	#6 -> #2
Best-of-N attack success	50-80%
Direct jailbreak success	10-30%
Role-playing jailbreak success	20-40%
Input filtering efficacy (regex)	60-70%
ML classifier efficacy	80-90%
LLM-as-judge efficacy	85-95%
Combined defense efficacy	90-95%

Источники¶

OWASP -- "Top 10 for Large Language Model Applications 2025" (Official)
Confident AI -- "OWASP Top 10 2025 for LLM Applications"
Zylos AI -- "LLM Security and Safety 2026: Vulnerabilities, Attacks, and Defense"
Learn-Prompting -- "Prompt Security 2026: Defending Against Injection and Jailbreak"
DeepStrike -- "OWASP LLM Top 10 Vulnerabilities 2025: AI Security Risks"
Giskard -- "Best-of-N Jailbreaking: The Automated LLM Attack"
arXiv:2601.22240 -- "A Systematic Literature Review on LLM Defenses Against Prompt Injection"