What is AI/LLM Security
Defining AI/LLM Security
AI/LLM security is the discipline of identifying, assessing, and mitigating risks unique to systems that incorporate artificial intelligence — particularly large language models — into their architecture. It sits at the intersection of traditional application security, data security, and a set of entirely novel threat categories that emerge from the probabilistic, natural-language-driven nature of modern AI systems.
Unlike traditional software security where we protect deterministic code from well-understood exploit classes, AI security must contend with systems that:
- Accept natural language as a primary input — blurring the boundary between instructions and data
- Produce non-deterministic outputs — the same input can yield different results
- Learn from and potentially memorize training data — creating novel data leakage vectors
- Can be given autonomous agency — making decisions and taking actions with real-world consequences
- Exhibit emergent behaviors — capabilities and vulnerabilities that weren’t explicitly programmed
Why AI Security Matters
1. Expanding Attack Surface
Every LLM integration adds a new class of entry point. A traditional web application might have forms, APIs, and file uploads. An LLM-powered application adds:
- Natural language inputs that bypass conventional input validation
- Retrieved documents that can carry injected instructions
- Tool calls that bridge the LLM to backend systems
- Multi-modal inputs (images, audio, PDFs) that can contain hidden payloads
2. Data Exposure at Scale
LLMs process and generate text that may contain sensitive information:
- Training data memorization: Models can regurgitate snippets of training data, including PII, credentials, and proprietary code
- Context window leakage: System prompts, conversation history, and RAG-retrieved documents can be exfiltrated through crafted prompts
- Inference data logging: Prompts and completions flowing through APIs create new data stores that must be protected
3. Autonomous Action
Agentic AI systems can execute code, call APIs, send emails, modify databases, and interact with external services. A compromised agent doesn’t just leak data — it can take destructive action on behalf of the attacker with whatever permissions it has been granted.
4. Scale of Deployment
LLMs are being integrated into virtually every category of software — from customer support to code generation to medical diagnosis to legal research. The blast radius of a novel LLM vulnerability class is enormous because it potentially affects every system using that pattern.
5. Non-Determinism and Unpredictability
Traditional security testing assumes reproducible behavior. LLMs introduce:
- Temperature-based randomness in outputs
- Sensitivity to prompt phrasing and ordering
- Emergent capabilities that appear at certain model scales
- Behavior changes after model updates or fine-tuning
This makes comprehensive security testing significantly harder — a prompt injection that fails 99 times might succeed on the 100th attempt.
How AI Security Differs from Traditional Software Security
| Dimension | Traditional Software Security | AI/LLM Security |
|---|---|---|
| Input nature | Structured data (forms, JSON, SQL) | Natural language, images, audio — unstructured and ambiguous |
| Input validation | Well-defined schemas, type checking, allowlists | No reliable way to separate instructions from data in natural language |
| Behavior model | Deterministic — same input produces same output | Probabilistic — outputs vary based on sampling, temperature, context |
| Attack taxonomy | Mature (OWASP Top 10, CWE, MITRE ATT&CK) | Emerging and evolving rapidly (OWASP Top 10 for LLMs, MITRE ATLAS) |
| Vulnerability discovery | Code review, SAST, DAST, fuzzing | Red-teaming, adversarial probing, manual prompt testing |
| Patching | Deploy code fix, vulnerability is resolved | Retraining is expensive; guardrails can often be bypassed |
| Trust boundaries | Clear (client/server, user/admin, internal/external) | Blurred — the LLM processes trusted and untrusted content in the same context window |
| Exploitation | Requires technical skill (crafting payloads, understanding protocols) | Can be done in plain English — dramatically lower barrier to entry |
| Defense maturity | Decades of tools, frameworks, and best practices | Early stage — most defenses are heuristic-based and incomplete |
| Supply chain | Libraries, packages, containers | All of the above PLUS model weights, training data, fine-tuning datasets, embedding models |
| Compliance frameworks | Well-established (PCI DSS, SOC 2, HIPAA) | Emerging (EU AI Act, NIST AI RMF, ISO/IEC 42001) |
The Fundamental Difference: No Code/Data Separation
In traditional computing, there is a clear distinction between code (instructions the machine executes) and data (information the machine processes). SQL injection, XSS, and command injection all exploit failures to maintain this boundary — but the boundary itself exists and can be enforced.
In LLM systems, there is no inherent separation between instructions and data. The system prompt, user input, retrieved documents, and tool outputs are all processed as a single stream of tokens. The model must infer which tokens are instructions to follow and which are content to process — and this inference can be manipulated. This is why prompt injection is often called the “SQL injection of AI” — except there is no prepared statement equivalent that fully solves it.
The AI Threat Landscape
Threat Categories Overview
mindmap
root((AI/LLM<br/>Threat Landscape))
**Input Attacks**
Prompt Injection
Jailbreaking
Adversarial Examples
Multi-modal Exploits
**Model Attacks**
Data/Model Poisoning
Model Theft / Extraction
Backdoor Attacks
Membership Inference
**Output Attacks**
Hallucination & Misinfo
Unsafe Code Generation
PII Leakage
Improper Output Handling
**Supply Chain**
Malicious Models
Poisoned Datasets
Compromised Frameworks
Namespace Hijacking
**Agentic Risks**
Excessive Agency
Tool Misuse
Confused Deputy
Privilege Escalation
Auto-exploit
**Infra & Ops**
Unbounded Consumption
API Key Exposure
Model Denial of Service
Prompt/Data Logging
1. Prompt Injection
The most discussed and arguably most dangerous LLM vulnerability. An attacker crafts input that causes the model to deviate from its intended instructions.
- Direct prompt injection: The user directly instructs the model to ignore its system prompt or perform unauthorized actions.
- Indirect prompt injection: Malicious instructions are embedded in external content (web pages, documents, emails) that the LLM retrieves and processes.
2. Data and Model Poisoning
Corrupting the training data or fine-tuning data to introduce backdoors, biases, or malicious behaviors into the model. This can happen at the pre-training, fine-tuning, or RAG data layer.
3. Model Theft and Extraction
Extracting proprietary model weights, architecture, or training data through:
- Repeated queries to reverse-engineer model behavior
- Side-channel attacks on inference infrastructure
- Insider threats with access to model artifacts
4. Supply Chain Attacks
Compromising upstream components: malicious models on Hugging Face, backdoored fine-tuning datasets, compromised inference frameworks, or poisoned vector database content.
5. Privacy Attacks
Extracting sensitive information that the model memorized during training:
- Training data extraction (membership inference, data extraction attacks)
- System prompt leakage
- PII exposure in generated outputs
6. Misuse and Abuse
Using LLMs to generate harmful content at scale:
- Deepfakes and synthetic media
- Phishing and social engineering content
- Malware generation
- Disinformation campaigns
7. Agent Exploitation
Targeting LLM-based agents that have tool access:
- Hijacking agent actions through prompt injection
- Exploiting excessive permissions (confused deputy attacks)
- Chaining multiple tool calls to achieve unauthorized objectives
The CIA Triad Applied to AI Systems
The classic CIA triad — Confidentiality, Integrity, and Availability — maps directly to AI/LLM security, but with AI-specific nuances:
Confidentiality
| Traditional | AI-Specific |
|---|---|
| Protect databases and files from unauthorized access | Prevent extraction of training data, system prompts, and conversation history |
| Encrypt data at rest and in transit | Protect model weights as intellectual property |
| Access control on sensitive resources | Prevent PII leakage through generated outputs |
| Ensure RAG-retrieved documents respect authorization boundaries | |
| Protect embedding vectors from inversion attacks |
Integrity
| Traditional | AI-Specific |
|---|---|
| Prevent unauthorized modification of data | Prevent data/model poisoning that corrupts model behavior |
| Ensure code hasn’t been tampered with | Verify model provenance and supply chain integrity |
| Input validation to prevent injection attacks | Defend against prompt injection that manipulates outputs |
| Prevent hallucinations that present false information as fact | |
| Ensure RAG pipeline retrieves authentic, untampered documents |
Availability
| Traditional | AI-Specific |
|---|---|
| Protect against DDoS | Prevent model denial of service via resource-intensive prompts |
| Ensure system uptime | Protect against unbounded consumption (token exhaustion, compute abuse) |
| Capacity planning and scaling | Rate limiting on inference APIs |
| Prevent model extraction attacks that degrade service quality | |
| Guard against adversarial inputs that cause inference failures |
Attack Surface of LLM Applications
An LLM application has a layered attack surface. Understanding each layer is critical for building a comprehensive defense strategy.
graph TD
USER["USER / ATTACKER"]
USER --> L1
subgraph L1["LAYER 1: INPUT LAYER"]
L1_DESC["User prompts (text, images, audio, files)<br/>API requests with prompt parameters<br/>Conversation history and session context<br/>Multi-modal inputs (images with hidden text, audio commands)"]
L1_ATK["Attacks: Direct prompt injection, jailbreaking,<br/>adversarial images, resource exhaustion via long inputs"]
end
L1 --> L2
subgraph L2["LAYER 2: RETRIEVAL LAYER (RAG)"]
L2_DESC["Document ingestion pipeline<br/>Embedding model (converts text to vectors)<br/>Vector database (stores and searches embeddings)<br/>Retrieved context (documents injected into prompt)"]
L2_ATK["Attacks: Indirect prompt injection via documents,<br/>RAG poisoning, embedding inversion,<br/>vector DB manipulation, document metadata injection"]
end
L2 --> L3
subgraph L3["LAYER 3: ORCHESTRATION / AGENTIC LAYER"]
L3_DESC["System prompt / instruction template<br/>Tool definitions and function schemas<br/>Chain-of-thought / planning logic<br/>Memory systems (short-term and long-term)<br/>Multi-agent communication"]
L3_ATK["Attacks: System prompt extraction, tool misuse via injection,<br/>excessive agency, confused deputy, memory poisoning,<br/>inter-agent prompt injection"]
end
L3 --> L4
subgraph L4["LAYER 4: MODEL LAYER"]
L4_DESC["Model weights and parameters<br/>Fine-tuning data and process<br/>Alignment / safety training (RLHF, Constitutional AI)<br/>Inference configuration (temperature, top-p, max tokens)"]
L4_ATK["Attacks: Model theft/extraction, training data extraction,<br/>fine-tuning poisoning, jailbreak bypass of alignment,<br/>membership inference"]
end
L4 --> L5
subgraph L5["LAYER 5: INFRASTRUCTURE / RUNTIME LAYER"]
L5_DESC["API gateway and authentication<br/>Logging and monitoring pipeline<br/>Cloud platform (Azure, AWS, GCP)<br/>GPU/TPU compute resources<br/>Network configuration and egress controls"]
L5_ATK["Attacks: API key theft, log injection, side-channel attacks<br/>on GPU memory, resource exhaustion / billing attacks,<br/>insufficient egress filtering"]
end
Defense-in-Depth for AI Systems
No single layer of defense is sufficient. A robust AI security posture requires controls at every layer:
- Input layer: Input filtering, content moderation, rate limiting, multi-modal scanning
- Retrieval layer: Document provenance verification, embedding monitoring, access-control-aware retrieval
- Orchestration layer: Least-privilege tool permissions, human-in-the-loop for sensitive actions, output filtering before tool execution
- Model layer: Model provenance verification, alignment testing, red-teaming, guardrail models
- Infrastructure layer: Standard cloud security controls, API authentication, logging, network segmentation, cost controls
Emerging Regulatory Landscape
AI security is increasingly shaped by regulation:
| Framework | Scope | Key Requirements |
|---|---|---|
| EU AI Act (2024) | EU-wide, risk-based regulation | Risk classification, transparency, conformity assessments for high-risk AI |
| NIST AI Risk Management Framework | US voluntary framework | Governance, mapping, measuring, and managing AI risks |
| ISO/IEC 42001 | International standard | AI management system requirements |
| MITRE ATLAS | Knowledge base | Adversarial threat landscape for AI systems, ATT&CK-style framework |
| OWASP Top 10 for LLMs | Application security guidance | Top 10 risks specific to LLM applications (covered in the next section) |
| Executive Order 14110 (US, 2023) | US policy | Safety and security requirements for frontier AI models |
Key Takeaways
-
AI security is not just “infosec + AI” — it requires understanding fundamentally new threat categories that don’t exist in traditional software.
-
The attack surface is layered — from user input through retrieval, orchestration, model, and infrastructure. Each layer requires distinct security controls.
-
Natural language is the new attack vector — prompt injection is possible precisely because LLMs cannot reliably distinguish instructions from data. This is an unsolved problem.
-
The barrier to exploitation is lower than ever — many AI attacks can be executed in plain English, without any programming or security expertise.
-
Defense-in-depth is essential — no single guardrail, filter, or alignment technique is sufficient. Security must be applied at every layer of the AI application stack.
-
The regulatory environment is evolving rapidly — organizations deploying AI must track emerging requirements from the EU AI Act, NIST AI RMF, and sector-specific guidance.
References
- OWASP. (2025). “OWASP Top 10 for Large Language Model Applications.” https://owasp.org/www-project-top-10-for-large-language-model-applications/
- MITRE. (2024). “ATLAS — Adversarial Threat Landscape for AI Systems.” https://atlas.mitre.org/
- NIST. (2023). “AI Risk Management Framework (AI RMF 1.0).” https://www.nist.gov/artificial-intelligence/risk-management-framework
- European Commission. (2024). “EU Artificial Intelligence Act.” https://artificialintelligenceact.eu/
- ISO/IEC. (2023). “ISO/IEC 42001:2023 — Artificial Intelligence Management System.” https://www.iso.org/standard/81230.html
- Greshake, K. et al. (2023). “Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.” https://arxiv.org/abs/2302.12173
- Perez, F. & Ribeiro, I. (2022). “Ignore This Title and HackAPrompt: Exposing Systemic Weaknesses of LLMs through a Global Scale Prompt Hacking Competition.” https://arxiv.org/abs/2311.16119
- Carlini, N. et al. (2023). “Extracting Training Data from Large Language Models.” https://arxiv.org/abs/2012.07805
- The White House. (2023). “Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence.” https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/
- Anthropic. (2023). “Anthropic’s Responsible Scaling Policy.” https://www.anthropic.com/index/anthropics-responsible-scaling-policy