← Back to AI/LLM Security

AI/LLM Red Teaming

16 min read

What Is AI Red Teaming

AI red teaming is the practice of systematically probing AI systems to identify failures in safety, security, fairness, and alignment — going well beyond what a traditional penetration test covers. While a pentest asks “can an attacker exploit this system?”, red teaming asks the broader question: “in what ways can this system cause harm?”

AI red teaming evaluates:

  • Security: Can the system be manipulated to perform unauthorized actions, leak data, or bypass access controls?
  • Safety: Can the system be induced to produce dangerous content — instructions for weapons, self-harm, CSAM, or other harmful material?
  • Fairness and bias: Does the system exhibit discriminatory behavior across demographic groups, languages, or cultural contexts?
  • Alignment: Does the system behave consistently with its stated purpose and values, even under adversarial pressure?
  • Truthfulness: Can the system be made to generate convincing misinformation, fabricated citations, or false claims?
  • Robustness: How does the system behave under edge cases, unusual inputs, and adversarial conditions?

Red teaming is not a one-time assessment. Leading AI labs run continuous red teaming programs throughout the model development lifecycle — during pre-training evaluation, post-training alignment, pre-deployment safety testing, and ongoing post-deployment monitoring. The goal is to discover failure modes before adversaries do and before the public is exposed to harmful outputs.

The term “red teaming” originates from Cold War military exercises where a designated adversary team (Red Team) would challenge plans and assumptions. In the AI context, red teamers combine domain expertise in cybersecurity, social engineering, policy, toxicology, biosecurity, and other fields with knowledge of how language models work to find the most consequential failure modes.


How It Differs from Traditional Red Teaming

AI red teaming shares the adversarial mindset of traditional red teaming but diverges significantly in targets, methods, and success criteria.

DimensionTraditional Red TeamingAI Red Teaming
Primary targetsNetworks, applications, physical facilities, people (social engineering)Model behavior, safety guardrails, content policies, alignment, fairness
ExploitsSoftware vulnerabilities, misconfigurations, credential theft, phishingPrompt injection, jailbreaks, adversarial inputs, context manipulation
Success criteriaUnauthorized access, data exfiltration, business disruptionPolicy violation, harmful content generation, bias demonstration, unsafe behavior
ToolingCobalt Strike, Metasploit, Burp Suite, custom C2 frameworksGarak, PyRIT, Promptfoo, custom prompt libraries, automated attack orchestrators
Required expertiseNetwork security, exploit development, social engineering, physical securityNLP, ML safety, domain-specific knowledge (biosecurity, chemistry, etc.), linguistics
ScopeTechnical security postureTechnical security + ethical safety + societal impact + alignment
DeterminismExploits either work or they don’tAttacks may succeed probabilistically; same input can yield different results
Attack surfaceCode, configurations, protocols, humansNatural language, multimodal inputs, training data, fine-tuning, RLHF
Regulatory driversPCI DSS, SOC 2, HIPAA, contractual requirementsEU AI Act, NIST AI RMF, White House voluntary commitments, internal safety policies
CadenceAnnual or quarterly assessmentsContinuous, integrated into model development lifecycle

A critical distinction: traditional red teaming evaluates whether security controls can be bypassed. AI red teaming also evaluates whether the system’s intended behavior is itself safe and aligned. A perfectly secure AI system that reliably generates biased or harmful content within its designed parameters is still a failure from a red teaming perspective.


Microsoft AI Red Team

Microsoft’s AI Red Team (AIRT) is one of the largest and most established AI red teaming operations in the industry. Since its formation, the team has conducted over 100 red teaming operations across Microsoft’s AI product portfolio, including Copilot, Azure OpenAI Service, Bing Chat, and numerous internal AI systems.

Team Structure

The AIRT is a cross-functional team combining:

  • Security researchers and penetration testers
  • Machine learning engineers
  • Linguists and social scientists
  • Domain experts (recruited per engagement based on the system under test)
  • Policy and ethics specialists

This cross-functional composition is deliberate. Pure security teams miss safety and fairness issues; pure ML teams miss exploitation techniques that security professionals routinely use.

Five Core Threat Model Elements

Microsoft’s methodology is built around five elements that define the scope of every engagement:

  1. System under test: The complete AI system including model, prompts, tools, data sources, and integrations — not just the model in isolation.
  2. Threat actors: Who might attack this system? Script kiddies, organized crime, nation-states, curious users, or internal employees?
  3. Adversary goals: What does the attacker want to achieve? Data theft, harmful content generation, reputation damage, service disruption?
  4. Attack vectors: Through which channels can the adversary interact with the system?
  5. Potential harms: What are the downstream consequences of a successful attack? Categorized as security harms, safety harms, fairness harms, or reliability harms.

Key Insights from 100+ Operations

Microsoft has publicly shared several important lessons from their red teaming program:

Start from downstream impact, not attack technique: Rather than running through a generic list of attack techniques, begin by identifying the worst possible outcomes for the specific system and work backward to find attack paths that achieve them. A customer service bot and a code generation tool have very different worst-case scenarios.

Prompt engineering outperforms gradient-based attacks in practice: For deployed systems accessed via API, creative prompt engineering and multi-turn social engineering consistently outperform technically sophisticated gradient-based adversarial attacks. Attackers do not need ML PhDs — they need creativity and persistence.

Evaluate the full system, not the model in isolation: A model that appears safe in isolation may become dangerous when connected to tools, retrieval systems, and other integrations. The AIRT always tests the complete deployed system.

Automation finds the common failures; humans find the novel ones: Automated scanning with tools like PyRIT efficiently covers known vulnerability patterns. Human red teamers are essential for discovering new attack categories, creative exploitation chains, and context-dependent failures.

PyRIT and the AI Red Teaming Agent

Microsoft developed PyRIT (Python Risk Identification Toolkit) as the primary tooling for their red teaming operations and open-sourced it in 2024. PyRIT supports:

  • Multi-turn attack strategies (Crescendo, PAIR, TAP, tree-of-attacks)
  • Automated scoring of model responses
  • Memory and conversation management
  • Extensible architecture for custom attack strategies

In April 2025, Microsoft announced the AI Red Teaming Agent, built on top of PyRIT, which uses an LLM to autonomously plan and execute red teaming campaigns. The agent can:

  • Analyze a target system and develop an attack plan
  • Select and sequence attack strategies based on observed model behavior
  • Adapt its approach based on the target’s responses
  • Generate comprehensive reports of discovered vulnerabilities

This represents a shift from tool-assisted human red teaming to AI-augmented red teaming, where the human provides strategic direction and the AI agent handles tactical execution.


Google DeepMind

Google DeepMind has invested heavily in automated approaches to red teaming, reflecting their broader research orientation toward scalable AI safety.

Automated Red Teaming (ART)

Google’s Automated Red Teaming approach uses Gemini models to attack other Gemini models. The core idea: use one LLM as a creative adversary to discover failure modes in another LLM, then use the findings to improve the target model’s safety training.

The ART pipeline works in stages:

  1. Attack generation: An attacker LLM generates diverse adversarial prompts designed to elicit policy-violating responses from the target model.
  2. Target evaluation: The adversarial prompts are run against the target model.
  3. Classification: A classifier LLM evaluates whether the target model’s responses violate safety policies.
  4. Clustering and analysis: Successful attacks are clustered to identify common vulnerability patterns.
  5. Feedback loop: Discovered vulnerabilities feed back into safety training data for the target model.

This approach scales far beyond what human red teams can achieve in terms of coverage, while human red teamers focus on novel and creative attack categories that automated systems are less likely to discover.

Frontier Safety Framework (2025)

In 2025, Google DeepMind updated its Frontier Safety Framework, which governs how the most capable AI models are evaluated before and after deployment. The framework establishes:

Three defense layers:

  1. Model-level mitigations: Safety training, RLHF, constitutional methods applied during model development
  2. System-level defenses: Input/output classifiers, content filters, rate limiting, and monitoring applied at the application layer
  3. Deployment controls: Access restrictions, use-case limitations, and monitoring applied at the distribution layer

The framework requires that critical capability evaluations (CCEs) be passed before models are deployed. These evaluations test for dangerous capabilities including:

  • Cyber-offense capability (can the model help conduct cyberattacks?)
  • Biosecurity risks (can the model provide uplift for biological weapon development?)
  • Autonomy and self-replication (can the model take actions to preserve or spread itself?)
  • Persuasion and manipulation (can the model conduct sophisticated influence operations?)

Cybersecurity Evaluation Benchmark

Google DeepMind has developed dedicated benchmarks for evaluating AI models’ cybersecurity capabilities, both offensive and defensive. These benchmarks test models across categories including:

  • Vulnerability identification in source code
  • Exploit development assistance
  • Security tool usage and automation
  • Social engineering content generation
  • Defensive security analysis and recommendation

The benchmarks are used internally during model development to ensure that improvements in general capability do not disproportionately increase dangerous cybersecurity capabilities.


Anthropic’s Approach

Anthropic’s approach to AI safety and red teaming is grounded in their position as a self-described AI safety company. Their methodology emphasizes proactive safety research, scalable oversight, and structured deployment policies.

Constitutional AI

Constitutional AI (CAI) is Anthropic’s foundational approach to aligning language models. Rather than relying solely on human feedback (RLHF), CAI provides the model with a set of principles (a “constitution”) and trains the model to self-critique and revise its own outputs according to those principles.

The process works in two phases:

  1. Supervised learning phase: The model generates responses, critiques them against the constitution, and generates revised responses. The revised responses become training data.
  2. RL phase: A preference model trained on constitutional comparisons provides the reward signal, replacing or supplementing human preference labels.

This approach makes the safety training more transparent (the principles are explicit and auditable) and more scalable (it reduces dependence on human labelers for every safety decision).

Constitutional Classifiers (2025)

In early 2025, Anthropic introduced Constitutional Classifiers — a system that uses constitutionally-trained models as real-time input and output filters. These classifiers evaluate user prompts and model responses against a set of safety principles, providing a defense layer that is:

  • More nuanced than keyword-based filters, because it understands context and intent
  • More robust than static rules, because it generalizes to novel attack patterns
  • Auditable, because the constitutional principles are explicit

Anthropic subjected Constitutional Classifiers to a public stress test through their HackerOne bug bounty program, inviting security researchers to attempt bypasses. The system demonstrated strong resilience against known jailbreak techniques while maintaining low false-positive rates on benign queries.

HackerOne Jailbreak Challenge

Anthropic partnered with HackerOne to run a structured jailbreaking challenge where vetted security researchers attempted to bypass Claude’s safety guardrails. The program:

  • Provided researchers with specific target behaviors to elicit
  • Offered bounties for successful, reproducible jailbreaks
  • Used findings to directly improve Claude’s safety training
  • Demonstrated a commitment to external adversarial evaluation

This represented one of the first formal, incentivized red teaming programs run by an AI lab through an established bug bounty platform.

Responsible Scaling Policy v3.0

Anthropic’s Responsible Scaling Policy (RSP) is a governance framework that ties model deployment decisions to demonstrated safety evaluations. Version 3.0, published in 2025, defines AI Safety Levels (ASLs) that function analogously to biosafety levels:

LevelDescriptionRequirements
ASL-1Systems that pose no meaningful catastrophic riskStandard security practices
ASL-2Systems where catastrophic misuse risk is present but not significantly above what is already availableCurrent deployment-level safeguards, standard evaluations, safety training
ASL-3Systems that substantially increase risk above the current baseline, or have early signs of dangerous autonomous capabilitiesEnhanced containment, intensive red teaming, external audits, deployment restrictions, hardened security
ASL-4(Not yet defined in detail) Qualitatively more capable systems requiring measures beyond ASL-3To be defined as capability thresholds approach

ASL-3 was activated in May 2025 for Claude Opus 4, making it the first publicly acknowledged model to reach this safety tier. ASL-3 activation requires:

  • Intensive internal and external red teaming
  • Enhanced information security for model weights
  • Deployment restrictions and monitoring
  • Demonstrated effectiveness of safety mitigations
  • Third-party safety audits

Frontier Red Team

Anthropic maintains a dedicated Frontier Red Team that conducts pre-deployment evaluations of new Claude models. This team:

  • Tests for dangerous capabilities in cybersecurity, biosecurity, nuclear/radiological domains, and autonomous behavior
  • Evaluates alignment properties under adversarial conditions
  • Works with external domain experts for evaluations requiring specialized knowledge
  • Publishes findings in Anthropic’s model cards and system safety documentation

Key Tools

Garak (NVIDIA)

What it is: An LLM vulnerability scanner that automates detection of common failure modes across language models. The name is a reference to the Star Trek character Elim Garak.

Key capabilities:

  • Comprehensive probe library covering prompt injection, data leakage, toxicity, hallucination, and encoding attacks
  • Support for multiple model backends (OpenAI, Hugging Face, Ollama, local models, REST APIs)
  • Detector framework for classifying model responses as pass/fail
  • Structured JSON/HTML reporting
  • Extensible plugin system for custom probes and detectors

GitHub: https://github.com/NVIDIA/garak

PyRIT (Microsoft)

What it is: The Python Risk Identification Toolkit — Microsoft’s open-source framework for AI red teaming. Designed for professional red teamers conducting multi-turn, strategy-driven attacks against AI systems.

Key capabilities:

  • Orchestrators for complex attack strategies (Crescendo, PAIR, TAP, many-shot, skeleton key)
  • Converters for payload transformation (Base64, ROT13, Unicode, translation, homoglyphs)
  • Scoring engines (self-ask, content classification, human-in-the-loop)
  • Persistent memory database tracking all interactions
  • Support for Azure OpenAI, OpenAI, Hugging Face, and custom targets

GitHub: https://github.com/Azure/PyRIT

Counterfit (Microsoft)

What it is: An open-source tool for assessing the security of machine learning models. While not LLM-specific, Counterfit is valuable for testing computer vision models, tabular classifiers, and other ML systems that are often components of larger AI platforms.

Key capabilities:

  • Adversarial example generation for ML models
  • Evasion attacks, inversion attacks, and extraction attacks
  • Support for models hosted on Azure ML, AWS SageMaker, and local endpoints
  • Command-line interface for interactive testing sessions

GitHub: https://github.com/Azure/counterfit

Promptfoo

What it is: An open-source LLM evaluation framework with built-in red teaming capabilities. Promptfoo bridges the gap between LLM quality assurance and security testing, making it particularly useful for development teams that want to integrate security checks into their CI/CD pipelines.

Key capabilities:

  • YAML-based test configuration for reproducible evaluations
  • Red team plugins covering OWASP LLM Top 10 categories
  • Multi-model comparison and benchmarking
  • Custom assertion and grading functions
  • CI/CD integration with GitHub Actions, GitLab CI, and others
  • Web-based result viewer and reporting

GitHub: https://github.com/promptfoo/promptfoo

DeepTeam

What it is: An open-source framework for red teaming LLMs, built with a focus on ease of use and comprehensive vulnerability coverage. DeepTeam provides a testing interface similar to DeepEval (its sister project for LLM evaluation) but focused specifically on adversarial testing.

Key capabilities:

  • Attack modules for prompt injection, jailbreaking, toxicity, bias, and PII leakage
  • Multi-turn attack strategies
  • Vulnerability scanning with severity classification
  • Integration with popular LLM providers
  • Metric-based reporting

GitHub: https://github.com/confident-ai/deepteam


Setting Up an AI Red Team Engagement

Team Composition

An effective AI red team combines multiple skill sets:

  • ML/AI specialists: Understand model architectures, training processes, and known weakness patterns. Can identify which attack strategies are most likely to succeed against specific model types.
  • Security researchers: Bring penetration testing methodology, creative exploitation thinking, and experience with attack tooling.
  • Domain experts: Recruited per engagement based on the system’s use case. For healthcare AI, include clinicians. For legal AI, include lawyers. For code generation, include software engineers. Domain experts validate whether model outputs are actually harmful or merely unusual.
  • Policy and ethics specialists: Evaluate findings against content policies, legal requirements, and ethical standards.
  • Linguists: Essential for multilingual testing and understanding how language nuances affect model behavior.

Methodology

A structured AI red team engagement follows this flow:

  1. Define objectives: What are you trying to learn? “Can this model be jailbroken?” is too narrow. “What harms can this system produce, and what is the effort required to elicit them?” is better.
  2. Threat model development: Identify the system architecture, threat actors, attack vectors, and potential harms using frameworks like Microsoft’s five-element model.
  3. Attack plan development: Based on the threat model, select attack strategies and tools. Prioritize by expected impact and likelihood.
  4. Manual exploration: Before running automated tools, conduct exploratory manual testing to understand the system’s personality, guardrails, and boundaries.
  5. Automated scanning: Run automated tools (Garak, PyRIT, Promptfoo) to cover known vulnerability patterns at scale.
  6. Deep-dive testing: Based on findings from manual and automated phases, conduct targeted deep-dive testing on the most promising attack vectors.
  7. Cross-validation: Verify findings across different conditions — different times of day, different conversation contexts, different phrasings.
  8. Impact assessment: For each finding, assess the real-world impact. A jailbreak that produces mildly impolite text is less important than one that produces actionable dangerous instructions.

Tooling Setup

A recommended tooling stack for a comprehensive engagement:

LayerToolPurpose
Automated scanningGarakBroad vulnerability scanning with pre-built probes
Strategy-driven attacksPyRITMulti-turn attacks, crescendo, PAIR, automated orchestration
Evaluation frameworkPromptfooReproducible test cases, CI/CD integration
Manual testingCustom scripts + API clientsExploratory testing, novel attack development
DocumentationMarkdown + screenshotsFinding documentation and evidence collection
CollaborationShared prompt libraryTeam coordination and knowledge sharing

Execution

During execution, maintain discipline around documentation and coordination:

  • Log every prompt and response, including failed attempts
  • Tag findings by category (security, safety, fairness, reliability)
  • Use a shared finding tracker with severity ratings
  • Conduct daily syncs to share discoveries and avoid duplicate effort
  • Time-box exploration of individual attack vectors — move on if progress stalls

Reporting

The final report should communicate findings to both technical and executive audiences:

  • Executive summary: Key risks, severity distribution, top recommendations
  • Methodology: What was tested, what tools were used, what was out of scope
  • Findings by category: Each finding with description, reproduction steps, evidence, severity, and remediation
  • Trend analysis: Common patterns across findings (e.g., “guardrails are consistently weak for non-English languages”)
  • Remediation roadmap: Prioritized recommendations with estimated effort and impact

Real-World Case Studies

DEF CON 31 — Generative AI Red Teaming Event (2023)

At DEF CON 31 in August 2023, the White House Office of Science and Technology Policy, in partnership with AI Village, organized the largest public AI red teaming event in history. Over 2,200 participants tested models from Anthropic, Cohere, Google, Hugging Face, Meta, NVIDIA, OpenAI, and Stability AI.

Key aspects of the event:

  • Scale: Thousands of non-expert participants interacting with frontier models in a structured adversarial context
  • Structure: Participants received specific challenge tasks (e.g., “make the model produce misinformation about voting,” “make the model reveal another user’s data”)
  • Findings: Participants identified issues across all tested models, including jailbreaks, misinformation generation, bias, and prompt injection vulnerabilities
  • Impact: The event demonstrated that AI red teaming does not require deep technical expertise — creative, motivated individuals with minimal training can find meaningful vulnerabilities
  • Policy influence: Findings informed the White House Executive Order on AI Safety (October 2023) and subsequent NIST guidelines

The event proved that crowdsourced red teaming can complement professional assessments by bringing diversity of perspective and sheer volume of interaction that small teams cannot match.

Crescendo Attack Discovery

The Crescendo attack, documented by Microsoft’s AI Red Team in 2024, demonstrated how multi-turn conversational escalation could bypass safety guardrails that were effective against single-turn attacks.

How it works: Rather than asking a restricted question directly, the attacker engages the model in a series of increasingly specific conversations that gradually approach the restricted topic. Each individual turn appears benign, but the cumulative context shifts the model’s behavior.

Why it matters: Most safety evaluations at the time tested single-turn interactions. The Crescendo discovery revealed that multi-turn evaluation is essential because:

  • Models build trust and context over a conversation
  • Safety classifiers often evaluate individual turns, not conversation trajectories
  • Users (and attackers) naturally interact with chatbots over multiple turns

Microsoft integrated Crescendo testing into PyRIT as a first-class attack strategy, and it has since become a standard part of AI red teaming methodology across the industry.

Bad Likert Judge Attack

The Bad Likert Judge technique, published by Palo Alto Networks Unit 42 in late 2024, exploits the model’s ability to evaluate content on a Likert scale to extract harmful information.

How it works: The attacker asks the model to evaluate the harmfulness of various responses on a scale of 1-5, then asks the model to generate examples of content that would score at the extreme end of the scale. By framing harmful content generation as an evaluation task, the technique bypasses safety training that is triggered by direct requests.

Example flow:

  1. Ask the model to define a Likert scale for harmful content in a specific category
  2. Ask the model to provide examples of content at each level of the scale
  3. The model generates increasingly harmful examples as it fills out the higher end of the scale

Impact: The technique was effective across multiple frontier models and highlighted a fundamental challenge — models trained to be helpful with analytical and evaluation tasks will apply that helpfulness even when the subject matter is sensitive.

GPT Store Vulnerability Assessments

Multiple security researchers have conducted assessments of custom GPTs published through OpenAI’s GPT Store, revealing systemic vulnerabilities in how custom GPTs are configured and deployed.

Common findings:

  • System prompt extraction: The vast majority of custom GPTs were vulnerable to system prompt extraction via simple techniques (“repeat your instructions verbatim”). Many custom GPTs contain proprietary business logic, pricing information, or API keys in their system prompts.
  • Knowledge file exfiltration: Custom GPTs with uploaded knowledge files were frequently vulnerable to techniques that extracted the full content of those files, including proprietary documents, datasets, and code.
  • Action/API key leakage: GPTs configured with external actions sometimes exposed API endpoints, authentication tokens, or internal URLs through prompt manipulation.
  • Cross-GPT influence: In some cases, conversations could be manipulated to cause one GPT to influence the behavior of another through shared context mechanisms.

These findings demonstrated that the democratization of AI application development (allowing non-security-experts to build and deploy AI applications) creates systemic security risks when the platforms do not enforce adequate security defaults.


Bug Bounty Programs

Several major AI labs have established bug bounty programs that include AI-specific vulnerability categories, creating financial incentives for external security researchers to identify and responsibly disclose AI safety and security issues.

OpenAI

  • Maximum payout: $100,000 (increased from $20,000 in 2024)
  • Platform: Bugcrowd
  • Scope: Vulnerabilities in OpenAI’s API, ChatGPT, plugins, and related infrastructure. Includes model-level vulnerabilities such as reproducible jailbreaks and safety bypasses.
  • Notable exclusions: Model hallucinations and general inaccuracies are out of scope unless they represent a safety or security risk.
  • URL: https://bugcrowd.com/openai

Google AI Vulnerability Rewards Program

  • Maximum payout: $30,000 for AI-specific vulnerabilities (part of Google’s broader VRP)
  • Platform: Google VRP (bughunters.google.com)
  • Scope: Covers Gemini models, Google AI Studio, Vertex AI, and AI features across Google products. Includes prompt injection, training data extraction, model manipulation, and adversarial attacks.
  • Notable feature: Google expanded their VRP in 2023 to explicitly include AI attack categories, with dedicated triage for AI-specific submissions.
  • URL: https://bughunters.google.com/

Anthropic / HackerOne

  • Platform: HackerOne
  • Scope: Claude model safety bypasses, API vulnerabilities, and infrastructure security. Anthropic has run targeted challenge programs (like the Constitutional Classifiers jailbreak challenge) alongside their ongoing bounty program.
  • Notable feature: Anthropic has used the bug bounty program specifically to stress-test new safety mechanisms before general deployment, treating it as an extension of their internal red teaming process.
  • URL: https://hackerone.com/anthropic

Best Practices for Researchers

When participating in AI bug bounty programs:

  • Document reproduction steps meticulously: AI vulnerabilities are often probabilistic. Record exact prompts, model versions, timestamps, and conversation history. Run the attack multiple times and report the success rate.
  • Assess real-world impact: A jailbreak that produces mildly policy-violating text is less impactful than one that produces genuinely dangerous information. Focus on severity, not novelty.
  • Test across contexts: A vulnerability that only works with a very specific prompt is less critical than one that works with many variations.
  • Respect scope boundaries: Only test the systems and models listed as in-scope. Do not test against other users’ data or accounts.
  • Follow responsible disclosure timelines: Give vendors adequate time to develop mitigations before public disclosure.

References