← Back to Red Teaming

Red Team Metrics & Reporting

18 min read

Why Metrics Matter

There is a pervasive failure mode in red teaming: the team compromises Domain Admin, writes a report that says “we got DA in 72 hours,” the client nods gravely, and nothing meaningfully changes. Six months later a different red team gets DA in 68 hours through a nearly identical path. The cycle repeats. The organization is no more secure than it was before, and the red team has produced theater rather than improvement.

The problem is not the red team’s skill. The problem is the absence of quantifiable, repeatable metrics that translate adversary simulation results into measurable security posture changes. Without metrics, red teaming is anecdote. With metrics, it becomes engineering.

From Anecdotes to Engineering

A mature red team program measures itself the same way any engineering discipline measures itself: through key performance indicators (KPIs) that are tracked over time, benchmarked against industry standards, and tied to business outcomes. This shift has several consequences.

Repeatability. When you define what you are measuring before an engagement begins, you can compare results across engagements. “We achieved the objective” becomes “We achieved the objective through 7 attack steps, with a mean detection time of 14.3 hours and a 23% technique detection rate, compared to 11 steps, 38.7 hours, and 12% in the previous engagement.”

Accountability. Metrics give the blue team concrete improvement targets. Instead of “improve detection,” the goal becomes “reduce MTTD for lateral movement techniques from 38.7 hours to under 20 hours by Q3.” That is a goal people can plan against, resource, and be held accountable for.

Justification. Red team programs are expensive. When a CFO asks why the organization spends CHF 400,000 annually on adversary simulation, “because attackers are scary” is not a compelling answer. “Because our last four engagements demonstrate a CHF 2.56M reduction in expected breach cost, representing a 6.4:1 ROI” is.

Connecting Findings to Business Risk

Every red team finding exists on two axes: technical severity and business impact. A critical vulnerability in an isolated test system is technically severe but has minimal business impact. A medium-severity misconfiguration that provides a path to the payment processing environment may represent existential risk. Metrics must capture both dimensions.

The most effective red team programs establish a risk taxonomy that maps technical findings to business processes, data classifications, and regulatory requirements. This taxonomy becomes the translation layer between the red team’s technical narrative and the executive’s strategic decision-making.


Key Performance Indicators

The following KPIs form the core measurement framework for red team operations. Each should be tracked per engagement, per technique category, and as a rolling trend over time.

MTTD — Mean Time to Detect

Definition: The average elapsed time between a red team action and the security operations team’s first detection of that action. Detection means a human analyst has identified the activity as suspicious or malicious — an alert firing in a SIEM that nobody reads does not count.

MTTD is the single most important metric in red teaming because it directly measures the blue team’s ability to identify adversary activity. Everything else — response, containment, remediation — depends on detection happening first.

SANS 2025 Benchmark: The SANS Institute’s 2025 Security Operations Report places the industry median MTTD at approximately 197 hours (roughly 8.2 days) for sophisticated adversary techniques. This figure varies dramatically by technique category:

Technique CategoryIndustry Median MTTDTop Quartile MTTD
Initial Access (Phishing)1.2 hours0.3 hours
Credential Access72 hours12 hours
Lateral Movement168 hours24 hours
Privilege Escalation96 hours18 hours
Data Exfiltration312 hours48 hours
Command & Control204 hours36 hours
Persistence Mechanisms480+ hours72 hours

Measuring Per Technique. Do not report a single aggregate MTTD. Break it down by MITRE ATT&CK technique. An organization might detect phishing emails in under an hour but take weeks to identify a golden ticket attack. The aggregate number hides the gap. Per-technique MTTD exposes exactly where detection capabilities are strong and where they are blind.

Tracking Improvement. Plot MTTD per technique category across engagements on a time series chart. The slope of that line is the most honest measure of whether your security program is improving. Flat or increasing slopes indicate that detection investments are not producing results. Decreasing slopes mean the blue team is getting faster.

MTTR — Mean Time to Respond

Definition: The average elapsed time between detection of red team activity and effective containment of that activity. Containment means the red team’s access or capability has been meaningfully degraded — not merely that a ticket was opened.

MTTR measures the blue team’s ability to act on detections. An organization that detects lateral movement in 2 hours but takes 96 hours to contain it has an excellent MTTD and a catastrophic MTTR. Both numbers matter.

Response Effectiveness Scoring. Not all responses are equal. A scale of 0–4 provides granularity:

ScoreDescription
0 — No ResponseDetection occurred but no action was taken
1 — Partial AwarenessAnalyst acknowledged the alert but did not investigate
2 — InvestigationAnalyst investigated and correctly identified the activity
3 — Partial ContainmentSome access was revoked but red team retained alternative paths
4 — Full ContainmentRed team access was completely eliminated and persistence removed

Track the distribution of these scores across engagements. A shift from mostly 0s and 1s to mostly 3s and 4s demonstrates that the SOC is not just detecting faster but responding more effectively.

Dwell Time

Definition: The total elapsed time from the red team’s initial compromise (first foothold on a target system) to the blue team’s detection of any red team activity. Dwell time is distinct from MTTD in that it measures the entire undetected presence, not the detection time for individual techniques.

Industry Benchmarks. Mandiant’s M-Trends 2025 report places the global median dwell time at 10 days for incidents detected internally, a significant improvement from 16 days in 2023 and 21 days in 2022. However, this median masks a bimodal distribution: organizations with mature detection programs identify intrusions within 1–3 days, while organizations without them frequently exceed 100 days.

For red team engagements specifically, dwell time should be measured from first callback to first confirmed detection. If the red team operates for the full engagement window (typically 2–4 weeks) without detection, dwell time is recorded as the full engagement duration with a “not detected” flag.

Reducing Dwell Time as Primary Goal. If you could choose only one metric to optimize, choose dwell time. It is the single best proxy for overall detection and response maturity. Every day of undetected adversary presence is a day during which data is being exfiltrated, persistence is being deepened, and the cost of incident response is compounding. The relationship between dwell time and breach cost is well-documented: Ponemon’s 2025 Cost of a Data Breach report shows that breaches identified within 100 days cost an average of USD 3.6M, while those exceeding 200 days cost USD 5.1M.

Attack Path Depth

Definition: The number of discrete steps the red team required to move from initial access to the stated objective. Each step represents a distinct technique execution — a new tool deployed, a credential harvested, a privilege escalated, a network boundary crossed.

Complexity Scoring. Raw step count alone is insufficient. A 5-step path that requires zero-day exploitation and custom tooling is more concerning than a 12-step path that uses only default credentials and public exploits. Weight each step by:

  • Technique sophistication (1–5 scale based on required skill and tooling)
  • Detection difficulty (1–5 scale based on available telemetry and signatures)
  • Automation potential (can the step be scripted by commodity malware?)

The resulting weighted attack path score provides a more nuanced view of organizational risk than step count alone.

Tracking Over Time. Increasing attack path depth across engagements (assuming consistent red team capability) indicates that the organization is closing shortcuts. If the red team needed 5 steps last year and 11 steps this year to reach the same objective, the defenders have forced the adversary to work harder. That is measurable progress.

Detection Rate by Technique

Definition: The percentage of distinct red team techniques that were detected by the blue team during the engagement. A technique is considered “detected” if a human analyst identified it as suspicious, regardless of whether containment followed.

ATT&CK Coverage Mapping. Map every technique used during the engagement to the MITRE ATT&CK framework, then overlay detection results. This produces a coverage heatmap showing which ATT&CK techniques the organization can detect and which it cannot. The heatmap becomes the blue team’s prioritization roadmap for detection engineering.

Calculation Example:

Total unique techniques employed:     34
Techniques detected:                  11
Techniques partially detected:         5
Techniques undetected:                18

Detection Rate (strict):         11/34 = 32.4%
Detection Rate (partial credit): 13.5/34 = 39.7%

Partial credit (counting partially detected techniques as 0.5) provides a more nuanced view but should always be reported alongside the strict rate for transparency.


Phishing Metrics

Phishing is often the entry point for red team engagements, and it has its own dedicated metric framework. Unfortunately, most organizations measure the wrong things.

Beyond Click Rate

The click rate — the percentage of recipients who click a malicious link — is the most commonly reported phishing metric and also the least useful. It tells you that humans can be tricked, which is not actionable intelligence. A 15% click rate is neither good nor bad without context about what happened next.

Credential Capture Rate

The metric that matters is credential capture rate: the percentage of recipients who not only clicked the link but entered valid credentials into the phishing page. This measures the actual risk — an attacker does not benefit from a click alone; they benefit from captured credentials.

MetricWhat It MeasuresActionability
Click RateUsers who clicked the linkLow — many click out of curiosity
Credential Capture RateUsers who submitted valid credentialsHigh — direct compromise
Payload Execution RateUsers who executed a payloadHigh — direct compromise
Reporting RateUsers who reported the phish to securityHigh — positive security behavior
Time to First ClickSpeed of initial user engagementMedium — urgency of response needed
Time to First ReportSpeed of user-initiated detectionHigh — human detection capability

Reporting Rate as a Positive Metric

Most phishing metrics measure failure. Reporting rate measures success: the percentage of recipients who correctly identified the phishing attempt and reported it through official channels. This is the only phishing metric that measures a positive security behavior.

Organizations with mature security cultures achieve reporting rates of 40–60%. If your reporting rate is below 10%, your security awareness program is failing regardless of what your click rate looks like. A single report can trigger an investigation that prevents credential use, making reporting rate the phishing equivalent of MTTD.

Time-to-First-Click

Time-to-first-click measures how quickly the first recipient engages with a phishing email after delivery. In most campaigns, the first click occurs within 60–90 seconds. This metric is critical for SOC planning: if the first credential is captured within 2 minutes of delivery, the SOC needs near-real-time phishing detection and automated response playbooks. A 30-minute manual triage process is too slow.

Multi-Campaign Trend Analysis

Individual phishing campaigns are snapshots. The value emerges from tracking metrics across campaigns over months and years. Plot credential capture rate and reporting rate on the same chart across 12+ months of campaigns. The capture rate should decrease and the reporting rate should increase. If both lines are flat, the security awareness program is not producing behavior change.

Benchmarking Against Industry. KnowBe4’s 2025 Phishing Industry Benchmarking Report provides baseline click rates by industry vertical and organization size. Use these as a starting reference, but invest more attention in your own trend lines than in cross-industry comparisons. Your year-over-year improvement matters more than whether you are “better” than the healthcare average.


ROI Calculation

Red teaming is an investment, and like all investments, it must be justified in financial terms. The ROI calculation for red teaming is not as clean as calculating returns on a stock portfolio, but with defensible assumptions, it produces numbers that executives can evaluate.

The Basic Formula

ROI = (Avoided Breach Cost - Red Team Cost) / Red Team Cost

Red Team Cost is straightforward: the total cost of the engagement (external fees, internal labor, tooling, infrastructure). For an internal red team, include fully loaded salaries, training, tooling licenses, and infrastructure costs.

Avoided Breach Cost is the harder number. It is estimated as:

Avoided Breach Cost = P(breach without red team) × Average Breach Cost
                    - P(breach with red team) × Average Breach Cost

The probability reduction is estimated from the findings remediated and the attack paths closed. This is inherently an estimate, but it is a defensible one when backed by data.

Ponemon Data and Benchmarks

The Ponemon Institute’s 2025 Cost of a Data Breach Report provides the foundational data for ROI calculations:

  • Global average cost of a data breach: USD 4.88M
  • Average cost in financial services: USD 6.08M
  • Average cost in healthcare: USD 10.93M
  • Cost reduction from proactive security testing: 18–24%

Using these figures, a CHF 200,000 red team engagement that identifies and enables remediation of attack paths representing a 20% probability reduction in breach yields:

Avoided Cost = 0.20 × CHF 4,880,000 = CHF 976,000
ROI = (976,000 - 200,000) / 200,000 = 3.88:1

For organizations in high-cost verticals (healthcare, financial services), or those with regulatory exposure, the ROI is significantly higher. The commonly cited CHF 6.40:1 benchmark represents organizations with mature programs that track and remediate findings systematically.

Presenting ROI to Executives

Executives do not want to see probability calculations. They want to see three things:

  1. What we spent: Total cost of the red team program this year
  2. What we found: Number and severity of vulnerabilities, with business context
  3. What it is worth: Estimated breach cost reduction in currency

Present it as a simple table:

ItemValue
Annual Red Team InvestmentCHF 380,000
Critical Attack Paths Identified & Remediated4
Estimated Breach Probability Reduction22%
Estimated Avoided Breach CostCHF 1,340,000
Net ROI3.53:1

Do not overstate precision. Round numbers and use ranges where appropriate. A CFO will trust “CHF 1.1–1.5M in avoided costs” more than “CHF 1,337,422.18” because the latter implies a false precision that undermines credibility.

Total Engagement Cost vs. Findings Value

Track the cost-per-finding across engagements. If each engagement costs CHF 100,000 and produces 3 critical findings, your cost per critical finding is CHF 33,333. Compare this to the cost of discovering the same finding during an actual breach (incident response costs, regulatory fines, business disruption). The ratio is typically 50:1 or higher in favor of proactive discovery.


Executive Reporting

The executive report is the primary deliverable that justifies the red team’s existence to organizational leadership. It must communicate technical findings in business language without losing accuracy or urgency.

What Executives Actually Care About

Board members and C-suite executives are not interested in which tools you used or how clever your attack chain was. They care about four things:

  1. Can we be breached? (Yes — and here is specifically how)
  2. What is at risk? (These specific business assets, data, and processes)
  3. How bad could it be? (Quantified in dollars, regulatory exposure, and reputational impact)
  4. What do we do about it? (Prioritized, resourced, and time-bound remediation plan)

Every element of the executive report should answer one of these four questions. If a section does not contribute to any of them, cut it.

Risk Scoring with Business Context

The standard Critical/High/Medium/Low rating is necessary but insufficient. Each rating must include a business impact description that translates technical severity into executive language.

RatingTechnical DefinitionBusiness Impact DescriptionExample
CriticalImmediate, unauthenticated path to objective completionExistential risk: full compromise of core business systems, mass data exfiltration, or regulatory violation with material financial impactUnauthenticated RCE on payment gateway leading to cardholder data access
HighAuthenticated or multi-step path to significant compromiseSevere risk: compromise of sensitive systems requiring incident response, potential regulatory notification, and significant remediation costDomain Admin via Kerberoasting of service account with weak password
MediumExploitable issue requiring specific conditions or producing limited impactModerate risk: compromise of non-critical systems, limited data exposure, or creation of conditions enabling further attackLocal privilege escalation on developer workstations via unpatched vulnerability
LowInformational finding or issue requiring unlikely conditionsMinor risk: policy violation, hardening deficiency, or theoretical attack path requiring impractical conditionsVerbose error messages disclosing internal software versions

The Attack Narrative

The most powerful section of an executive report is the attack narrative: a chronological story of how the red team achieved its objectives, told in plain language. This is where red teaming transcends vulnerability scanning — it provides a story that executives remember and act on.

Example executive summary paragraph (anonymized):

On March 3, the red team sent a targeted phishing email to 12 employees in the finance department, impersonating the company’s benefits provider. Within 4 hours, 3 employees entered their corporate credentials into the phishing page. Using one set of captured credentials, the team accessed the corporate VPN and established a persistent foothold on the internal network. Over the following 6 days — without detection — the team moved laterally through the Windows domain, escalated privileges to Domain Administrator, and accessed the SAP financial system containing quarterly earnings data not yet disclosed publicly. This data, if exfiltrated by a real adversary, would constitute material non-public information subject to SEC insider trading regulations. The total time from initial phishing email to access of regulated financial data was 8 days. The security operations team did not detect any phase of this operation.

This paragraph contains no CVE numbers, no tool names, no technical jargon. It tells a story that a board member can understand and will find alarming. The technical details belong in the technical report.

Visual Attack Path Diagrams

Include a simplified attack path diagram in the executive report. Use 4–6 nodes maximum, each representing a major phase (Initial Access, Foothold, Lateral Movement, Privilege Escalation, Objective). Connect them with arrows labeled with the time elapsed. This visual anchors the narrative and makes the attack path immediately comprehensible.


Technical Reporting

The technical report is the operational deliverable consumed by the security engineering, SOC, and IT teams who will actually fix the findings. It must be detailed enough to reproduce every finding without assistance from the red team.

Detailed Finding Format

Each finding should follow a consistent structure:

Finding ID: RT-2026-017

Title: Kerberoastable Service Account with Weak Password Leading to Domain Admin

Severity: Critical

MITRE ATT&CK Mapping: T1558.003 (Kerberoasting), T1078 (Valid Accounts), T1021.002 (SMB/Windows Admin Shares)

Description: The service account svc-sqlbackup is configured with a Service Principal Name (SPN) and uses a password that was cracked offline in 47 minutes using a standard wordlist with rules. The account is a member of the Domain Admins group, providing immediate domain-wide administrative access upon password recovery.

Business Impact: Complete compromise of the Windows Active Directory domain, affecting all 4,200 domain-joined systems. The attacker would have read/write access to all file shares, email systems, and applications authenticating against Active Directory.

Steps to Reproduce:

  1. From any domain-authenticated session, request a Kerberos service ticket for svc-sqlbackup using GetUserSPNs.py
  2. Extract the ticket hash and crack offline using hashcat -m 13100 with rockyou.txt and best64.rule
  3. Authenticate as svc-sqlbackup using the recovered password via SMB
  4. Confirm Domain Admin membership via net group "Domain Admins" /domain

Evidence: [Screenshots, command output, timestamps]

Remediation:

  1. Immediate: Change the svc-sqlbackup password to a 25+ character randomly generated string
  2. Short-term: Remove the account from Domain Admins; apply principle of least privilege
  3. Long-term: Implement Group Managed Service Accounts (gMSA) to eliminate human-managed service account passwords
  4. Detection: Deploy Kerberoasting detection via Windows Event ID 4769 with anomalous encryption type (0x17)

ATT&CK Navigator Layer: [Link to JSON layer file]

Tool Output Documentation

Include raw tool output in appendices, not in the finding body. The finding body should be human-readable. Appendices should include exact commands executed, full output, and timestamps. This serves two purposes: it allows the blue team to build detection signatures from the exact artifacts, and it provides evidence if the engagement results are challenged.

Timeline Reconstruction

Provide a complete chronological timeline of all red team activity, formatted as:

[2026-03-03 09:14:22 UTC] Phishing emails delivered to finance department (12 recipients)
[2026-03-03 09:16:01 UTC] First click — user J.Smith accessed phishing page
[2026-03-03 09:16:34 UTC] Credentials captured — J.Smith submitted corporate credentials
[2026-03-03 13:22:11 UTC] VPN access established using captured credentials
[2026-03-03 13:24:45 UTC] C2 beacon established on workstation WS-FIN-042
[2026-03-03 14:01:33 UTC] Internal network reconnaissance initiated (BloodHound collection)
...

This timeline is invaluable for the SOC to correlate against their own logs and identify exactly where detection failed.

IOC Appendix

Every red team report must include a complete Indicators of Compromise appendix listing all artifacts the red team left in the environment:

  • IP addresses and domains used for C2
  • File hashes (MD5, SHA1, SHA256) of all tools and payloads deployed
  • Service names, registry keys, and scheduled tasks created for persistence
  • User accounts created or modified
  • DNS queries generated by implants
  • Network signatures (JA3/JA3S hashes, User-Agent strings)

This appendix serves double duty: it enables the blue team to verify complete cleanup after the engagement, and it provides known-bad indicators for tuning detection systems.


Report Structure Template

A well-structured red team report follows a consistent hierarchy that serves multiple audiences. The following template represents industry best practice as aligned with PTES, OWASP, and other frameworks.

graph TD
    A[Red Team Report] --> B[Executive Summary]
    A --> C[Scope & Methodology]
    A --> D[Attack Narrative]
    A --> E[Findings Summary]
    A --> F[Detailed Findings]
    A --> G[Remediation Roadmap]
    A --> H[Appendices]

    B --> B1[Business Context]
    B --> B2[Key Results]
    B --> B3[Critical Risks]
    B --> B4[Strategic Recommendations]

    C --> C1[Objectives]
    C --> C2[Rules of Engagement]
    C --> C3[Threat Model]
    C --> C4[Methodology & Tools]

    D --> D1[Phase 1: Reconnaissance]
    D --> D2[Phase 2: Initial Access]
    D --> D3[Phase 3: Post-Exploitation]
    D --> D4[Phase 4: Objective Completion]

    E --> E1[Risk Rating Distribution]
    E --> E2[Findings Table]
    E --> E3[ATT&CK Coverage Map]

    F --> F1[Finding Detail Pages]

    G --> G1[Priority Matrix]
    G --> G2[Quick Wins]
    G --> G3[Strategic Improvements]
    G --> G4[Long-term Architecture]

    H --> H1[IOC List]
    H --> H2[Tool Output]
    H --> H3[Detailed Timeline]
    H --> H4[ATT&CK Navigator Layers]
    H --> H5[Raw Evidence]

Section Details

1. Executive Summary (1–2 pages). Written for the board and C-suite. No technical jargon. Includes overall risk assessment, whether objectives were achieved, top 3 recommendations, and ROI summary.

2. Scope & Methodology (1–2 pages). Documents what was tested, what was excluded, the threat model used, rules of engagement, and the methodology followed. This section protects both the red team and the client by establishing shared expectations.

3. Attack Narrative (3–5 pages). The chronological story of the engagement. Written in accessible language with a technical layer for each phase. Includes simplified attack path diagrams and key decision points. This is the section that gets read.

4. Findings Summary (1–2 pages). A table of all findings with ID, title, severity, and status. Includes a risk rating distribution chart (how many Critical, High, Medium, Low). Provides a summary ATT&CK coverage map showing tested and detected techniques.

5. Detailed Findings (variable). One page per finding using the detailed format described above. This is the section the remediation team works from.

6. Remediation Roadmap (2–3 pages). Findings grouped by remediation priority, not by severity alone. Quick wins (less than 1 week, low effort) listed first. Strategic improvements (1–3 months) next. Long-term architectural changes last. Each item includes estimated effort, responsible team, and suggested timeline.

7. Appendices (variable). IOC list, raw tool output, detailed timeline, ATT&CK Navigator JSON layers, and any additional evidence. These appendices can be extensive — 50+ pages is normal for a thorough engagement.


Data Visualization

Effective visualization transforms raw data into insight. The following visualization types should be standard components of every red team report and program dashboard.

Attack Path Graphs

Directed graphs showing the sequence of compromises from initial access to objective. Each node represents a system or identity, and each edge represents a technique. Color-code edges by detection status (detected = green, partially detected = yellow, undetected = red). Tools like BloodHound, PlotlyJS, and Gephi can generate these programmatically.

Detection Coverage Heatmaps

Map red team techniques to the MITRE ATT&CK matrix and color-code each cell by detection capability: full detection (green), partial detection (yellow), no detection (red), not tested (gray). The ATT&CK Navigator tool produces these natively. Over multiple engagements, animate the heatmap to show detection coverage expanding — this is one of the most powerful visualizations for demonstrating program maturity.

Trend Charts

Time-series line charts tracking KPIs across engagements. Plot MTTD, MTTR, dwell time, detection rate, and phishing metrics on separate charts with the same time axis. Add vertical lines marking major security program investments (new EDR deployment, SOC expansion, detection engineering hire) to correlate investment with outcome.

Risk Matrices

The traditional 5x5 risk matrix (likelihood vs. impact) remains effective for executive communication. Plot each finding as a dot on the matrix, with dot size proportional to the number of affected systems. Cluster findings by category (identity, network, endpoint, application) using color coding.

Timeline Visualizations

Horizontal timeline bars showing red team activity, blue team detection, and blue team response as parallel tracks. This visualization immediately shows the gap between adversary action and defender reaction. Use tools like TimelineJS, Mermaid Gantt charts, or custom D3.js visualizations.

ToolPurposeOutput Format
ATT&CK NavigatorTechnique coverage heatmapsSVG, JSON layer files
BloodHoundActive Directory attack path graphsInteractive web, PNG export
PlotlyJS / Plotly DashInteractive KPI dashboardsHTML, PNG, PDF
GephiNetwork and attack path visualizationSVG, PDF, PNG
D3.jsCustom interactive visualizationsHTML/SVG
MermaidDiagrams in Markdown-compatible reportsSVG, PNG
Power BI / TableauExecutive dashboardsInteractive web, PDF

Remediation Tracking

Identifying findings is only half the value of red teaming. The other half is verifying that findings are actually fixed. A red team program without remediation tracking is producing shelf-ware.

Finding Lifecycle

Every finding moves through a defined lifecycle. Tracking state transitions and time-in-state provides accountability and identifies bottlenecks.

stateDiagram-v2
    [*] --> Open: Finding reported
    Open --> Triage: Risk accepted / prioritized
    Triage --> Remediate: Fix in progress
    Remediate --> Verify: Fix deployed
    Verify --> Closed: Re-test passed
    Verify --> Remediate: Re-test failed
    Triage --> Risk_Accepted: Business decision
    Risk_Accepted --> [*]: Documented exception
    Closed --> [*]: Finding resolved

State Definitions:

  • Open: Finding has been reported but not yet reviewed by the responsible team
  • Triage: Finding has been reviewed, risk-assessed, and assigned to a remediation owner
  • Remediate: Active remediation work is in progress
  • Verify: Remediation has been deployed and is awaiting red team re-test
  • Closed: Red team has re-tested and confirmed the finding is resolved
  • Risk Accepted: Business leadership has formally accepted the risk with documented justification

SLA Tracking

Define remediation SLAs by severity and track compliance:

SeverityRemediation SLARe-test SLAEscalation Trigger
Critical72 hours1 week after remediation48 hours overdue
High30 days2 weeks after remediation14 days overdue
Medium90 days30 days after remediation30 days overdue
Low180 daysNext scheduled engagement60 days overdue

Track SLA compliance as a metric. If 40% of critical findings are missing SLA, that is a finding in itself — one that indicates the organization cannot operationalize security improvements regardless of how many red team engagements it runs.

Re-test Scheduling

Every remediated finding must be re-tested by the red team. This is non-negotiable. A “fixed” finding that has not been independently verified is not fixed — it is assumed fixed, which is a meaningfully different thing. Schedule re-tests in batches (monthly or quarterly) to minimize overhead, and track the re-test pass/fail rate as a metric.

Regression Tracking

When a previously closed finding reappears in a subsequent engagement, it is a regression. Regressions indicate that the remediation was incomplete, that the root cause was not addressed, or that the organization’s change management processes allowed the vulnerability to be reintroduced. Track regression rate as a program-level metric. A high regression rate is a signal that remediation quality, not detection capability, is the bottleneck.


KPI Dashboard Template

The following table provides a template for a red team program KPI dashboard. Update it after each engagement and review it quarterly with security leadership.

MetricDefinitionTargetCurrentTrend
MTTD (Overall)Mean time from red team action to blue team detection< 24 hours38.7 hoursImproving (was 52.1h)
MTTD (Lateral Movement)Detection time for lateral movement techniques< 12 hours26.4 hoursStable
MTTR (Overall)Mean time from detection to containment< 4 hours11.2 hoursImproving (was 18.6h)
Dwell TimeTime from initial compromise to first detection< 48 hours6.3 daysImproving (was 11.1 days)
Detection RatePercentage of techniques detected> 60%34.2%Improving (was 23.8%)
Attack Path DepthSteps from initial access to objective> 15 steps9 stepsImproving (was 6 steps)
Phishing Credential CapturePercentage of recipients submitting credentials< 3%7.2%Improving (was 11.4%)
Phishing Reporting RatePercentage of recipients reporting the phish> 50%28.3%Improving (was 19.1%)
Remediation SLA ComplianceFindings remediated within SLA> 95%78.4%Stable
Finding Regression RatePreviously closed findings that reappear< 5%8.7%Worsening (was 6.2%)
ROIReturn on red team investment> 5:13.88:1Improving (was 2.94:1)

Benchmarking

Metrics in isolation are difficult to interpret. Benchmarking provides the context necessary to determine whether your program is performing well, poorly, or somewhere in between.

Industry Standards

Several organizations publish benchmarks relevant to red team metrics:

  • SANS Institute: Annual Security Operations Survey — MTTD and MTTR benchmarks by industry
  • Mandiant (Google): M-Trends Report — dwell time statistics globally and by region
  • Ponemon Institute: Cost of a Data Breach Report — breach cost data by industry, size, and security investment
  • Verizon: Data Breach Investigations Report (DBIR) — attack pattern frequency and technique distribution
  • KnowBe4: Phishing Industry Benchmarking Report — phishing metrics by industry and organization size
  • MITRE Engenuity: ATT&CK Evaluations — detection capability benchmarks for specific security products

Use these as reference points, not as targets. Your organization’s specific threat model, industry vertical, and regulatory environment should determine your targets. A defense contractor should have stricter targets than a regional retailer.

Year-over-Year Improvement

The most meaningful benchmark is yourself over time. For each KPI, calculate the year-over-year improvement rate:

YoY Improvement = (Previous Year Value - Current Year Value) / Previous Year Value × 100%

For metrics where lower is better (MTTD, dwell time, phishing capture rate), a positive improvement rate means performance improved. For metrics where higher is better (detection rate, reporting rate), invert the formula.

Plot YoY improvement rates for all KPIs on a single bar chart. This “improvement velocity” view shows where the security program is making progress and where it is stagnating. Share this chart with executive leadership — it directly answers the question “are we getting better?”

Peer Comparison Frameworks

For organizations in regulated industries, peer comparison through FFIEC Cybersecurity Assessment Tool, NIST CSF maturity models, or sector-specific ISACs provides structured comparison against peer organizations. These frameworks do not directly measure red team metrics, but they provide maturity scores that can be correlated with red team results to demonstrate that adversary simulation is driving measurable maturity improvement.

Avoiding Metric Gaming

A final caution: any metric that is tied to incentives will eventually be gamed. If the SOC’s bonus depends on MTTD, analysts will start marking alerts as “detected” prematurely. If the red team’s success depends on low detection rates, they will avoid testing well-monitored areas. Guard against this by:

  • Using multiple metrics so that gaming one degrades another
  • Having the red team and blue team report to different leadership to avoid conflicts of interest
  • Independently auditing metric calculations periodically
  • Focusing on trends rather than absolute values — it is harder to fake a consistent improvement trend than a single good number

The goal is not good metrics. The goal is good security. Metrics are the measurement instrument. If the instrument is corrupted, the measurements are worthless regardless of what they show.


Tying It All Together

Metrics and reporting are the connective tissue between red team operations and organizational security improvement. Without them, red teaming is an expensive demonstration of attacker capability. With them, it becomes a calibrated measurement of defensive capability that improves over time.

The progression looks like this:

  1. First engagement: Establish baselines for all KPIs. The numbers will likely be discouraging. That is normal and expected.
  2. Remediation cycle: Track findings through the lifecycle. Measure SLA compliance and close rate.
  3. Second engagement: Compare KPIs against baselines. Identify improvements and regressions.
  4. Program maturity: After 4+ engagements, trend lines become meaningful. Calculate ROI. Present to executive leadership.
  5. Continuous improvement: Integrate with purple teaming for accelerated feedback loops. Benchmark against industry standards.

For practical examples of how these metrics are applied in actual engagements, see Real Engagements & Case Studies. For the methodological frameworks that structure how metrics are collected during engagements, refer to Frameworks & Methodologies.

The organizations that get the most value from red teaming are not the ones with the best red teams. They are the ones with the best measurement programs — because measurement is what transforms a point-in-time assessment into a continuous improvement engine.