Comparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming

Authors: Alexandra Chouldechova, A. Feder Cooper, Solon Barocas, Abhinav Palia, Dan Vann, Hanna Wallach

Published: 2026-01-26

arXiv ID: 2601.18076v1

Added to Library: 2026-01-27 04:00 UTC

Red Teaming

📄 Abstract

We argue that conclusions drawn about relative system safety or attack method efficacy via AI red teaming are often not supported by evidence provided by attack success rate (ASR) comparisons. We show, through conceptual, theoretical, and empirical contributions, that many conclusions are founded on apples-to-oranges comparisons or low-validity measurements. Our arguments are grounded in asking a simple question: When can attack success rates be meaningfully compared? To answer this question, we draw on ideas from social science measurement theory and inferential statistics, which, taken together, provide a conceptual grounding for understanding when numerical values obtained through the quantification of system attributes can be meaningfully compared. Through this lens, we articulate conditions under which ASRs can and cannot be meaningfully compared. Using jailbreaking as a running example, we provide examples and extensive discussion of apples-to-oranges ASR comparisons and measurement validity challenges.

🔍 Key Points

The paper critiques the validity of Attack Success Rate (ASR) comparisons in AI red teaming, emphasizing that conclusions often rely on faulty measurement methods.
It introduces concepts from social science measurement theory to determine when ASRs can be meaningfully compared, focusing on the validity and coherence of comparisons.
Using jailbreaking as a key example, the authors demonstrate issues like 'apples-to-oranges' comparisons and the impact of varying definitions of attack success on ASR validity.
Methods for establishing measurement validity are outlined, including proper definition and systematization of harmful prompts and success criteria used in evaluations.
The authors provide recommendations for improving AI red teaming practices, urging the need for clearer definitions and reporting of metrics in red teaming studies.

💡 Why This Paper Matters

This paper significantly contributes to the discourse on AI red teaming by addressing the critical issue of measurement validity in ASR comparisons. Its insights help to form a robust foundation for evaluating system safety and attack efficacy, ensuring that quantitative assessments lead to reliable conclusions. By emphasizing methodological rigor and clarity, the paper seeks to improve the quality of AI safety assessments, ultimately contributing to more responsible AI development.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper are particularly relevant for AI security researchers as they underscore the necessity of accurate measurement frameworks when conducting adversarial tests on AI systems. By establishing guidelines for valid comparisons and pinpointing frequent errors in existing methods, the paper equips researchers with tools to enhance the reliability of their evaluations, which is crucial for developing safer AI systems.

Comparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper