Intent Laundering: AI Safety Datasets Are Not What They Seem

📄 Abstract

We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world attacks based on three key properties: driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce "intent laundering": a procedure that abstracts away triggering cues from attacks (data points) while strictly preserving their malicious intent and all relevant details. Our results indicate that current AI safety datasets fail to faithfully represent real-world attacks due to their overreliance on triggering cues. In fact, once these cues are removed, all previously evaluated "reasonably safe" models become unsafe, including Gemini 3 Pro and Claude Sonnet 3.7. Moreover, when intent laundering is adapted as a jailbreaking technique, it consistently achieves high attack success rates, ranging from 90% to over 98%, under fully black-box access. Overall, our findings expose a significant disconnect between how model safety is evaluated and how real-world adversaries behave.

🔍 Key Points

The paper introduces 'intent laundering', a novel technique that removes triggering cues from AI safety datasets while preserving their malicious intent, revealing that existing datasets fail to reflect real-world adversarial behavior accurately.
Empirical results demonstrate that removing triggering cues dramatically increases attack success rates for models that were previously considered 'reasonably safe'. For instance, success rates on the AdvBench dataset increased from 5.38% to 86.79% after intent laundering.
Intent laundering is shown to be a potent jailbreaking technique, achieving high attack success rates (90% to over 98%) across multiple models, including advanced language models like Gemini 3 Pro and Claude Sonnet 3.7.
The study reveals significant quality issues in widely used AI safety datasets, such as AdvBench and HarmBench, primarily due to an overreliance on unrealistic triggering cues and high levels of data duplication, undermining their effectiveness for safety evaluations.
The findings call for urgent improvements in how AI model safety is evaluated, emphasizing that current methodologies may lead to inflated safety assessments based on inadequate evaluation frameworks.

💡 Why This Paper Matters

This paper is crucial for understanding the limitations of AI safety evaluations and introduces a significant methodology to improve it, thereby enhancing the robustness of safety mechanisms. Its implications extend to improving real-world model reliability against adversarial attacks, highlighting critical gaps that need addressing in safety research frameworks.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper relevant as it provides a new perspective on the vulnerabilities of AI safety methods and datasets. The introduction of intent laundering as a systematic approach to expose weaknesses offers a pathway for developing more resilient AI models against real-world threats, which is foundational for advancing security in AI systems.

Intent Laundering: AI Safety Datasets Are Not What They Seem

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper