← Back to Library

Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks

Authors: Hayfa Dhabhi, Kashyap Thimmaraju

Published: 2026-02-10

arXiv ID: 2602.09629v1

Added to Library: 2026-02-11 03:01 UTC

Red Teaming Safety

📄 Abstract

Large Language Models (LLMs) deploy safety mechanisms to prevent harmful outputs, yet these defenses remain vulnerable to adversarial prompts. While existing research demonstrates that jailbreak attacks succeed, it does not explain \textit{where} defenses fail or \textit{why}. To address this gap, we propose that LLM safety operates as a sequential pipeline with distinct checkpoints. We introduce the \textbf{Four-Checkpoint Framework}, which organizes safety mechanisms along two dimensions: processing stage (input vs.\ output) and detection level (literal vs.\ intent). This creates four checkpoints, CP1 through CP4, each representing a defensive layer that can be independently evaluated. We design 13 evasion techniques, each targeting a specific checkpoint, enabling controlled testing of individual defensive layers. Using this framework, we evaluate GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across 3,312 single-turn, black-box test cases. We employ an LLM-as-judge approach for response classification and introduce Weighted Attack Success Rate (WASR), a severity-adjusted metric that captures partial information leakage overlooked by binary evaluation. Our evaluation reveals clear patterns. Traditional Binary ASR reports 22.6\% attack success. However, WASR reveals 52.7\%, a 2.3$\times$ higher vulnerability. Output-stage defenses (CP3, CP4) prove weakest at 72--79\% WASR, while input-literal defenses (CP1) are strongest at 13\% WASR. Claude achieves the strongest safety (42.8\% WASR), followed by GPT-5 (55.9\%) and Gemini (59.5\%). These findings suggest that current defenses are strongest at input-literal checkpoints but remain vulnerable to intent-level manipulation and output-stage techniques. The Four-Checkpoint Framework provides a structured approach for identifying and addressing safety vulnerabilities in deployed systems.

🔍 Key Points

  • Introduction of the Four-Checkpoint Framework for analyzing LLM safety mechanisms, organizing them by processing stage (input vs. output) and detection level (literal vs. intent).
  • Development of 13 targeted evasion techniques that systematically test the robustness of defenses at each checkpoint, allowing focused evaluation of safety mechanisms.
  • Evaluation of three state-of-the-art LLMs (GPT-5, Claude Sonnet 4, Gemini 2.5 Pro) using over 3,300 test cases, revealing significant vulnerabilities predominantly in output-stage defenses (CP3 and CP4).
  • Introduction of the Weighted Attack Success Rate (WASR) metric, which quantifies the severity of information leakage beyond binary metrics, highlighting that 52.7% of attacks reveal harmful information when considering partial compliance.
  • Identification that current defenses are robust against literal input attacks (CP1) but weak against intent-level manipulation and output-stage techniques.

💡 Why This Paper Matters

This paper presents crucial insights into the vulnerabilities of large language models, specifically highlighting the weaknesses in existing safety mechanisms when faced with sophisticated evasion techniques. By establishing the Four-Checkpoint Framework, the authors provide a structured approach to diagnose and strengthen LLM defenses, which is imperative as these models become increasingly integrated into sensitive and high-stakes applications.

🎯 Why It's Interesting for AI Security Researchers

This work is particularly relevant to AI security researchers as it not only addresses the pressing issue of ensuring the safety of large language models but also advances the methodologies for evaluating their defenses. Understanding where current defenses falter allows researchers and developers to create more resilient models and informs the ongoing discourse around ethical AI deployment. The findings on partial leakage emphasize the need for improved evaluation metrics, which have wide implications for the field.

📚 Read the Full Paper