← Back to Library

How Real is Your Jailbreak? Fine-grained Jailbreak Evaluation with Anchored Reference

Authors: Songyang Liu, Chaozhuo Li, Rui Pu, Litian Zhang, Chenxu Wang, Zejian Chen, Yuting Zhang, Yiming Hei

Published: 2026-01-04

arXiv ID: 2601.03288v1

Added to Library: 2026-01-08 03:05 UTC

Red Teaming

📄 Abstract

Jailbreak attacks present a significant challenge to the safety of Large Language Models (LLMs), yet current automated evaluation methods largely rely on coarse classifications that focus mainly on harmfulness, leading to substantial overestimation of attack success. To address this problem, we propose FJAR, a fine-grained jailbreak evaluation framework with anchored references. We first categorized jailbreak responses into five fine-grained categories: Rejective, Irrelevant, Unhelpful, Incorrect, and Successful, based on the degree to which the response addresses the malicious intent of the query. This categorization serves as the basis for FJAR. Then, we introduce a novel harmless tree decomposition approach to construct high-quality anchored references by breaking down the original queries. These references guide the evaluator in determining whether the response genuinely fulfills the original query. Extensive experiments demonstrate that FJAR achieves the highest alignment with human judgment and effectively identifies the root causes of jailbreak failures, providing actionable guidance for improving attack strategies.

🔍 Key Points

  • FJAR framework introduces a fine-grained categorization of jailbreak responses into five distinct types: Rejective, Irrelevant, Unhelpful, Incorrect, and Successful, allowing for more accurate evaluation of jailbreak attack efficacy.
  • The framework incorporates a novel harmless tree decomposition strategy to construct anchored references for malicious queries, enabling the evaluation process to be more structured and informative.
  • Empirical results demonstrate that FJAR significantly outperforms existing evaluation methods in terms of correlation with human judgments regarding attack success rates and the quality of responses generated by language models.
  • FJAR not only identifies the success of jailbreak attempts but also analyzes their failure types, providing insights into the effectiveness and limitations of various attack strategies.
  • The research highlights the crucial importance of not just harmfulness but also the relevance and correctness of content produced during jailbreak attempts.

💡 Why This Paper Matters

This paper is highly relevant in the ongoing efforts to secure large language models against jailbreak attacks, a significant vulnerability in AI systems. By proposing the FJAR framework, it offers a more reliable mechanism for evaluating the success of such attacks, thus contributing to the development of more robust defenses and enhancing the safety of deployed AI systems.

🎯 Why It's Interesting for AI Security Researchers

This paper is of great interest to AI security researchers as it addresses a pressing issue in the deployment of large language models: jailbreaking. The innovative approaches for evaluation and categorization introduced in FJAR provide researchers with valuable tools to understand, assess, and enhance the resilience of AI systems against adversarial manipulations, thereby directly contributing to the safety and reliability of AI technologies.

📚 Read the Full Paper