Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression

Authors: Tianyu Zhang, Zihang Xi, Jingyu Hua, Sheng Zhong

Published: 2025-11-27

arXiv ID: 2511.22044v1

Added to Library: 2025-12-01 03:01 UTC

Red Teaming

📄 Abstract

In the realm of black-box jailbreak attacks on large language models (LLMs), the feasibility of constructing a narrow safety proxy, a lightweight model designed to predict the attack success rate (ASR) of adversarial prompts, remains underexplored. This work investigates the distillability of an LLM's core security logic. We propose a novel framework that incorporates an improved outline filling attack to achieve dense sampling of the model's security boundaries. Furthermore, we introduce a ranking regression paradigm that replaces standard regression and trains the proxy model to predict which prompt yields a higher ASR. Experimental results show that our proxy model achieves an accuracy of 91.1 percent in predicting the relative ranking of average long response (ALR), and 69.2 percent in predicting ASR. These findings confirm the predictability and distillability of jailbreak behaviors, and demonstrate the potential of leveraging such distillability to optimize black-box attacks.

🔍 Key Points

The study presents a novel 'Narrow Safety Proxy' model that can predict the Attack Success Rate (ASR) of adversarial prompts against large language models (LLMs), demonstrating the distillability of LLM security logic.
An improved 'Outline Filling Attack' method is introduced, allowing for dense sampling of model responses by breaking down dangerous queries into structured outlines, enhancing the diversity and effectiveness of attack prompts.
The use of a 'Ranking Regression' paradigm to replace traditional regression allows the proxy model to effectively predict the relative safety of prompts against domain shifts in ASR across different topics.
The paper achieves impressive empirical results, with the proxy model obtaining a 91.1% accuracy in predicting relative rankings of long responses and a 69.2% accuracy in predicting ASR, validating the predictability of jailbreak behaviors.
The findings highlight significant implications for optimizing black-box attacks, informing future defensive strategies against such vulnerabilities in LLMs.

💡 Why This Paper Matters

This paper is highly relevant as it uncovers the underlying security mechanisms of LLMs through innovative methodologies, illustrating how attackers can predict and exploit vulnerabilities. By establishing the effectiveness of the proposed safety proxy and ranking regression framework, it contributes significantly to the discourse on AI security and adversarial strategies.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper essential for understanding the vulnerabilities inherent in large language models, particularly in black-box settings. The methodologies and findings discussed could guide the development of more robust models and defenses against emergent adversarial tactics, making it a critical resource for ongoing research into securing AI systems.

Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper