Jailbreaking LLMs via Calibration

📄 Abstract

Safety alignment in Large Language Models (LLMs) often creates a systematic discrepancy between a model's aligned output and the underlying pre-aligned data distribution. We propose a framework in which the effect of safety alignment on next-token prediction is modeled as a systematic distortion of a pre-alignment distribution. We cast Weak-to-Strong Jailbreaking as a forecast aggregation problem and derive an optimal aggregation strategy characterized by a Gradient Shift in the loss-induced dual space. We show that logit-arithmetic jailbreaking methods are a special case of this framework under cross-entropy loss, and derive a broader family of aggregation rules corresponding to other proper losses. We also propose a new hybrid aggregation rule. Evaluations across red-teaming benchmarks and math utility tasks using frontier models demonstrate that our approach achieves superior Attack Success Rates and lower "Jailbreak Tax" compared with existing methods, especially on the safety-hardened gpt-oss-120b.

🔍 Key Points

Proposes a theoretical framework for analyzing safety alignment in LLMs as a systematic distortion of pre-alignment distributions, allowing for a deeper understanding of model behavior.
Introduces the Gradient Shift aggregation method as an optimal strategy to enhance attack success rates while minimizing the Jailbreak Tax in jailbreaking applications.
Extends existing logit-arithmetic methods into a broader family of aggregation strategies applicable to various proper loss functions, including a novel hybrid method for better performance.
Demonstrates superior performance of the Gradient Shift over existing methods through extensive evaluations on red-teaming tasks and math utility benchmarks, especially with safety-hardened models like gpt-oss-120b.
Shows the potential for the theoretical framework to facilitate both jailbreaking and defense strategies, opening avenues for leveraging model interactions to improve safety and robustness.

💡 Why This Paper Matters

This paper significantly advances the understanding of LLM vulnerabilities and the mechanics underlying jailbreaking by introducing a robust theoretical framework and effective aggregation methods. Its key findings not only illuminate the ongoing challenges in LLM safety alignment but also provide practical solutions for enhancing model performance while addressing safety concerns. This work fosters further exploration into model behaviors and safety mechanisms in AI, proposing methodologies that could lead to safer deployment and utilization of language models.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly relevant because it addresses critical vulnerabilities in large language models. The proposed methods offer insights into how adversarial actors can exploit these models, which is crucial for developing stronger defensive mechanisms. Additionally, the theoretical underpinnings and practical results provided by this study can guide future research directions in AI safety, focusing on minimizing the risks associated with deploying LLMs in real-world applications.

Jailbreaking LLMs via Calibration

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper