← Back to Library

Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models

Authors: Xiangwen Wang, Ananth Balashankar, Varun Chandrasekaran

Published: 2026-03-11

arXiv ID: 2603.11149v1

Added to Library: 2026-03-13 03:03 UTC

Red Teaming

📄 Abstract

Large language models remain vulnerable to jailbreak attacks, yet we still lack a systematic understanding of how jailbreak success scales with attacker effort across methods, model families, and harm types. We initiate a scaling-law framework for jailbreaks by treating each attack as a compute-bounded optimization procedure and measuring progress on a shared FLOPs axis. Our systematic evaluation spans four representative jailbreak paradigms, covering optimization-based attacks, self-refinement prompting, sampling-based selection, and genetic optimization, across multiple model families and scales on a diverse set of harmful goals. We investigate scaling laws that relate attacker budget to attack success score by fitting a simple saturating exponential function to FLOPs--success trajectories, and we derive comparable efficiency summaries from the fitted curves. Empirically, prompting-based paradigms tend to be the most compute-efficient compared to optimization-based methods. To explain this gap, we cast prompt-based updates into an optimization view and show via a same-state comparison that prompt-based attacks more effectively optimize in prompt space. We also show that attacks occupy distinct success--stealthiness operating points with prompting-based methods occupying the high-success, high-stealth region. Finally, we find that vulnerability is strongly goal-dependent: harms involving misinformation are typically easier to elicit than other non-misinformation harms.

🔍 Key Points

  • The paper establishes a compute-normalized scaling-law framework for analyzing jailbreaking methods in language models, allowing for a systematic understanding of how attack efficiency and success scale with compute resources.
  • It empirically evaluates four distinct jailbreak paradigms—optimization-based, self-refinement prompting, sampling-based selection, and genetic optimization—across different language model families and harmful goal types.
  • The study reveals that prompt-based attacks (PAIR) generally achieve higher success and lower compute requirements compared to optimization-based methods (GCG), indicating a significant efficiency gap within attacks.
  • A detailed analysis shows that attack success rates are heavily influenced by the nature of the goal, with misinformation being easier to elicit than other harmful outputs, suggesting a nuanced understanding of vulnerabilities in language models.

💡 Why This Paper Matters

This paper is crucial for understanding the landscape of security vulnerabilities in large language models, particularly in the context of jailbreaking, which poses significant risks to safe AI use. By establishing a systematic analysis of attack effectiveness relative to compute resources, the research offers valuable insights that can inform both developers and researchers focused on enhancing model safety and robustness.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly relevant as it provides a quantifiable method to measure and compare the success of various jailbreak techniques. The implications of understanding these vulnerabilities can guide defenses against potential exploits, contributing to safer AI systems in practical applications.

📚 Read the Full Paper