← Back to Library

Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

Authors: Indranil Halder, Annesya Banerjee, Cengiz Pehlevan

Published: 2026-03-11

arXiv ID: 2603.11331v1

Added to Library: 2026-03-13 03:02 UTC

Red Teaming

📄 Abstract

Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that adversarial prompt-injection attacks can amplify attack success rate from the slow polynomial growth observed without injection to exponential growth with the number of inference-time samples. To explain this phenomenon, we propose a theoretical generative model of proxy language in terms of a spin-glass system operating in a replica-symmetry-breaking regime, where generations are drawn from the associated Gibbs measure and a subset of low-energy, size-biased clusters is designated unsafe. Within this framework, we analyze prompt injection-based jailbreaking. Short injected prompts correspond to a weak magnetic field aligned towards unsafe cluster centers and yield a power-law scaling of attack success rate with the number of inference-time samples, while long injected prompts, i.e., strong magnetic field, yield exponential scaling. We derive these behaviors analytically and confirm them empirically on large language models. This transition between two regimes is due to the appearance of an ordered phase in the spin chain under a strong magnetic field, which suggests that the injected jailbreak prompt enhances adversarial order in the language model.

🔍 Key Points

  • The paper presents a theoretical generative model for understanding the adversarial prompt-injection vulnerability in large language models (LLMs), utilizing concepts from spin-glass theory.
  • It identifies two phases of attack success rate (ASR): a polynomial growth phase when using weak prompts and an exponential growth phase when strong prompts are injected, illustrating a polynomial-exponential crossover in attack efficacy.
  • Experimental results validate the proposed scaling laws derived from the model and demonstrate the effectiveness of adversarial prompt injections on various LLMs, highlighting vulnerabilities in safety-aligned systems.
  • The analysis reveals that increasing the strength of prompt injections correlates with a significant rise in the rate of successful jailbreaks, emphasizing the scalability of adversarial attacks in terms of inference-time samples.

💡 Why This Paper Matters

This paper is critical as it addresses the growing concern of adversarial attacks in AI, particularly in safety-aligned large language models. By providing a theoretical foundation and empirical validation, it deepens the understanding of how language models can be manipulated through adversarial means, raising awareness of inherent vulnerabilities.

🎯 Why It's Interesting for AI Security Researchers

The findings and theoretical insights of this study are highly relevant to AI security researchers, as they not only unveil the mechanisms behind prompt injection attacks but also propose analytical methods to assess and mitigate these vulnerabilities. Understanding these dynamics is crucial for developing robust AI systems that can resist adversarial manipulations.

📚 Read the Full Paper