← Back to Library

The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs

Authors: Yonghong Deng, Zhen Yang, Ping Jian, Xinyue Zhang, Zhongbin Guo, Chengzhi Li

Published: 2026-03-09

arXiv ID: 2603.08234v1

Added to Library: 2026-03-10 04:00 UTC

Red Teaming

📄 Abstract

With the rapid advancement of large language models (LLMs), the safety of LLMs has become a critical concern. Despite significant efforts in safety alignment, current LLMs remain vulnerable to jailbreaking attacks. However, the root causes of such vulnerabilities are still poorly understood, necessitating a rigorous investigation into jailbreak mechanisms across both academic and industrial communities. In this work, we focus on a continuation-triggered jailbreak phenomenon, whereby simply relocating a continuation-triggered instruction suffix can substantially increase jailbreak success rates. To uncover the intrinsic mechanisms of this phenomenon, we conduct a comprehensive mechanistic interpretability analysis at the level of attention heads. Through causal interventions and activation scaling, we show that this jailbreak behavior primarily arises from an inherent competition between the model's intrinsic continuation drive and the safety defenses acquired through alignment training. Furthermore, we perform a detailed behavioral analysis of the identified safety-critical attention heads, revealing notable differences in the functions and behaviors of safety heads across different model architectures. These findings provide a novel mechanistic perspective for understanding and interpreting jailbreak behaviors in LLMs, offering both theoretical insights and practical implications for improving model safety.

🔍 Key Points

  • The paper investigates a specific jailbreak method in large language models (LLMs) known as continuation-triggered jailbreak which strategically relocates a suffix to bypass safety mechanisms.
  • Using mechanistic interpretability methods such as path patching and activation scaling, it identifies critical attention heads involved in the competition between continuation generation and safety enforcement.
  • The study categorizes attention heads into safety heads and continuation heads, providing insights into their distinct roles in controlling model outputs and their interaction dynamics during jailbreak attempts.
  • Experiments demonstrate a clear increase in the Attack Success Rate (ASR) when the continuation-triggered suffix is improperly placed, showcasing vulnerabilities in model architectures.
  • Results emphasize the need for a deeper understanding of model internals to design better safety mechanisms in LLMs, aligning safety features with generative capabilities.

💡 Why This Paper Matters

This paper provides a crucial exploration into the underlying mechanics of jailbreak vulnerabilities in large language models. By identifying and analyzing the competition between safety and continuation mechanisms, it lays foundational insights for developing more robust safety frameworks. The findings are significant as they not only highlight safety risks in current AI models but also offer pathways for improving their defenses against malicious exploitation.

🎯 Why It's Interesting for AI Security Researchers

Researchers in AI security will find this paper highly relevant as it uncovers the nuanced interactions that lead to security vulnerabilities in language models. By elucidating the mechanical dynamics of attention heads and jailbreaking methods, it presents actionable insights for enhancing model robustness, which is critical for the deployment of AI systems in sensitive applications. The work also sets a precedent for future studies aiming to strengthen safety protocols and reduce the risks associated with malicious use of AI.

📚 Read the Full Paper