← Back to Library

Chain-of-Thought Hijacking

Authors: Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez

Published: 2025-10-30

arXiv ID: 2510.26418v1

Added to Library: 2025-10-31 04:00 UTC

Red Teaming

📄 Abstract

Large reasoning models (LRMs) achieve higher task performance by allocating more inference-time compute, and prior works suggest this scaled reasoning may also strengthen safety by improving refusal. Yet we find the opposite: the same reasoning can be used to bypass safeguards. We introduce Chain-of-Thought Hijacking, a jailbreak attack on reasoning models. The attack pads harmful requests with long sequences of harmless puzzle reasoning. Across HarmBench, CoT Hijacking reaches a 99%, 94%, 100%, and 94% attack success rate (ASR) on Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet, respectively - far exceeding prior jailbreak methods for LRMs. To understand the effectiveness of our attack, we turn to a mechanistic analysis, which shows that mid layers encode the strength of safety checking, while late layers encode the verification outcome. Long benign CoT dilutes both signals by shifting attention away from harmful tokens. Targeted ablations of attention heads identified by this analysis causally decrease refusal, confirming their role in a safety subnetwork. These results show that the most interpretable form of reasoning - explicit CoT - can itself become a jailbreak vector when combined with final-answer cues. We release prompts, outputs, and judge decisions to facilitate replication.

🔍 Key Points

  • Introduction of Chain-of-Thought Hijacking (CoT Hijacking) as a novel jailbreak attack that exploits long benign reasoning to bypass language model safety mechanisms.
  • Empirical results demonstrate high attack success rates (up to 99%) on various large reasoning models (LRMs), outperforming existing methods by a significant margin.
  • Mechanistic analysis reveals that mid-layers encode safety checks while late layers encode verification outcomes, and that longer chains of harmless reasoning dilute refusal signals, making models vulnerable to harmful instructions.
  • Findings emphasize that reasoning models, despite their improvements in accuracy and reasoning capabilities, may be more susceptible to safety failures under specific operational conditions.
  • Implications call for robust safety mechanisms that scale with reasoning depth, rather than relying on shallow refusal heuristics.

💡 Why This Paper Matters

The paper presents critical insights into the vulnerabilities of large reasoning models, particularly how extended reasoning processes can be manipulated to weaken safety checks. The discovery of CoT Hijacking as a new attack vector raises important questions about the effectiveness of current safety protocols, suggesting the need for enhanced safeguards in AI systems.

🎯 Why It's Interesting for AI Security Researchers

This paper is of keen interest to AI security researchers as it uncovers significant threats to the safety of large reasoning models, which are increasingly used in sensitive applications. The introduction of a novel jailbreak method reveals critical weaknesses in AI safety mechanisms. Understanding these vulnerabilities can inform the development of more resilient AI systems and safety protocols to mitigate potential risks associated with malicious exploitation.

📚 Read the Full Paper