← Back to Library

Bag of Tricks for Subverting Reasoning-based Safety Guardrails

Authors: Shuo Chen, Zhen Han, Haokun Chen, Bailan He, Shengyun Si, Jingpei Wu, Philip Torr, Volker Tresp, Jindong Gu

Published: 2025-10-13

arXiv ID: 2510.11570v1

Added to Library: 2025-10-14 04:02 UTC

Red Teaming

📄 Abstract

Recent reasoning-based safety guardrails for Large Reasoning Models (LRMs), such as deliberative alignment, have shown strong defense against jailbreak attacks. By leveraging LRMs' reasoning ability, these guardrails help the models to assess the safety of user inputs before generating final responses. The powerful reasoning ability can analyze the intention of the input query and will refuse to assist once it detects the harmful intent hidden by the jailbreak methods. Such guardrails have shown a significant boost in defense, such as the near-perfect refusal rates on the open-source gpt-oss series. Unfortunately, we find that these powerful reasoning-based guardrails can be extremely vulnerable to subtle manipulation of the input prompts, and once hijacked, can lead to even more harmful results. Specifically, we first uncover a surprisingly fragile aspect of these guardrails: simply adding a few template tokens to the input prompt can successfully bypass the seemingly powerful guardrails and lead to explicit and harmful responses. To explore further, we introduce a bag of jailbreak methods that subvert the reasoning-based guardrails. Our attacks span white-, gray-, and black-box settings and range from effortless template manipulations to fully automated optimization. Along with the potential for scalable implementation, these methods also achieve alarmingly high attack success rates (e.g., exceeding 90% across 5 different benchmarks on gpt-oss series on both local host models and online API services). Evaluations across various leading open-source LRMs confirm that these vulnerabilities are systemic, underscoring the urgent need for stronger alignment techniques for open-sourced LRMs to prevent malicious misuse. Code is open-sourced at https://chenxshuo.github.io/bag-of-tricks.

🔍 Key Points

  • The paper exposes severe vulnerabilities in reasoning-based safety guardrails for Large Reasoning Models (LRMs), revealing their susceptibility to various sophisticated manipulation techniques.
  • Introduced four novel jailbreak methods – Structural CoT Bypass, Fake Over-Refusal, Coercive Optimization, and Reasoning Hijack – each capable of bypassing guardrails and eliciting harmful responses with over 90% attack success rates.
  • The study demonstrates the systemic nature of these vulnerabilities across several open-source models, indicating that simply improving reasoning capabilities does not inherently fortify model safety.
  • The experimental results underscore the urgent need for more robust alignment techniques in LRMs to mitigate the risks associated with adversarial input prompts.
  • By providing open-sourced code and detailed insights on exploiting LRMs, the research promotes further investigation into model safety mechanisms and their shortcomings.

💡 Why This Paper Matters

This paper is critically relevant as it underscores significant security risks in widely used LRM safety mechanisms, presenting a clear call to action for researchers to develop improved alignment techniques. It drives home the message that current defenses, while advanced, are still vulnerable, and this research lays the groundwork for future studies aiming to enhance the safety and reliability of AI systems.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly useful as it not only identifies weaknesses in existing safety measures, but also provides concrete methodologies for testing these vulnerabilities. The insights gathered from the proposed jailbreak methods may influence future designs of defensive models and provoke further exploration into the robustness of AI safety mechanisms against adversarial attacks.

📚 Read the Full Paper