← Back to Library

Rebellion: Noise-Robust Reasoning Training for Audio Reasoning Models

Authors: Tiansheng Huang, Virat Shejwalkar, Oscar Chang, Milad Nasr, Ling Liu

Published: 2025-11-12

arXiv ID: 2511.09682v1

Added to Library: 2025-11-14 23:01 UTC

Red Teaming

📄 Abstract

Instilling reasoning capabilities in large models (LMs) using reasoning training (RT) significantly improves LMs' performances. Thus Audio Reasoning Models (ARMs), i.e., audio LMs that can reason, are becoming increasingly popular. However, no work has studied the safety of ARMs against jailbreak attacks that aim to elicit harmful responses from target models. To this end, first, we show that standard RT with appropriate safety reasoning data can protect ARMs from vanilla audio jailbreaks, but cannot protect them against our proposed simple yet effective jailbreaks. We show that this is because of the significant representation drift between vanilla and advanced jailbreaks which forces the target ARMs to emit harmful responses. Based on this observation, we propose Rebellion, a robust RT that trains ARMs to be robust to the worst-case representation drift. All our results are on Qwen2-Audio; they demonstrate that Rebellion: 1) can protect against advanced audio jailbreaks without compromising performance on benign tasks, and 2) significantly improves accuracy-safety trade-off over standard RT method.

🔍 Key Points

  • Introduction of Rebellion, a novel reasoning training method that enhances audio reasoning models (ARMs) by making them robust against sophisticated audio jailbreak attacks.
  • Identification of the vulnerability of standard reasoning training (RT) to advanced jailbreaks due to representation drift, which allows harmful responses to bypass safety guardrails.
  • Rigorous experimental validation showing that Rebellion maintains high performance in benign tasks while significantly reducing harmful outputs in presence of audio jailbreaks, demonstrating strong safety-accuracy trade-offs.
  • Discovery of a 'think twice' behavior in Rebellion-trained ARMs, indicating an internal safety check mechanism that leads to correct refusal of harmful queries despite initial compliance triggered by jailbreaks.
  • Establishment of a dual dataset approach, using both safety and benign reasoning data for training, thus ensuring comprehensive reasoning capabilities.

💡 Why This Paper Matters

This paper is significant as it addresses a critical gap in the safety of audio reasoning models when exposed to sophisticated attacks. By proposing Rebellion, it not only provides a practical solution for enhancing the security of ARMs but also contributes to the broader discourse on robustness and safety in AI models, which is increasingly essential as these systems are deployed in sensitive applications.

🎯 Why It's Interesting for AI Security Researchers

The findings presented in this paper are especially relevant to AI security researchers, as they highlight vulnerabilities in existing audio reasoning models and propose effective countermeasures against emerging threat vectors like jailbreak embeddings. With the increasing reliance on AI for decision-making processes across various sectors, understanding and mitigating such risks is crucial for developing safe and reliable AI systems.

📚 Read the Full Paper