← Back to Library

Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning

Authors: Rui Pu, Chaozhuo Li, Rui Ha, Litian Zhang, Lirong Qiu, Xi Zhang

Published: 2025-08-05

arXiv ID: 2508.03054v1

Added to Library: 2025-08-14 23:04 UTC

📄 Abstract

Defending large language models (LLMs) against jailbreak attacks is essential for their safe and reliable deployment. Existing defenses often rely on shallow pattern matching, which struggles to generalize to novel and unseen attack strategies. To address this challenge, we propose the Cognitive-Driven Defense (CDD) framework, which targets the underlying structure of jailbreak prompts by applying meta-operations, defined as basic manipulations that conceal harmful intent.CDD emulates human cognitive reasoning through a structured reasoning chain. It begins with a global perception of the prompt and follows with a localized analysis to uncover hidden manipulations. By applying supervised fine-tuning on this structured chain, the model learns to identify and reason about known manipulation patterns. To enhance generalization to unseen threats, an entropy-guided reinforcement learning algorithm (EG-GRPO) is introduced to encourage exploration of new types and variants of meta-operations. Experiments demonstrate that CDD can achieve state-of-the-art defense performance and exhibit strong generalization to unseen jailbreak attacks.

🔍 Key Points

  • Introduction of WhisperInject, a two-stage framework for adversarial audio attacks on audio-language models that reveals critical vulnerabilities.
  • Development of Reinforcement Learning with Projected Gradient Descent (RL-PGD) to discover native harmful payloads from the model itself, increasing attack success rates.
  • Achieved over 86% success in manipulating state-of-the-art audio language models while maintaining audio imperceptibility to human listeners.
  • Highlights the transition from merely bypassing safety protocols to eliciting harmful model behaviors, shifting the focus of audio adversarial attacks.
  • Demonstration of the feasibility and practicality of stealthy attacks in real-world scenarios, warning of potential widespread risks.

💡 Why This Paper Matters

This paper provides critical insights into the vulnerabilities of audio-language models, detailing a practical framework that allows adversaries to induce harmful behavior covertly. It underscores the need for enhanced safety measures beyond traditional text filtering and reveals valuable directions for future research in AI safety and security practices.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper are highly relevant to AI security researchers as they expose a significant gap in the protection of audio-language models against adversarial attacks. With the increasing integration of AI in everyday audio interactions, understanding and mitigating such vulnerabilities are crucial to ensuring the safety and reliability of AI systems in real-world applications.

📚 Read the Full Paper