← Back to Library

Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs

Authors: Xikang Yang, Biyu Zhou, Xuehai Tang, Jizhong Han, Songlin Hu

Published: 2025-07-30

arXiv ID: 2507.22564v1

Added to Library: 2025-07-31 04:00 UTC

Red Teaming Safety

📄 Abstract

Large Language Models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet their safety mechanisms remain susceptible to adversarial attacks that exploit cognitive biases -- systematic deviations from rational judgment. Unlike prior jailbreaking approaches focused on prompt engineering or algorithmic manipulation, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. We propose CognitiveAttack, a novel red-teaming framework that systematically leverages both individual and combined cognitive biases. By integrating supervised fine-tuning and reinforcement learning, CognitiveAttack generates prompts that embed optimized bias combinations, effectively bypassing safety protocols while maintaining high attack success rates. Experimental results reveal significant vulnerabilities across 30 diverse LLMs, particularly in open-source models. CognitiveAttack achieves a substantially higher attack success rate compared to the SOTA black-box method PAP (60.1% vs. 31.6%), exposing critical limitations in current defense mechanisms. These findings highlight multi-bias interactions as a powerful yet underexplored attack vector. This work introduces a novel interdisciplinary perspective by bridging cognitive science and LLM safety, paving the way for more robust and human-aligned AI systems.

🔍 Key Points

  • CognitiveAttack framework leverages multi-bias interactions to generate adversarial prompts that effectively bypass LLM safety protocols, outperforming state-of-the-art (SOTA) methods significantly.
  • The framework integrates supervised fine-tuning and reinforcement learning to optimize combinations of cognitive biases, demonstrating that synergistic bias interactions yield high attack success rates.
  • Extensive experiments reveal vulnerabilities in 30 diverse LLMs, especially in open-source models, with CognitiveAttack achieving an attack success rate of 60.1%, notably higher than existing methods.
  • The work bridges cognitive science and AI safety, highlighting critical gaps in model defenses against cognitive biases, revealing deeper model cognition flaws that need addressing.

💡 Why This Paper Matters

This paper is essential as it uncovers a novel approach to adversarial attacks on large language models through cognitive biases, a largely unexplored area that combines insights from psychology and AI. By systematically evaluating the vulnerabilities induced by multi-bias interactions, it not only demonstrates significant weaknesses in LLMs but also establishes a new framework for enhancing AI safety mechanisms.

🎯 Why It's Interesting for AI Security Researchers

This paper is of great interest to AI security researchers as it provides a new perspective on the intersection of cognitive psychology and machine learning vulnerabilities. With the emergence of NLP technologies in sensitive and high-stakes environments, understanding and mitigating cognitive-based vulnerabilities becomes crucial in the development of safer and more robust AI systems. The findings could guide future research in improving AI defenses and enhancing the alignment of language models with ethical guidelines.

📚 Read the Full Paper