Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation

📄 Abstract

While reasoning-based large language models excel at complex tasks through an internal, structured thinking process, a concerning phenomenon has emerged that such a thinking process can aggregate social stereotypes, leading to biased outcomes. However, the underlying behaviours of these language models in social bias scenarios remain underexplored. In this work, we systematically investigate mechanisms within the thinking process behind this phenomenon and uncover two failure patterns that drive social bias aggregation: 1) stereotype repetition, where the model relies on social stereotypes as its primary justification, and 2) irrelevant information injection, where it fabricates or introduces new details to support a biased narrative. Building on these insights, we introduce a lightweight prompt-based mitigation approach that queries the model to review its own initial reasoning against these specific failure patterns. Experiments on question answering (BBQ and StereoSet) and open-ended (BOLD) benchmarks show that our approach effectively reduces bias while maintaining or improving accuracy.

🔍 Key Points

Introduction of the Malicious Token Injection (MTI) attack framework that targets the key-value cache of transformer models during inference, revealing a significant attack surface that has been largely overlooked.
Theoretically quantifies the impact of cache perturbations on attention mechanisms, linking the extent of cache corruption to shifts in token distribution and downstream model performance.
Empirical results show that the MTI framework consistently reduces task performance across various NLP benchmarks, including classification and question answering, highlighting the vulnerability during inference.
Identifies specific vulnerabilities in retrieval-augmented generation systems and agentic reasoning pipelines, demonstrating that perturbations in cached representations can significantly impair their functionality.
Presents lightweight defense strategies such as cache resetting and dropout-mask randomization that offer partial mitigation against cache corruption, underlining the importance of cache integrity in model robustness.

💡 Why This Paper Matters

This paper is highly relevant as it introduces a novel perspective on the vulnerabilities of large language models by focusing on the key-value cache during inference. The findings stress the importance of maintaining cache integrity to ensure robust and secure LLM deployments, particularly in safety-critical applications. It paves the way for future research on defending against cache-side attacks and understanding their broader implications in model behavior and reliability.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper is of significant interest as it not only exposes a critical and previously underexplored area of vulnerability in large language models but also provides a formalized methodology for evaluating such risks. The introduction of a systematic attack framework paired with theoretical and empirical validations presents a comprehensive case for reconsidering how LLM security is approached. As AI systems become more integrated into sensitive applications, understanding and mitigating these vulnerabilities is crucial.

Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper