← Back to Library

Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models

Authors: Riccardo Cantini, Nicola Gabriele, Alessio Orsino, Domenico Talia

Published: 2025-07-03

arXiv ID: 2507.02799v1

Added to Library: 2025-07-04 04:01 UTC

Red Teaming

📄 Abstract

Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks through mechanisms such as Chain-of-Thought (CoT) prompting or fine-tuned reasoning traces. While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear. In this work, we leverage the CLEAR-Bias benchmark, originally designed for Large Language Models (LLMs), to investigate the adversarial robustness of RLMs to bias elicitation. We systematically evaluate state-of-the-art RLMs across diverse sociocultural dimensions, using an LLM-as-a-judge approach for automated safety scoring and leveraging jailbreak techniques to assess the strength of built-in safety mechanisms. Our evaluation addresses three key questions: (i) how the introduction of reasoning capabilities affects model fairness and robustness; (ii) whether models fine-tuned for reasoning exhibit greater safety than those relying on CoT prompting at inference time; and (iii) how the success rate of jailbreak attacks targeting bias elicitation varies with the reasoning mechanisms employed. Our findings reveal a nuanced relationship between reasoning capabilities and bias safety. Surprisingly, models with explicit reasoning, whether via CoT prompting or fine-tuned reasoning traces, are generally more vulnerable to bias elicitation than base models without such mechanisms, suggesting reasoning may unintentionally open new pathways for stereotype reinforcement. Reasoning-enabled models appear somewhat safer than those relying on CoT prompting, which are particularly prone to contextual reframing attacks through storytelling prompts, fictional personas, or reward-shaped instructions. These results challenge the assumption that reasoning inherently improves robustness and underscore the need for more bias-aware approaches to reasoning design.

🔍 Key Points

  • Introduces the CLEAR-Bias benchmark for assessing bias elicitation in Reasoning Language Models (RLMs), enabling a systematic evaluation of model robustness against adversarial attacks.
  • Investigates the impact of reasoning mechanisms (Chain-of-Thought prompting vs. fine-tuned reasoning) on bias safety, revealing that reasoning may paradoxically exacerbate bias in models.
  • Presents a nuanced analysis of various jailbreak techniques, demonstrating differing vulnerabilities of models depending on their reasoning architecture.
  • Finds that models with explicit reasoning are generally more vulnerable to bias elicitation than base models, challenging the assumption that reasoning enhances safety.
  • Offers key insights for improving model architecture and training by highlighting the need for bias-aware approaches in reasoning design.

💡 Why This Paper Matters

This paper is crucial as it deepens our understanding of how reasoning capabilities in language models can unintentionally influence bias and safety. The use of the CLEAR-Bias benchmark provides a novel framework for measuring these effects, which is integral for developing more responsible AI technologies that are aligned with ethical principles.

🎯 Why It's Interesting for AI Security Researchers

This paper is especially relevant to AI security researchers as it outlines vulnerabilities in modern language models, highlighting how reasoning mechanisms can create new pathways for bias to emerge. Understanding these dynamics is critical for developing robust defense strategies and ensuring safe deployment of AI systems in sensitive domains.

📚 Read the Full Paper