← Back to Library

SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration

Authors: Yu Pan, Wenlong Yu, Tiejun Wu, Xiaohu Ye, Qiannan Si, Guangquan Xu, Bin Wu

Published: 2026-03-16

arXiv ID: 2603.15397v1

Added to Library: 2026-03-17 04:00 UTC

📄 Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, they remain highly susceptible to jailbreak attacks that undermine their safety alignment. Existing defense mechanisms typically rely on post hoc filtering applied only to the final output, leaving intermediate reasoning steps unmonitored and vulnerable to adversarial manipulation. To address this gap, this paper proposes a SaFer Chain-of-Thought (SFCoT) framework, which proactively evaluates and calibrates potentially unsafe reasoning steps in real time. SFCoT incorporates a three-tier safety scoring system alongside a multi-perspective consistency verification mechanism, designed to detect potential risks throughout the reasoning process. A dynamic intervention module subsequently performs targeted calibration to redirect reasoning trajectories toward safe outcomes. Experimental results demonstrate that SFCoT reduces the attack success rate from $58.97\%$ to $12.31\%$, demonstrating it as an effective and efficient LLM safety enhancement method without a significant decline in general performance.

🔍 Key Points

  • Investigation of safety vulnerabilities associated with Test-Time Training (TTT) methods, specifically highlighting how prompt injections can lead to harmful amplification effects in model behaviors.
  • Introduction of the concept of 'reasoning tax' i.e., the decline in reasoning ability of LLMs during TTT due to amplification effects, regardless of whether the base model starts safe or harmful.
  • Demonstration that TTT methods based on self-consistency, like Test-Time Reinforcement Learning (TTRL), can be adversarially exploited (HarmInject prompts) to combine benign and harmful queries, leading to degradation in reasoning performance and increased harmfulness.
  • Empirical analysis showing that benign prompt injections also lead to harmful amplification, underscoring the fragility of self-consistency methods under varied prompt conditions.
  • Emphasis on the inadequacy of simple filtering techniques to mitigate safety and reasoning vulnerabilities, calling for the development of more sophisticated TTT methods.

💡 Why This Paper Matters

This paper is significant as it reveals critical vulnerabilities in TTT approaches used for language models, emphasizing that while such methods can enhance reasoning, they also introduce serious safety risks. The concept of 'reasoning tax' presents a paradox of amplifying safe behavior at the cost of rational responses, highlighting the delicate balance needed when deploying LLMs in real-world scenarios.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper highly relevant as it provides insights into how certain training methods can inadvertently lead to safety failures in language models. The exploration of adversarial prompt designs and the potential for harmful behavior amplification serves as an essential warning for the design and deployment of AI systems, framing discussions around safety protocols in machine learning applications.

📚 Read the Full Paper