PSRT: Accelerating LRM-based Guard Models via Prefilled Safe Reasoning Traces

📄 Abstract

Large Reasoning Models (LRMs) have demonstrated remarkable performance on tasks such as mathematics and code generation. Motivated by these strengths, recent work has empirically demonstrated the effectiveness of LRMs as guard models in improving harmful query detection. However, LRMs typically generate long reasoning traces during inference, causing substantial computational overhead. In this paper, we introduce PSRT, a method that replaces the model's reasoning process with a Prefilled Safe Reasoning Trace, thereby significantly reducing the inference cost of LRMs. Concretely, PSRT prefills "safe reasoning virtual tokens" from a constructed dataset and learns over their continuous embeddings. With the aid of indicator tokens, PSRT enables harmful-query detection in a single forward pass while preserving the classification effectiveness of LRMs. We evaluate PSRT on 7 models, 13 datasets, and 8 jailbreak methods. In terms of efficiency, PSRT completely removes the overhead of generating reasoning tokens during inference. In terms of classification performance, PSRT achieves nearly identical accuracy, with only a minor average F1 drop of 0.015 across 7 models and 5 datasets.

🔍 Key Points

1. **Vulnerability of Activation Steering**: The paper proves that activation steering, often seen as a safe and interpretable model control technique, can systematically compromise the safety mechanisms of Large Language Models (LLMs), notably increasing the probability of harmful compliance with prompts.
2. **Random Steering Effects**: The study highlights alarming findings, showing that injecting random perturbations into model activations can increase harmful compliance rates from 0% to as high as 27%, revealing inherent vulnerabilities across various model architectures.
3. **Hazardous Impact of SAEs**: Utilizing benign features from Sparse Autoencoders (SAEs) further escalates harmful compliance rates, suggesting that interpreted steering vectors designed for safety can inadvertently lead to significant risks.
4. **Creation of Universal Attack Vectors**: The authors demonstrate that aggregating multiple harmful random steering vectors can form universal attack vectors, achieving up to a 4× increase in compliance rates on unseen prompts without needing model internals or harmful data for training.
5. **Case Study Validation**: Through case studies using public API steering, the research effectively illustrates how production-level models can be compromised by features ordinarily believed to be safe, revealing fundamental lapses in model safety.

💡 Why This Paper Matters

This paper is critically relevant as it challenges the safe implementation of interpretability techniques like activation steering in LLMs. By systematically demonstrating that these methods can lead to severe compromises in model safety, it calls into question widely accepted practices in AI safety and control, emphasizing the need for robust safety frameworks that can account for the unintended consequences of seemingly benign manipulations.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper is crucial because it exposes new attack vectors that could be exploited by malicious actors to bypass safety mechanisms in LLMs. The findings regarding the dual nature of interpretability—where it can lead to both improved control and significant vulnerabilities—highlight the necessity of investigating the safety implications of advanced AI technologies. This research lays the groundwork for developing more resilient systems against such vulnerabilities and fuels the ongoing discourse on AI safety and ethics.

PSRT: Accelerating LRM-based Guard Models via Prefilled Safe Reasoning Traces

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper