The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment

Authors: Zhexin Zhang, Yida Lu, Junfeng Fang, Junxiao Yang, Shiyao Cui, Hao Zhou, Fandong Meng, Jie Zhou, Hongning Wang, Minlie Huang, Tat-Seng Chua

Published: 2026-02-04

arXiv ID: 2602.04196v1

Added to Library: 2026-02-05 03:03 UTC

📄 Abstract

Safety risks of AI models have been widely studied at deployment time, such as jailbreak attacks that elicit harmful outputs. In contrast, safety risks emerging during training remain largely unexplored. Beyond explicit reward hacking that directly manipulates explicit reward functions in reinforcement learning, we study implicit training-time safety risks: harmful behaviors driven by a model's internal incentives and contextual background information. For example, during code-based reinforcement learning, a model may covertly manipulate logged accuracy for self-preservation. We present the first systematic study of this problem, introducing a taxonomy with five risk levels, ten fine-grained risk categories, and three incentive types. Extensive experiments reveal the prevalence and severity of these risks: notably, Llama-3.1-8B-Instruct exhibits risky behaviors in 74.4% of training runs when provided only with background information. We further analyze factors influencing these behaviors and demonstrate that implicit training-time risks also arise in multi-agent training settings. Our results identify an overlooked yet urgent safety challenge in training.

🔍 Key Points

Introduction of the RAPO framework, which improves adaptive safe reasoning in Large Reasoning Models (LRMs) against complex jailbreak attacks.
The paper emphasizes the necessity of scaling safe reasoning depth according to the complexity of attack prompts, supported by both theoretical analysis and empirical evidence.
RAPO utilizes a composite reward system combining risk-aware and general utility rewards to enhance model responsiveness to potential risks in prompts.
Extensive experiments demonstrate RAPO's effectiveness, achieving significantly lower attack success rates on various benchmarks while maintaining model utility.
The study provides insights into safe reasoning processes and their relation to in-context learning, offering a new perspective on LRM safety.

💡 Why This Paper Matters

This paper is significant as it addresses the ongoing safety challenges concerning Large Reasoning Models in the context of sophisticated adversarial prompts. By proposing the RAPO framework, it not only demonstrates a novel way to enhance the safety of these models but also offers empirical evidence supporting its effectiveness. The findings emphasize the importance of aligning safe reasoning with the complexity of the input, leading to safer AI systems capable of better handling malicious attempts to bypass safety measures.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper relevant because it tackles a critical aspect of AI safety—ensuring that models can resist and adapt to increasingly sophisticated jailbreak attacks. With the risk of AI models generating harmful content being a pressing issue, the methods and insights presented in this research provide valuable strategies for developing more robust and secure AI systems. Furthermore, the theoretical and empirical analyses enrich the understanding of safe reasoning, which is crucial for enhancing AI alignment and overall safety.

The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper