← Back to Library

Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

Authors: Guobin Shen, Dongcheng Zhao, Haibo Tong, Jindong Li, Feifei Zhao, Yi Zeng

Published: 2025-10-01

arXiv ID: 2510.01088v1

Added to Library: 2025-10-03 04:03 UTC

Red Teaming Safety

📄 Abstract

Ensuring Large Language Model (LLM) safety remains challenging due to the absence of universal standards and reliable content validators, making it difficult to obtain effective training signals. We discover that aligned models already possess robust internal safety beliefs: they consistently produce high-confidence refusals to harmful requests while exhibiting high entropy when generating potentially dangerous content. This entropy gap reveals an untapped signal--models intrinsically "know" when to refuse. We introduce Safety Instincts Reinforcement Learning (SIRL), which transforms this internal confidence into a self-generated reward signal, eliminating dependence on external validators or human annotations. SIRL teaches models to trust their safety instincts by reinforcing low-entropy refusal behaviors. Evaluated on Llama and Qwen models, SIRL maintains 89%+ Defense Success Rates (DSRs) against 20+ jailbreak methods, from static prompts to adaptive attacks. Using only 15,000 unlabeled prompts, SIRL surpasses resource-intensive supervised methods while preserving performance on mathematics, coding, and conversation benchmarks. Our work demonstrates that effective alignment can emerge from within, paving the way for more autonomous and robust AI safety mechanisms that scale without extensive human oversight.

🔍 Key Points

  • Discovery of an entropy gap in LLM responses indicating intrinsic safety beliefs, where safe refusals manifest lower entropy compared to harmful outputs.
  • Introduction of Safety Instincts Reinforcement Learning (SIRL), a novel self-alignment method that leverages internal confidence signals as a reward mechanism for enhancing model safety.
  • Evaluation of SIRL shows over 89% Defense Success Rates (DSR) against 20+ jailbreak methods, highlighting its robustness compared to existing supervised methods which require extensive resources.
  • Demonstration that SIRL not only enhances safety but also preserves or improves performance across diverse tasks including reasoning, coding, and conversation.
  • Findings suggest a paradigm shift in AI safety alignment, advocating for models to self-regulate safety based on their internal confidence rather than relying on extensive human oversight.

💡 Why This Paper Matters

The paper's research contributes significant advancements in the realm of AI safety for large language models, presenting a self-sufficient mechanism (SIRL) to enhance model safety by utilizing intrinsic signals. By reducing dependence on external validators and achieving high defense rates against evasive attacks, the study paves the way for autonomous AI systems capable of maintaining safety standards as they evolve. This self-reinforcement approach reveals that robust AI safety does not necessarily require external intervention, making it a pivotal resource for future AI safety developments.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of great interest to AI security researchers as it addresses a critical challenge in ensuring the safe deployment of language models amid evolving jailbreak techniques. The innovative methodology of SIRL provides a framework for enhancing model resilience without extensive human oversight, presenting a potentially scalable solution for autonomous AI safety mechanisms. This research insights into the fusion of internal confidence signals with safety measures could inspire new approaches to AI alignment and safety protocols, contributing to the ongoing discourse on trustworthiness in AI systems.

📚 Read the Full Paper