← Back to Library

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

Authors: Hao Wang, Yanting Wang, Hao Li, Rui Li, Lei Sha

Published: 2026-01-15

arXiv ID: 2601.10589v1

Added to Library: 2026-01-16 03:00 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) have achieved remarkable capabilities but remain vulnerable to adversarial ``jailbreak'' attacks designed to bypass safety guardrails. Current safety alignment methods depend heavily on static external red teaming, utilizing fixed defense prompts or pre-collected adversarial datasets. This leads to a rigid defense that overfits known patterns and fails to generalize to novel, sophisticated threats. To address this critical limitation, we propose empowering the model to be its own red teamer, capable of achieving autonomous and evolving adversarial attacks. Specifically, we introduce Safety Self- Play (SSP), a system that utilizes a single LLM to act concurrently as both the Attacker (generating jailbreaks) and the Defender (refusing harmful requests) within a unified Reinforcement Learning (RL) loop, dynamically evolving attack strategies to uncover vulnerabilities while simultaneously strengthening defense mechanisms. To ensure the Defender effectively addresses critical safety issues during the self-play, we introduce an advanced Reflective Experience Replay Mechanism, which uses an experience pool accumulated throughout the process. The mechanism employs a Upper Confidence Bound (UCB) sampling strategy to focus on failure cases with low rewards, helping the model learn from past hard mistakes while balancing exploration and exploitation. Extensive experiments demonstrate that our SSP approach autonomously evolves robust defense capabilities, significantly outperforming baselines trained on static adversarial datasets and establishing a new benchmark for proactive safety alignment.

🔍 Key Points

  • Introduction of Safety Self-Play (SSP) system that allows an LLM to self-generate adversarial attacks and defenses within a unified reinforcement learning framework.
  • Implementation of a Reflective Experience Replay Mechanism to ensure the model learns from historical failures, promoting continuous improvement in both attack and defense capabilities.
  • Utilization of Upper Confidence Bound (UCB) sampling to prioritize difficult and infrequently encountered cases during experience replay, enhancing effective learning.
  • Demonstrated significant improvements in defense robustness against jailbreak attacks as shown through extensive experimental results across multiple language models, outperforming static defense baselines.
  • Maintained high core model performance while achieving reduced attack success rates, effectively balancing security and utility.

💡 Why This Paper Matters

This paper presents a significant advancement in the proactive safety alignment of large language models through a novel approach that combines self-play and reflective learning. The proposed SSP system not only empowers models to autonomously adapt and respond to emerging threats but also ensures that they remain robust against various adversarial strategies. This research is crucial for addressing the growing safety challenges faced by LLMs in real-world applications, making it an important contribution to the AI field.

🎯 Why It's Interesting for AI Security Researchers

The work addresses critical safety concerns in AI, particularly the vulnerabilities of language models to adversarial attacks. As these models become more integrated into sensitive applications, understanding and improving their safety alignment is vital. Researchers focused on AI security will find this paper useful as it provides innovative methods for enhancing model resilience, which is essential for the responsible development of AI technologies.

📚 Read the Full Paper