← Back to Library

TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

Authors: Zhewen Tan, Wenhan Yu, Jianfeng Si, Tongxin Liu, Kaiqi Guan, Huiyan Jin, Jiawen Tao, Xiaokun Yuan, Duohe Ma, Xiangzheng Zhang, Tong Yang, Lin Sun

Published: 2026-01-26

arXiv ID: 2601.18292v1

Added to Library: 2026-01-27 04:01 UTC

Safety

📄 Abstract

In recent years, safety risks associated with large language models have become increasingly prominent, highlighting the urgent need to mitigate the generation of toxic and harmful content. The mainstream paradigm for LLM safety alignment typically adopts a collaborative framework involving three roles: an attacker for adversarial prompt generation, a defender for safety defense, and an evaluator for response assessment. In this paper, we propose a closed-loop reinforcement learning framework called TriPlay-RL that enables iterative and co-improving collaboration among three roles with near-zero manual annotation. Experimental results show that the attacker preserves high output diversity while achieving a 20%-50% improvement in adversarial effectiveness; the defender attains 10%-30% gains in safety performance without degrading general reasoning capability; and the evaluator continuously refines its fine-grained judgment ability through iterations, accurately distinguishing unsafe responses, simple refusals, and useful guidance. Overall, our framework establishes an efficient and scalable paradigm for LLM safety alignment, enabling continuous co-evolution within a unified learning loop.

🔍 Key Points

  • Introduction of TriPlay-RL, a tri-role self-play reinforcement learning framework that enhances the safety alignment of large language models (LLMs) through minimal manual annotation.
  • Demonstration of improved performance across three roles: MRedM (attacker) enhances adversarial effectiveness by 20%-50%; MBlueM (defender) improves safety without degrading reasoning capabilities by 10%-30%; MEvalM (evaluator) refines judgment accuracy through iterative feedback.
  • Utilization of novel reward mechanisms including semantic rewards, diversity penalties, and multi-model attack rewards that foster collaboration between roles and prevent issues like entropy collapse and defense overfitting.
  • Empirical results showcasing the competitive advantage of the TriPlay-RL framework in safety alignment tasks compared to previous approaches, maintaining high output diversity while mitigating harmful content generation.

💡 Why This Paper Matters

TriPlay-RL proposes a substantial leap forward in LLM safety alignment by integrating three roles within a closed-loop reinforcement learning paradigm, enhancing adversarial robustness and safety without the burden of extensive manual tuning. This approach addresses critical gaps in conventional safety methodologies while also preserving the models' general reasoning capabilities, establishing it as a vital development in the field of AI safety.

🎯 Why It's Interesting for AI Security Researchers

This paper is of particular interest to AI security researchers as it presents a robust framework for systematically improving the safety and functionality of LLMs, which are increasingly integrated into various applications. Given the proliferation of AI-generated content and the inherent risks associated with LLMs producing harmful outputs, the TriPlay-RL framework could significantly influence ongoing research in adversarial robustness, automated red teaming, and responsible AI deployment strategies.

📚 Read the Full Paper