← Back to Library

Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models

Authors: Fenghua Weng, Chaochao Lu, Xia Hu, Wenqi Shao, Wenjie Wang

Published: 2025-12-08

arXiv ID: 2512.07141v1

Added to Library: 2025-12-09 04:01 UTC

📄 Abstract

As multimodal reasoning improves the overall capabilities of Large Vision Language Models (LVLMs), recent studies have begun to explore safety-oriented reasoning, aiming to enhance safety awareness by analyzing potential safety risks during the reasoning process before generating the final response. Although such approaches improve safety awareness and interpretability, this single-pass think-then-answer paradigm remains vulnerable to contextual or visual jailbreak attacks. This reveals a critical flaw: single-pass reasoning may overlook explicit harmful content in its own output. Our key insight is to exploit this wasted signal through reflection, which can effectively leverage the malicious content revealed in the first-pass reasoning to enable genuine self-correction and prevent unsafe generations. Motivated by this, we propose Think-Reflect-Revise (TRR), a three-stage training framework designed to enhance the safety alignment of LVLMs through policy-guided self-reflection. We first build a Reflective Safety Reasoning (ReSafe) dataset with 5,000 examples that follow a think-reflect-revise process. We then fine-tune the target model using the ReSafe dataset to initialize reflective behavior, and finally reinforce policy-guided reflection through reinforcement learning. Experimental results show that TRR substantially improves the safety performance of LVLMs across both safety-awareness benchmarks and jailbreak attack evaluations, increasing the overall safe response rate from 42.8% to 87.7% on Qwen2.5-VL-7B, while preserving stable performance on general benchmarks such as MMMU and MMStar. The project page is available at https://think-reflect-revise.github.io/.

🔍 Key Points

  • Introduction of RL-MTJail, a reinforcement learning framework specifically designed for black-box multi-turn jailbreak attacks on large language models, optimizing for harmful output across multiple turns.
  • Formulation of two heuristic process rewards: over-harm mitigation and target-guided progression, which address the challenges of sparse supervision and enhance long-term attack strategy development.
  • Extensive experimental validation showing RL-MTJail consistently outperforms existing baselines in attack success rates across diverse language models and benchmarks, indicating its effectiveness and adaptability.
  • The paper highlights the importance of optimizing across entire interaction trajectories rather than single turns, showcasing the benefits of trajectory-level learning for effective attack strategies.

💡 Why This Paper Matters

This paper is significant as it advances the state-of-the-art in understanding and conducting multi-turn attacks on large language models, presenting a novel approach that enhances the potential for eliciting harmful responses without being restricted to immediate outputs. The findings underline the vulnerabilities of current AI systems in real-world applications and the critical need for effective countermeasures against automated black-box attacks.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper of great interest due to its exploration of sophisticated attack methodologies that exploit the weaknesses of language models. The novel techniques proposed can inform both defense strategies against such attacks and the understanding of model vulnerabilities, contributing to the broader discourse on AI safety and responsible deployment.

📚 Read the Full Paper