← Back to Library

From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment

Authors: Kyubyung Chae, Hyunbin Jin, Taesup Kim

Published: 2025-06-07

arXiv ID: 2506.10020v1

Added to Library: 2025-06-13 03:02 UTC

Red Teaming

📄 Abstract

Safely aligning large language models (LLMs) often demands extensive human-labeled preference data, a process that's both costly and time-consuming. While synthetic data offers a promising alternative, current methods frequently rely on complex iterative prompting or auxiliary models. To address this, we introduce Refusal-Aware Adaptive Injection (RAAI), a straightforward, training-free, and model-agnostic framework that repurposes LLM attack techniques. RAAI works by detecting internal refusal signals and adaptively injecting predefined phrases to elicit harmful, yet fluent, completions. Our experiments show RAAI effectively jailbreaks LLMs, increasing the harmful response rate from a baseline of 2.15% to up to 61.04% on average across four benchmarks. Crucially, fine-tuning LLMs with the synthetic data generated by RAAI improves model robustness against harmful prompts while preserving general capabilities on standard tasks like MMLU and ARC. This work highlights how LLM attack methodologies can be reframed as practical tools for scalable and controllable safety alignment.

🔍 Key Points

  • Introduced Refusal-Aware Adaptive Injection (RAAI), a novel approach to leveraging attack techniques for generating harmful data from large language models (LLMs) to improve safety alignment without incurring human labeling costs.
  • RAAI demonstrates substantial effectiveness, increasing harmful response rates from 2.15% to an average of 61.04% across multiple benchmarks, outperforming existing attack methods like GPTFuzzer and Emulated Disalignment.
  • The generated synthetic data from RAAI is shown to improve model robustness against harmful prompts while maintaining general performance on standard tasks like MMLU and ARC, thereby avoiding the "alignment tax" common in traditional safety alignment methods.
  • The framework employs model-agnostic techniques, making it applicable across different LLM architectures, thus broadening its usability and impact in the field of AI safety.

💡 Why This Paper Matters

This paper is significant because it reframes LLM attack methodologies into a practical toolset for generating high-quality synthetic data for safety alignment, highlighting a new, reliable pathway to enhance AI safety without the heavy reliance on human judgment, which is often prone to biases and limitations. The findings underscore the dual role of adversarial prompting not only as a vulnerability test but also as an effective data generation strategy.

🎯 Why It's Interesting for AI Security Researchers

The innovative methods introduced in this paper have direct implications for AI security research by providing new techniques for evaluating the robustness of language models against adversarial threats, which can be crucial for developing safer AI systems. It also opens avenues for further exploration of synthetic data generation methods in the context of adversarial attacks, which are central to the AI security field.

📚 Read the Full Paper