← Back to Library

SPIN: Self-Supervised Prompt INjection

Authors: Leon Zhou, Junfeng Yang, Chengzhi Mao

Published: 2024-10-17

arXiv ID: 2410.13236v1

Added to Library: 2025-11-11 14:20 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) are increasingly used in a variety of important applications, yet their safety and reliability remain as major concerns. Various adversarial and jailbreak attacks have been proposed to bypass the safety alignment and cause the model to produce harmful responses. We introduce Self-supervised Prompt INjection (SPIN) which can detect and reverse these various attacks on LLMs. As our self-supervised prompt defense is done at inference-time, it is also compatible with existing alignment and adds an additional layer of safety for defense. Our benchmarks demonstrate that our system can reduce the attack success rate by up to 87.9%, while maintaining the performance on benign user requests. In addition, we discuss the situation of an adaptive attacker and show that our method is still resilient against attackers who are aware of our defense.

🔍 Key Points

  • Introduction of Self-supervised Prompt INjection (SPIN) as a detection and reversal method for adversarial attacks on LLMs.
  • The ability of SPIN to significantly reduce attack success rates by up to 87.9% while maintaining performance on benign inputs.
  • SPIN operates at inference time, making it compatibility with existing safety alignment techniques without requiring additional training.
  • The proposed method includes self-supervised detection tasks such as 'Repeat' and 'Interject' which help identify harmful inputs based on their performance on well-defined language tasks.
  • SPIN demonstrates resilience against adaptive attackers who may adjust their strategies to bypass existing defenses.

💡 Why This Paper Matters

The research on SPIN highlights a critical advancement in safeguarding large language models against adversarial attacks. By utilizing self-supervised techniques that operate in real-time without requiring extensive retraining, this work sets a precedent for integrating safety mechanisms directly into the deployment phase of AI applications. The findings contribute valuable insights into enhancing the reliability and robustness of LLMs in practical applications.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant for AI security researchers as it addresses major vulnerabilities in current Large Language Models (LLMs) while providing innovative methodologies for detection and reversal of adversarial attacks. By demonstrating the efficacy of self-supervised metrics in a practical context, it offers a framework that can be further explored and adapted for various AI safety applications. Additionally, the implications of making defenses against dynamic adversarial strategies provide a critical step in developing resilient AI systems.

📚 Read the Full Paper