Sparse Autoencoders are Capable LLM Jailbreak Mitigators

Authors: Yannick Assogba, Jacopo Cortellazzi, Javier Abad, Pau Rodriguez, Xavier Suau, Arno Blaas

Published: 2026-02-12

arXiv ID: 2602.12418v1

Added to Library: 2026-02-16 03:00 UTC

Red Teaming

📄 Abstract

Jailbreak attacks remain a persistent threat to large language model safety. We propose Context-Conditioned Delta Steering (CC-Delta), an SAE-based defense that identifies jailbreak-relevant sparse features by comparing token-level representations of the same harmful request with and without jailbreak context. Using paired harmful/jailbreak prompts, CC-Delta selects features via statistical testing and applies inference-time mean-shift steering in SAE latent space. Across four aligned instruction-tuned models and twelve jailbreak attacks, CC-Delta achieves comparable or better safety-utility tradeoffs than baseline defenses operating in dense latent space. In particular, our method clearly outperforms dense mean-shift steering on all four models, and particularly against out-of-distribution attacks, showing that steering in sparse SAE feature space offers advantages over steering in dense activation space for jailbreak mitigation. Our results suggest off-the-shelf SAEs trained for interpretability can be repurposed as practical jailbreak defenses without task-specific training.

🔍 Key Points

Introduction of Context-Conditioned Delta Steering (CC-Delta) that uses Sparse Autoencoders (SAEs) to effectively mitigate jailbreak attacks on large language models.
CC-Delta demonstrates improved safety-utility trade-offs when compared to existing baseline defenses operating in dense latent space, outperforming dense activation steering methods especially on out-of-distribution attacks.
A novel statistical feature selection method identifies jailbreak-relevant features by analyzing changes in token-level representations without needing model generations for training.
Experimental results across multiple instruction-tuned models show that CC-Delta provides superior generalization against unseen jailbreak attack types, indicating more robust defense mechanisms.
The research highlights the potential of off-the-shelf SAEs trained for interpretability in practical jailbreak defense applications.

💡 Why This Paper Matters

This paper presents a significant advancement in the field of AI safety by introducing a novel method (CC-Delta) for mitigating jailbreak attacks on large language models. The findings indicate that steering mechanisms operating in sparse latent spaces can achieve better performance and more robust defenses compared to traditional methods. Such advancements are crucial as AI models become increasingly adopted in sensitive applications, emphasizing the importance of maintaining safety and utility in their operations.

🎯 Why It's Interesting for AI Security Researchers

This paper is of great interest to AI security researchers because it addresses a critical vulnerability in large language models—the potential for malicious jailbreak attacks. By proposing a novel defense mechanism using SAEs, it provides a new approach to enhance the safety of AI systems in real-world applications. The implications of better safety-utility trade-offs are vital for researchers who are looking to develop more resilient AI systems and to explore the effectiveness of sparse representations in adversarial conditions.

Sparse Autoencoders are Capable LLM Jailbreak Mitigators

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper