← Back to Library

Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation

Authors: Huizhen Shu, Xuying Li, Qirui Wang, Yuji Kosuga, Mengqiu Tian, Zhuo Li

Published: 2025-08-14

arXiv ID: 2508.10404v1

Added to Library: 2025-08-15 04:00 UTC

Red Teaming

📄 Abstract

With the rapid proliferation of Natural Language Processing (NLP), especially Large Language Models (LLMs), generating adversarial examples to jailbreak LLMs remains a key challenge for understanding model vulnerabilities and improving robustness. In this context, we propose a new black-box attack method that leverages the interpretability of large models. We introduce the Sparse Feature Perturbation Framework (SFPF), a novel approach for adversarial text generation that utilizes sparse autoencoders to identify and manipulate critical features in text. After using the SAE model to reconstruct hidden layer representations, we perform feature clustering on the successfully attacked texts to identify features with higher activations. These highly activated features are then perturbed to generate new adversarial texts. This selective perturbation preserves the malicious intent while amplifying safety signals, thereby increasing their potential to evade existing defenses. Our method enables a new red-teaming strategy that balances adversarial effectiveness with safety alignment. Experimental results demonstrate that adversarial texts generated by SFPF can bypass state-of-the-art defense mechanisms, revealing persistent vulnerabilities in current NLP systems.However, the method's effectiveness varies across prompts and layers, and its generalizability to other architectures and larger models remains to be validated.

🔍 Key Points

  • Introduction of Sparse Feature Perturbation Framework (SFPF) for adversarial text generation, leveraging sparse autoencoders to manipulate critical text features.
  • Experimental validation shows SFPF-generated adversarial texts can bypass state-of-the-art defense mechanisms, revealing vulnerabilities in NLP systems.
  • SFPF achieves a balance between adversarial effectiveness and safety alignment, making it a useful strategy for testing model robustness in real-world scenarios.
  • The method employs a novel clustering approach to identify adversarial-sensitive features in hidden layers of language models, enhancing interpretability of model activations.
  • Observations indicate varying effectiveness of the SFPF across different prompts and layers, suggesting areas for further research on generalizability to other architectures.

💡 Why This Paper Matters

This paper presents significant advancements in the understanding and application of adversarial attacks on large language models through the SFPF approach. By combining interpretability with practical attack strategies, the work highlights both the vulnerabilities of NLP systems and the potential for improving safety protocols. As language models become increasingly prevalent in critical applications, the insights and methodologies developed in this work are essential for enhancing model robustness and ensuring responsible deployment.

🎯 Why It's Interesting for AI Security Researchers

This paper is of great interest to AI security researchers as it addresses persistent vulnerabilities in large language models, a crucial area of concern given their widespread use in sensitive contexts. The innovative techniques introduced, such as feature clustering and selective perturbation, provide valuable frameworks for developing more effective adversarial testing methods. Additionally, the findings underline the importance of enhancing the robustness of AI models against adversarial attacks, making it crucial for researchers dedicated to advancing AI safety and ethical standards.

📚 Read the Full Paper