On Jailbreaking Quantized Language Models Through Fault Injection Attacks

📄 Abstract

The safety alignment of Language Models (LMs) is a critical concern, yet their integrity can be challenged by direct parameter manipulation attacks, such as those potentially induced by fault injection. As LMs are increasingly deployed using low-precision quantization for efficiency, this paper investigates the efficacy of such attacks for jailbreaking aligned LMs across different quantization schemes. We propose gradient-guided attacks, including a tailored progressive bit-level search algorithm introduced herein and a comparative word-level (single weight update) attack. Our evaluation on Llama-3.2-3B, Phi-4-mini, and Llama-3-8B across FP16 (baseline), and weight-only quantization (FP8, INT8, INT4) reveals that quantization significantly influences attack success. While attacks readily achieve high success (>80% Attack Success Rate, ASR) on FP16 models, within an attack budget of 25 perturbations, FP8 and INT8 models exhibit ASRs below 20% and 50%, respectively. Increasing the perturbation budget up to 150 bit-flips, FP8 models maintained ASR below 65%, demonstrating some resilience compared to INT8 and INT4 models that have high ASR. In addition, analysis of perturbation locations revealed differing architectural targets across quantization schemes, with (FP16, INT4) and (INT8, FP8) showing similar characteristics. Besides, jailbreaks induced in FP16 models were highly transferable to subsequent FP8/INT8 quantization (<5% ASR difference), though INT4 significantly reduced transferred ASR (avg. 35% drop). These findings highlight that while common quantization schemes, particularly FP8, increase the difficulty of direct parameter manipulation jailbreaks, vulnerabilities can still persist, especially through post-attack quantization.

🔍 Key Points

Investigation of bit-flip and word-level attacks on quantized language models (LMs), emphasizing the vulnerability of these models to manipulative attacks using low-precision quantization schemes.
Development of a tailored progressive bit-level search algorithm that successfully identifies and manipulates critical bits in the model parameters, showcasing high attack success rates on FP16 models.
Empirical evaluation of the effectiveness of attacks across various quantization schemes (FP8, INT8, INT4) showed that quantization can significantly influence the success rate of jailbreak attempts, with FP8 providing the most resilience against attacks.
The study highlights the transferability of jailbreaks induced in FP16 models when these models are quantized to FP8 or INT8, although significant drops in effectiveness were observed when quantized to INT4.
Analysis of perturbation locations reveals architectural vulnerabilities within different layers of the models, emphasizing that quantization impacts which components are most easily attacked.

💡 Why This Paper Matters

This paper underscores the pressing security vulnerabilities in language models, particularly as they become more prevalent in applications involving low-precision quantization. By demonstrating how quantization schemes influence model susceptibility to attacks, the research provides essential insights for maintaining the integrity of AI systems. The findings directly support the development of more robust designs for LMs, crucial for their safe deployment in sensitive scenarios.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper is important as it explores novel attack vectors that leverage physical properties of hardware, specifically fault injection through bit manipulation. As LMs continue to be integrated into various applications, understanding their vulnerabilities is vital for developing effective defenses. The detailed analysis of quantization's impact on attack success rates also provides a framework for future research into enhancing AI system safety, making it a significant contribution to the field of AI security.

On Jailbreaking Quantized Language Models Through Fault Injection Attacks

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper