On Jailbreaking Quantized Language Models Through Fault Injection Attacks

📄 Abstract

The safety alignment of Language Models (LMs) is a critical concern, yet their integrity can be challenged by direct parameter manipulation attacks, such as those potentially induced by fault injection. As LMs are increasingly deployed using low-precision quantization for efficiency, this paper investigates the efficacy of such attacks for jailbreaking aligned LMs across different quantization schemes. We propose gradient-guided attacks, including a tailored progressive bit-level search algorithm introduced herein and a comparative word-level (single weight update) attack. Our evaluation on Llama-3.2-3B, Phi-4-mini, and Llama-3-8B across FP16 (baseline), and weight-only quantization (FP8, INT8, INT4) reveals that quantization significantly influences attack success. While attacks readily achieve high success (>80\% Attack Success Rate, ASR) on FP16 models, within an attack budget of 25 perturbations, FP8 and INT8 models exhibit ASRs below 20\% and 50\%, respectively. Increasing the perturbation budget up to 150 bit-flips, FP8 models maintained ASR below 65\%, demonstrating some resilience compared to INT8 and INT4 models that have high ASR. In addition, analysis of perturbation locations revealed differing architectural targets across quantization schemes, with (FP16, INT4) and (INT8, FP8) showing similar characteristics. Besides, jailbreaks induced in FP16 models were highly transferable to subsequent FP8/INT8 quantization (<5\% ASR difference), though INT4 significantly reduced transferred ASR (avg. 35\% drop). These findings highlight that while common quantization schemes, particularly FP8, increase the difficulty of direct parameter manipulation jailbreaks, vulnerabilities can still persist, especially through post-attack quantization.

🔍 Key Points

Investigated attack resilience of quantized language models (LMs) against bit-flip and word-level jailbreak attacks, revealing vulnerabilities related to quantization schemes.
Introduced novel gradient-guided attacks including a tailored progressive bit-level search, providing a systematic approach to identifying critical bits for effective jailbreaks.
Demonstrated that while FP16 models are highly vulnerable to targeted perturbations with high Attack Success Rates (ASR), quantized models show significantly reduced attack success, particularly in FP8 and INT8 formats.
Analyzed architectural vulnerabilities and perturbation locations, highlighting how different quantization formats affect attack dynamics and success rates.
Showed that jailbroken states in FP16 models could transfer to post-attack quantized versions (FP8 and INT8), although with varying levels of success, thus emphasizing the need for robust defenses.

💡 Why This Paper Matters

This paper provides critical insights into the security vulnerabilities of modern language models, especially concerning the effectiveness of jailbreaking techniques under different quantization schemes. Its findings underline the complex interplay between model quantization and susceptibility to attacks, which is essential for developing robust defenses in AI systems. By exploring both theoretical and practical implications of fault-injection attacks on LMs, this research paves the way for improved safety protocols in AI deployment.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper is particularly relevant due to its exploration of new attack vectors leveraging fault-injection techniques. As LMs are increasingly used in critical applications and are deployed in resource-constrained environments using lower-precision formats, understanding the vulnerabilities outlined in this research is crucial for anticipating potential security threats and designing more resilient systems against adversarial attacks.

On Jailbreaking Quantized Language Models Through Fault Injection Attacks

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper