← Back to Library

Adversarial Contrastive Learning for LLM Quantization Attacks

Authors: Dinghong Song, Zhiwei Xu, Hai Wan, Xibin Zhao, Pengfei Su, Dong Li

Published: 2026-01-06

arXiv ID: 2601.02680v1

Added to Library: 2026-01-07 10:00 UTC

Red Teaming

📄 Abstract

Model quantization is critical for deploying large language models (LLMs) on resource-constrained hardware, yet recent work has revealed severe security risks that benign LLMs in full precision may exhibit malicious behaviors after quantization. In this paper, we propose Adversarial Contrastive Learning (ACL), a novel gradient-based quantization attack that achieves superior attack effectiveness by explicitly maximizing the gap between benign and harmful responses probabilities. ACL formulates the attack objective as a triplet-based contrastive loss, and integrates it with a projected gradient descent two-stage distributed fine-tuning strategy to ensure stable and efficient optimization. Extensive experiments demonstrate ACL's remarkable effectiveness, achieving attack success rates of 86.00% for over-refusal, 97.69% for jailbreak, and 92.40% for advertisement injection, substantially outperforming state-of-the-art methods by up to 44.67%, 18.84%, and 50.80%, respectively.

🔍 Key Points

  • Introduction of Adversarial Contrastive Learning (ACL), a novel gradient-based quantization attack that leverages triplet-based contrastive loss to amplify attack effectiveness.
  • Development of a two-stage fine-tuning strategy that ensures efficient and stable optimization while preserving harmful outputs during quantization.
  • Demonstration of superior attack success rates in multiple scenarios: 86.00% for over-refusal, 97.69% for jailbreak, and 92.40% for advertisement injection, significantly outperforming existing methods.
  • Conducting thorough experiments across multiple large language models (LLMs) and attack scenarios to validate the effectiveness and practicality of ACL for real-world applications.
  • Detailed exploration of the impact of distributed fine-tuning strategies on optimizing model performance and minimizing computational costs.

💡 Why This Paper Matters

This paper is crucial for understanding and addressing the security vulnerabilities associated with the quantization of large language models (LLMs). By presenting a novel attack method, ACL, the authors demonstrate clear advancements in attack efficacy, thus highlighting a significant concern for the deployment of AI models in insecure environments. The findings encourage a reevaluation of current LLM security protocols and advocate for the development of stronger defensive mechanisms.

🎯 Why It's Interesting for AI Security Researchers

The paper would be of keen interest to AI security researchers as it highlights a new dimension in the security landscape of LLMs. The findings reveal how quantization can inadvertently activate harmful behaviors in seemingly benign models, underscoring the importance of security in AI deployment. Researchers can leverage the insights and methodologies presented to develop more effective defenses against similar quantization-based attacks, making this study foundational for advancing AI security measures.

📚 Read the Full Paper