← Back to Library

Preserving Fairness and Safety in Quantized LLMs Through Critical Weight Protection

Authors: Muhammad Alif Al Hakim, Alfan Farizki Wicaksono, Fajri Koto

Published: 2026-01-17

arXiv ID: 2601.12033v1

Added to Library: 2026-01-21 03:02 UTC

Safety

πŸ“„ Abstract

Quantization is widely adopted to reduce the computational cost of large language models (LLMs); however, its implications for fairness and safety, particularly in dynamic quantization and multilingual contexts, remain underexplored. In this work, we conduct a systematic study of how static and dynamic quantization methods impact fairness and safety across benchmarks measuring intrinsic and extrinsic bias and safety alignment. For fairness, we evaluate English, French, Dutch, Spanish, and Turkish; for safety, we focus on English, Korean, and Arabic. Our findings reveal that quantization consistently degrades fairness and safety, with dynamic methods demonstrating greater stability than static ones. Moreover, fairness degradation varies across languages, while safety deterioration is especially pronounced in non-English settings. To address these risks, we introduce Critical Weight Protection, a novel technique that identifies and preserves fairness- and safety-critical weights during quantization. This approach effectively mitigates bias and safety deterioration without costly retraining or alignment, maintaining trustworthiness while retaining efficiency.

πŸ” Key Points

  • The paper introduces a novel technique named Critical Weight Protection, which identifies and preserves fairness- and safety-critical weights in large language models during quantization, addressing degradation issues caused by quantization methods.
  • Empirical studies demonstrate that both static and dynamic quantization methods negatively impact fairness and safety, highlighting that dynamic quantization is more stable across different tasks and languages.
  • The research provides a comprehensive evaluation of fairness and safety across multiple languages (English, French, Dutch, Spanish, and Turkish for fairness; English, Korean, and Arabic for safety), revealing the varying impact of quantization on language-specific contexts.
  • This work contributes to the understanding of how quantization affects bias and safety alignment, with findings indicating that naΓ―ve quantization can amplify bias and compromise safety, especially in non-English settings.

πŸ’‘ Why This Paper Matters

This paper is significant as it addresses the critical issue of preserving fairness and safety in AI models amidst the increasing trend of quantization for efficiency. Given the potential real-world harms caused by biased or unsafe outputs in applications involving language models, the proposed method provides a practical solution that balances model performance and responsible AI deployment. By ensuring that essential weights related to fairness and safety are maintained, the research supports the ethical development of AI technologies in diverse, multilingual settings.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper particularly relevant as it addresses the intersection of model efficiency and ethical implications of AI deployment. The proposed method can help mitigate risks associated with biased or harmful outputs in AI systems, particularly in sensitive applications such as healthcare and education. Understanding the implications of quantization on model safety and fairness is crucial for developing robust security frameworks and ensuring that AI systems operate within ethical boundaries.

πŸ“š Read the Full Paper