← Back to Library

Pruning Strategies for Backdoor Defense in LLMs

Authors: Santosh Chapagain, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi

Published: 2025-08-27

arXiv ID: 2508.20032v1

Added to Library: 2025-08-28 04:03 UTC

Safety

📄 Abstract

Backdoor attacks are a significant threat to the performance and integrity of pre-trained language models. Although such models are routinely fine-tuned for downstream NLP tasks, recent work shows they remain vulnerable to backdoor attacks that survive vanilla fine-tuning. These attacks are difficult to defend because end users typically lack knowledge of the attack triggers. Such attacks consist of stealthy malicious triggers introduced through subtle syntactic or stylistic manipulations, which can bypass traditional detection and remain in the model, making post-hoc purification essential. In this study, we explore whether attention-head pruning can mitigate these threats without any knowledge of the trigger or access to a clean reference model. To this end, we design and implement six pruning-based strategies: (i) gradient-based pruning, (ii) layer-wise variance pruning, (iii) gradient-based pruning with structured L1/L2 sparsification, (iv) randomized ensemble pruning, (v) reinforcement-learning-guided pruning, and (vi) Bayesian uncertainty pruning. Each method iteratively removes the least informative heads while monitoring validation accuracy to avoid over-pruning. Experimental evaluation shows that gradient-based pruning performs best while defending the syntactic triggers, whereas reinforcement learning and Bayesian pruning better withstand stylistic attacks.

🔍 Key Points

  • The paper introduces six pruning strategies (gradient-based, layer-wise variance, structured sparsification, randomized ensemble, reinforcement learning-guided, and Bayesian uncertainty pruning) as defenses against backdoor attacks in large language models.
  • Experimental results demonstrate that gradient-based pruning effectively defends against syntactic triggers, while reinforcement learning and Bayesian pruning show stronger performance against stylistic attacks.
  • The paper emphasizes the importance of post-hoc purification techniques that do not require knowledge of the backdoor trigger or access to a clean reference model, making the methods applicable in real-world scenarios.
  • Evaluation metrics include Label Flip Rate (LFR) and Clean Accuracy (ACC), highlighting the trade-offs between preserving model performance and mitigating vulnerabilities to backdoor attacks.
  • The study reinforces the necessity for practical defense mechanisms in NLP systems, which are increasingly deployed without transparency and can be exploited by adversarial attacks.

💡 Why This Paper Matters

This paper is relevant as it addresses the pressing issue of backdoor attacks on large language models, providing innovative pruning strategies that enhance model robustness without requiring prior knowledge of attack mechanisms. Its findings underscore the need for effective post-hoc defenses in a landscape where NLP applications are susceptible to covert threats, ultimately contributing to safer deployment of AI models in production environments.

🎯 Why It's Interesting for AI Security Researchers

This paper will interest AI security researchers as it tackles a critical area of vulnerability in machine learning systems—backdoor attacks—by presenting novel and practical defense mechanisms. The exploration of pruning techniques not only advances the field of adversarial machine learning but also provides insights that can be applied to safeguard various AI applications in sensitive domains, making the contributions significant for future research and real-world applications.

📚 Read the Full Paper