Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

Authors: Miao Yu, Zhenhong Zhou, Moayad Aloqaily, Kun Wang, Biwei Huang, Stephen Wang, Yueming Jin, Qingsong Wen

Published: 2025-09-26

arXiv ID: 2509.21761v1

Added to Library: 2025-09-29 04:01 UTC

Red Teaming

📄 Abstract

Fine-tuned Large Language Models (LLMs) are vulnerable to backdoor attacks through data poisoning, yet the internal mechanisms governing these attacks remain a black box. Previous research on interpretability for LLM safety tends to focus on alignment, jailbreak, and hallucination, but overlooks backdoor mechanisms, making it difficult to understand and fully eliminate the backdoor threat. In this paper, aiming to bridge this gap, we explore the interpretable mechanisms of LLM backdoors through Backdoor Attribution (BkdAttr), a tripartite causal analysis framework. We first introduce the Backdoor Probe that proves the existence of learnable backdoor features encoded within the representations. Building on this insight, we further develop Backdoor Attention Head Attribution (BAHA), efficiently pinpointing the specific attention heads responsible for processing these features. Our primary experiments reveals these heads are relatively sparse; ablating a minimal \textbf{$\sim$ 3%} of total heads is sufficient to reduce the Attack Success Rate (ASR) by \textbf{over 90%}. More importantly, we further employ these findings to construct the Backdoor Vector derived from these attributed heads as a master controller for the backdoor. Through only \textbf{1-point} intervention on \textbf{single} representation, the vector can either boost ASR up to \textbf{$\sim$ 100% ($\uparrow$)} on clean inputs, or completely neutralize backdoor, suppressing ASR down to \textbf{$\sim$ 0% ($\downarrow$)} on triggered inputs. In conclusion, our work pioneers the exploration of mechanistic interpretability in LLM backdoors, demonstrating a powerful method for backdoor control and revealing actionable insights for the community.

🔍 Key Points

Introduction of Backdoor Attribution (BkdAttr), a tripartite causal analysis framework for LLMs that elucidates backdoor mechanisms.
Development of the Backdoor Probe that confirms the presence of learnable backdoor features within model representations.
Creation of Backdoor Attention Head Attribution (BAHA) which identifies attention heads crucial for backdoor feature processing, demonstrating that as little as 3% ablation of heads can reduce Attack Success Rate (ASR) by over 90%.
Construction of the Backdoor Vector which allows for control over backdoor activation through simple interventions, enabling either enhancement or suppression of ASR with minimal input changes.
Application of BkdAttr across different LLM architectures, showcasing its effectiveness against various backdoor types, thereby proving its generalizability.

💡 Why This Paper Matters

This paper is significant because it is among the first to thoroughly investigate and provide a mechanistic understanding of backdoor attacks in large language models. By establishing a framework that allows for interpretation and control over these vulnerabilities, the authors contribute not only to academic research but also to practical safety measures for deploying LLMs securely in real-world applications. The findings highlight actionable methods to mitigate backdoor threats, underscoring their implications for enhancing the safety and robustness of AI systems.

🎯 Why It's Interesting for AI Security Researchers

The research in this paper is critical for AI security researchers focused on understanding and mitigating backdoor attacks in machine learning models, especially in increasingly complex architectures like LLMs. The novel methods proposed, such as Backdoor Attribution, provide fresh insights into identifying and neutralizing backdoor threats, which aligns with the ongoing efforts to enhance model integrity and trustworthiness in AI systems. Therefore, this work serves as a foundational piece for future research on AI safety and security.

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper