Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

Authors: Miao Yu, Zhenhong Zhou, Moayad Aloqaily, Kun Wang, Biwei Huang, Stephen Wang, Yueming Jin, Qingsong Wen

Published: 2025-09-26

arXiv ID: 2509.21761v2

Added to Library: 2025-10-01 01:01 UTC

Red Teaming

📄 Abstract

Fine-tuned Large Language Models (LLMs) are vulnerable to backdoor attacks through data poisoning, yet the internal mechanisms governing these attacks remain a black box. Previous research on interpretability for LLM safety tends to focus on alignment, jailbreak, and hallucination, but overlooks backdoor mechanisms, making it difficult to understand and fully eliminate the backdoor threat. In this paper, aiming to bridge this gap, we explore the interpretable mechanisms of LLM backdoors through Backdoor Attribution (BkdAttr), a tripartite causal analysis framework. We first introduce the Backdoor Probe that proves the existence of learnable backdoor features encoded within the representations. Building on this insight, we further develop Backdoor Attention Head Attribution (BAHA), efficiently pinpointing the specific attention heads responsible for processing these features. Our primary experiments reveals these heads are relatively sparse; ablating a minimal \textbf{$\sim$ 3%} of total heads is sufficient to reduce the Attack Success Rate (ASR) by \textbf{over 90%}. More importantly, we further employ these findings to construct the Backdoor Vector derived from these attributed heads as a master controller for the backdoor. Through only \textbf{1-point} intervention on \textbf{single} representation, the vector can either boost ASR up to \textbf{$\sim$ 100% ($\uparrow$)} on clean inputs, or completely neutralize backdoor, suppressing ASR down to \textbf{$\sim$ 0% ($\downarrow$)} on triggered inputs. In conclusion, our work pioneers the exploration of mechanistic interpretability in LLM backdoors, demonstrating a powerful method for backdoor control and revealing actionable insights for the community.

🔍 Key Points

Introduction of the Backdoor Attribution (BkdAttr) framework which provides a tripartite causal analysis for understanding backdoor mechanisms in LLMs.
Development of Backdoor Probe to detect and validate learnable backdoor features within LLM representations, which can achieve over 95% accuracy in distinguishing backdoor samples from clean inputs.
Presentation of Backdoor Attention Head Attribution (BAHA) that identifies a sparse set of attention heads responsible for backdoor activations, demonstrating significant decreases in Attack Success Rate (ASR) with minimal head ablation (approximately 3%).
Creation of Backdoor Vectors, manipulated representations that can either amplify backdoor activities or suppress them during inference, effectively functioning as switch controls for backdoor activations.

💡 Why This Paper Matters

This paper addresses a crucial gap in the interpretability and safety of large language models, presenting novel methods to not only understand but also control potential backdoor threats. The insights and techniques outlined could lead to the development of more robust LLMs that mitigate the risks associated with malicious backdoor attacks, ensuring safer deployment in real-world applications.

🎯 Why It's Interesting for AI Security Researchers

Given the growing concerns about the security and safety of AI systems, particularly regarding vulnerabilities like backdoor attacks, this paper would be of significant interest to AI security researchers. It not only uncovers the hidden mechanisms of backdoor attacks but also offers actionable methods for detection and mitigation, aligning with the broader goals of enhancing the reliability and trustworthiness of AI technologies.

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper