← Back to Library

Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation

Authors: Giovanni De Muri, Mark Vero, Robin Staab, Martin Vechev

Published: 2025-10-21

arXiv ID: 2510.18541v1

Added to Library: 2025-10-22 03:01 UTC

Red Teaming

📄 Abstract

LLMs are often used by downstream users as teacher models for knowledge distillation, compressing their capabilities into memory-efficient models. However, as these teacher models may stem from untrusted parties, distillation can raise unexpected security risks. In this paper, we investigate the security implications of knowledge distillation from backdoored teacher models. First, we show that prior backdoors mostly do not transfer onto student models. Our key insight is that this is because existing LLM backdooring methods choose trigger tokens that rarely occur in usual contexts. We argue that this underestimates the security risks of knowledge distillation and introduce a new backdooring technique, T-MTB, that enables the construction and study of transferable backdoors. T-MTB carefully constructs a composite backdoor trigger, made up of several specific tokens that often occur individually in anticipated distillation datasets. As such, the poisoned teacher remains stealthy, while during distillation the individual presence of these tokens provides enough signal for the backdoor to transfer onto the student. Using T-MTB, we demonstrate and extensively study the security risks of transferable backdoors across two attack scenarios, jailbreaking and content modulation, and across four model families of LLMs.

🔍 Key Points

  • This paper investigates the transferability of backdoors in LLMs during knowledge distillation, revealing that existing backdoor methods fail to transfer effectively to student models due to trigger rarity.
  • The introduction of a novel backdoor attack technique called T-MTB, which uses multi-token triggers composed of frequently occurring tokens, addresses the limitations of previous methods by enhancing transferability while maintaining stealthiness.
  • The authors perform extensive experimental evaluations across multiple model families and attack scenarios, demonstrating significant security risks associated with backdoored teacher models in knowledge distillation.
  • Results indicate that under realistic threat assumptions, attackers can construct transferable backdoors with high attack success rates even on non-anticipated datasets, challenging previous safety assumptions about model distillation.
  • The paper emphasizes the urgent need for improved security measures against backdoor threats in the increasingly common practice of LLM distillation.

💡 Why This Paper Matters

This study highlights critical security vulnerabilities in the knowledge distillation process for language models, where backdoored teacher models can effectively influence student models. This riskiest scenario where untrusted third-party models are used in distillation necessitates urgent attention toward developing robust defenses and security measures.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper are particularly relevant for AI security researchers as they underscore the potential for backdoor attacks in a real-world context, which could be exploited through widely adopted knowledge distillation practices. Understanding how backdoors can transfer between models is crucial for developing effective defenses against adversarial threats in AI, particularly as LLMs become more prevalent in various applications.

📚 Read the Full Paper