← Back to Library

P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs

Authors: Shuai Zhao, Xinyi Wu, Shiqian Zhao, Xiaobao Wu, Zhongliang Guo, Yanhao Jia, Anh Tuan Luu

Published: 2025-10-06

arXiv ID: 2510.04503v1

Added to Library: 2025-10-07 04:03 UTC

Safety

📄 Abstract

During fine-tuning, large language models (LLMs) are increasingly vulnerable to data-poisoning backdoor attacks, which compromise their reliability and trustworthiness. However, existing defense strategies suffer from limited generalization: they only work on specific attack types or task settings. In this study, we propose Poison-to-Poison (P2P), a general and effective backdoor defense algorithm. P2P injects benign triggers with safe alternative labels into a subset of training samples and fine-tunes the model on this re-poisoned dataset by leveraging prompt-based learning. This enforces the model to associate trigger-induced representations with safe outputs, thereby overriding the effects of original malicious triggers. Thanks to this robust and generalizable trigger-based fine-tuning, P2P is effective across task settings and attack types. Theoretically and empirically, we show that P2P can neutralize malicious backdoors while preserving task performance. We conduct extensive experiments on classification, mathematical reasoning, and summary generation tasks, involving multiple state-of-the-art LLMs. The results demonstrate that our P2P algorithm significantly reduces the attack success rate compared with baseline models. We hope that the P2P can serve as a guideline for defending against backdoor attacks and foster the development of a secure and trustworthy LLM community.

🔍 Key Points

  • Introduction of the Poison-to-Poison (P2P) algorithm for defending against data-poisoning backdoor attacks in large language models (LLMs), addressing limitations of existing defense strategies.
  • P2P injects benign triggers into training samples with safe alternative labels, allowing the model to associate trigger-induced representations with secure outputs, effectively neutralizing malicious triggers.
  • Extensive empirical results demonstrate that P2P significantly reduces attack success rates (ASR) across various classification and generation tasks while preserving or improving task performance, indicating its robustness and generalization ability.
  • P2P offers a comprehensive defense that adapts to different attack types, enhancing practical applicability for real-world scenarios where generative models are utilized.
  • The study illustrates a promising direction for future research in security within LLMs, motivating the development of defensible models against evolving threats.

💡 Why This Paper Matters

The proposed P2P algorithm provides a novel and effective approach to mitigate data-poisoning backdoor attacks in large language models. By ensuring that model performance is maintained while significantly reducing the effectiveness of such attacks, this work underscores the importance of robust defense mechanisms in the rapidly evolving landscape of AI applications. Moreover, the findings reaffirm the necessity of developing secure and trustworthy AI systems, particularly as they continue to be integrated into critical domains such as healthcare, finance, and education.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant to AI security researchers due to its focus on backdoor attacks, a critical vulnerability in machine learning systems. The innovative approach of P2P not only presents a solution to a pressing issue but also opens avenues for further research into defense mechanisms against adversarial threats in LLMs. As these models become increasingly prominent in diverse applications, understanding how to protect them from backdoor attacks is paramount, making this study a significant contribution to the field of AI security.

📚 Read the Full Paper