← Back to Library

P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs

Authors: Shuai Zhao, Xinyi Wu, Shiqian Zhao, Xiaobao Wu, Zhongliang Guo, Yanhao Jia, Anh Tuan Luu

Published: 2025-10-06

arXiv ID: 2510.04503v2

Added to Library: 2025-10-13 12:02 UTC

Safety

📄 Abstract

During fine-tuning, large language models (LLMs) are increasingly vulnerable to data-poisoning backdoor attacks, which compromise their reliability and trustworthiness. However, existing defense strategies suffer from limited generalization: they only work on specific attack types or task settings. In this study, we propose Poison-to-Poison (P2P), a general and effective backdoor defense algorithm. P2P injects benign triggers with safe alternative labels into a subset of training samples and fine-tunes the model on this re-poisoned dataset by leveraging prompt-based learning. This enforces the model to associate trigger-induced representations with safe outputs, thereby overriding the effects of original malicious triggers. Thanks to this robust and generalizable trigger-based fine-tuning, P2P is effective across task settings and attack types. Theoretically and empirically, we show that P2P can neutralize malicious backdoors while preserving task performance. We conduct extensive experiments on classification, mathematical reasoning, and summary generation tasks, involving multiple state-of-the-art LLMs. The results demonstrate that our P2P algorithm significantly reduces the attack success rate compared with baseline models. We hope that the P2P can serve as a guideline for defending against backdoor attacks and foster the development of a secure and trustworthy LLM community.

🔍 Key Points

  • Introduction of the Poison-to-Poison (P2P) algorithm, a novel approach to defend against backdoor attacks in large language models (LLMs) by utilizing benign triggers to override malicious backdoor effects.
  • P2P demonstrates strong generalizability across various attack types and task settings, effectively neutralizing backdoors while preserving task performance.
  • Extensive empirical validation shows that P2P significantly lowers attack success rates (ASR) compared to existing defense mechanisms, making it a practical solution for real-world applications.
  • The framework leverages prompt-based learning to align benign triggers with alternative labels, enhancing the model's classification and prediction stability against adversarial inputs.
  • Findings indicate that P2P not only maintains but can also improve model accuracy post-defensive application, ensuring reliable and trustworthy LLM outputs.

💡 Why This Paper Matters

The proposed P2P framework represents a significant advancement in the field of AI and machine learning security, focusing on the growing concern of data-poisoning backdoor attacks in LLMs. By offering a robust, generalized solution that effectively mitigates such vulnerabilities while maintaining high model performance, this paper contributes to the development of secure AI systems. It opens avenues for further research and application in making LLMs safer for diverse real-world tasks, thereby fostering trust in AI technologies.

🎯 Why It's Interesting for AI Security Researchers

This paper is crucial for AI security researchers as it addresses a rapidly emerging threat to LLMs—backdoor attacks—which can severely impact the reliability of AI applications. The introduction of the P2P defense mechanism showcases an innovative method to tackle these attacks, appealing to researchers focused on building resilient AI models. Furthermore, the insights gained from the empirical analysis of P2P's effectiveness against multiple attack types inspire further exploration of similar defensive strategies, contributing to the ongoing discourse on AI security.

📚 Read the Full Paper