Plato's Form: Toward Backdoor Defense-as-a-Service for LLMs with Prototype Representations

📄 Abstract

Large language models (LLMs) are increasingly deployed in security-sensitive applications, yet remain vulnerable to backdoor attacks. However, existing backdoor defenses are difficult to operationalize for Backdoor Defense-as-a-Service (BDaaS), as they require unrealistic side information (e.g., downstream clean data, known triggers/targets, or task domain specifics), and lack reusable, scalable purification across diverse backdoored models. In this paper, we present PROTOPURIFY, a backdoor purification framework via parameter edits under minimal assumptions. PROTOPURIFY first builds a backdoor vector pool from clean and backdoored model pairs, aggregates vectors into candidate prototypes, and selects the most aligned candidate for the target model via similarity matching. PROTOPURIFY then identifies a boundary layer through layer-wise prototype alignment and performs targeted purification by suppressing prototype-aligned components in the affected layers, achieving fine-grained mitigation with minimal impact on benign utility. Designed as a BDaaS-ready primitive, PROTOPURIFY supports reusability, customizability, interpretability, and runtime efficiency. Experiments across various LLMs on both classification and generation tasks show that PROTOPURIFY consistently outperforms 6 representative defenses against 6 diverse attacks, including single-trigger, multi-trigger, and triggerless backdoor settings. PROTOPURIFY reduces ASR to below 10%, and even as low as 1.6% in some cases, while incurring less than a 3% drop in clean utility. PROTOPURIFY further demonstrates robustness against adaptive backdoor variants and stability on non-backdoored models.

🔍 Key Points

Introduction of ProtoPurify, a backdoor purification framework that minimizes assumptions and maximizes applicability for various LLMs.
The innovative design enables reusability, customizability, interpretability, and runtime efficiency, addressing key limitations of existing backdoor defenses.
ProtoPurify achieves significant reductions in Attack Success Rate (ASR) across multiple datasets and attacks while maintaining benigh accuracy, outperforming six state-of-the-art defenses.
The method leverages prototype representations to effectively capture backdoor behaviors, allowing for fine-tuned layer-wise purification and controlled safeguarding of model integrity.
Ablation studies demonstrate the impact of component choices on purification performance, highlighting the significance of prototype selection and boundary layer detection.

💡 Why This Paper Matters

This paper presents a critical advancement in securing large language models through its development of ProtoPurify. By addressing key challenges in backdoor defense, the proposed method is poised to significantly enhance the safety and reliability of AI applications, particularly in sensitive domains. Its flexible, service-oriented approach demonstrates substantial practical implications for industry stakeholders, emphasizing the importance of efficient purification strategies.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it tackles one of the most pressing threats in AI deployment—backdoor attacks. The proposed framework offers novel solutions to challenges in mitigating such attacks, presenting itself as a practical, service-level tool that can enhance AI safety. Researchers will find value in the insights on model purification and the evaluation of its effectiveness against a diverse set of backdoor strategies, contributing to the broader discourse on AI robustness and security.

Plato's Form: Toward Backdoor Defense-as-a-Service for LLMs with Prototype Representations

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper