UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models

Authors: Huawei Lin, Yingjie Lao, Tong Geng, Tan Yu, Weijie Zhao

Published: 2025-02-18

arXiv ID: 2502.13141v1

Added to Library: 2025-11-11 14:34 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) are vulnerable to attacks like prompt injection, backdoor attacks, and adversarial attacks, which manipulate prompts or models to generate harmful outputs. In this paper, departing from traditional deep learning attack paradigms, we explore their intrinsic relationship and collectively term them Prompt Trigger Attacks (PTA). This raises a key question: Can we determine if a prompt is benign or poisoned? To address this, we propose UniGuardian, the first unified defense mechanism designed to detect prompt injection, backdoor attacks, and adversarial attacks in LLMs. Additionally, we introduce a single-forward strategy to optimize the detection pipeline, enabling simultaneous attack detection and text generation within a single forward pass. Our experiments confirm that UniGuardian accurately and efficiently identifies malicious prompts in LLMs.

🔍 Key Points

Introduction of Prompt Trigger Attacks (PTA) as a unified category encompassing prompt injection, backdoor attacks, and adversarial attacks, highlighting their intrinsic relationships.
Development of UniGuardian, a novel and efficient training-free defense mechanism that simultaneously detects multiple attack types during inference.
Implementation of a Single-Forward Strategy that accelerates attack detection by integrating detection with text generation in a single forward pass of the model.
Extensive experimental validation demonstrating UniGuardian's superior performance in accurately identifying malicious prompts across various attack types compared to existing methods.
Release of implementation code, promoting accessibility and further research in prompt attack detection and defense mechanisms.

💡 Why This Paper Matters

This paper is critical in advancing the security of large language models by proposing a comprehensive framework, UniGuardian, that effectively addresses various prompt manipulation attacks. Such a unified approach is essential as the landscape of AI continues to evolve, making language models increasingly susceptible to different forms of adversarial attacks.

🎯 Why It's Interesting for AI Security Researchers

The findings and methodologies presented in this paper are highly significant for AI security researchers engaged in understanding and mitigating risks associated with large language models. As the adoption of these models grows, so does the necessity for robust defenses against manipulation techniques that can compromise their reliability and safety. The identification and implementation of a unified defense mechanism against diverse attack types will be of keen interest to researchers focused on enhancing the resilience of AI systems.

UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper