PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization

📄 Abstract

Long context LLMs are vulnerable to prompt injection, where an attacker can inject an instruction in a long context to induce an LLM to generate an attacker-desired output. Existing prompt injection defenses are designed for short contexts. When extended to long-context scenarios, they have limited effectiveness. The reason is that an injected instruction constitutes only a very small portion of a long context, making the defense very challenging. In this work, we propose PISanitizer, which first pinpoints and sanitizes potential injected tokens (if any) in a context before letting a backend LLM generate a response, thereby eliminating the influence of the injected instruction. To sanitize injected tokens, PISanitizer builds on two observations: (1) prompt injection attacks essentially craft an instruction that compels an LLM to follow it, and (2) LLMs intrinsically leverage the attention mechanism to focus on crucial input tokens for output generation. Guided by these two observations, we first intentionally let an LLM follow arbitrary instructions in a context and then sanitize tokens receiving high attention that drive the instruction-following behavior of the LLM. By design, PISanitizer presents a dilemma for an attacker: the more effectively an injected instruction compels an LLM to follow it, the more likely it is to be sanitized by PISanitizer. Our extensive evaluation shows that PISanitizer can successfully prevent prompt injection, maintain utility, outperform existing defenses, is efficient, and is robust to optimization-based and strong adaptive attacks. The code is available at https://github.com/sleeepeer/PISanitizer.

🔍 Key Points

Introduction of ICX360, an open-source Python toolkit designed for explaining outputs of large language models (LLMs) based on their inputs or user-provided context.
Inclusion of three specific methods for explanation: MExGen (perturbation-based), CELL (contrastive explanations), and Token Highlighter (gradient-based), catering to different needs for explainability in LLMs.
Development of a structured framework for in-context explainability, enabling categorization of methods based on model access level, input granularity, and output granularity, providing a comprehensive landscape of current techniques.
Comparison and contrast with existing explainability methods (SHAP, Captum, Inseq), showcasing the enhanced interpretability and efficiency of the ICX360 toolkit against limitations of these existing tools.
Provision of tutorials and quick-start guidance for practical application, promoting accessibility for a wider range of users and practitioners.

💡 Why This Paper Matters

The ICX360 toolkit represents a significant advancement in the field of explainability for large language models, providing structured and accessible methods for understanding AI outputs in high-stakes applications. By focusing on the relationship between input context and output responses, it addresses a critical need in AI that enhances transparency, trust, and accountability in automated systems. This toolkit not only contributes to the academic discourse on AI interpretability but also has practical implications for industries relying on LLMs, such as healthcare and finance.

🎯 Why It's Interesting for AI Security Researchers

This paper would particularly interest AI security researchers due to its focus on explainability in high-stakes systems where LLMs are applied. Understanding the rationale behind AI outputs is crucial for identifying vulnerabilities, biases, and potentially harmful outcomes in automated decision-making processes. Enhanced transparency through tools like ICX360 could enable researchers to rigorously assess and mitigate risks associated with LLM use in sensitive contexts, ultimately fostering safer and more reliable AI applications.

PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper