← Back to Library

CachePrune: Neural-Based Attribution Defense Against Indirect Prompt Injection Attacks

Authors: Rui Wang, Junda Wu, Yu Xia, Tong Yu, Ruiyi Zhang, Ryan Rossi, Lina Yao, Julian McAuley

Published: 2025-04-29

arXiv ID: 2504.21228v1

Added to Library: 2025-11-11 14:14 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) are identified as being susceptible to indirect prompt injection attack, where the model undesirably deviates from user-provided instructions by executing tasks injected in the prompt context. This vulnerability stems from LLMs' inability to distinguish between data and instructions within a prompt. In this paper, we propose CachePrune that defends against this attack by identifying and pruning task-triggering neurons from the KV cache of the input prompt context. By pruning such neurons, we encourage the LLM to treat the text spans of input prompt context as only pure data, instead of any indicator of instruction following. These neurons are identified via feature attribution with a loss function induced from an upperbound of the Direct Preference Optimization (DPO) objective. We show that such a loss function enables effective feature attribution with only a few samples. We further improve on the quality of feature attribution, by exploiting an observed triggering effect in instruction following. Our approach does not impose any formatting on the original prompt or introduce extra test-time LLM calls. Experiments show that CachePrune significantly reduces attack success rates without compromising the response quality. Note: This paper aims to defend against indirect prompt injection attacks, with the goal of developing more secure and robust AI systems.

🔍 Key Points

  • Proposes CachePrune, a novel approach to defend against indirect prompt injection attacks in large language models (LLMs) by identifying and pruning task-triggering neurons from the key-value (KV) cache of the input prompt context.
  • Introduces a feature attribution mechanism utilizing a loss function based on the Direct Preference Optimization (DPO) concept that enhances the identification of critical neurons with significant task-triggering effects, enabling effective pruning using few samples.
  • Demonstrates through rigorous experimentation that CachePrune substantially reduces attack success rates while maintaining the quality of responses from LLMs, outperforming existing mitigation techniques that often compromise output quality or require complex prompt modifications.
  • Analyzes the pruning distribution across different model layers to identify the concentration of neurons responsible for distinguishing between data and instructions, showcasing the importance of middle layers in LLM operation.
  • Presents findings that extend beyond defense, suggesting that the methodology can improve understanding of LLMs’ processing of prompts, potentially informing future designs of more resilient AI systems.

💡 Why This Paper Matters

This paper is relevant and important as it addresses a critical vulnerability in LLMs that can undermine the integrity and trustworthiness of AI systems, especially in applications demanding high reliability. The innovative methods and experimental validations in CachePrune suggest promising directions for enhancing prompt safety in AI deployments.

🎯 Why It's Interesting for AI Security Researchers

This paper would interest AI security researchers due to its focus on a growing area of concern in AI security: indirect prompt injection attacks. As LLMs become increasingly integrated into real-world applications, understanding and mitigating the associated risks is paramount. The proposed CachePrune mechanism could serve as a foundational method for developing more secure AI systems and contributes to the broader discourse on ensuring AI safety and robustness.

📚 Read the Full Paper