DRIP: Defending Prompt Injection via De-instruction Training and Residual Fusion Model Architecture

📄 Abstract

Large language models (LLMs) have demonstrated impressive instruction-following capabilities. However, these capabilities also expose models to prompt injection attacks, where maliciously crafted inputs overwrite or distract from the intended instructions. A core vulnerability lies in the model's lack of semantic role understanding: it cannot distinguish directive intent from descriptive content, leading it to execute instruction-like phrases embedded in data. We propose DRIP, a training-time defense grounded in a semantic modeling perspective, which enforces robust separation between instruction and data semantics without sacrificing utility. DRIP introduces two lightweight yet complementary mechanisms: (1) a token-wise de-instruction shift that performs semantic disentanglement, weakening directive semantics in data tokens while preserving content meaning; and (2) a residual fusion pathway that provides a persistent semantic anchor, reinforcing the influence of the true top-level instruction during generation. Experimental results on LLaMA-8B and Mistral-7B across three prompt injection benchmarks (SEP, AlpacaFarm, and InjecAgent) demonstrate that DRIP outperforms state-of-the-art defenses, including StruQ, SecAlign, ISE, and PFT, improving role separation by 49%, and reducing attack success rate by 66% for adaptive attacks. Meanwhile, DRIP's utility is on par with the undefended model across AlpacaEval, IFEval, and MT-Bench. Our findings underscore the power of lightweight representation edits and role-aware supervision in securing LLMs against adaptive prompt injection.

🔍 Key Points

Introduction of DRIP, a novel training-time defense against prompt injection targeting large language models (LLMs).
DRIP employs two mechanisms: a token-wise de-instruction shift to suppress directive semantics in data, and a residual fusion pathway to anchor the true instruction intent.
Empirical evaluations demonstrate that DRIP outperforms existing defenses by enhancing role separation by up to 49% and reducing attack success rates by 66%.
DRIP maintains instruction-following utility comparable to undefended models, validating its effectiveness without sacrificing performance.

💡 Why This Paper Matters

This paper is crucial as it addresses a significant vulnerability in LLMs—prompt injection attacks—by presenting an innovative defense mechanism that not only improves security but also preserves the utility of language models. The proposed methods provide a robust framework for LLM safety, which is particularly relevant as these models are increasingly integrated into real-world applications.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper of great interest because it explores a contemporary and critical area of adversarial attacks on language models. The methods illustrated in DRIP offer insights into effective defenses against such attacks, contributing to the overall security landscape for AI technologies. Understanding and mitigating prompt injection vulnerabilities are essential as AI systems become more prevalent and relied upon in sensitive domains.

DRIP: Defending Prompt Injection via De-instruction Training and Residual Fusion Model Architecture

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper