SD-RAG: A Prompt-Injection-Resilient Framework for Selective Disclosure in Retrieval-Augmented Generation

📄 Abstract

Retrieval-Augmented Generation (RAG) has attracted significant attention due to its ability to combine the generative capabilities of Large Language Models (LLMs) with knowledge obtained through efficient retrieval mechanisms over large-scale data collections. Currently, the majority of existing approaches overlook the risks associated with exposing sensitive or access-controlled information directly to the generation model. Only a few approaches propose techniques to instruct the generative model to refrain from disclosing sensitive information; however, recent studies have also demonstrated that LLMs remain vulnerable to prompt injection attacks that can override intended behavioral constraints. For these reasons, we propose a novel approach to Selective Disclosure in Retrieval-Augmented Generation, called SD-RAG, which decouples the enforcement of security and privacy constraints from the generation process itself. Rather than relying on prompt-level safeguards, SD-RAG applies sanitization and disclosure controls during the retrieval phase, prior to augmenting the language model's input. Moreover, we introduce a semantic mechanism to allow the ingestion of human-readable dynamic security and privacy constraints together with an optimized graph-based data model that supports fine-grained, policy-aware retrieval. Our experimental evaluation demonstrates the superiority of SD-RAG over baseline existing approaches, achieving up to a $58\%$ improvement in the privacy score, while also showing a strong resilience to prompt injection attacks targeting the generative model.

🔍 Key Points

The paper introduces MultiMax probes, a new architecture specifically designed to handle long-context inputs in language models, significantly improving detection performance compared to existing probes.
It details the use of AlphaEvolve, an automated architecture search tool, to discover optimized probe architectures that enhance performance in misuse detection tasks.
The study highlights the effectiveness of cascading classifiers that combine lightweight probes with large language models, achieving a favorable cost-accuracy trade-off.
The results indicate substantial room for improvement in probe architecture, hinting at the potential for further advancements in AI safety measures against misuse.
The paper empirically demonstrates that existing activation probe methods are fragile in the face of distribution shifts, stressing the need for robust solutions in real-world applications.

💡 Why This Paper Matters

This paper is crucial as it addresses growing concerns over the misuse of language models by proposing technically advanced probes that can effectively identify and mitigate harmful queries. Its findings contribute to the ongoing discourse on AI safety and responsible deployment of powerful language models in real-world applications.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of high interest to AI security researchers because it tackles the critical issue of misuse of AI models, providing novel methodologies for improving detection systems against adversarial attacks. The proposed solutions could enhance existing frameworks in AI monitoring and security, making significant contributions to developing safer AI systems.

SD-RAG: A Prompt-Injection-Resilient Framework for Selective Disclosure in Retrieval-Augmented Generation

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper