Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs

📄 Abstract

Large language models (LLMs) remain acutely vulnerable to prompt injection and related jailbreak attacks; heuristic guardrails (rules, filters, LLM judges) are routinely bypassed. We present Contextual Integrity Verification (CIV), an inference-time security architecture that attaches cryptographically signed provenance labels to every token and enforces a source-trust lattice inside the transformer via a pre-softmax hard attention mask (with optional FFN/residual gating). CIV provides deterministic, per-token non-interference guarantees on frozen models: lower-trust tokens cannot influence higher-trust representations. On benchmarks derived from recent taxonomies of prompt-injection vectors (Elite-Attack + SoK-246), CIV attains 0% attack success rate under the stated threat model while preserving 93.1% token-level similarity and showing no degradation in model perplexity on benign tasks; we note a latency overhead attributable to a non-optimized data path. Because CIV is a lightweight patch -- no fine-tuning required -- we demonstrate drop-in protection for Llama-3-8B and Mistral-7B. We release a reference implementation, an automated certification harness, and the Elite-Attack corpus to support reproducible research.

🔍 Key Points

Introduction of Contextual Integrity Verification (CIV), a novel deterministic security architecture for Large Language Models (LLMs) that counters prompt-injection attacks through cryptographic tagging and trust-based restrictions.
CIV achieves 0% attack success rate on prompt-injection benchmarks while maintaining 93.1% output similarity and no degradation in model perplexity.
CIV operates as a lightweight patch requiring no fine-tuning of underlying models, allowing for rapid adoption in existing LLM systems like Llama-3-8B and Mistral-7B.
CIV embeds a cryptographically signed trust lattice for each token, ensuring that information from lower-trust sources cannot impact higher-trust states in the model's behavior, enhancing overall integrity and security.
The research includes a theoretical framework with formal proofs of security claims, as well as a reference implementation and an automated certification harness for reproducibility.

💡 Why This Paper Matters

The paper presents a significant advancement in the security of Large Language Models against prevalent prompt-injection attacks through the introduction of Contextual Integrity Verification (CIV). By providing a robust, cryptographically-based method for ensuring trust boundaries and deterministic prevention of lower-trust influences on higher-trust outputs, this work not only strengthens existing LLM frameworks but also sets a new standard for secure implementation in AI applications. The findings underline the importance of re-evaluating security mechanisms in AI, especially as these models become increasingly integrated into critical real-world applications.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of great interest to AI security researchers as it addresses one of the most pressing vulnerabilities in current AI systems: prompt injection attacks. The introduction of a formal, provable security architecture like CIV provides a new lens through which to assess and mitigate risks associated with adversarial inputs. Furthermore, the implications of having a system that can maintain secure integrity without requiring significant retraining or model alterations opens up practical avenues for deploying secure AI solutions in various domains. The theoretical foundation and empirical evidence presented also contribute to a deeper understanding of information flow control within neural networks, which is crucial for future research in AI safety and security.

Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper