Prompt Injection as Role Confusion

Authors: Charles Ye, Jasmine Cui, Dylan Hadfield-Menell

Published: 2026-02-22

arXiv ID: 2603.12277v2

Added to Library: 2026-03-23 02:02 UTC

Red Teaming

📄 Abstract

Language models remain vulnerable to prompt injection attacks despite extensive safety training. We trace this failure to role confusion: models infer roles from how text is written, not where it comes from. We design novel role probes to capture how models internally identify "who is speaking." These reveal why prompt injection works: untrusted text that imitates a role inherits that role's authority. We test this insight by injecting spoofed reasoning into user prompts and tool outputs, achieving average success rates of 60% on StrongREJECT and 61% on agent exfiltration, across multiple open- and closed-weight models with near-zero baselines. Strikingly, the degree of internal role confusion strongly predicts attack success before generation begins. Our findings reveal a fundamental gap: security is defined at the interface but authority is assigned in latent space. More broadly, we introduce a unifying, mechanistic framework for prompt injection, demonstrating that diverse prompt-injection attacks exploit the same underlying role-confusion mechanism.

🔍 Key Points

The authors identify 'role confusion' as a structural weakness in language models that allows prompt injection attacks to succeed, meaning that models misinterpret untrusted text posing as trusted roles.
They introduce 'role probes,' a novel method for measuring how models internally identify and authenticate roles based on style rather than textual tags.
The paper presents a new prompt injection attack method called CoT Forgery, achieving significant attack success rates (60%+) across various models by injecting fabricated reasoning into user prompts.
Experimental findings reveal that the internal representation of roles in models can be hijacked through stylistic mimicry, leading to a predictable increase in attack success rates based on the measured role confusion.
The authors frame prompt injection as a form of 'state poisoning,' suggesting that understanding role perception can bridge the gap between model defense mechanisms and their operational vulnerabilities.

💡 Why This Paper Matters

This paper is highly relevant as it sheds light on the foundational mechanisms underlying prompt injection attacks on language models. By revealing the critical role of style versus tags in how these models interpret user commands, it provides essential insights that can drive the development of more robust AI safety measures.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper invaluable because it addresses a significant vulnerability—prompt injection—by meticulously connecting it to a fundamental design flaw in language models. The introduction of role probes and characterizations of role confusion offer new avenues for understanding and enhancing AI safety, making the findings critical for advancing secure AI systems.

Prompt Injection as Role Confusion

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper