← Back to Library

Prompt Injection as Role Confusion

Authors: Charles Ye, Jasmine Cui, Dylan Hadfield-Menell

Published: 2026-02-22

arXiv ID: 2603.12277v1

Added to Library: 2026-03-16 02:01 UTC

Red Teaming

📄 Abstract

Language models remain vulnerable to prompt injection attacks despite extensive safety training. We trace this failure to role confusion: models infer roles from how text is written, not where it comes from. We design novel role probes to capture how models internally identify "who is speaking." These reveal why prompt injection works: untrusted text that imitates a role inherits that role's authority. We test this insight by injecting spoofed reasoning into user prompts and tool outputs, achieving average success rates of 60% on StrongREJECT and 61% on agent exfiltration, across multiple open- and closed-weight models with near-zero baselines. Strikingly, the degree of internal role confusion strongly predicts attack success before generation begins. Our findings reveal a fundamental gap: security is defined at the interface but authority is assigned in latent space. More broadly, we introduce a unifying, mechanistic framework for prompt injection, demonstrating that diverse prompt-injection attacks exploit the same underlying role-confusion mechanism.

🔍 Key Points

  • The paper identifies and analyzes 'role confusion' in language models, where models infer roles based on style rather than actual source, leading to vulnerabilities in prompt injections.
  • Novel role probes were designed to measure how models internally identify roles, demonstrating that stylistic cues predominately influence role recognition and authority assignment in latent space.
  • The CoT Forgery attack method was introduced, showing high effectiveness in injecting spoofed reasoning into user prompts, which models mistakenly treat as legitimate reasoning. Attack success rates were recorded, showing averages of 60% for StrongREJECT and 61% for agent exfiltration across various models.
  • Findings suggest that architectural boundaries defining privilege are ineffective, as they collapse in latent space under role confusion, indicating a significant gap in the security of language models.
  • The work provides a mechanistic understanding of prompt injection security vulnerabilities, framing them as state poisoning, and offers implications for future model design and defense strategies.

💡 Why This Paper Matters

This paper is critical as it uncovers fundamental flaws in how language models perceive and assign authority, pointing to significant vulnerabilities in current security mechanisms against prompt injection attacks. By demonstrating that style, position, and explicit declarations drive role perception, the research highlights the need for robust defenses that genuinely underpin model behavior beyond surface-level heuristics.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper of great interest because it delves into the underlying mechanics of prompt injection vulnerabilities in language models. It not only critiques existing defenses but also proposes novel methodologies to measure and understand role perception, providing a new lens through which to design secure AI systems. Moreover, the insights gained from the role confusion framework can inform the development of more resilient architectures to mitigate such vulnerabilities.

📚 Read the Full Paper