Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message

📄 Abstract

The rise of conversational interfaces has greatly enhanced LLM usability by leveraging dialogue history for sophisticated reasoning. However, this reliance introduces an unexplored attack surface. This paper introduces Trojan Horse Prompting, a novel jailbreak technique. Adversaries bypass safety mechanisms by forging the model's own past utterances within the conversational history provided to its API. A malicious payload is injected into a model-attributed message, followed by a benign user prompt to trigger harmful content generation. This vulnerability stems from Asymmetric Safety Alignment: models are extensively trained to refuse harmful user requests but lack comparable skepticism towards their own purported conversational history. This implicit trust in its "past" creates a high-impact vulnerability. Experimental validation on Google's Gemini-2.0-flash-preview-image-generation shows Trojan Horse Prompting achieves a significantly higher Attack Success Rate (ASR) than established user-turn jailbreaking methods. These findings reveal a fundamental flaw in modern conversational AI security, necessitating a paradigm shift from input-level filtering to robust, protocol-level validation of conversational context integrity.

🔍 Key Points

Introduction of Trojan Horse Prompting as a novel jailbreak technique for conversational multimodal models by forging the model's own past utterances.
Identification of Asymmetric Safety Alignment as a critical vulnerability where models are trained to distrust user inputs but trust their own outputs, leading to potential security breaches.
Demonstration of significantly higher Attack Success Rates (ASR) compared to traditional jailbreaking methods, validating the efficacy of the Trojan Horse Prompting approach on Google's Gemini-2.0-flash-preview-image-generation.
Proposal for a paradigm shift in AI security practices from merely filtering inputs to prioritizing the integrity and authenticity of conversational context management.
Comprehensive experimental validation illustrating the attack's potency and offering a framework for future research on multimodal model vulnerabilities.

💡 Why This Paper Matters

This paper is highly relevant as it uncovers a significant security vulnerability in conversational AI systems that could be exploited at a systemic level. The findings challenge existing paradigms of AI safety, emphasizing the need for a deeper analysis of context integrity and paving the way for new defensive strategies against emerging threats in AI applications.

🎯 Why It's Interesting for AI Security Researchers

The research provides key insights into vulnerabilities that can be exploited in conversational AI, making it crucial for AI security researchers. The findings about Trojan Horse Prompting and the Asymmetric Safety Alignment hypothesis offer a fresh perspective on security architecture for multimodal models, and directly inform the development of improved safety protocols, making this paper a vital contribution to discussions on AI robustness and security.

Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper