Analysis of LLMs Against Prompt Injection and Jailbreak Attacks

Authors: Piyush Jaiswal, Aaditya Pratap, Shreyansh Saraswati, Harsh Kasyap, Somanath Tripathy

Published: 2026-02-24

arXiv ID: 2602.22242v1

Added to Library: 2026-02-27 03:01 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) are widely deployed in real-world systems. Given their broader applicability, prompt engineering has become an efficient tool for resource-scarce organizations to adopt LLMs for their own purposes. At the same time, LLMs are vulnerable to prompt-based attacks. Thus, analyzing this risk has become a critical security requirement. This work evaluates prompt-injection and jailbreak vulnerability using a large, manually curated dataset across multiple open-source LLMs, including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen, and Gemma variants. We observe significant behavioural variation across models, including refusal responses and complete silent non-responsiveness triggered by internal safety mechanisms. Furthermore, we evaluated several lightweight, inference-time defence mechanisms that operate as filters without any retraining or GPU-intensive fine-tuning. Although these defences mitigate straightforward attacks, they are consistently bypassed by long, reasoning-heavy prompts.

🔍 Key Points

Comprehensive empirical evaluation of open-source LLMs (Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen, Gemma) against prompt injection and jailbreak attacks, revealing significant behavioral variations and vulnerabilities.
Development and testing of lightweight inference-time defence mechanisms (e.g., self-examination, prompt risk classification filters, policy-based guardrails) that operate without retraining, showcasing their effectiveness and limitations against sophisticated adversarial prompts.
Identification of critical failure modes, including silent non-responsiveness and partial compliance, which highlight the nuanced behaviors and risks in LLMs beyond simple jailbreak success rates.
Public release of a curated dataset of adversarial prompts to facilitate reproducible research and promote further investigation into LLM security vulnerabilities and defenses.
Findings suggest that reliance solely on external filtering mechanisms is inadequate, emphasizing the need for integrated safety reasoning within LLM architectures.

💡 Why This Paper Matters

This paper addresses the critical aspect of security in large language models, particularly in the context of prompt injection and jailbreak vulnerabilities. By evaluating the strengths and weaknesses of current open-source models and proposing practical defense mechanisms, it contributes to a better understanding of LLM safety in real-world deployments. The insights gathered from this study will be invaluable for developers, researchers, and organizations looking to incorporate robust safety measures into LLM applications.

🎯 Why It's Interesting for AI Security Researchers

The vulnerability of LLMs to adversarial attacks poses significant risks in applications handling sensitive information or critical tasks. This paper's findings are crucial for AI security researchers, as it not only provides empirical data on the vulnerabilities of various models but also evaluates multiple defense strategies, enhancing the knowledge base needed to strengthen LLM security. Additionally, the public dataset of adversarial prompts can serve as a foundation for future research in the domain.

Analysis of LLMs Against Prompt Injection and Jailbreak Attacks

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper