← Back to Library

AttackEval: A Systematic Empirical Study of Prompt Injection Attack Effectiveness Against Large Language Models

Authors: Jackson Wang

Published: 2026-04-04

arXiv ID: 2604.03598v1

Added to Library: 2026-04-07 02:02 UTC

Red Teaming

📄 Abstract

Prompt injection has emerged as a critical vulnerability in large language model (LLM) deployments, yet existing research is heavily weighted toward defenses. The attack side -- specifically, which injection strategies are most effective and why -- remains insufficiently studied.We address this gap with AttackEval, a systematic empirical study of prompt injection attack effectiveness. We construct a taxonomy of ten attack categories organized into three parent groups (Syntactic, Contextual, and Semantic/Social), populate each category with 25 carefully crafted prompts (250 total), and evaluate them against a simulated production victim system under four progressively stronger defense tiers. Experiments reveal several non-obvious findings: (1) Obfuscation (OBF) achieves the highest single-attack success rate (ASR = 0.76) against even intent-aware defenses, because it defeats both keyword matching and semantic similarity checks simultaneously; (2) Semantic/Social attacks - Emotional Manipulation (EM) and Reward Framing (RF) - maintain high ASR (0.44-0.48) against intent-aware defenses due to their natural language surface, which evades structural anomaly detection; (3) Composite attacks combining two complementary strategies dramatically boost ASR, with the OBF + EM pair reaching 97.6%; (4) Stealth correlates positively with residual ASR against semantic defenses (r = 0.71), implying that future defenses must jointly optimize for both structural and behavioral signals. Our findings identify concrete blind spots in current defenses and provide actionable guidance for designing more robust LLM safety systems.

🔍 Key Points

  • The paper introduces AttackEval, a structured empirical study of prompt injection attacks on large language models, contributing to a deeper understanding of attack methodologies and effectiveness.
  • A taxonomy of ten attack categories is proposed, categorized into Syntactic, Contextual, and Semantic/Social attacks, enhancing the classification and analysis of such vulnerabilities.
  • Experiments reveal that Obfuscation (OBF) is the most effective attack, achieving the highest success rates even against intent-aware defenses, along with insights into the effectiveness of composite attacks that combine strategies.
  • The findings highlight critical blind spots in existing defenses, particularly against behavioral and semantic attacks, which can manipulate LLM outputs without typical detection mechanisms.
  • The development of layered defense architectures that incorporate intent reasoning, obfuscation-awareness, and alignment-exploitation detection is recommended to enhance LLM safety.

💡 Why This Paper Matters

This paper presents significant contributions to the understanding of prompt injection attacks, outlining effective attack strategies and identifying weaknesses in current defense mechanisms. By systematically evaluating the effectiveness of various attack strategies, particularly obfuscation and emotional manipulation, it lays the groundwork for designing more robust and comprehensive defenses against these emerging threats in large language model deployments.

🎯 Why It's Interesting for AI Security Researchers

The paper is vital for AI security researchers as it fills a crucial gap in the literature regarding prompt injection attacks, which have become a significant vulnerability in LLM systems. Understanding these attack methodologies allows researchers to design more effective defenses and contributes to the broader goal of enhancing the security and reliability of AI systems in real-world applications.

📚 Read the Full Paper