← Back to Library

Bypassing Prompt Injection Detectors through Evasive Injections

Authors: Md Jahedur Rahman, Ihsen Alouani

Published: 2026-01-31

arXiv ID: 2602.00750v1

Added to Library: 2026-02-03 08:04 UTC

Red Teaming

📄 Abstract

Large language models (LLMs) are increasingly used in interactive and retrieval-augmented systems, but they remain vulnerable to task drift; deviations from a user's intended instruction due to injected secondary prompts. Recent work has shown that linear probes trained on activation deltas of LLMs' hidden layers can effectively detect such drift. In this paper, we evaluate the robustness of these detectors against adversarially optimised suffixes. We generate universal suffixes that cause poisoned inputs to evade detection across multiple probes simultaneously. Our experiments on Phi-3 3.8B and Llama-3 8B show that a single suffix can achieve high attack success rates; up to 93.91% and 99.63%, respectively, when all probes must be fooled, and nearly perfect success (>90%) under majority vote setting. These results demonstrate that activation delta-based task drift detectors are highly vulnerable to adversarial suffixes, highlighting the need for stronger defences against adaptive attacks. We also propose a defence technique where we generate multiple suffixes and randomly append one of them to the prompts while making forward passes of the LLM and train logistic regression models with these activations. We found this approach to be highly effective against such attacks.

🔍 Key Points

  • The paper identifies the vulnerabilities of task drift detectors, which are based on monitoring activation deltas of large language models, to adversarial suffixes that can effectively evade detection.
  • Using the Greedy Coordinate Gradient (GCG) method, the authors demonstrate how universally optimized suffixes can achieve high success rates in misleading linear classifiers, with rates up to 99.63%.
  • The paper proposes a novel defense mechanism that generates multiple suffixes and incorporates them during model training, effectively increasing the robustness of task drift detection against adaptive adversarial attacks.
  • Experimental results indicate significant differences in attack success rates on different classifiers and layers, highlighting the need for improved defenses in complex scenarios where multiple probes are involved.
  • The work emphasizes the importance of developing robust defenses due to the rising sophistication of attacks on large language models, contributing to the discourse on securing AI systems.

💡 Why This Paper Matters

This paper is a crucial contribution to the field of AI security as it not only reveals significant vulnerabilities in current drift detection mechanisms for large language models but also proposes effective defenses against sophisticated attacks. The demonstration of how adversarial suffixes can mislead task drift detectors underlines the pressing need for more resilient models in AI applications, making this work highly relevant in today's landscape of increasing adversarial threats.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper highly relevant as it not only advances the understanding of vulnerabilities in AI systems but also provides insight into adversarial attack methodologies and their countermeasures. The findings underscore the necessity for ongoing research in security assessments of large language models, particularly in contexts where they are deployed for sensitive or critical tasks.

📚 Read the Full Paper