← Back to Library

Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs

Authors: Alina Fastowski, Bardh Prenkaj, Yuxiao Li, Gjergji Kasneci

Published: 2025-11-08

arXiv ID: 2511.05919v1

Added to Library: 2025-11-14 23:04 UTC

Red Teaming

πŸ“„ Abstract

LLMs are now an integral part of information retrieval. As such, their role as question answering chatbots raises significant concerns due to their shown vulnerability to adversarial man-in-the-middle (MitM) attacks. Here, we propose the first principled attack evaluation on LLM factual memory under prompt injection via Xmera, our novel, theory-grounded MitM framework. By perturbing the input given to "victim" LLMs in three closed-book and fact-based QA settings, we undermine the correctness of the responses and assess the uncertainty of their generation process. Surprisingly, trivial instruction-based attacks report the highest success rate (up to ~85.3%) while simultaneously having a high uncertainty for incorrectly answered questions. To provide a simple defense mechanism against Xmera, we train Random Forest classifiers on the response uncertainty levels to distinguish between attacked and unattacked queries (average AUC of up to ~96%). We believe that signaling users to be cautious about the answers they receive from black-box and potentially corrupt LLMs is a first checkpoint toward user cyberspace safety.

πŸ” Key Points

  • Introduction of the Ο‡mera framework as the first principled attack evaluation method on LLM factual memory under prompt injection in adversarial scenarios.
  • Demonstration of various MitM attacks categorized into Ξ±, Ξ², and Ξ³ types, showcasing how even trivial instruction-based attacks can successfully deceive LLMs with notable accuracy.
  • Empirical evidence showing high uncertainty levels in LLM responses during attacks, which can be leveraged to build a defense mechanism using machine learning classifiers to alert users of potentially manipulated responses.
  • Release of a novel factually adversarial dataset containing 3000 samples designed to benchmark and facilitate further research in adversarial vulnerabilities within LLMs.
  • High performance of Random Forest classifiers (up to ~96% AUC) in detecting attacked queries using uncertainty metrics, establishing a pathway towards user safety in LLM applications.

πŸ’‘ Why This Paper Matters

This paper is crucial as it addresses the significant vulnerability of LLMs to adversarial attacks, particularly in contexts where factual accuracy is paramount, such as in information retrieval and question-answering systems. By unveiling specific weaknesses and developing the Ο‡mera framework, the authors pave the way for future research aimed at enhancing the robustness and trustworthiness of AI systems, thus contributing to safer AI deployment in critical applications.

🎯 Why It's Interesting for AI Security Researchers

This research holds great interest for AI security researchers as it delineates a clear framework for understanding and evaluating adversarial threats in LLMs, a topic of growing concern with the increasing reliance on these models for critical tasks. The findings not only highlight existing vulnerabilities but also propose empirical methods for detection and mitigation, guiding future research and practical implementations aimed at strengthening AI security.

πŸ“š Read the Full Paper