← Back to Library

Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs

Authors: Alina Fastowski, Bardh Prenkaj, Yuxiao Li, Gjergji Kasneci

Published: 2025-11-08

arXiv ID: 2511.05919v2

Added to Library: 2025-11-21 03:05 UTC

Red Teaming

πŸ“„ Abstract

LLMs are now an integral part of information retrieval. As such, their role as question answering chatbots raises significant concerns due to their shown vulnerability to adversarial man-in-the-middle (MitM) attacks. Here, we propose the first principled attack evaluation on LLM factual memory under prompt injection via Xmera, our novel, theory-grounded MitM framework. By perturbing the input given to "victim" LLMs in three closed-book and fact-based QA settings, we undermine the correctness of the responses and assess the uncertainty of their generation process. Surprisingly, trivial instruction-based attacks report the highest success rate (up to ~85.3%) while simultaneously having a high uncertainty for incorrectly answered questions. To provide a simple defense mechanism against Xmera, we train Random Forest classifiers on the response uncertainty levels to distinguish between attacked and unattacked queries (average AUC of up to ~96%). We believe that signaling users to be cautious about the answers they receive from black-box and potentially corrupt LLMs is a first checkpoint toward user cyberspace safety.

πŸ” Key Points

  • Introduced the Ο‡mera framework, a novel methodology for assessing the susceptibility of Large Language Models (LLMs) to adversarial man-in-the-middle (MitM) attacks, providing a theoretical foundation for further research in this area.
  • Demonstrated that simple instruction-based attacks achieve significant success rates (~85.3%) in compromising the factual accuracy of LLMs, highlighting a major vulnerability in current information retrieval systems.
  • Developed and validated a defense mechanism using Random Forest classifiers based on uncertainty metrics, achieving an AUC of ~96%, signaling the potential for real-time detection of compromised responses to users.
  • Created a factually adversarial dataset containing 3000 samples, which will serve as a resource for future research on adversarial attacks and model robustness in factual question-answering tasks.
  • Identified the need for transparency and accountability in LLM deployments, especially in high-stakes applications, to prevent misinformation propagation and enhance user safety.

πŸ’‘ Why This Paper Matters

This paper is relevant and important as it tackles a critical gap in understanding the security vulnerabilities of LLMs within information retrieval systems, particularly in the context of unauthorized input manipulations. The findings and methodologies proposed, especially the Ο‡mera framework and its accompanying defense strategies, are essential for developing more robust AI systems that can operate safely in live environments.

🎯 Why It's Interesting for AI Security Researchers

This paper is of significant interest to AI security researchers as it presents novel attack methods (Ο‡mera framework) that expose the vulnerabilities of LLMs to adversarial manipulations. Moreover, the study offers a practical defense mechanism through uncertainty measurements, fostering further inquiry into safeguarding AI applications against similar threats. The establishment of a new adversarial dataset also lays the groundwork for standardized testing and evaluation of future security measures in the AI field.

πŸ“š Read the Full Paper