← Back to Library

Memory Poisoning Attack and Defense on Memory Based LLM-Agents

Authors: Balachandra Devarangadi Sunil, Isheeta Sinha, Piyush Maheshwari, Shantanu Todmal, Shreyan Malik, Shuchi Mishra

Published: 2026-01-09

arXiv ID: 2601.05504v1

Added to Library: 2026-01-12 03:03 UTC

Safety

📄 Abstract

Large language model agents equipped with persistent memory are vulnerable to memory poisoning attacks, where adversaries inject malicious instructions through query only interactions that corrupt the agents long term memory and influence future responses. Recent work demonstrated that the MINJA (Memory Injection Attack) achieves over 95 % injection success rate and 70 % attack success rate under idealized conditions. However, the robustness of these attacks in realistic deployments and effective defensive mechanisms remain understudied. This work addresses these gaps through systematic empirical evaluation of memory poisoning attacks and defenses in Electronic Health Record (EHR) agents. We investigate attack robustness by varying three critical dimensions: initial memory state, number of indication prompts, and retrieval parameters. Our experiments on GPT-4o-mini, Gemini-2.0-Flash and Llama-3.1-8B-Instruct models using MIMIC-III clinical data reveal that realistic conditions with pre-existing legitimate memories dramatically reduce attack effectiveness. We then propose and evaluate two novel defense mechanisms: (1) Input/Output Moderation using composite trust scoring across multiple orthogonal signals, and (2) Memory Sanitization with trust-aware retrieval employing temporal decay and pattern-based filtering. Our defense evaluation reveals that effective memory sanitization requires careful trust threshold calibration to prevent both overly conservative rejection (blocking all entries) and insufficient filtering (missing subtle attacks), establishing important baselines for future adaptive defense mechanisms. These findings provide crucial insights for securing memory-augmented LLM agents in production environments.

🔍 Key Points

  • Demonstrates the high susceptibility of memory-based large language models (LLMs) to memory poisoning attacks, particularly in healthcare applications where safety is critical.
  • Empirical evaluation shows that realistic conditions significantly diminish the effectiveness of memory injection attacks, highlighting the importance of pre-existing benign memories in mitigating risks.
  • Introduces two novel defense mechanisms: Input/Output Moderation and Memory Sanitization, which incorporate trust scoring to filter malicious entries while maintaining high utility for benign queries.
  • Establishes essential baselines for future research on adaptive defense mechanisms in LLMs, focusing on trust calibration and memory integrity.
  • Provides experimental results that reveal a double-edged sword: while sanitization can prevent memory poisoning, it may also lead to overconfidence in malicious entries.

💡 Why This Paper Matters

This paper is critical for advancing the understanding and defenses against memory poisoning attacks in LLMs, especially as these models become more prevalent in sensitive domains such as healthcare. By systematically evaluating the robustness of attacks and the effectiveness of proposed defenses, it lays the groundwork for future improvements in AI security measures that protect both the systems and the integrity of the data they handle.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper are particularly significant for AI security researchers because they address a burgeoning area of vulnerability in advanced AI systems. The insights gained from empirical evaluations of attack methodologies and defensive strategies contribute to the development of more resilient models capable of withstanding adversarial manipulations, which is essential for ensuring the safe deployment of AI technologies in critical applications.

📚 Read the Full Paper