← Back to Library

MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval

Authors: Saksham Sahai Srivastava, Haoyu He

Published: 2025-12-18

arXiv ID: 2512.16962v1

Added to Library: 2026-01-07 10:09 UTC

Red Teaming

📄 Abstract

Large Language Model (LLM) agents increasingly rely on long-term memory and Retrieval-Augmented Generation (RAG) to persist experiences and refine future performance. While this experience learning capability enhances agentic autonomy, it introduces a critical, unexplored attack surface, i.e., the trust boundary between an agent's reasoning core and its own past. In this paper, we introduce MemoryGraft. It is a novel indirect injection attack that compromises agent behavior not through immediate jailbreaks, but by implanting malicious successful experiences into the agent's long-term memory. Unlike traditional prompt injections that are transient, or standard RAG poisoning that targets factual knowledge, MemoryGraft exploits the agent's semantic imitation heuristic which is the tendency to replicate patterns from retrieved successful tasks. We demonstrate that an attacker who can supply benign ingestion-level artifacts that the agent reads during execution can induce it to construct a poisoned RAG store where a small set of malicious procedure templates is persisted alongside benign experiences. When the agent later encounters semantically similar tasks, union retrieval over lexical and embedding similarity reliably surfaces these grafted memories, and the agent adopts the embedded unsafe patterns, leading to persistent behavioral drift across sessions. We validate MemoryGraft on MetaGPT's DataInterpreter agent with GPT-4o and find that a small number of poisoned records can account for a large fraction of retrieved experiences on benign workloads, turning experience-based self-improvement into a vector for stealthy and durable compromise. To facilitate reproducibility and future research, our code and evaluation data are available at https://github.com/Jacobhhy/Agent-Memory-Poisoning.

🔍 Key Points

  • Introduction of MemoryGraft, a novel indirect injection attack that compromises long-term memory in LLM agents by introducing malicious experiences into their memory banks.
  • The attack exploits the semantic imitation heuristic, allowing agents to imitate unsafe procedures without explicit prompts or triggers, resulting in persistent behavioral drift.
  • Experimental validation on MetaGPT's DataInterpreter agent shows that a small number of poisoned records can dominate retrieval processes, influencing the agent's behavior significantly even across sessions.
  • The paper highlights the oversight in existing memory systems that fail to check provenance of experiences, thus exposing agents to long-term memory attacks.
  • Proposes potential defense mechanisms such as Cryptographic Provenance Attestation to secure memory writing and retrieval processes against similar attacks.

💡 Why This Paper Matters

The findings in this paper underscore the critical need to scrutinize memory mechanisms in LLM agents, particularly as they grow more autonomous and integrated into decision-making systems. By revealing vulnerabilities in how these agents trust their past experiences, the study calls for improved safeguards and awareness surrounding AI memory architectures.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of interest to AI security researchers as it presents a new vector for compromise in LLMs, an increasingly prevalent technology in autonomous systems. By demonstrating how memory can be manipulated undetected, it illustrates significant risks within current AI architectures. The research not only contributes to the understanding of memory-related vulnerabilities but also offers pathways for developing robust defense strategies.

📚 Read the Full Paper