Exploring Approaches for Detecting Memorization of Recommender System Data in Large Language Models

📄 Abstract

Large Language Models (LLMs) are increasingly applied in recommendation scenarios due to their strong natural language understanding and generation capabilities. However, they are trained on vast corpora whose contents are not publicly disclosed, raising concerns about data leakage. Recent work has shown that the MovieLens-1M dataset is memorized by both the LLaMA and OpenAI model families, but the extraction of such memorized data has so far relied exclusively on manual prompt engineering. In this paper, we pose three main questions: Is it possible to enhance manual prompting? Can LLM memorization be detected through methods beyond manual prompting? And can the detection of data leakage be automated? To address these questions, we evaluate three approaches: (i) jailbreak prompt engineering; (ii) unsupervised latent knowledge discovery, probing internal activations via Contrast-Consistent Search (CCS) and Cluster-Norm; and (iii) Automatic Prompt Engineering (APE), which frames prompt discovery as a meta-learning process that iteratively refines candidate instructions. Experiments on MovieLens-1M using LLaMA models show that jailbreak prompting does not improve the retrieval of memorized items and remains inconsistent; CCS reliably distinguishes genuine from fabricated movie titles but fails on numerical user and rating data; and APE retrieves item-level information with moderate success yet struggles to recover numerical interactions. These findings suggest that automatically optimizing prompts is the most promising strategy for extracting memorized samples.

🔍 Key Points

Introduction of Lexical Anchor Tree Search (LATS), a novel method that allows multi-turn jailbreaking of aligned language models without the need for an attacker LLM, resulting in significant efficiency improvements.
LATS reformulates jailbreaking into a breadth-first search over multi-turn dialogues, which enables the injection of key lexical anchors incrementally, achieving high attack success rates (ASR) with fewer queries (average of 6.4 queries) compared to existing methods (which often require 20+ queries).
Experimental evaluations on multiple benchmarks (AdvBench and HarmBench) demonstrate that LATS achieves 97-100% ASR across various language models (including GPT, Claude, and Llama) and outperforms existing single-turn and multi-turn jailbreak methods by at least 10% in ASR.
LATS shows resilience against multiple defense mechanisms like In-Context Demonstrations and PromptGuard, indicating that conversational structures can be exploited as under-protected attack surfaces in LLMs.

💡 Why This Paper Matters

This paper presents a critical advancement in the field of AI security, specifically regarding large language models, by introducing an efficient and effective method for multi-turn jailbreaking via LATS. The ability to achieve high attack success rates with significantly fewer queries marks a pivotal shift in the strategy used by attackers, thereby highlighting new vulnerabilities in aligned models and the need for revising current defensive frameworks against such sophisticated attacks.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly valuable as it sheds light on a novel attack methodology that exploits conversational dynamics. The insights gained from LATS can inform the development of more robust security measures and defenses against multi-turn attacks in LLMs, guiding future research aimed at enhancing the safety and alignment of AI systems.

Exploring Approaches for Detecting Memorization of Recommender System Data in Large Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper