Reflective Personalization Optimization: A Post-hoc Rewriting Framework for Black-Box Large Language Models

📄 Abstract

The personalization of black-box large language models (LLMs) is a critical yet challenging task. Existing approaches predominantly rely on context injection, where user history is embedded into the prompt to directly guide the generation process. However, this single-step paradigm imposes a dual burden on the model: generating accurate content while simultaneously aligning with user-specific styles. This often results in a trade-off that compromises output quality and limits precise control. To address this fundamental tension, we propose Reflective Personalization Optimization (RPO), a novel framework that redefines the personalization paradigm by decoupling content generation from alignment. RPO operates in two distinct stages: first, a base model generates a high-quality, generic response; then, an external reflection module explicitly rewrites this output to align with the user's preferences. This reflection module is trained using a two-stage process. Initially, supervised fine-tuning is employed on structured rewriting trajectories to establish a core personalized reasoning policy that models the transformation from generic to user-aligned responses. Subsequently, reinforcement learning is applied to further refine and enhance the quality of the personalized outputs. Comprehensive experiments on the LaMP benchmark demonstrate that RPO, by decoupling content generation from personalization, significantly outperforms state-of-the-art baselines. These findings underscore the superiority of explicit response shaping over implicit context injection. Moreover, RPO introduces an efficient, model-agnostic personalization layer that can be seamlessly integrated with any underlying base model, paving the way for a new and effective direction in user-centric generation scenarios.

🔍 Key Points

Introduction of the χmera framework as the first principled attack evaluation method on LLM factual memory under prompt injection in adversarial scenarios.
Demonstration of various MitM attacks categorized into α, β, and γ types, showcasing how even trivial instruction-based attacks can successfully deceive LLMs with notable accuracy.
Empirical evidence showing high uncertainty levels in LLM responses during attacks, which can be leveraged to build a defense mechanism using machine learning classifiers to alert users of potentially manipulated responses.
Release of a novel factually adversarial dataset containing 3000 samples designed to benchmark and facilitate further research in adversarial vulnerabilities within LLMs.
High performance of Random Forest classifiers (up to ~96% AUC) in detecting attacked queries using uncertainty metrics, establishing a pathway towards user safety in LLM applications.

💡 Why This Paper Matters

This paper is crucial as it addresses the significant vulnerability of LLMs to adversarial attacks, particularly in contexts where factual accuracy is paramount, such as in information retrieval and question-answering systems. By unveiling specific weaknesses and developing the χmera framework, the authors pave the way for future research aimed at enhancing the robustness and trustworthiness of AI systems, thus contributing to safer AI deployment in critical applications.

🎯 Why It's Interesting for AI Security Researchers

This research holds great interest for AI security researchers as it delineates a clear framework for understanding and evaluating adversarial threats in LLMs, a topic of growing concern with the increasing reliance on these models for critical tasks. The findings not only highlight existing vulnerabilities but also propose empirical methods for detection and mitigation, guiding future research and practical implementations aimed at strengthening AI security.

Reflective Personalization Optimization: A Post-hoc Rewriting Framework for Black-Box Large Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper