Towards Agentic Self-Learning LLMs in Search Environment

📄 Abstract

We study whether self-learning can scale LLM-based agents without relying on human-curated datasets or predefined rule-based rewards. Through controlled experiments in a search-agent setting, we identify two key determinants of scalable agent training: the source of reward signals and the scale of agent task data. We find that rewards from a Generative Reward Model (GRM) outperform rigid rule-based signals for open-domain learning, and that co-evolving the GRM with the policy further boosts performance. Increasing the volume of agent task data-even when synthetically generated-substantially enhances agentic capabilities. Building on these insights, we propose \textbf{Agentic Self-Learning} (ASL), a fully closed-loop, multi-role reinforcement learning framework that unifies task generation, policy execution, and evaluation within a shared tool environment and LLM backbone. ASL coordinates a Prompt Generator, a Policy Model, and a Generative Reward Model to form a virtuous cycle of harder task setting, sharper verification, and stronger solving. Empirically, ASL delivers steady, round-over-round gains, surpasses strong RLVR baselines (e.g., Search-R1) that plateau or degrade, and continues improving under zero-labeled-data conditions, indicating superior sample efficiency and robustness. We further show that GRM verification capacity is the main bottleneck: if frozen, it induces reward hacking and stalls progress; continual GRM training on the evolving data distribution mitigates this, and a small late-stage injection of real verification data raises the performance ceiling. This work establishes reward source and data scale as critical levers for open-domain agent learning and demonstrates the efficacy of multi-role co-evolution for scalable, self-improving agents. The data and code of this paper are released at https://github.com/forangel2014/Towards-Agentic-Self-Learning

🔍 Key Points

First systematic analysis of poisoning risks in LLM-based prompt optimization, highlighting vulnerabilities in feedback manipulation over query manipulation.
Identification of the fake-reward attack that significantly raises attack success rates by providing misleading feedback without access to the reward model.
Development of a lightweight defense mechanism (highlighting) that effectively mitigates the impact of fake-reward attacks while maintaining system utility.
Empirical evidence showing that prompt optimization metrics critically influence the susceptibility of systems to adversarial exploitation, emphasizing the need for careful metric selection.
Proposed an actionable framework for securing feedback channels in LLM-based optimizers, marking prompt optimization pipelines as a new attack surface in AI safety.

💡 Why This Paper Matters

This paper is crucial for advancing the understanding of security vulnerabilities inherent in LLM-based optimization processes. By presenting a novel class of feedback manipulation attacks and highlighting the importance of robust defense mechanisms, the authors contribute significantly to the field of AI safety. Their findings advocate for re-evaluating existing practices in prompt optimization pipelines, which are increasingly used in real-world applications.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly relevant as it addresses a critical gap in the literature concerning the vulnerabilities specific to LLM-based optimization methods. The exploration of feedback manipulation attacks, the introduction of the fake-reward attack, and the proposed defenses align with ongoing concerns about the security of machine learning systems. As these models become more embedded in sensitive applications, understanding their threats and implementing effective safeguards is paramount.

Towards Agentic Self-Learning LLMs in Search Environment

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper