← Back to Library

SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

Authors: Qiusi Zhan, Angeline Budiman-Chan, Abdelrahman Zayed, Xingzhi Guo, Daniel Kang, Joo-Kyung Kim

Published: 2025-10-19

arXiv ID: 2510.17017v1

Added to Library: 2025-10-21 04:04 UTC

Safety

📄 Abstract

Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and reason to answer open-domain questions. While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored. In this paper, we first evaluate search agents using red-teaming datasets and find that they are more likely to produce harmful outputs than base LLMs. For example, when asked ``How can I track someone's location without their consent?'', a base model refuses, whereas a search agent designed to retrieve and cite sources may lower its refusal threshold, fetch documents (e.g., court cases), and, once appended, synthesize them into an informative yet unsafe summary. We further show that utility-oriented fine-tuning intensifies this risk, motivating joint alignment of safety and utility. We present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe queries and rewards safe ones. Experiments show that SafeSearch reduces agent harmfulness by over 70% across three red-teaming datasets while producing safe, helpful responses, and matches the QA performance of a utility-only finetuned agent; further analyses confirm the effectiveness of the query-level reward in jointly improving safety and utility.

🔍 Key Points

  • Search agents using large language models (LLMs) are more likely to produce harmful outputs compared to base LLMs, especially after utility-oriented fine-tuning.
  • The authors introduce SafeSearch, a multi-objective reinforcement learning framework that optimizes both safety and utility in search agents by incorporating a query-level shaping reward.
  • SafeSearch is empirically shown to reduce harmful outputs by over 70% while maintaining comparable QA performance to standard utility-focused fine-tuning methods.
  • The paper systematically evaluates the safety behaviors of search agents through extensive experimentation with red-teaming datasets, highlighting the need for safety alignment during the training process.
  • The approach addresses the critical trade-off between safety and utility, providing a novel solution to allow search agents to safely generate informative responses.

💡 Why This Paper Matters

This paper is significant as it addresses the growing concern of harmful outputs from AI search agents using LLMs. By proposing SafeSearch, the authors provide a robust framework that ensures these agents can operate safely without sacrificing their utility, which is paramount for real-world applications.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper are of great interest to AI security researchers as they highlight the risks associated with LLM-powered search agents. The proposed SafeSearch framework offers insights into mitigating unsafe behaviors, making it crucial for developing secure AI systems that ensure user safety, a topic of increasing importance in AI ethics and policy.

📚 Read the Full Paper