Paper Library

Policy-as-Prompt: Turning AI Governance Rules into Guardrails for AI Agents

Gauri Kholkar, Ratinder Ahuja

2025-09-28

2509.23994v2

Quant Fever, Reasoning Blackholes, Schrodinger's Compliance, and More: Probing GPT-OSS-20B

Shuyi Lin, Tian Lu, Zikai Wang, Bo Wen, Yibo Zhao, Cheng Tan

2025-09-28

red teaming

2509.23882v1

RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration

Xiuyuan Chen, Jian Zhao, Yuchen Yuan, Tianle Zhang, Huilin Zhou, Zheng Zhu, Ping Hu, Linghe Kong, Chi Zhang, Weiran Huang, Xuelong Li

2025-09-28

safety

2509.25271v1

SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents

Jianshuo Dong, Sheng Guo, Hao Wang, Xun Chen, Zhuotao Liu, Tianwei Zhang, Ke Xu, Minlie Huang, Han Qiu

2025-09-28

red teaming

2509.23694v3

SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents

Jianshuo Dong, Sheng Guo, Hao Wang, Zhuotao Liu, Tianwei Zhang, Ke Xu, Minlie Huang, Han Qiu

2025-09-28

safety

2509.23694v2

SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents

Jianshuo Dong, Sheng Guo, Hao Wang, Zhuotao Liu, Tianwei Zhang, Ke Xu, Minlie Huang, Han Qiu

2025-09-28

safety

2509.23694v1

PSG-Agent: Personality-Aware Safety Guardrail for LLM-based Agents

Yaozu Wu, Jizhou Guo, Dongyuan Li, Henry Peng Zou, Wei-Chieh Huang, Yankai Chen, Zhen Wang, Weizhi Zhang, Yangning Li, Meng Zhang, Renhe Jiang, Philip S. Yu

2025-09-28

safety

2509.23614v1

Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning

Zhaoqi Wang, Daqing He, Zijian Zhang, Xin Li, Liehuang Zhu, Meng Li, Jiamou Liu

2025-09-28

red teaming

2509.23558v1

ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search

Zeyu Shen, Basileal Imana, Tong Wu, Chong Xiang, Prateek Mittal, Aleksandra Korolova

2025-09-27

red teaming

2509.23519v1

Dual-Space Smoothness for Robust and Balanced LLM Unlearning

Han Yan, Zheyuan Liu, Meng Jiang

2025-09-27

red teaming

2509.23362v1

Preventing Robotic Jailbreaking via Multimodal Domain Adaptation

Francesco Marchiori, Rohan Sinha, Christopher Agia, Alexander Robey, George J. Pappas, Mauro Conti, Marco Pavone

2025-09-27

red teaming

2509.23281v1

GuardNet: Graph-Attention Filtering for Jailbreak Defense in Large Language Models

Javad Forough, Mohammad Maheri, Hamed Haddadi

2025-09-27

red teaming

2509.23037v1

ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents

Hwan Chang, Yonghyun Jun, Hwanhee Lee

2025-09-26

red teaming

2509.22830v1

Jailbreaking on Text-to-Video Models via Scene Splitting Strategy

Wonjun Lee, Haon Park, Doehyeon Lee, Bumsub Ham, Suhyun Kim

2025-09-26

red teaming

2509.22292v1

Safety Compliance: Rethinking LLM Safety Reasoning through the Lens of Compliance

Wenbin Hu, Huihao Jing, Haochen Shi, Haoran Li, Yangqiu Song

2025-09-26

safety

2509.22250v1

September 29 - October 05, 2025

SecInfer: Preventing Prompt Injection via Inference-time Scaling

SecInfer: Preventing Prompt Injection via Inference-time Scaling

SecInfer: Preventing Prompt Injection via Inference-time Scaling

GSPR: Aligning LLM Safeguards as Generalizable Safety Policy Reasoners

Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment

Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in LLMs

DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models

September 22 - September 28, 2025

Policy-as-Prompt: Turning AI Governance Rules into Guardrails for AI Agents

Quant Fever, Reasoning Blackholes, Schrodinger's Compliance, and More: Probing GPT-OSS-20B

RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration

SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents

SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents

SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents

PSG-Agent: Personality-Aware Safety Guardrail for LLM-based Agents

Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning

ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search

Dual-Space Smoothness for Robust and Balanced LLM Unlearning

Preventing Robotic Jailbreaking via Multimodal Domain Adaptation

GuardNet: Graph-Attention Filtering for Jailbreak Defense in Large Language Models

ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents

Jailbreaking on Text-to-Video Models via Scene Splitting Strategy

Safety Compliance: Rethinking LLM Safety Reasoning through the Lens of Compliance