Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1331 papers total

September 29 - October 05, 2025

3 papers

September 22 - September 28, 2025

21 papers

Policy-as-Prompt: Turning AI Governance Rules into Guardrails for AI Agents

Gauri Kholkar, Ratinder Ahuja
2025-09-28
2509.23994v2

Quant Fever, Reasoning Blackholes, Schrodinger's Compliance, and More: Probing GPT-OSS-20B

Shuyi Lin, Tian Lu, Zikai Wang, Bo Wen, Yibo Zhao, Cheng Tan
2025-09-28
red teaming
2509.23882v1

RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration

Xiuyuan Chen, Jian Zhao, Yuchen Yuan, Tianle Zhang, Huilin Zhou, Zheng Zhu, Ping Hu, Linghe Kong, Chi Zhang, Weiran Huang, Xuelong Li
2025-09-28
safety
2509.25271v1

SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents

Jianshuo Dong, Sheng Guo, Hao Wang, Xun Chen, Zhuotao Liu, Tianwei Zhang, Ke Xu, Minlie Huang, Han Qiu
2025-09-28
red teaming
2509.23694v3

SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents

Jianshuo Dong, Sheng Guo, Hao Wang, Zhuotao Liu, Tianwei Zhang, Ke Xu, Minlie Huang, Han Qiu
2025-09-28
safety
2509.23694v2

SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents

Jianshuo Dong, Sheng Guo, Hao Wang, Zhuotao Liu, Tianwei Zhang, Ke Xu, Minlie Huang, Han Qiu
2025-09-28
safety
2509.23694v1

PSG-Agent: Personality-Aware Safety Guardrail for LLM-based Agents

Yaozu Wu, Jizhou Guo, Dongyuan Li, Henry Peng Zou, Wei-Chieh Huang, Yankai Chen, Zhen Wang, Weizhi Zhang, Yangning Li, Meng Zhang, Renhe Jiang, Philip S. Yu
2025-09-28
safety
2509.23614v1

Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning

Zhaoqi Wang, Daqing He, Zijian Zhang, Xin Li, Liehuang Zhu, Meng Li, Jiamou Liu
2025-09-28
red teaming
2509.23558v1

ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search

Zeyu Shen, Basileal Imana, Tong Wu, Chong Xiang, Prateek Mittal, Aleksandra Korolova
2025-09-27
red teaming
2509.23519v1

Dual-Space Smoothness for Robust and Balanced LLM Unlearning

Han Yan, Zheyuan Liu, Meng Jiang
2025-09-27
red teaming
2509.23362v1

Preventing Robotic Jailbreaking via Multimodal Domain Adaptation

Francesco Marchiori, Rohan Sinha, Christopher Agia, Alexander Robey, George J. Pappas, Mauro Conti, Marco Pavone
2025-09-27
red teaming
2509.23281v1

GuardNet: Graph-Attention Filtering for Jailbreak Defense in Large Language Models

Javad Forough, Mohammad Maheri, Hamed Haddadi
2025-09-27
red teaming
2509.23037v1

ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents

Hwan Chang, Yonghyun Jun, Hwanhee Lee
2025-09-26
red teaming
2509.22830v1

Jailbreaking on Text-to-Video Models via Scene Splitting Strategy

Wonjun Lee, Haon Park, Doehyeon Lee, Bumsub Ham, Suhyun Kim
2025-09-26
red teaming
2509.22292v1

Safety Compliance: Rethinking LLM Safety Reasoning through the Lens of Compliance

Wenbin Hu, Huihao Jing, Haochen Shi, Haoran Li, Yangqiu Song
2025-09-26
safety
2509.22250v1

The Rogue Scalpel: Activation Steering Compromises LLM Safety

Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Y. Rogov, Ivan Oseledets, Elena Tutubalina
2025-09-26
red teaming safety
2509.22067v1

"Your AI, My Shell": Demystifying Prompt Injection Attacks on Agentic AI Coding Editors

Yue Liu, Yanjie Zhao, Yunbo Lyu, Ting Zhang, Haoyu Wang, David Lo
2025-09-26
red teaming
2509.22040v1

Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment

Jaehan Kim, Minkyoo Song, Seungwon Shin, Sooel Son
2025-09-26
safety
2509.22745v1

PSRT: Accelerating LRM-based Guard Models via Prefilled Safe Reasoning Traces

Jiawei Zhao, Yuang Qi, Weiming Zhang, Nenghai Yu, Kejiang Chen
2025-09-26
2509.21768v1

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

Miao Yu, Zhenhong Zhou, Moayad Aloqaily, Kun Wang, Biwei Huang, Stephen Wang, Yueming Jin, Qingsong Wen
2025-09-26
red teaming
2509.21761v2

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

Miao Yu, Zhenhong Zhou, Moayad Aloqaily, Kun Wang, Biwei Huang, Stephen Wang, Yueming Jin, Qingsong Wen
2025-09-26
red teaming
2509.21761v1