Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1169 papers total

September 29 - October 05, 2025

9 papers

SecInfer: Preventing Prompt Injection via Inference-time Scaling

Yupei Liu, Yanting Wang, Yuqi Jia, Jinyuan Jia, Neil Zhenqiang Gong
2025-09-29
2509.24967v4

SecInfer: Preventing Prompt Injection via Inference-time Scaling

Yupei Liu, Yanting Wang, Yuqi Jia, Jinyuan Jia, Neil Zhenqiang Gong
2025-09-29
2509.24967v3

SecInfer: Preventing Prompt Injection via Inference-time Scaling

Yupei Liu, Yanting Wang, Yuqi Jia, Jinyuan Jia, Neil Zhenqiang Gong
2025-09-29
red teaming
2509.24967v2

GSPR: Aligning LLM Safeguards as Generalizable Safety Policy Reasoners

Haoran Li, Yulin Chen, Jingru Zeng, Hao Peng, Huihao Jing, Wenbin Hu, Xi Yang, Ziqian Zeng, Sirui Han, Yangqiu Song
2025-09-29
safety
2509.24418v1

Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

Yichi Zhang, Yue Ding, Jingwen Yang, Tianwei Luo, Dongbai Li, Ranjie Duan, Qiang Liu, Hang Su, Yinpeng Dong, Jun Zhu
2025-09-29
2509.24393v1

HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment

Langqi Yang, Tianhang Zheng, Kedong Xiu, Yixuan Chen, Di Wang, Puning Zhao, Zhan Qin, Kui Ren
2025-09-29
red teaming
2509.24384v1

Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in LLMs

Jongwook Han, Jongwon Lim, Injin Kong, Yohan Jo
2025-09-29
red teaming
2509.24319v1

DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

Zherui Li, Zheng Nie, Zhenhong Zhou, Yufei Guo, Yue Liu, Yitong Zhang, Yu Cheng, Qingsong Wen, Kun Wang, Jiaheng Zhang
2025-09-29
red teaming
2509.24296v1

AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models

Zihao Zhu, Xinyu Wu, Gehan Hu, Siwei Lyu, Ke Xu, Baoyuan Wu
2025-09-29
red teaming
2509.24269v1

September 22 - September 28, 2025

15 papers

Policy-as-Prompt: Turning AI Governance Rules into Guardrails for AI Agents

Gauri Kholkar, Ratinder Ahuja
2025-09-28
2509.23994v2

Quant Fever, Reasoning Blackholes, Schrodinger's Compliance, and More: Probing GPT-OSS-20B

Shuyi Lin, Tian Lu, Zikai Wang, Bo Wen, Yibo Zhao, Cheng Tan
2025-09-28
red teaming
2509.23882v1

RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration

Xiuyuan Chen, Jian Zhao, Yuchen Yuan, Tianle Zhang, Huilin Zhou, Zheng Zhu, Ping Hu, Linghe Kong, Chi Zhang, Weiran Huang, Xuelong Li
2025-09-28
safety
2509.25271v1

SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents

Jianshuo Dong, Sheng Guo, Hao Wang, Xun Chen, Zhuotao Liu, Tianwei Zhang, Ke Xu, Minlie Huang, Han Qiu
2025-09-28
red teaming
2509.23694v3

SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents

Jianshuo Dong, Sheng Guo, Hao Wang, Zhuotao Liu, Tianwei Zhang, Ke Xu, Minlie Huang, Han Qiu
2025-09-28
safety
2509.23694v2

SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents

Jianshuo Dong, Sheng Guo, Hao Wang, Zhuotao Liu, Tianwei Zhang, Ke Xu, Minlie Huang, Han Qiu
2025-09-28
safety
2509.23694v1

PSG-Agent: Personality-Aware Safety Guardrail for LLM-based Agents

Yaozu Wu, Jizhou Guo, Dongyuan Li, Henry Peng Zou, Wei-Chieh Huang, Yankai Chen, Zhen Wang, Weizhi Zhang, Yangning Li, Meng Zhang, Renhe Jiang, Philip S. Yu
2025-09-28
safety
2509.23614v1

Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning

Zhaoqi Wang, Daqing He, Zijian Zhang, Xin Li, Liehuang Zhu, Meng Li, Jiamou Liu
2025-09-28
red teaming
2509.23558v1

ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search

Zeyu Shen, Basileal Imana, Tong Wu, Chong Xiang, Prateek Mittal, Aleksandra Korolova
2025-09-27
red teaming
2509.23519v1

Dual-Space Smoothness for Robust and Balanced LLM Unlearning

Han Yan, Zheyuan Liu, Meng Jiang
2025-09-27
red teaming
2509.23362v1

Preventing Robotic Jailbreaking via Multimodal Domain Adaptation

Francesco Marchiori, Rohan Sinha, Christopher Agia, Alexander Robey, George J. Pappas, Mauro Conti, Marco Pavone
2025-09-27
red teaming
2509.23281v1

GuardNet: Graph-Attention Filtering for Jailbreak Defense in Large Language Models

Javad Forough, Mohammad Maheri, Hamed Haddadi
2025-09-27
red teaming
2509.23037v1

ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents

Hwan Chang, Yonghyun Jun, Hwanhee Lee
2025-09-26
red teaming
2509.22830v1

Jailbreaking on Text-to-Video Models via Scene Splitting Strategy

Wonjun Lee, Haon Park, Doehyeon Lee, Bumsub Ham, Suhyun Kim
2025-09-26
red teaming
2509.22292v1

Safety Compliance: Rethinking LLM Safety Reasoning through the Lens of Compliance

Wenbin Hu, Huihao Jing, Haochen Shi, Haoran Li, Yangqiu Song
2025-09-26
safety
2509.22250v1