Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 770 papers total

September 29 - October 05, 2025

8 papers

SecInfer: Preventing Prompt Injection via Inference-time Scaling

Yupei Liu, Yanting Wang, Yuqi Jia, Jinyuan Jia, Neil Zhenqiang Gong
2025-09-29
2509.24967v3

SecInfer: Preventing Prompt Injection via Inference-time Scaling

Yupei Liu, Yanting Wang, Yuqi Jia, Jinyuan Jia, Neil Zhenqiang Gong
2025-09-29
red teaming
2509.24967v2

GSPR: Aligning LLM Safeguards as Generalizable Safety Policy Reasoners

Haoran Li, Yulin Chen, Jingru Zeng, Hao Peng, Huihao Jing, Wenbin Hu, Xi Yang, Ziqian Zeng, Sirui Han, Yangqiu Song
2025-09-29
safety
2509.24418v1

Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

Yichi Zhang, Yue Ding, Jingwen Yang, Tianwei Luo, Dongbai Li, Ranjie Duan, Qiang Liu, Hang Su, Yinpeng Dong, Jun Zhu
2025-09-29
2509.24393v1

HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment

Langqi Yang, Tianhang Zheng, Kedong Xiu, Yixuan Chen, Di Wang, Puning Zhao, Zhan Qin, Kui Ren
2025-09-29
red teaming
2509.24384v1

Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in LLMs

Jongwook Han, Jongwon Lim, Injin Kong, Yohan Jo
2025-09-29
red teaming
2509.24319v1

DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

Zherui Li, Zheng Nie, Zhenhong Zhou, Yufei Guo, Yue Liu, Yitong Zhang, Yu Cheng, Qingsong Wen, Kun Wang, Jiaheng Zhang
2025-09-29
red teaming
2509.24296v1

AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models

Zihao Zhu, Xinyu Wu, Gehan Hu, Siwei Lyu, Ke Xu, Baoyuan Wu
2025-09-29
red teaming
2509.24269v1

September 22 - September 28, 2025

16 papers

Quant Fever, Reasoning Blackholes, Schrodinger's Compliance, and More: Probing GPT-OSS-20B

Shuyi Lin, Tian Lu, Zikai Wang, Bo Wen, Yibo Zhao, Cheng Tan
2025-09-28
red teaming
2509.23882v1

RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration

Xiuyuan Chen, Jian Zhao, Yuchen Yuan, Tianle Zhang, Huilin Zhou, Zheng Zhu, Ping Hu, Linghe Kong, Chi Zhang, Weiran Huang, Xuelong Li
2025-09-28
safety
2509.25271v1

SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents

Jianshuo Dong, Sheng Guo, Hao Wang, Zhuotao Liu, Tianwei Zhang, Ke Xu, Minlie Huang, Han Qiu
2025-09-28
safety
2509.23694v2

SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents

Jianshuo Dong, Sheng Guo, Hao Wang, Zhuotao Liu, Tianwei Zhang, Ke Xu, Minlie Huang, Han Qiu
2025-09-28
safety
2509.23694v1

PSG-Agent: Personality-Aware Safety Guardrail for LLM-based Agents

Yaozu Wu, Jizhou Guo, Dongyuan Li, Henry Peng Zou, Wei-Chieh Huang, Yankai Chen, Zhen Wang, Weizhi Zhang, Yangning Li, Meng Zhang, Renhe Jiang, Philip S. Yu
2025-09-28
safety
2509.23614v1

Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning

Zhaoqi Wang, Daqing He, Zijian Zhang, Xin Li, Liehuang Zhu, Meng Li, Jiamou Liu
2025-09-28
red teaming
2509.23558v1

Dual-Space Smoothness for Robust and Balanced LLM Unlearning

Han Yan, Zheyuan Liu, Meng Jiang
2025-09-27
red teaming
2509.23362v1

Preventing Robotic Jailbreaking via Multimodal Domain Adaptation

Francesco Marchiori, Rohan Sinha, Christopher Agia, Alexander Robey, George J. Pappas, Mauro Conti, Marco Pavone
2025-09-27
red teaming
2509.23281v1

GuardNet: Graph-Attention Filtering for Jailbreak Defense in Large Language Models

Javad Forough, Mohammad Maheri, Hamed Haddadi
2025-09-27
red teaming
2509.23037v1

ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents

Hwan Chang, Yonghyun Jun, Hwanhee Lee
2025-09-26
red teaming
2509.22830v1

Jailbreaking on Text-to-Video Models via Scene Splitting Strategy

Wonjun Lee, Haon Park, Doehyeon Lee, Bumsub Ham, Suhyun Kim
2025-09-26
red teaming
2509.22292v1

Safety Compliance: Rethinking LLM Safety Reasoning through the Lens of Compliance

Wenbin Hu, Huihao Jing, Haochen Shi, Haoran Li, Yangqiu Song
2025-09-26
safety
2509.22250v1

The Rogue Scalpel: Activation Steering Compromises LLM Safety

Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Y. Rogov, Ivan Oseledets, Elena Tutubalina
2025-09-26
red teaming safety
2509.22067v1

Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment

Jaehan Kim, Minkyoo Song, Seungwon Shin, Sooel Son
2025-09-26
safety
2509.22745v1

PSRT: Accelerating LRM-based Guard Models via Prefilled Safe Reasoning Traces

Jiawei Zhao, Yuang Qi, Weiming Zhang, Nenghai Yu, Kejiang Chen
2025-09-26
2509.21768v1

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

Miao Yu, Zhenhong Zhou, Moayad Aloqaily, Kun Wang, Biwei Huang, Stephen Wang, Yueming Jin, Qingsong Wen
2025-09-26
red teaming
2509.21761v2