Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1331 papers total

September 29 - October 05, 2025

24 papers

Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach

Xiangfang Li, Yu Wang, Bo Li
2025-10-01
red teaming
2510.01342v1

Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

Guobin Shen, Dongcheng Zhao, Haibo Tong, Jindong Li, Feifei Zhao, Yi Zeng
2025-10-01
red teaming safety
2510.01088v1

Large Reasoning Models Learn Better Alignment from Flawed Thinking

ShengYun Peng, Eric Smith, Ivan Evtimov, Song Jiang, Pin-Yu Chen, Hongyuan Zhan, Haozhu Wang, Duen Horng Chau, Mahesh Pasupuleti, Jianfeng Chi
2025-10-01
red teaming
2510.00938v1

ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs

Adi Simhi, Jonathan Herzig, Martin Tutek, Itay Itzhak, Idan Szpektor, Yonatan Belinkov
2025-10-01
safety
2510.00857v1

Toward Safer Diffusion Language Models: Discovery and Mitigation of Priming Vulnerability

Shojiro Yamabe, Jun Sakuma
2025-10-01
red teaming
2510.00565v1

A Call to Action for a Secure-by-Design Generative AI Paradigm

Dalal Alharthi, Ivan Roberto Kawaminami Garcia
2025-10-01
2510.00451v1

Fairness Testing in Retrieval-Augmented Generation: How Small Perturbations Reveal Bias in Small Language Models

Matheus Vinicius da Silva de Oliveira, Jonathan de Andrade Silva, Awdren de Lima Fontao
2025-09-30
2509.26584v1

STaR-Attack: A Spatio-Temporal and Narrative Reasoning Attack Framework for Unified Multimodal Understanding and Generation Models

Shaoxiong Guo, Tianyi Du, Lijun Li, Yuyao Wu, Jie Li, Jing Shao
2025-09-30
red teaming
2509.26473v1

SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models

Qinjian Zhao, Jiaqi Wang, Zhiqiang Gao, Zhihao Dou, Belal Abuhaija, Kaizhu Huang
2025-09-30
red teaming
2509.26345v1

SafeEvalAgent: Toward Agentic and Self-Evolving Safety Evaluation of LLMs

Yixu Wang, Xin Wang, Yang Yao, Xinyuan Li, Yan Teng, Xingjun Ma, Yingchun Wang
2025-09-30
safety
2509.26100v1

Better Privilege Separation for Agents by Restricting Data Types

Dennis Jacob, Emad Alghamdi, Zhanhao Hu, Basel Alomair, David Wagner
2025-09-30
2509.25926v1

SafeMind: Benchmarking and Mitigating Safety Risks in Embodied LLM Agents

Ruolin Chen, Yinqian Sun, Jihang Wang, Mingyang Lv, Qian Zhang, Yi Zeng
2025-09-30
safety
2509.25885v1

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Yein Park, Jungwoo Park, Jaewoo Kang
2025-09-30
red teaming
2509.25843v1

How Diffusion Models Memorize

Juyeop Kim, Songkuk Kim, Jong-Seok Lee
2025-09-30
2509.25705v1

STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

Jing-Jing Li, Jianfeng He, Chao Shang, Devang Kulshreshtha, Xun Xian, Yi Zhang, Hang Su, Sandesh Swamy, Yanjun Qi
2025-09-30
red teaming
2509.25624v1

Fingerprinting LLMs via Prompt Injection

Yuepeng Hu, Zhengyuan Jiang, Mengyuan Li, Osama Ahmed, Zhicong Huang, Cheng Hong, Neil Gong
2025-09-29
2509.25448v2

Towards Trustworthy Lexical Simplification: Exploring Safety and Efficiency with Small LLMs

Akio Hayakawa, Stefan Bott, Horacio Saggion
2025-09-29
safety
2509.25086v1

A-MemGuard: A Proactive Defense Framework for LLM-Based Agent Memory

Qianshan Wei, Tengchao Yang, Yaochen Wang, Xinfeng Li, Lijun Li, Zhenfei Yin, Yi Zhan, Thorsten Holz, Zhiqiang Lin, XiaoFeng Wang
2025-09-29
safety
2510.02373v1

SecInfer: Preventing Prompt Injection via Inference-time Scaling

Yupei Liu, Yanting Wang, Yuqi Jia, Jinyuan Jia, Neil Zhenqiang Gong
2025-09-29
2509.24967v4

SecInfer: Preventing Prompt Injection via Inference-time Scaling

Yupei Liu, Yanting Wang, Yuqi Jia, Jinyuan Jia, Neil Zhenqiang Gong
2025-09-29
2509.24967v3

SecInfer: Preventing Prompt Injection via Inference-time Scaling

Yupei Liu, Yanting Wang, Yuqi Jia, Jinyuan Jia, Neil Zhenqiang Gong
2025-09-29
red teaming
2509.24967v2

GSPR: Aligning LLM Safeguards as Generalizable Safety Policy Reasoners

Haoran Li, Yulin Chen, Jingru Zeng, Hao Peng, Huihao Jing, Wenbin Hu, Xi Yang, Ziqian Zeng, Sirui Han, Yangqiu Song
2025-09-29
safety
2509.24418v1

Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

Yichi Zhang, Yue Ding, Jingwen Yang, Tianwei Luo, Dongbai Li, Ranjie Duan, Qiang Liu, Hang Su, Yinpeng Dong, Jun Zhu
2025-09-29
2509.24393v1

HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment

Langqi Yang, Tianhang Zheng, Kedong Xiu, Yixuan Chen, Di Wang, Puning Zhao, Zhan Qin, Kui Ren
2025-09-29
red teaming
2509.24384v1