Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 770 papers total

September 29 - October 05, 2025

24 papers

UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models

Yuhao Sun, Zhuoer Xu, Shiwen Cui, Kun Yang, Lingyun Yu, Yongdong Zhang, Hongtao Xie
2025-10-02
2510.02194v1

Dynamic Target Attack

Kedong Xiu, Churui Zeng, Tianhang Zheng, Xinzhe Huang, Xiaojun Jia, Di Wang, Puning Zhao, Zhan Qin, Kui Ren
2025-10-02
red teaming
2510.02422v1

NLP Methods for Detecting Novel LLM Jailbreaks and Keyword Analysis with BERT

John Hawkins, Aditya Pramar, Rodney Beard, Rohitash Chandra
2025-10-02
red teaming
2510.01644v1

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

Zhenyu Pan, Yiting Zhang, Zhuo Liu, Yolo Yunlong Tang, Zeliang Zhang, Haozheng Luo, Yuwei Han, Jianshu Zhang, Dennis Wu, Hong-Yu Chen, Haoran Lu, Haoyang Fang, Manling Li, Chenliang Xu, Philip S. Yu, Han Liu
2025-10-02
red teaming
2510.01586v1

Bypassing Prompt Guards in Production with Controlled-Release Prompting

Jaiden Fairoze, Sanjam Garg, Keewoo Lee, Mingyuan Wang
2025-10-02
red teaming
2510.01529v2

Bypassing Prompt Guards in Production with Controlled-Release Prompting

Jaiden Fairoze, Sanjam Garg, Keewoo Lee, Mingyuan Wang
2025-10-02
red teaming
2510.01529v1

Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

Isha Gupta, Rylan Schaeffer, Joshua Kazdan, Ken Ziyu Liu, Sanmi Koyejo
2025-10-01
red teaming
2510.01494v2

Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

Isha Gupta, Rylan Schaeffer, Joshua Kazdan, Ken Liu, Sanmi Koyejo
2025-10-01
red teaming
2510.01494v1

Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

Shoumik Saha, Jifan Chen, Sam Mayers, Sanjay Krishna Gouda, Zijian Wang, Varun Kumar
2025-10-01
red teaming
2510.01359v1

WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents

Yinuo Liu, Ruohan Xu, Xilong Wang, Yuqi Jia, Neil Zhenqiang Gong
2025-10-01
red teaming
2510.01354v1

Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach

Xiangfang Li, Yu Wang, Bo Li
2025-10-01
red teaming
2510.01342v1

Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

Guobin Shen, Dongcheng Zhao, Haibo Tong, Jindong Li, Feifei Zhao, Yi Zeng
2025-10-01
red teaming safety
2510.01088v1

Large Reasoning Models Learn Better Alignment from Flawed Thinking

ShengYun Peng, Eric Smith, Ivan Evtimov, Song Jiang, Pin-Yu Chen, Hongyuan Zhan, Haozhu Wang, Duen Horng Chau, Mahesh Pasupuleti, Jianfeng Chi
2025-10-01
red teaming
2510.00938v1

ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs

Adi Simhi, Jonathan Herzig, Martin Tutek, Itay Itzhak, Idan Szpektor, Yonatan Belinkov
2025-10-01
safety
2510.00857v1

Toward Safer Diffusion Language Models: Discovery and Mitigation of Priming Vulnerability

Shojiro Yamabe, Jun Sakuma
2025-10-01
red teaming
2510.00565v1

STaR-Attack: A Spatio-Temporal and Narrative Reasoning Attack Framework for Unified Multimodal Understanding and Generation Models

Shaoxiong Guo, Tianyi Du, Lijun Li, Yuyao Wu, Jie Li, Jing Shao
2025-09-30
red teaming
2509.26473v1

SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models

Qinjian Zhao, Jiaqi Wang, Zhiqiang Gao, Zhihao Dou, Belal Abuhaija, Kaizhu Huang
2025-09-30
red teaming
2509.26345v1

SafeEvalAgent: Toward Agentic and Self-Evolving Safety Evaluation of LLMs

Yixu Wang, Xin Wang, Yang Yao, Xinyuan Li, Yan Teng, Xingjun Ma, Yingchun Wang
2025-09-30
safety
2509.26100v1

SafeMind: Benchmarking and Mitigating Safety Risks in Embodied LLM Agents

Ruolin Chen, Yinqian Sun, Jihang Wang, Mingyang Lv, Qian Zhang, Yi Zeng
2025-09-30
safety
2509.25885v1

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Yein Park, Jungwoo Park, Jaewoo Kang
2025-09-30
red teaming
2509.25843v1

STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

Jing-Jing Li, Jianfeng He, Chao Shang, Devang Kulshreshtha, Xun Xian, Yi Zhang, Hang Su, Sandesh Swamy, Yanjun Qi
2025-09-30
red teaming
2509.25624v1

Fingerprinting LLMs via Prompt Injection

Yuepeng Hu, Zhengyuan Jiang, Mengyuan Li, Osama Ahmed, Zhicong Huang, Cheng Hong, Neil Gong
2025-09-29
2509.25448v2

Towards Trustworthy Lexical Simplification: Exploring Safety and Efficiency with Small LLMs

Akio Hayakawa, Stefan Bott, Horacio Saggion
2025-09-29
safety
2509.25086v1

A-MemGuard: A Proactive Defense Framework for LLM-Based Agent Memory

Qianshan Wei, Tengchao Yang, Yaochen Wang, Xinfeng Li, Lijun Li, Zhenfei Yin, Yi Zhan, Thorsten Holz, Zhiqiang Lin, XiaoFeng Wang
2025-09-29
safety
2510.02373v1