Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1331 papers total

October 06 - October 12, 2025

1 paper

September 29 - October 05, 2025

23 papers

Read the Scene, Not the Script: Outcome-Aware Safety for LLMs

Rui Wu, Yihao Quan, Zeru Shi, Zhenting Wang, Yanshu Li, Ruixiang Tang
2025-10-05
red teaming safety
2510.04320v1

VortexPIA: Indirect Prompt Injection Attack against LLMs for Efficient Extraction of User Privacy

Yu Cui, Sicheng Pan, Yifei Liu, Haibin Zhang, Cong Zuo
2025-10-05
red teaming
2510.04261v1

AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents

Yanjie Li, Yiming Cao, Dong Wang, Bin Xiao
2025-10-05
red teaming
2510.04257v1

COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability

Yizhuo Ding, Mingkang Chen, Qiuhua Liu, Fenghua Weng, Wanying Qu, Yue Yang, Yugang Jiang, Zuxuan Wu, Yanwei Fu, Wenqi Shao
2025-10-05
2510.04196v1

Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods

Yulin Chen, Haoran Li, Yuan Sui, Yangqiu Song, Bryan Hooi
2025-10-04
red teaming
2510.03705v1

Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models

Kartik Pandit, Sourav Ganguly, Arnesh Banerjee, Shaahin Angizi, Arnob Ghosh
2025-10-03
2510.03520v1

NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks

Javad Rafiei Asl, Sidhant Narula, Mohammad Ghasemigol, Eduardo Blanco, Daniel Takabi
2025-10-03
red teaming
2510.03417v1

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

Imene Kerboua, Sahar Omidi Shayegan, Megh Thakkar, Xing Han Lù, Léo Boisvert, Massimo Caccia, Jérémy Espinas, Alexandre Aussem, Véronique Eglin, Alexandre Lacoste
2025-10-03
red teaming
2510.03204v1

Untargeted Jailbreak Attack

Xinzhe Huang, Wenjing Hu, Tianhang Zheng, Kedong Xiu, Xiaojun Jia, Di Wang, Zhan Qin, Kui Ren
2025-10-03
red teaming
2510.02999v1

External Data Extraction Attacks against Retrieval-Augmented Large Language Models

Yu He, Yifei Chen, Yiming Li, Shuo Shao, Leyi Qi, Boheng Li, Dacheng Tao, Zhan Qin
2025-10-03
red teaming
2510.02964v1

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

Zhixin Xie, Xurui Song, Jun Luo
2025-10-03
red teaming
2510.02833v1

VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation

Lesly Miculicich, Mihir Parmar, Hamid Palangi, Krishnamurthy Dj Dvijotham, Mirko Montanari, Tomas Pfister, Long T. Le
2025-10-03
safety
2510.05156v1

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

Chengquan Guo, Chulin Xie, Yu Yang, Zhaorun Chen, Zinan Lin, Xander Davies, Yarin Gal, Dawn Song, Bo Li
2025-10-02
red teaming
2510.02609v1

UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models

Yuhao Sun, Zhuoer Xu, Shiwen Cui, Kun Yang, Lingyun Yu, Yongdong Zhang, Hongtao Xie
2025-10-02
2510.02194v1

Dynamic Target Attack

Kedong Xiu, Churui Zeng, Tianhang Zheng, Xinzhe Huang, Xiaojun Jia, Di Wang, Puning Zhao, Zhan Qin, Kui Ren
2025-10-02
red teaming
2510.02422v1

NLP Methods for Detecting Novel LLM Jailbreaks and Keyword Analysis with BERT

John Hawkins, Aditya Pramar, Rodney Beard, Rohitash Chandra
2025-10-02
red teaming
2510.01644v1

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

Zhenyu Pan, Yiting Zhang, Zhuo Liu, Yolo Yunlong Tang, Zeliang Zhang, Haozheng Luo, Yuwei Han, Jianshu Zhang, Dennis Wu, Hong-Yu Chen, Haoran Lu, Haoyang Fang, Manling Li, Chenliang Xu, Philip S. Yu, Han Liu
2025-10-02
red teaming
2510.01586v1

Bypassing Prompt Guards in Production with Controlled-Release Prompting

Jaiden Fairoze, Sanjam Garg, Keewoo Lee, Mingyuan Wang
2025-10-02
red teaming
2510.01529v2

Bypassing Prompt Guards in Production with Controlled-Release Prompting

Jaiden Fairoze, Sanjam Garg, Keewoo Lee, Mingyuan Wang
2025-10-02
red teaming
2510.01529v1

Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

Isha Gupta, Rylan Schaeffer, Joshua Kazdan, Ken Ziyu Liu, Sanmi Koyejo
2025-10-01
red teaming
2510.01494v2

Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

Isha Gupta, Rylan Schaeffer, Joshua Kazdan, Ken Liu, Sanmi Koyejo
2025-10-01
red teaming
2510.01494v1

Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

Shoumik Saha, Jifan Chen, Sam Mayers, Sanjay Krishna Gouda, Zijian Wang, Varun Kumar
2025-10-01
red teaming
2510.01359v1

WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents

Yinuo Liu, Ruohan Xu, Xilong Wang, Yuqi Jia, Neil Zhenqiang Gong
2025-10-01
red teaming
2510.01354v1