Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1172 papers total

October 06 - October 12, 2025

10 papers

AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling

Xiaogeng Liu, Chaowei Xiao
2025-10-06
red teaming
2510.05379v2

AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling

Xiaogeng Liu, Chaowei Xiao
2025-10-06
red teaming
2510.05379v1

WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

Xi Xuan, Xuechen Liu, Wenxin Zhang, Yi-Cheng Lin, Xiaojian Lin, Tomi Kinnunen
2025-10-06
2510.05305v1

Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?

Rishika Bhagwatkar, Kevin Kasa, Abhay Puri, Gabriel Huang, Irina Rish, Graham W. Taylor, Krishnamurthy Dj Dvijotham, Alexandre Lacoste
2025-10-06
red teaming
2510.05244v1

Proactive defense against LLM Jailbreak

Weiliang Zhao, Jinjun Peng, Daniel Ben-Levi, Zhou Yu, Junfeng Yang
2025-10-06
red teaming safety
2510.05052v1

Imperceptible Jailbreaking against Large Language Models

Kuofeng Gao, Yiming Li, Chao Du, Xin Wang, Xingjun Ma, Shu-Tao Xia, Tianyu Pang
2025-10-06
red teaming
2510.05025v1

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

Yuxin Wen, Arman Zharmagambetov, Ivan Evtimov, Narine Kokhlikyan, Tom Goldstein, Kamalika Chaudhuri, Chuan Guo
2025-10-06
red teaming
2510.04885v1

Unified Threat Detection and Mitigation Framework (UTDMF): Combating Prompt Injection, Deception, and Bias in Enterprise-Scale Transformers

Santhosh KumarRavindran
2025-10-06
red teaming
2510.04528v1

P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs

Shuai Zhao, Xinyi Wu, Shiqian Zhao, Xiaobao Wu, Zhongliang Guo, Yanhao Jia, Anh Tuan Luu
2025-10-06
safety
2510.04503v2

P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs

Shuai Zhao, Xinyi Wu, Shiqian Zhao, Xiaobao Wu, Zhongliang Guo, Yanhao Jia, Anh Tuan Luu
2025-10-06
safety
2510.04503v1

September 29 - October 05, 2025

14 papers

Read the Scene, Not the Script: Outcome-Aware Safety for LLMs

Rui Wu, Yihao Quan, Zeru Shi, Zhenting Wang, Yanshu Li, Ruixiang Tang
2025-10-05
red teaming safety
2510.04320v1

VortexPIA: Indirect Prompt Injection Attack against LLMs for Efficient Extraction of User Privacy

Yu Cui, Sicheng Pan, Yifei Liu, Haibin Zhang, Cong Zuo
2025-10-05
red teaming
2510.04261v1

AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents

Yanjie Li, Yiming Cao, Dong Wang, Bin Xiao
2025-10-05
red teaming
2510.04257v1

COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability

Yizhuo Ding, Mingkang Chen, Qiuhua Liu, Fenghua Weng, Wanying Qu, Yue Yang, Yugang Jiang, Zuxuan Wu, Yanwei Fu, Wenqi Shao
2025-10-05
2510.04196v1

Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods

Yulin Chen, Haoran Li, Yuan Sui, Yangqiu Song, Bryan Hooi
2025-10-04
red teaming
2510.03705v1

Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models

Kartik Pandit, Sourav Ganguly, Arnesh Banerjee, Shaahin Angizi, Arnob Ghosh
2025-10-03
2510.03520v1

NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks

Javad Rafiei Asl, Sidhant Narula, Mohammad Ghasemigol, Eduardo Blanco, Daniel Takabi
2025-10-03
red teaming
2510.03417v1

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

Imene Kerboua, Sahar Omidi Shayegan, Megh Thakkar, Xing Han Lù, Léo Boisvert, Massimo Caccia, Jérémy Espinas, Alexandre Aussem, Véronique Eglin, Alexandre Lacoste
2025-10-03
red teaming
2510.03204v1

Untargeted Jailbreak Attack

Xinzhe Huang, Wenjing Hu, Tianhang Zheng, Kedong Xiu, Xiaojun Jia, Di Wang, Zhan Qin, Kui Ren
2025-10-03
red teaming
2510.02999v1

External Data Extraction Attacks against Retrieval-Augmented Large Language Models

Yu He, Yifei Chen, Yiming Li, Shuo Shao, Leyi Qi, Boheng Li, Dacheng Tao, Zhan Qin
2025-10-03
red teaming
2510.02964v1

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

Zhixin Xie, Xurui Song, Jun Luo
2025-10-03
red teaming
2510.02833v1

VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation

Lesly Miculicich, Mihir Parmar, Hamid Palangi, Krishnamurthy Dj Dvijotham, Mirko Montanari, Tomas Pfister, Long T. Le
2025-10-03
safety
2510.05156v1

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

Chengquan Guo, Chulin Xie, Yu Yang, Zhaorun Chen, Zinan Lin, Xander Davies, Yarin Gal, Dawn Song, Bo Li
2025-10-02
red teaming
2510.02609v1

UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models

Yuhao Sun, Zhuoer Xu, Shiwen Cui, Kun Yang, Lingyun Yu, Yongdong Zhang, Hongtao Xie
2025-10-02
2510.02194v1