Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 770 papers total

October 06 - October 12, 2025

13 papers

Towards Reliable and Practical LLM Security Evaluations via Bayesian Modelling

Mary Llewellyn, Annie Gray, Josh Collyer, Michael Harries
2025-10-07
red teaming
2510.05709v1

Evaluating LLM Safety Across Child Development Stages: A Simulated Agent Approach

Abhejay Murali, Saleh Afroogh, Kevin Chen, David Atkinson, Amit Dhurandhar, Junfeng Jiao
2025-10-07
safety
2510.05484v1

Adversarial Reinforcement Learning for Large Language Model Agent Safety

Zizhao Wang, Dingcheng Li, Vaishakh Keshava, Phillip Wallis, Ananth Balashankar, Peter Stone, Lukas Rutishauser
2025-10-06
red teaming
2510.05442v1

AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling

Xiaogeng Liu, Chaowei Xiao
2025-10-06
red teaming
2510.05379v2

AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling

Xiaogeng Liu, Chaowei Xiao
2025-10-06
red teaming
2510.05379v1

WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

Xi Xuan, Xuechen Liu, Wenxin Zhang, Yi-Cheng Lin, Xiaojian Lin, Tomi Kinnunen
2025-10-06
2510.05305v1

Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?

Rishika Bhagwatkar, Kevin Kasa, Abhay Puri, Gabriel Huang, Irina Rish, Graham W. Taylor, Krishnamurthy Dj Dvijotham, Alexandre Lacoste
2025-10-06
red teaming
2510.05244v1

Proactive defense against LLM Jailbreak

Weiliang Zhao, Jinjun Peng, Daniel Ben-Levi, Zhou Yu, Junfeng Yang
2025-10-06
red teaming safety
2510.05052v1

Imperceptible Jailbreaking against Large Language Models

Kuofeng Gao, Yiming Li, Chao Du, Xin Wang, Xingjun Ma, Shu-Tao Xia, Tianyu Pang
2025-10-06
red teaming
2510.05025v1

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

Yuxin Wen, Arman Zharmagambetov, Ivan Evtimov, Narine Kokhlikyan, Tom Goldstein, Kamalika Chaudhuri, Chuan Guo
2025-10-06
red teaming
2510.04885v1

Unified Threat Detection and Mitigation Framework (UTDMF): Combating Prompt Injection, Deception, and Bias in Enterprise-Scale Transformers

Santhosh KumarRavindran
2025-10-06
red teaming
2510.04528v1

P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs

Shuai Zhao, Xinyi Wu, Shiqian Zhao, Xiaobao Wu, Zhongliang Guo, Yanhao Jia, Anh Tuan Luu
2025-10-06
safety
2510.04503v2

P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs

Shuai Zhao, Xinyi Wu, Shiqian Zhao, Xiaobao Wu, Zhongliang Guo, Yanhao Jia, Anh Tuan Luu
2025-10-06
safety
2510.04503v1

September 29 - October 05, 2025

11 papers

Read the Scene, Not the Script: Outcome-Aware Safety for LLMs

Rui Wu, Yihao Quan, Zeru Shi, Zhenting Wang, Yanshu Li, Ruixiang Tang
2025-10-05
red teaming safety
2510.04320v1

AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents

Yanjie Li, Yiming Cao, Dong Wang, Bin Xiao
2025-10-05
red teaming
2510.04257v1

COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability

Yizhuo Ding, Mingkang Chen, Qiuhua Liu, Fenghua Weng, Wanying Qu, Yue Yang, Yugang Jiang, Zuxuan Wu, Yanwei Fu, Wenqi Shao
2025-10-05
2510.04196v1

Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods

Yulin Chen, Haoran Li, Yuan Sui, Yangqiu Song, Bryan Hooi
2025-10-04
red teaming
2510.03705v1

Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models

Kartik Pandit, Sourav Ganguly, Arnesh Banerjee, Shaahin Angizi, Arnob Ghosh
2025-10-03
2510.03520v1

NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks

Javad Rafiei Asl, Sidhant Narula, Mohammad Ghasemigol, Eduardo Blanco, Daniel Takabi
2025-10-03
red teaming
2510.03417v1

Untargeted Jailbreak Attack

Xinzhe Huang, Wenjing Hu, Tianhang Zheng, Kedong Xiu, Xiaojun Jia, Di Wang, Zhan Qin, Kui Ren
2025-10-03
red teaming
2510.02999v1

External Data Extraction Attacks against Retrieval-Augmented Large Language Models

Yu He, Yifei Chen, Yiming Li, Shuo Shao, Leyi Qi, Boheng Li, Dacheng Tao, Zhan Qin
2025-10-03
red teaming
2510.02964v1

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

Zhixin Xie, Xurui Song, Jun Luo
2025-10-03
red teaming
2510.02833v1

VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation

Lesly Miculicich, Mihir Parmar, Hamid Palangi, Krishnamurthy Dj Dvijotham, Mirko Montanari, Tomas Pfister, Long T. Le
2025-10-03
safety
2510.05156v1

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

Chengquan Guo, Chulin Xie, Yu Yang, Zhaorun Chen, Zinan Lin, Xander Davies, Yarin Gal, Dawn Song, Bo Li
2025-10-02
red teaming
2510.02609v1