← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 770 papers total
October 06 - October 12, 2025
13 papers
Towards Reliable and Practical LLM Security Evaluations via Bayesian Modelling
Mary Llewellyn, Annie Gray, Josh Collyer, Michael Harries
2025-10-07
red teaming
2510.05709v1
Evaluating LLM Safety Across Child Development Stages: A Simulated Agent Approach
Abhejay Murali, Saleh Afroogh, Kevin Chen, David Atkinson, Amit Dhurandhar, Junfeng Jiao
2025-10-07
safety
2510.05484v1
Adversarial Reinforcement Learning for Large Language Model Agent Safety
Zizhao Wang, Dingcheng Li, Vaishakh Keshava, Phillip Wallis, Ananth Balashankar, Peter Stone, Lukas Rutishauser
2025-10-06
red teaming
2510.05442v1
AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling
Xiaogeng Liu, Chaowei Xiao
2025-10-06
red teaming
2510.05379v2
AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling
Xiaogeng Liu, Chaowei Xiao
2025-10-06
red teaming
2510.05379v1
WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection
Xi Xuan, Xuechen Liu, Wenxin Zhang, Yi-Cheng Lin, Xiaojian Lin, Tomi Kinnunen
2025-10-06
2510.05305v1
Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?
Rishika Bhagwatkar, Kevin Kasa, Abhay Puri, Gabriel Huang, Irina Rish, Graham W. Taylor, Krishnamurthy Dj Dvijotham, Alexandre Lacoste
2025-10-06
red teaming
2510.05244v1
Proactive defense against LLM Jailbreak
Weiliang Zhao, Jinjun Peng, Daniel Ben-Levi, Zhou Yu, Junfeng Yang
2025-10-06
red teaming
safety
2510.05052v1
Imperceptible Jailbreaking against Large Language Models
Kuofeng Gao, Yiming Li, Chao Du, Xin Wang, Xingjun Ma, Shu-Tao Xia, Tianyu Pang
2025-10-06
red teaming
2510.05025v1
RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection
Yuxin Wen, Arman Zharmagambetov, Ivan Evtimov, Narine Kokhlikyan, Tom Goldstein, Kamalika Chaudhuri, Chuan Guo
2025-10-06
red teaming
2510.04885v1
Unified Threat Detection and Mitigation Framework (UTDMF): Combating Prompt Injection, Deception, and Bias in Enterprise-Scale Transformers
Santhosh KumarRavindran
2025-10-06
red teaming
2510.04528v1
P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs
Shuai Zhao, Xinyi Wu, Shiqian Zhao, Xiaobao Wu, Zhongliang Guo, Yanhao Jia, Anh Tuan Luu
2025-10-06
safety
2510.04503v2
P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs
Shuai Zhao, Xinyi Wu, Shiqian Zhao, Xiaobao Wu, Zhongliang Guo, Yanhao Jia, Anh Tuan Luu
2025-10-06
safety
2510.04503v1
September 29 - October 05, 2025
11 papers
Read the Scene, Not the Script: Outcome-Aware Safety for LLMs
Rui Wu, Yihao Quan, Zeru Shi, Zhenting Wang, Yanshu Li, Ruixiang Tang
2025-10-05
red teaming
safety
2510.04320v1
AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents
Yanjie Li, Yiming Cao, Dong Wang, Bin Xiao
2025-10-05
red teaming
2510.04257v1
COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability
Yizhuo Ding, Mingkang Chen, Qiuhua Liu, Fenghua Weng, Wanying Qu, Yue Yang, Yugang Jiang, Zuxuan Wu, Yanwei Fu, Wenqi Shao
2025-10-05
2510.04196v1
Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods
Yulin Chen, Haoran Li, Yuan Sui, Yangqiu Song, Bryan Hooi
2025-10-04
red teaming
2510.03705v1
Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models
Kartik Pandit, Sourav Ganguly, Arnesh Banerjee, Shaahin Angizi, Arnob Ghosh
2025-10-03
2510.03520v1
NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks
Javad Rafiei Asl, Sidhant Narula, Mohammad Ghasemigol, Eduardo Blanco, Daniel Takabi
2025-10-03
red teaming
2510.03417v1
Untargeted Jailbreak Attack
Xinzhe Huang, Wenjing Hu, Tianhang Zheng, Kedong Xiu, Xiaojun Jia, Di Wang, Zhan Qin, Kui Ren
2025-10-03
red teaming
2510.02999v1
External Data Extraction Attacks against Retrieval-Augmented Large Language Models
Yu He, Yifei Chen, Yiming Li, Shuo Shao, Leyi Qi, Boheng Li, Dacheng Tao, Zhan Qin
2025-10-03
red teaming
2510.02964v1
Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs
Zhixin Xie, Xurui Song, Jun Luo
2025-10-03
red teaming
2510.02833v1
VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation
Lesly Miculicich, Mihir Parmar, Hamid Palangi, Krishnamurthy Dj Dvijotham, Mirko Montanari, Tomas Pfister, Long T. Le
2025-10-03
safety
2510.05156v1
RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents
Chengquan Guo, Chulin Xie, Yu Yang, Zhaorun Chen, Zinan Lin, Xander Davies, Yarin Gal, Dawn Song, Bo Li
2025-10-02
red teaming
2510.02609v1
‹
1
2
3
...
13
14
15
...
31
32
33
›