← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1172 papers total
October 06 - October 12, 2025
10 papers
AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling
Xiaogeng Liu, Chaowei Xiao
2025-10-06
red teaming
2510.05379v2
AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling
Xiaogeng Liu, Chaowei Xiao
2025-10-06
red teaming
2510.05379v1
WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection
Xi Xuan, Xuechen Liu, Wenxin Zhang, Yi-Cheng Lin, Xiaojian Lin, Tomi Kinnunen
2025-10-06
2510.05305v1
Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?
Rishika Bhagwatkar, Kevin Kasa, Abhay Puri, Gabriel Huang, Irina Rish, Graham W. Taylor, Krishnamurthy Dj Dvijotham, Alexandre Lacoste
2025-10-06
red teaming
2510.05244v1
Proactive defense against LLM Jailbreak
Weiliang Zhao, Jinjun Peng, Daniel Ben-Levi, Zhou Yu, Junfeng Yang
2025-10-06
red teaming
safety
2510.05052v1
Imperceptible Jailbreaking against Large Language Models
Kuofeng Gao, Yiming Li, Chao Du, Xin Wang, Xingjun Ma, Shu-Tao Xia, Tianyu Pang
2025-10-06
red teaming
2510.05025v1
RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection
Yuxin Wen, Arman Zharmagambetov, Ivan Evtimov, Narine Kokhlikyan, Tom Goldstein, Kamalika Chaudhuri, Chuan Guo
2025-10-06
red teaming
2510.04885v1
Unified Threat Detection and Mitigation Framework (UTDMF): Combating Prompt Injection, Deception, and Bias in Enterprise-Scale Transformers
Santhosh KumarRavindran
2025-10-06
red teaming
2510.04528v1
P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs
Shuai Zhao, Xinyi Wu, Shiqian Zhao, Xiaobao Wu, Zhongliang Guo, Yanhao Jia, Anh Tuan Luu
2025-10-06
safety
2510.04503v2
P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs
Shuai Zhao, Xinyi Wu, Shiqian Zhao, Xiaobao Wu, Zhongliang Guo, Yanhao Jia, Anh Tuan Luu
2025-10-06
safety
2510.04503v1
September 29 - October 05, 2025
14 papers
Read the Scene, Not the Script: Outcome-Aware Safety for LLMs
Rui Wu, Yihao Quan, Zeru Shi, Zhenting Wang, Yanshu Li, Ruixiang Tang
2025-10-05
red teaming
safety
2510.04320v1
VortexPIA: Indirect Prompt Injection Attack against LLMs for Efficient Extraction of User Privacy
Yu Cui, Sicheng Pan, Yifei Liu, Haibin Zhang, Cong Zuo
2025-10-05
red teaming
2510.04261v1
AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents
Yanjie Li, Yiming Cao, Dong Wang, Bin Xiao
2025-10-05
red teaming
2510.04257v1
COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability
Yizhuo Ding, Mingkang Chen, Qiuhua Liu, Fenghua Weng, Wanying Qu, Yue Yang, Yugang Jiang, Zuxuan Wu, Yanwei Fu, Wenqi Shao
2025-10-05
2510.04196v1
Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods
Yulin Chen, Haoran Li, Yuan Sui, Yangqiu Song, Bryan Hooi
2025-10-04
red teaming
2510.03705v1
Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models
Kartik Pandit, Sourav Ganguly, Arnesh Banerjee, Shaahin Angizi, Arnob Ghosh
2025-10-03
2510.03520v1
NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks
Javad Rafiei Asl, Sidhant Narula, Mohammad Ghasemigol, Eduardo Blanco, Daniel Takabi
2025-10-03
red teaming
2510.03417v1
FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents
Imene Kerboua, Sahar Omidi Shayegan, Megh Thakkar, Xing Han Lù, Léo Boisvert, Massimo Caccia, Jérémy Espinas, Alexandre Aussem, Véronique Eglin, Alexandre Lacoste
2025-10-03
red teaming
2510.03204v1
Untargeted Jailbreak Attack
Xinzhe Huang, Wenjing Hu, Tianhang Zheng, Kedong Xiu, Xiaojun Jia, Di Wang, Zhan Qin, Kui Ren
2025-10-03
red teaming
2510.02999v1
External Data Extraction Attacks against Retrieval-Augmented Large Language Models
Yu He, Yifei Chen, Yiming Li, Shuo Shao, Leyi Qi, Boheng Li, Dacheng Tao, Zhan Qin
2025-10-03
red teaming
2510.02964v1
Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs
Zhixin Xie, Xurui Song, Jun Luo
2025-10-03
red teaming
2510.02833v1
VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation
Lesly Miculicich, Mihir Parmar, Hamid Palangi, Krishnamurthy Dj Dvijotham, Mirko Montanari, Tomas Pfister, Long T. Le
2025-10-03
safety
2510.05156v1
RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents
Chengquan Guo, Chulin Xie, Yu Yang, Zhaorun Chen, Zinan Lin, Xander Davies, Yarin Gal, Dawn Song, Bo Li
2025-10-02
red teaming
2510.02609v1
UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models
Yuhao Sun, Zhuoer Xu, Shiwen Cui, Kun Yang, Lingyun Yu, Yongdong Zhang, Hongtao Xie
2025-10-02
2510.02194v1
‹
1
2
3
...
29
30
31
...
47
48
49
›