Paper Library

Read the Scene, Not the Script: Outcome-Aware Safety for LLMs

Rui Wu, Yihao Quan, Zeru Shi, Zhenting Wang, Yanshu Li, Ruixiang Tang

2025-10-05

red teaming safety

2510.04320v1

VortexPIA: Indirect Prompt Injection Attack against LLMs for Efficient Extraction of User Privacy

Yu Cui, Sicheng Pan, Yifei Liu, Haibin Zhang, Cong Zuo

2025-10-05

red teaming

2510.04261v1

AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents

Yanjie Li, Yiming Cao, Dong Wang, Bin Xiao

2025-10-05

red teaming

2510.04257v1

COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability

Yizhuo Ding, Mingkang Chen, Qiuhua Liu, Fenghua Weng, Wanying Qu, Yue Yang, Yugang Jiang, Zuxuan Wu, Yanwei Fu, Wenqi Shao

2025-10-05

2510.04196v1

Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods

Yulin Chen, Haoran Li, Yuan Sui, Yangqiu Song, Bryan Hooi

2025-10-04

red teaming

2510.03705v1

Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models

Kartik Pandit, Sourav Ganguly, Arnesh Banerjee, Shaahin Angizi, Arnob Ghosh

2025-10-03

2510.03520v1

NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks

Javad Rafiei Asl, Sidhant Narula, Mohammad Ghasemigol, Eduardo Blanco, Daniel Takabi

2025-10-03

red teaming

2510.03417v1

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

Imene Kerboua, Sahar Omidi Shayegan, Megh Thakkar, Xing Han Lù, Léo Boisvert, Massimo Caccia, Jérémy Espinas, Alexandre Aussem, Véronique Eglin, Alexandre Lacoste

2025-10-03

red teaming

2510.03204v1

Untargeted Jailbreak Attack

Xinzhe Huang, Wenjing Hu, Tianhang Zheng, Kedong Xiu, Xiaojun Jia, Di Wang, Zhan Qin, Kui Ren

2025-10-03

red teaming

2510.02999v1

External Data Extraction Attacks against Retrieval-Augmented Large Language Models

Yu He, Yifei Chen, Yiming Li, Shuo Shao, Leyi Qi, Boheng Li, Dacheng Tao, Zhan Qin

2025-10-03

red teaming

2510.02964v1

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

Zhixin Xie, Xurui Song, Jun Luo

2025-10-03

red teaming

2510.02833v1

VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation

Lesly Miculicich, Mihir Parmar, Hamid Palangi, Krishnamurthy Dj Dvijotham, Mirko Montanari, Tomas Pfister, Long T. Le

2025-10-03

safety

2510.05156v1

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

Chengquan Guo, Chulin Xie, Yu Yang, Zhaorun Chen, Zinan Lin, Xander Davies, Yarin Gal, Dawn Song, Bo Li

2025-10-02

red teaming

2510.02609v1

UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models

Yuhao Sun, Zhuoer Xu, Shiwen Cui, Kun Yang, Lingyun Yu, Yongdong Zhang, Hongtao Xie

2025-10-02

2510.02194v1

Dynamic Target Attack

Kedong Xiu, Churui Zeng, Tianhang Zheng, Xinzhe Huang, Xiaojun Jia, Di Wang, Puning Zhao, Zhan Qin, Kui Ren

2025-10-02

red teaming

2510.02422v1

NLP Methods for Detecting Novel LLM Jailbreaks and Keyword Analysis with BERT

John Hawkins, Aditya Pramar, Rodney Beard, Rohitash Chandra

2025-10-02

red teaming

2510.01644v1

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

Zhenyu Pan, Yiting Zhang, Zhuo Liu, Yolo Yunlong Tang, Zeliang Zhang, Haozheng Luo, Yuwei Han, Jianshu Zhang, Dennis Wu, Hong-Yu Chen, Haoran Lu, Haoyang Fang, Manling Li, Chenliang Xu, Philip S. Yu, Han Liu

2025-10-02

red teaming

2510.01586v1

Bypassing Prompt Guards in Production with Controlled-Release Prompting

Jaiden Fairoze, Sanjam Garg, Keewoo Lee, Mingyuan Wang

2025-10-02

red teaming

2510.01529v2

Bypassing Prompt Guards in Production with Controlled-Release Prompting

Jaiden Fairoze, Sanjam Garg, Keewoo Lee, Mingyuan Wang

2025-10-02

red teaming

2510.01529v1

Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

Isha Gupta, Rylan Schaeffer, Joshua Kazdan, Ken Ziyu Liu, Sanmi Koyejo

2025-10-01

red teaming

2510.01494v2

Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

Isha Gupta, Rylan Schaeffer, Joshua Kazdan, Ken Liu, Sanmi Koyejo

2025-10-01

red teaming

2510.01494v1

Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

Shoumik Saha, Jifan Chen, Sam Mayers, Sanjay Krishna Gouda, Zijian Wang, Varun Kumar

2025-10-01

red teaming

2510.01359v1

WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents

Yinuo Liu, Ruohan Xu, Xilong Wang, Yuqi Jia, Neil Zhenqiang Gong

2025-10-01

red teaming

2510.01354v1

October 06 - October 12, 2025

P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs

September 29 - October 05, 2025

Read the Scene, Not the Script: Outcome-Aware Safety for LLMs

VortexPIA: Indirect Prompt Injection Attack against LLMs for Efficient Extraction of User Privacy

AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents

COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability

Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods

Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models

NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

Untargeted Jailbreak Attack

External Data Extraction Attacks against Retrieval-Augmented Large Language Models

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models

Dynamic Target Attack

NLP Methods for Detecting Novel LLM Jailbreaks and Keyword Analysis with BERT

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

Bypassing Prompt Guards in Production with Controlled-Release Prompting

Bypassing Prompt Guards in Production with Controlled-Release Prompting

Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents