Paper Library

Measuring the Vulnerability Disclosure Policies of AI Vendors

Yangheran Piao, Jingjie Li, Daniel W. Woods

2025-09-07

2509.06136v1

Multimodal Prompt Injection Attacks: Risks and Defenses for Modern LLMs

Andrew Yeo, Daeseon Choi

2025-09-07

red teaming safety

2509.05883v1

AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs

Debdeep Sanyal, Manodeep Ray, Murari Mandal

2025-09-06

red teaming

2509.08000v1

EchoLeak: The First Real-World Zero-Click Prompt Injection Exploit in a Production LLM System

Pavan Reddy, Aditya Sanjay Gujral

2025-09-06

red teaming

2509.10540v1

Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models

Youjia Zheng, Mohammad Zandsalimy, Shanu Sushmita

2025-09-05

red teaming

2509.05471v1

The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models

Danielle Ensign, Henry Sleight, Kyle Fish

2025-09-05

2509.04781v1

NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models

Chuhan Zhang, Ye Zhang, Bowen Shi, Yuyou Gan, Tianyu Du, Shouling Ji, Dazhan Deng, Yingcai Wu

2025-09-04

red teaming

2509.03985v1

Between a Rock and a Hard Place: Exploiting Ethical Reasoning to Jailbreak LLMs

Shei Pern Chua, Thai Zhen Leng, Teh Kai Jun, Xiao Li, Xiaolin Hu

2025-09-04

red teaming

2509.05367v1

SafeProtein: Red-Teaming Framework and Benchmark for Protein Foundation Models

Jigang Fan, Zhenghong Zhou, Ruofan Jin, Le Cong, Mengdi Wang, Zaixi Zhang

2025-09-03

red teaming

2509.03487v1

BioBlue: Notable runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format

Roland Pihlakas, Sruthi Kuriakose

2025-09-02

safety

2509.02655v1

Enhancing Reliability in LLM-Integrated Robotic Systems: A Unified Approach to Security and Safety

Wenxiao Zhang, Xiangrui Kong, Conan Dewitt, Thomas Bräunl, Jin B. Hong

2025-09-02

safety

2509.02163v1

Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue

2025-09-02

2509.01909v3

Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue

2025-09-02

2509.01909v2

Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue

2025-09-02

2509.01909v1

Unraveling LLM Jailbreaks Through Safety Knowledge Neurons

Chongwen Zhao, Kaizhu Huang

2025-09-01

red teaming safety

2509.01631v1

Strata-Sword: A Hierarchical Safety Evaluation towards LLMs based on Reasoning Complexity of Jailbreak Instructions

Shiji Zhao, Ranjie Duan, Jiexi Liu, Xiaojun Jia, Fengxiang Wang, Cheng Wei, Ruoxi Cheng, Yong Xie, Chang Liu, Qing Guo, Jialing Tao, Hui Xue, Xingxing Wei

2025-09-01

red teaming safety

2509.01444v1

September 08 - September 14, 2025

Improving LLM Safety and Helpfulness using SFT and DPO: A Study on OPT-350M

PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLMs with Enhanced Safety, Fairness, and Controllability

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Architecting Resilient LLM Agents: A Guide to Secure Plan-then-Execute Implementations

ImportSnare: Directed "Code Manual" Hijacking in Retrieval-Augmented Code Generation

Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling

SafeToolBench: Pioneering a Prospective Benchmark to Evaluating Tool Utilization Safety in LLMs

Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

September 01 - September 07, 2025

Measuring the Vulnerability Disclosure Policies of AI Vendors

Multimodal Prompt Injection Attacks: Risks and Defenses for Modern LLMs

AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs

EchoLeak: The First Real-World Zero-Click Prompt Injection Exploit in a Production LLM System

Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models

The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models

NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models

Between a Rock and a Hard Place: Exploiting Ethical Reasoning to Jailbreak LLMs

SafeProtein: Red-Teaming Framework and Benchmark for Protein Foundation Models

BioBlue: Notable runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format

Enhancing Reliability in LLM-Integrated Robotic Systems: A Unified Approach to Security and Safety

Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

Unraveling LLM Jailbreaks Through Safety Knowledge Neurons

Strata-Sword: A Hierarchical Safety Evaluation towards LLMs based on Reasoning Complexity of Jailbreak Instructions