Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1331 papers total

September 08 - September 14, 2025

8 papers

Improving LLM Safety and Helpfulness using SFT and DPO: A Study on OPT-350M

Piyush Pant
2025-09-10
safety
2509.09055v1

PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLMs with Enhanced Safety, Fairness, and Controllability

Tung Vu, Lam Nguyen, Quynh Dao
2025-09-10
safety
2509.08910v1

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park
2025-09-10
red teaming
2509.08729v1

Architecting Resilient LLM Agents: A Guide to Secure Plan-then-Execute Implementations

Ron F. Del Rosario, Klaudia Krawiecka, Christian Schroeder de Witt
2025-09-10
2509.08646v1

ImportSnare: Directed "Code Manual" Hijacking in Retrieval-Augmented Code Generation

Kai Ye, Liangcai Su, Chenxiong Qian
2025-09-09
red teaming
2509.07941v1

Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling

Minghui Li, Hao Zhang, Yechao Zhang, Wei Wan, Shengshan Hu, pei Xiaobing, Jing Wang
2025-09-09
red teaming
2509.07617v1

SafeToolBench: Pioneering a Prospective Benchmark to Evaluating Tool Utilization Safety in LLMs

Hongfei Xia, Hongru Wang, Zeming Liu, Qian Yu, Yuhang Guo, Haifeng Wang
2025-09-09
safety
2509.07315v1

Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

Junjie Mu, Zonghao Ying, Zhekui Fan, Zonglei Jing, Yaoyuan Zhang, Zhengmin Yu, Wenxin Zhang, Quanchen Zou, Xiangzheng Zhang
2025-09-08
red teaming
2509.06350v1

September 01 - September 07, 2025

16 papers

Measuring the Vulnerability Disclosure Policies of AI Vendors

Yangheran Piao, Jingjie Li, Daniel W. Woods
2025-09-07
2509.06136v1

Multimodal Prompt Injection Attacks: Risks and Defenses for Modern LLMs

Andrew Yeo, Daeseon Choi
2025-09-07
red teaming safety
2509.05883v1

AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs

Debdeep Sanyal, Manodeep Ray, Murari Mandal
2025-09-06
red teaming
2509.08000v1

EchoLeak: The First Real-World Zero-Click Prompt Injection Exploit in a Production LLM System

Pavan Reddy, Aditya Sanjay Gujral
2025-09-06
red teaming
2509.10540v1

Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models

Youjia Zheng, Mohammad Zandsalimy, Shanu Sushmita
2025-09-05
red teaming
2509.05471v1

The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models

Danielle Ensign, Henry Sleight, Kyle Fish
2025-09-05
2509.04781v1

NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models

Chuhan Zhang, Ye Zhang, Bowen Shi, Yuyou Gan, Tianyu Du, Shouling Ji, Dazhan Deng, Yingcai Wu
2025-09-04
red teaming
2509.03985v1

Between a Rock and a Hard Place: Exploiting Ethical Reasoning to Jailbreak LLMs

Shei Pern Chua, Thai Zhen Leng, Teh Kai Jun, Xiao Li, Xiaolin Hu
2025-09-04
red teaming
2509.05367v1

SafeProtein: Red-Teaming Framework and Benchmark for Protein Foundation Models

Jigang Fan, Zhenghong Zhou, Ruofan Jin, Le Cong, Mengdi Wang, Zaixi Zhang
2025-09-03
red teaming
2509.03487v1

BioBlue: Notable runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format

Roland Pihlakas, Sruthi Kuriakose
2025-09-02
safety
2509.02655v1

Enhancing Reliability in LLM-Integrated Robotic Systems: A Unified Approach to Security and Safety

Wenxiao Zhang, Xiangrui Kong, Conan Dewitt, Thomas Bräunl, Jin B. Hong
2025-09-02
safety
2509.02163v1

Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue
2025-09-02
2509.01909v3

Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue
2025-09-02
2509.01909v2

Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue
2025-09-02
2509.01909v1

Unraveling LLM Jailbreaks Through Safety Knowledge Neurons

Chongwen Zhao, Kaizhu Huang
2025-09-01
red teaming safety
2509.01631v1

Strata-Sword: A Hierarchical Safety Evaluation towards LLMs based on Reasoning Complexity of Jailbreak Instructions

Shiji Zhao, Ranjie Duan, Jiexi Liu, Xiaojun Jia, Fengxiang Wang, Cheng Wei, Ruoxi Cheng, Yong Xie, Chang Liu, Qing Guo, Jialing Tao, Hui Xue, Xingxing Wei
2025-09-01
red teaming safety
2509.01444v1