Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 770 papers total

September 01 - September 07, 2025

17 papers

Multimodal Prompt Injection Attacks: Risks and Defenses for Modern LLMs

Andrew Yeo, Daeseon Choi
2025-09-07
red teaming safety
2509.05883v1

AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs

Debdeep Sanyal, Manodeep Ray, Murari Mandal
2025-09-06
red teaming
2509.08000v1

EchoLeak: The First Real-World Zero-Click Prompt Injection Exploit in a Production LLM System

Pavan Reddy, Aditya Sanjay Gujral
2025-09-06
red teaming
2509.10540v1

Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models

Youjia Zheng, Mohammad Zandsalimy, Shanu Sushmita
2025-09-05
red teaming
2509.05471v1

The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models

Danielle Ensign, Henry Sleight, Kyle Fish
2025-09-05
2509.04781v1

NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models

Chuhan Zhang, Ye Zhang, Bowen Shi, Yuyou Gan, Tianyu Du, Shouling Ji, Dazhan Deng, Yingcai Wu
2025-09-04
red teaming
2509.03985v1

Between a Rock and a Hard Place: Exploiting Ethical Reasoning to Jailbreak LLMs

Shei Pern Chua, Thai Zhen Leng, Teh Kai Jun, Xiao Li, Xiaolin Hu
2025-09-04
red teaming
2509.05367v1

SafeProtein: Red-Teaming Framework and Benchmark for Protein Foundation Models

Jigang Fan, Zhenghong Zhou, Ruofan Jin, Le Cong, Mengdi Wang, Zaixi Zhang
2025-09-03
red teaming
2509.03487v1

BioBlue: Notable runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format

Roland Pihlakas, Sruthi Kuriakose
2025-09-02
safety
2509.02655v1

Enhancing Reliability in LLM-Integrated Robotic Systems: A Unified Approach to Security and Safety

Wenxiao Zhang, Xiangrui Kong, Conan Dewitt, Thomas Bräunl, Jin B. Hong
2025-09-02
safety
2509.02163v1

Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue
2025-09-02
2509.01909v3

Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue
2025-09-02
2509.01909v2

Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue
2025-09-02
2509.01909v1

Unraveling LLM Jailbreaks Through Safety Knowledge Neurons

Chongwen Zhao, Kaizhu Huang
2025-09-01
red teaming safety
2509.01631v1

Strata-Sword: A Hierarchical Safety Evaluation towards LLMs based on Reasoning Complexity of Jailbreak Instructions

Shiji Zhao, Ranjie Duan, Jiexi Liu, Xiaojun Jia, Fengxiang Wang, Cheng Wei, Ruoxi Cheng, Yong Xie, Chang Liu, Qing Guo, Jialing Tao, Hui Xue, Xingxing Wei
2025-09-01
red teaming safety
2509.01444v1

LLM-empowered Agents Simulation Framework for Scenario Generation in Service Ecosystem Governance

Deyu Zhou, Yuqi Hou, Xiao Xue, Xudong Lu, Qingzhong Li, Lizhen Cui
2025-09-01
governance
2509.01441v1

Web Fraud Attacks Against LLM-Driven Multi-Agent Systems

Dezhang Kong, Hujin Peng, Yilun Zhang, Lele Zhao, Zhenhua Xu, Shi Lin, Changting Lin, Meng Han
2025-09-01
2509.01211v1

August 25 - August 31, 2025

7 papers