Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 235 papers total

July 28 - August 03, 2025

12 papers

Activation-Guided Local Editing for Jailbreaking Attacks

Jiecong Wang, Haoran Li, Hao Peng, Ziqian Zeng, Zihao Wang, Haohua Du, Zhengtao Yu
2025-08-01
red teaming
2508.00555v1

Pro2Guard: Proactive Runtime Enforcement of LLM Agent Safety via Probabilistic Model Checking

Haoyu Wang, Chris M. Poskitt, Jun Sun, Jiali Wei
2025-08-01
safety
2508.00500v1

A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

Shirui Wang, Zhihui Tang, Huaxia Yang, Qiuhong Gong, Tiantian Gu, Hongyang Ma, Yongxin Wang, Wubin Sun, Zeliang Lian, Kehang Mao, Yinan Jiang, Zhicheng Huang, Lingyun Ma, Wenjie Shen, Yajie Ji, Yunhui Tan, Chunbo Wang, Yunlu Gao, Qianling Ye, Rui Lin, Mingyu Chen, Lijuan Niu, Zhihao Wang, Peng Yu, Mengran Lang, Yue Liu, Huimin Zhang, Haitao Shen, Long Chen, Qiguang Zhao, Si-Xuan Liu, Lina Zhou, Hua Gao, Dongqiang Ye, Lingmin Meng, Youtao Yu, Naixin Liang, Jianxiong Wu
2025-07-31
safety
2507.23486v1

Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

Saeed Almheiri, Yerulan Kongrat, Adrian Santosh, Ruslan Tasmukhanov, Josemaria Vera, Muhammad Dehan Al Kautsar, Fajri Koto
2025-07-31
red teaming
2507.23465v1

Measuring Harmfulness of Computer-Using Agents

Aaron Xuxiang Tian, Ruofan Zhang, Janet Tang, Jiaxin Wen
2025-07-31
red teaming
2508.00935v1

Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs

Xikang Yang, Biyu Zhou, Xuehai Tang, Jizhong Han, Songlin Hu
2025-07-30
red teaming safety
2507.22564v1

Promoting Online Safety by Simulating Unsafe Conversations with LLMs

Owen Hoffman, Kangze Peng, Zehua You, Sajid Kamal, Sukrit Venkatagiri
2025-07-29
safety
2507.22267v1

Strategic Deflection: Defending LLMs from Logit Manipulation

Yassine Rachidy, Jihad Rbaiti, Youssef Hmamouche, Faissal Sehbaoui, Amal El Fallah Seghrouchni
2025-07-29
red teaming
2507.22160v1

Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security

Muzhi Dai, Shixuan Liu, Zhiyuan Zhao, Junyu Gao, Hao Sun, Xuelong Li
2025-07-29
red teaming
2507.22037v1

Anyone Can Jailbreak: Prompt-Based Attacks on LLMs and T2Is

Ahmed B Mustafa, Zihan Ye, Yang Lu, Michael P Pound, Shreyank N Gowda
2025-07-29
red teaming
2507.21820v1

PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking

Quanchen Zou, Zonghao Ying, Moyang Chen, Wenzhuo Xu, Yisong Xiao, Yakai Li, Deyue Zhang, Dongdong Yang, Zhao Liu, Xiangzheng Zhang
2025-07-29
red teaming
2507.21540v1

Enhancing Jailbreak Attacks on LLMs via Persona Prompts

Zheng Zhang, Peilin Zhao, Deheng Ye, Hao Wang
2025-07-28
red teaming
2507.22171v1

July 21 - July 27, 2025

5 papers

July 14 - July 20, 2025

7 papers