Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1169 papers total

July 28 - August 03, 2025

9 papers

Measuring Harmfulness of Computer-Using Agents

Aaron Xuxiang Tian, Ruofan Zhang, Janet Tang, Jiaxin Wen
2025-07-31
red teaming
2508.00935v1

Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs

Xikang Yang, Biyu Zhou, Xuehai Tang, Jizhong Han, Songlin Hu
2025-07-30
red teaming safety
2507.22564v1

Promoting Online Safety by Simulating Unsafe Conversations with LLMs

Owen Hoffman, Kangze Peng, Zehua You, Sajid Kamal, Sukrit Venkatagiri
2025-07-29
safety
2507.22267v1

Strategic Deflection: Defending LLMs from Logit Manipulation

Yassine Rachidy, Jihad Rbaiti, Youssef Hmamouche, Faissal Sehbaoui, Amal El Fallah Seghrouchni
2025-07-29
red teaming
2507.22160v1

Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security

Muzhi Dai, Shixuan Liu, Zhiyuan Zhao, Junyu Gao, Hao Sun, Xuelong Li
2025-07-29
red teaming
2507.22037v1

Anyone Can Jailbreak: Prompt-Based Attacks on LLMs and T2Is

Ahmed B Mustafa, Zihan Ye, Yang Lu, Michael P Pound, Shreyank N Gowda
2025-07-29
red teaming
2507.21820v1

PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking

Quanchen Zou, Zonghao Ying, Moyang Chen, Wenzhuo Xu, Yisong Xiao, Yakai Li, Deyue Zhang, Dongdong Yang, Zhao Liu, Xiangzheng Zhang
2025-07-29
red teaming
2507.21540v1

Soft Injection of Task Embeddings Outperforms Prompt-Based In-Context Learning

Jungwon Park, Wonjong Rhee
2025-07-28
2507.20906v2

Enhancing Jailbreak Attacks on LLMs via Persona Prompts

Zheng Zhang, Peilin Zhao, Deheng Ye, Hao Wang
2025-07-28
red teaming
2507.22171v1

July 21 - July 27, 2025

6 papers

July 14 - July 20, 2025

9 papers

DeRAG: Black-box Adversarial Attacks on Multiple Retrieval-Augmented Generation Applications via Prompt Injection

Jerry Wang, Fang Yu
2025-07-20
red teaming
2507.15042v1

AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning

Yi Zhang, An Zhang, XiuYu Zhang, Leheng Sheng, Yuxin Chen, Zhenkai Liang, Xiang Wang
2025-07-20
2507.14987v1

Automated Safety Evaluations Across 20 Large Language Models: The Aymara LLM Risk and Responsibility Matrix

Juan Manuel Contreras
2025-07-19
safety
2507.14719v1

Innocence in the Crossfire: Roles of Skip Connections in Jailbreaking Visual Language Models

Palash Nandi, Maithili Joshi, Tanmoy Chakraborty
2025-07-18
red teaming
2507.13761v1

TopicAttack: An Indirect Prompt Injection Attack via Topic Transition

Yulin Chen, Haoran Li, Yuexin Li, Yue Liu, Yangqiu Song, Bryan Hooi
2025-07-18
red teaming
2507.13686v2

Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers

Liang Lin, Zhihao Xu, Xuehai Tang, Shi Liu, Biyu Zhou, Fuqing Zhu, Jizhong Han, Songlin Hu
2025-07-17
red teaming safety
2507.13474v1

Prompt Injection 2.0: Hybrid AI Threats

Jeremy McHugh, Kristina Šekrst, Jon Cefalu
2025-07-17
red teaming
2507.13169v1

Exploiting Jailbreaking Vulnerabilities in Generative AI to Bypass Ethical Safeguards for Facilitating Phishing Attacks

Rina Mishra, Gaurav Varshney
2025-07-16
red teaming
2507.12185v1

LLMs Encode Harmfulness and Refusal Separately

Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, Weiyan Shi
2025-07-16
red teaming
2507.11878v1