Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 770 papers total

July 14 - July 20, 2025

12 papers

DeRAG: Black-box Adversarial Attacks on Multiple Retrieval-Augmented Generation Applications via Prompt Injection

Jerry Wang, Fang Yu
2025-07-20
red teaming
2507.15042v1

AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning

Yi Zhang, An Zhang, XiuYu Zhang, Leheng Sheng, Yuxin Chen, Zhenkai Liang, Xiang Wang
2025-07-20
2507.14987v1

Automated Safety Evaluations Across 20 Large Language Models: The Aymara LLM Risk and Responsibility Matrix

Juan Manuel Contreras
2025-07-19
safety
2507.14719v1

Innocence in the Crossfire: Roles of Skip Connections in Jailbreaking Visual Language Models

Palash Nandi, Maithili Joshi, Tanmoy Chakraborty
2025-07-18
red teaming
2507.13761v1

TopicAttack: An Indirect Prompt Injection Attack via Topic Transition

Yulin Chen, Haoran Li, Yuexin Li, Yue Liu, Yangqiu Song, Bryan Hooi
2025-07-18
red teaming
2507.13686v2

Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers

Liang Lin, Zhihao Xu, Xuehai Tang, Shi Liu, Biyu Zhou, Fuqing Zhu, Jizhong Han, Songlin Hu
2025-07-17
red teaming safety
2507.13474v1

Prompt Injection 2.0: Hybrid AI Threats

Jeremy McHugh, Kristina Šekrst, Jon Cefalu
2025-07-17
red teaming
2507.13169v1

Exploiting Jailbreaking Vulnerabilities in Generative AI to Bypass Ethical Safeguards for Facilitating Phishing Attacks

Rina Mishra, Gaurav Varshney
2025-07-16
red teaming
2507.12185v1

LLMs Encode Harmfulness and Refusal Separately

Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, Weiyan Shi
2025-07-16
red teaming
2507.11878v1

Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility

Brendan Murphy, Dillon Bowen, Shahrad Mohammadzadeh, Julius Broomfield, Adam Gleave, Kellin Pelrine
2025-07-15
red teaming
2507.11630v1

The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

Zichen Wen, Jiashu Qu, Dongrui Liu, Zhiyuan Liu, Ruixi Wu, Yicun Yang, Xiangqi Jin, Haoyun Xu, Xuyang Liu, Weijia Li, Chaochao Lu, Jing Shao, Conghui He, Linfeng Zhang
2025-07-15
red teaming safety
2507.11097v1

ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning

Zhengyue Zhao, Yingzi Ma, Somesh Jha, Marco Pavone, Chaowei Xiao
2025-07-14
red teaming
2507.11500v1

July 07 - July 13, 2025

12 papers

Measuring What Matters: A Framework for Evaluating Safety Risks in Real-World LLM Applications

Jia Yi Goh, Shaun Khoo, Nyx Iskandar, Gabriel Chua, Leanne Tan, Jessica Foo
2025-07-13
safety
2507.09820v1

SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems

Wenliang Shan, Michael Fu, Rui Yang, Chakkrit Tantithamthavorn
2025-07-11
2507.08898v3

SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems

Wenliang Shan, Michael Fu, Rui Yang, Chakkrit Tantithamthavorn
2025-07-11
2507.08898v2

SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems

Wenliang Shan, Michael Fu, Rui Yang, Chakkrit, Tantithamthavorn
2025-07-11
2507.08898v1

A Dynamic Stackelberg Game Framework for Agentic AI Defense Against LLM Jailbreaking

Zhengye Han, Quanyan Zhu
2025-07-10
red teaming safety
2507.08207v1

Defending Against Prompt Injection With a Few DefensiveTokens

Sizhe Chen, Yizhu Wang, Nicholas Carlini, Chawin Sitawarin, David Wagner
2025-07-10
red teaming
2507.07974v2

GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing

Peiyan Zhang, Haibo Jin, Liying Kang, Haohan Wang
2025-07-10
red teaming
2507.07735v1

May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks

Nishit V. Pandya, Andrey Labunets, Sicun Gao, Earlence Fernandes
2025-07-10
red teaming
2507.07417v1

An attention-aware GNN-based input defender against multi-turn jailbreak on LLMs

Zixuan Huang, Kecheng Huang, Lihao Yin, Bowei He, Huiling Zhen, Mingxuan Yuan, Zili Shao
2025-07-09
red teaming
2507.07146v1

On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

Stephen Obadinma, Xiaodan Zhu
2025-07-09
red teaming
2507.06489v1

Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models

Aaron Dharna, Cong Lu, Jeff Clune
2025-07-09
red teaming
2507.06466v1

The bitter lesson of misuse detection

Hadrien Mariaccia, Charbel-Raphaël Segerie, Diego Dorn
2025-07-08
red teaming
2507.06282v1