Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1169 papers total

July 14 - July 20, 2025

3 papers

July 07 - July 13, 2025

16 papers

Measuring What Matters: A Framework for Evaluating Safety Risks in Real-World LLM Applications

Jia Yi Goh, Shaun Khoo, Nyx Iskandar, Gabriel Chua, Leanne Tan, Jessica Foo
2025-07-13
safety
2507.09820v1

SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems

Wenliang Shan, Michael Fu, Rui Yang, Chakkrit Tantithamthavorn
2025-07-11
2507.08898v3

SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems

Wenliang Shan, Michael Fu, Rui Yang, Chakkrit Tantithamthavorn
2025-07-11
2507.08898v2

SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems

Wenliang Shan, Michael Fu, Rui Yang, Chakkrit, Tantithamthavorn
2025-07-11
2507.08898v1

A Dynamic Stackelberg Game Framework for Agentic AI Defense Against LLM Jailbreaking

Zhengye Han, Quanyan Zhu
2025-07-10
red teaming safety
2507.08207v1

Defending Against Prompt Injection With a Few DefensiveTokens

Sizhe Chen, Yizhu Wang, Nicholas Carlini, Chawin Sitawarin, David Wagner
2025-07-10
red teaming
2507.07974v2

GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing

Peiyan Zhang, Haibo Jin, Liying Kang, Haohan Wang
2025-07-10
red teaming
2507.07735v1

May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks

Nishit V. Pandya, Andrey Labunets, Sicun Gao, Earlence Fernandes
2025-07-10
red teaming
2507.07417v1

An attention-aware GNN-based input defender against multi-turn jailbreak on LLMs

Zixuan Huang, Kecheng Huang, Lihao Yin, Bowei He, Huiling Zhen, Mingxuan Yuan, Zili Shao
2025-07-09
red teaming
2507.07146v1

On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

Stephen Obadinma, Xiaodan Zhu
2025-07-09
red teaming
2507.06489v1

Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models

Aaron Dharna, Cong Lu, Jeff Clune
2025-07-09
red teaming
2507.06466v1

The bitter lesson of misuse detection

Hadrien Mariaccia, Charbel-Raphaël Segerie, Diego Dorn
2025-07-08
red teaming
2507.06282v1

CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations

Xiaohu Li, Yunfeng Ning, Zepeng Bao, Mayi Xu, Jianhao Chen, Tieyun Qian
2025-07-08
red teaming safety
2507.06043v1

TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data

Aravind Cheruvu, Shravya Kanchi, Sifat Muhammad Abdullah, Nicholas Kong, Daphne Yao, Murtuza Jadliwala, Bimal Viswanath
2025-07-08
red teaming
2507.05660v1

Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

Ziqi Miao, Lijun Li, Yuan Xiong, Zhenhua Liu, Pengyu Zhu, Jing Shao
2025-07-07
red teaming
2507.05248v1

Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message

Wei Duan, Li Qian
2025-07-07
red teaming
2507.04673v1

June 30 - July 06, 2025

5 papers