← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1331 papers total
July 07 - July 13, 2025
13 papers
SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems
Wenliang Shan, Michael Fu, Rui Yang, Chakkrit, Tantithamthavorn
2025-07-11
2507.08898v1
A Dynamic Stackelberg Game Framework for Agentic AI Defense Against LLM Jailbreaking
Zhengye Han, Quanyan Zhu
2025-07-10
red teaming
safety
2507.08207v1
Defending Against Prompt Injection With a Few DefensiveTokens
Sizhe Chen, Yizhu Wang, Nicholas Carlini, Chawin Sitawarin, David Wagner
2025-07-10
red teaming
2507.07974v2
GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing
Peiyan Zhang, Haibo Jin, Liying Kang, Haohan Wang
2025-07-10
red teaming
2507.07735v1
May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks
Nishit V. Pandya, Andrey Labunets, Sicun Gao, Earlence Fernandes
2025-07-10
red teaming
2507.07417v1
An attention-aware GNN-based input defender against multi-turn jailbreak on LLMs
Zixuan Huang, Kecheng Huang, Lihao Yin, Bowei He, Huiling Zhen, Mingxuan Yuan, Zili Shao
2025-07-09
red teaming
2507.07146v1
On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks
Stephen Obadinma, Xiaodan Zhu
2025-07-09
red teaming
2507.06489v1
Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models
Aaron Dharna, Cong Lu, Jeff Clune
2025-07-09
red teaming
2507.06466v1
The bitter lesson of misuse detection
Hadrien Mariaccia, Charbel-Raphaël Segerie, Diego Dorn
2025-07-08
red teaming
2507.06282v1
CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations
Xiaohu Li, Yunfeng Ning, Zepeng Bao, Mayi Xu, Jianhao Chen, Tieyun Qian
2025-07-08
red teaming
safety
2507.06043v1
TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data
Aravind Cheruvu, Shravya Kanchi, Sifat Muhammad Abdullah, Nicholas Kong, Daphne Yao, Murtuza Jadliwala, Bimal Viswanath
2025-07-08
red teaming
2507.05660v1
Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models
Ziqi Miao, Lijun Li, Yuan Xiong, Zhenhua Liu, Pengyu Zhu, Jing Shao
2025-07-07
red teaming
2507.05248v1
Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message
Wei Duan, Li Qian
2025-07-07
red teaming
2507.04673v1
June 30 - July 06, 2025
11 papers
Tail-aware Adversarial Attacks: A Distributional Approach to Efficient LLM Jailbreaking
Tim Beyer, Yan Scholten, Leo Schwinn, Stephan Günnemann
2025-07-06
red teaming
2507.04446v2
Tail-aware Adversarial Attacks: A Distributional Approach to Efficient LLM Jailbreaking
Tim Beyer, Yan Scholten, Stephan Günnemann, Leo Schwinn
2025-07-06
red teaming
2507.04446v1
Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs
Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho
2025-07-06
red teaming
safety
2507.04365v1
Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs
Jeremiah Giordani
2025-07-04
safety
2507.03662v1
On Jailbreaking Quantized Language Models Through Fault Injection Attacks
Noureldin Zahran, Ahmad Tahmasivand, Ihsen Alouani, Khaled Khasawneh, Mohammed E. Fouda
2025-07-04
red teaming
2507.03236v2
On Jailbreaking Quantized Language Models Through Fault Injection Attacks
Noureldin Zahran, Ahmad Tahmasivand, Ihsen Alouani, Khaled Khasawneh, Mohammed E. Fouda
2025-07-04
red teaming
2507.03236v1
Adversarial Manipulation of Reasoning Models using Internal Representations
Kureha Yamaguchi, Benjamin Etheridge, Andy Arditi
2025-07-03
red teaming
2507.03167v1
Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection
Ziqi Miao, Yi Ding, Lijun Li, Jing Shao
2025-07-03
red teaming
2507.02844v1
Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models
Riccardo Cantini, Nicola Gabriele, Alessio Orsino, Domenico Talia
2025-07-03
red teaming
2507.02799v1
Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks
Sizhe Chen, Arman Zharmagambetov, David Wagner, Chuan Guo
2025-07-03
2507.02735v2
PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage
Krishna Kanth Nakka, Xue Jiang, Xuebing Zhou
2025-07-03
red teaming
2507.02332v1
‹
1
2
3
...
47
48
49
...
54
55
56
›