Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 770 papers total

June 30 - July 06, 2025

2 papers

June 23 - June 29, 2025

12 papers

Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models

Younwoo Choi, Changling Li, Yongjin Yang, Zhijing Jin
2025-06-28
red teaming
2506.22957v1

VERA: Variational Inference Framework for Jailbreaking Large Language Models

Anamika Lochab, Lu Yan, Patrick Pynadath, Xiangyu Zhang, Ruqi Zhang
2025-06-27
red teaming
2506.22666v1

MetaCipher: A General and Extensible Reinforcement Learning Framework for Obfuscation-Based Jailbreak Attacks on Black-Box LLMs

Boyuan Chen, Minghao Shao, Abdul Basit, Siddharth Garg, Muhammad Shafique
2025-06-27
red teaming
2506.22557v1

Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses

Mohamed Ahmed, Mohamed Abdelmouty, Mingyu Kim, Gunvanth Kandula, Alex Park, James C. Davis
2025-06-27
red teaming safety
2506.21972v1

Red Teaming for Generative AI, Report on a Copyright-Focused Exercise Completed in an Academic Medical Center

James Wen, Sahil Nalawade, Zhiwei Liang, Catherine Bielick, Marisa Ferrara Boston, Alexander Chowdhury, Adele Collin, Luigi De Angelis, Jacob Ellen, Heather Frase, Rodrigo R. Gameiro, Juan Manuel Gutierrez, Pooja Kadam, Murat Keceli, Srikanth Krishnamurthy, Anne Kwok, Yanan Lance Lu, Heather Mattie, Liam G. McCoy, Katherine Miller, Allison C. Morgan, Marlene Louisa Moerig, Trang Nguyen, Alexander Owen-Post, Alex D. Ruiz, Sreekar Reddy Puchala, Soujanya Samineni, Takeshi Tohyama, Varun Ullanat, Carmine Valenza, Camilo Velez, Pengcheng Wang, Anna Wuest, Yuxiang Zhou, Yingde Zhu, Jason M. Johnson, Naomi Lenane, Jennifer Willcox, Francis J. Vitiello, Leo Anthony G. Celi, Renato Umeton
2025-06-26
red teaming
2506.22523v2

Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation

Chenkai Sun, Denghui Zhang, ChengXiang Zhai, Heng Ji
2025-06-26
safety
2506.20949v1

A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures

Dezhang Kong, Shi Lin, Zhenhua Xu, Zhebo Wang, Minghao Li, Yufeng Li, Yilun Zhang, Zeyang Sha, Yuyuan Li, Changting Lin, Xun Wang, Xuan Liu, Muhammad Khurram Khan, Ningyu Zhang, Chaochao Chen, Meng Han
2025-06-24
safety
2506.19676v1

PrivacyXray: Detecting Privacy Breaches in LLMs through Semantic Consistency and Probability Certainty

Jinwen He, Yiyang Lu, Zijin Lin, Kai Chen, Yue Zhao
2025-06-24
2506.19563v1

MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models

Yinan Xia, Yilei Jiang, Yingshui Tan, Xiaoyong Zhu, Xiangyu Yue, Bo Zheng
2025-06-24
red teaming
2506.19257v1

Command-V: Pasting LLM Behaviors via Activation Profiles

Barry Wang, Avi Schwarzschild, Alexander Robey, Ali Payani, Charles Fleming, Mingjie Sun, Daphne Ippolito
2025-06-23
red teaming
2506.19140v1

Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks

Xiaodong Wu, Xiangman Li, Jianbing Ni
2025-06-23
red teaming
2506.18543v1

NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation

Yu Xie, Chengjie Zeng, Lingyun Zhang, Yanwei Fu
2025-06-23
2506.18325v1

June 16 - June 22, 2025

10 papers

Multi-turn Jailbreaking via Global Refinement and Active Fabrication

Hua Tang, Lingyong Yan, Yukun Zhao, Shuaiqiang Wang, Jizhou Huang, Dawei Yin
2025-06-22
red teaming
2506.17881v1

Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs

Shuang Ao, Yi Dong, Jinwei Hu, Sarvapali Ramchurn
2025-06-21
safety
2506.18931v1

SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

Zhenglin Lai, Mengyao Liao, Dong Xu, Zebin Zhao, Zhihang Yuan, Chao Fan, Jianqiang Li, Bingzhe Wu
2025-06-20
safety
2506.17368v1

From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers

Jingtong Su, Julia Kempe, Karen Ullrich
2025-06-20
red teaming
2506.17052v1

MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning

Muyang Zheng, Yuanzhi Yao, Changting Lin, Rui Wang, Meng Han
2025-06-20
red teaming
2506.16792v1

Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models

Lei Jiang, Zixun Zhang, Zizhou Wang, Xiaobing Sun, Zhen Li, Liangli Zhen, Xiaohua Xu
2025-06-20
red teaming
2506.16760v1

Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models

Biao Yi, Tiansheng Huang, Sishuo Chen, Tong Li, Zheli Liu, Zhixuan Chu, Yiming Li
2025-06-19
red teaming
2506.16447v1

Probing the Robustness of Large Language Models Safety to Latent Perturbations

Tianle Gu, Kexin Huang, Zongqi Wang, Yixu Wang, Jie Li, Yuanqi Yao, Yang Yao, Yujiu Yang, Yan Teng, Yingchun Wang
2025-06-19
red teaming
2506.16078v1

deepSURF: Detecting Memory Safety Vulnerabilities in Rust Through Fuzzing LLM-Augmented Harnesses

Georgios Androutsopoulos, Antonio Bianchi
2025-06-18
safety
2506.15648v1

LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

Gabrel J. Perin, Runjin Chen, Xuxi Chen, Nina S. T. Hirata, Zhangyang Wang, Junyuan Hong
2025-06-18
safety
2506.15606v1