Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1169 papers total

June 30 - July 06, 2025

17 papers

On Jailbreaking Quantized Language Models Through Fault Injection Attacks

Noureldin Zahran, Ahmad Tahmasivand, Ihsen Alouani, Khaled Khasawneh, Mohammed E. Fouda
2025-07-04
red teaming
2507.03236v1

Adversarial Manipulation of Reasoning Models using Internal Representations

Kureha Yamaguchi, Benjamin Etheridge, Andy Arditi
2025-07-03
red teaming
2507.03167v1

Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection

Ziqi Miao, Yi Ding, Lijun Li, Jing Shao
2025-07-03
red teaming
2507.02844v1

Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models

Riccardo Cantini, Nicola Gabriele, Alessio Orsino, Domenico Talia
2025-07-03
red teaming
2507.02799v1

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

Sizhe Chen, Arman Zharmagambetov, David Wagner, Chuan Guo
2025-07-03
2507.02735v2

PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage

Krishna Kanth Nakka, Xue Jiang, Xuebing Zhou
2025-07-03
red teaming
2507.02332v1

MGC: A Compiler Framework Exploiting Compositional Blindness in Aligned LLMs for Malware Generation

Lu Yan, Zhuo Zhang, Xiangzhe Xu, Shengwei An, Guangyu Shen, Zhou Xuan, Xuan Chen, Xiangyu Zhang
2025-07-02
red teaming
2507.02057v1

SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism

Beitao Chen, Xinyu Lyu, Lianli Gao, Jingkuan Song, Heng Tao Shen
2025-07-02
red teaming safety
2507.01513v1

LLM-based Realistic Safety-Critical Driving Video Generation

Yongjie Fu, Ruijian Zha, Pei Tian, Xuan Di
2025-07-02
safety
2507.01264v1

`For Argument's Sake, Show Me How to Harm Myself!': Jailbreaking LLMs in Suicide and Self-Harm Contexts

Annika M Schoene, Cansu Canca
2025-07-01
red teaming
2507.02990v1

Enhancing LLM Agent Safety via Causal Influence Prompting

Dongyoon Hahm, Woogyeol Jin, June Suk Choi, Sungsoo Ahn, Kimin Lee
2025-07-01
safety
2507.00979v1

Reasoning as an Adaptive Defense for Safety

Taeyoun Kim, Fahim Tajwar, Aditi Raghunathan, Aviral Kumar
2025-07-01
red teaming
2507.00971v1

SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents

Siyuan Liang, Tianmeng Fang, Zhe Liu, Aishan Liu, Yan Xiao, Jinyuan He, Ee-Chien Chang, Xiaochun Cao
2025-07-01
red teaming
2507.00841v1

Linearly Decoding Refused Knowledge in Aligned Language Models

Aryan Shrivastava, Ari Holtzman
2025-06-30
red teaming
2507.00239v1

Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models

Tung-Ling Li, Hongliang Liu
2025-06-30
red teaming
2506.24056v1

Leveraging the Potential of Prompt Engineering for Hate Speech Detection in Low-Resource Languages

Ruhina Tabasshum Prome, Tarikul Islam Tamiti, Anomadarshi Barua
2025-06-30
red teaming
2506.23930v1

Evaluating Multi-Agent Defences Against Jailbreaking Attacks on Large Language Models

Maria Carolina Cornelia Wit, Jun Pang
2025-06-30
red teaming
2506.23576v1

June 23 - June 29, 2025

7 papers

Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models

Younwoo Choi, Changling Li, Yongjin Yang, Zhijing Jin
2025-06-28
red teaming
2506.22957v1

VERA: Variational Inference Framework for Jailbreaking Large Language Models

Anamika Lochab, Lu Yan, Patrick Pynadath, Xiangyu Zhang, Ruqi Zhang
2025-06-27
red teaming
2506.22666v1

MetaCipher: A General and Extensible Reinforcement Learning Framework for Obfuscation-Based Jailbreak Attacks on Black-Box LLMs

Boyuan Chen, Minghao Shao, Abdul Basit, Siddharth Garg, Muhammad Shafique
2025-06-27
red teaming
2506.22557v1

Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses

Mohamed Ahmed, Mohamed Abdelmouty, Mingyu Kim, Gunvanth Kandula, Alex Park, James C. Davis
2025-06-27
red teaming safety
2506.21972v1

Red Teaming for Generative AI, Report on a Copyright-Focused Exercise Completed in an Academic Medical Center

James Wen, Sahil Nalawade, Zhiwei Liang, Catherine Bielick, Marisa Ferrara Boston, Alexander Chowdhury, Adele Collin, Luigi De Angelis, Jacob Ellen, Heather Frase, Rodrigo R. Gameiro, Juan Manuel Gutierrez, Pooja Kadam, Murat Keceli, Srikanth Krishnamurthy, Anne Kwok, Yanan Lance Lu, Heather Mattie, Liam G. McCoy, Katherine Miller, Allison C. Morgan, Marlene Louisa Moerig, Trang Nguyen, Alexander Owen-Post, Alex D. Ruiz, Sreekar Reddy Puchala, Soujanya Samineni, Takeshi Tohyama, Varun Ullanat, Carmine Valenza, Camilo Velez, Pengcheng Wang, Anna Wuest, Yuxiang Zhou, Yingde Zhu, Jason M. Johnson, Naomi Lenane, Jennifer Willcox, Francis J. Vitiello, Leo Anthony G. Celi, Renato Umeton
2025-06-26
red teaming
2506.22523v2

Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation

Chenkai Sun, Denghui Zhang, ChengXiang Zhai, Heng Ji
2025-06-26
safety
2506.20949v1

A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures

Dezhang Kong, Shi Lin, Zhenhua Xu, Zhebo Wang, Minghao Li, Yufeng Li, Yilun Zhang, Zeyang Sha, Yuyuan Li, Changting Lin, Xun Wang, Xuan Liu, Muhammad Khurram Khan, Ningyu Zhang, Chaochao Chen, Meng Han
2025-06-24
safety
2506.19676v1