Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 770 papers total

July 07 - July 13, 2025

4 papers

June 30 - July 06, 2025

20 papers

Tail-aware Adversarial Attacks: A Distributional Approach to Efficient LLM Jailbreaking

Tim Beyer, Yan Scholten, Leo Schwinn, Stephan Günnemann
2025-07-06
red teaming
2507.04446v2

Tail-aware Adversarial Attacks: A Distributional Approach to Efficient LLM Jailbreaking

Tim Beyer, Yan Scholten, Stephan Günnemann, Leo Schwinn
2025-07-06
red teaming
2507.04446v1

Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs

Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho
2025-07-06
red teaming safety
2507.04365v1

Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs

Jeremiah Giordani
2025-07-04
safety
2507.03662v1

On Jailbreaking Quantized Language Models Through Fault Injection Attacks

Noureldin Zahran, Ahmad Tahmasivand, Ihsen Alouani, Khaled Khasawneh, Mohammed E. Fouda
2025-07-04
red teaming
2507.03236v2

On Jailbreaking Quantized Language Models Through Fault Injection Attacks

Noureldin Zahran, Ahmad Tahmasivand, Ihsen Alouani, Khaled Khasawneh, Mohammed E. Fouda
2025-07-04
red teaming
2507.03236v1

Adversarial Manipulation of Reasoning Models using Internal Representations

Kureha Yamaguchi, Benjamin Etheridge, Andy Arditi
2025-07-03
red teaming
2507.03167v1

Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection

Ziqi Miao, Yi Ding, Lijun Li, Jing Shao
2025-07-03
red teaming
2507.02844v1

Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models

Riccardo Cantini, Nicola Gabriele, Alessio Orsino, Domenico Talia
2025-07-03
red teaming
2507.02799v1

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

Sizhe Chen, Arman Zharmagambetov, David Wagner, Chuan Guo
2025-07-03
2507.02735v2

PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage

Krishna Kanth Nakka, Xue Jiang, Xuebing Zhou
2025-07-03
red teaming
2507.02332v1

MGC: A Compiler Framework Exploiting Compositional Blindness in Aligned LLMs for Malware Generation

Lu Yan, Zhuo Zhang, Xiangzhe Xu, Shengwei An, Guangyu Shen, Zhou Xuan, Xuan Chen, Xiangyu Zhang
2025-07-02
red teaming
2507.02057v1

SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism

Beitao Chen, Xinyu Lyu, Lianli Gao, Jingkuan Song, Heng Tao Shen
2025-07-02
red teaming safety
2507.01513v1

LLM-based Realistic Safety-Critical Driving Video Generation

Yongjie Fu, Ruijian Zha, Pei Tian, Xuan Di
2025-07-02
safety
2507.01264v1

`For Argument's Sake, Show Me How to Harm Myself!': Jailbreaking LLMs in Suicide and Self-Harm Contexts

Annika M Schoene, Cansu Canca
2025-07-01
red teaming
2507.02990v1

Enhancing LLM Agent Safety via Causal Influence Prompting

Dongyoon Hahm, Woogyeol Jin, June Suk Choi, Sungsoo Ahn, Kimin Lee
2025-07-01
safety
2507.00979v1

Reasoning as an Adaptive Defense for Safety

Taeyoun Kim, Fahim Tajwar, Aditi Raghunathan, Aviral Kumar
2025-07-01
red teaming
2507.00971v1

SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents

Siyuan Liang, Tianmeng Fang, Zhe Liu, Aishan Liu, Yan Xiao, Jinyuan He, Ee-Chien Chang, Xiaochun Cao
2025-07-01
red teaming
2507.00841v1

Linearly Decoding Refused Knowledge in Aligned Language Models

Aryan Shrivastava, Ari Holtzman
2025-06-30
red teaming
2507.00239v1

Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models

Tung-Ling Li, Hongliang Liu
2025-06-30
red teaming
2506.24056v1