Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 235 papers total

June 30 - July 06, 2025

13 papers

Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models

Riccardo Cantini, Nicola Gabriele, Alessio Orsino, Domenico Talia
2025-07-03
red teaming
2507.02799v1

PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage

Krishna Kanth Nakka, Xue Jiang, Xuebing Zhou
2025-07-03
red teaming
2507.02332v1

MGC: A Compiler Framework Exploiting Compositional Blindness in Aligned LLMs for Malware Generation

Lu Yan, Zhuo Zhang, Xiangzhe Xu, Shengwei An, Guangyu Shen, Zhou Xuan, Xuan Chen, Xiangyu Zhang
2025-07-02
red teaming
2507.02057v1

SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism

Beitao Chen, Xinyu Lyu, Lianli Gao, Jingkuan Song, Heng Tao Shen
2025-07-02
red teaming safety
2507.01513v1

LLM-based Realistic Safety-Critical Driving Video Generation

Yongjie Fu, Ruijian Zha, Pei Tian, Xuan Di
2025-07-02
safety
2507.01264v1

`For Argument's Sake, Show Me How to Harm Myself!': Jailbreaking LLMs in Suicide and Self-Harm Contexts

Annika M Schoene, Cansu Canca
2025-07-01
red teaming
2507.02990v1

Enhancing LLM Agent Safety via Causal Influence Prompting

Dongyoon Hahm, Woogyeol Jin, June Suk Choi, Sungsoo Ahn, Kimin Lee
2025-07-01
safety
2507.00979v1

Reasoning as an Adaptive Defense for Safety

Taeyoun Kim, Fahim Tajwar, Aditi Raghunathan, Aviral Kumar
2025-07-01
red teaming
2507.00971v1

SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents

Siyuan Liang, Tianmeng Fang, Zhe Liu, Aishan Liu, Yan Xiao, Jinyuan He, Ee-Chien Chang, Xiaochun Cao
2025-07-01
red teaming
2507.00841v1

Linearly Decoding Refused Knowledge in Aligned Language Models

Aryan Shrivastava, Ari Holtzman
2025-06-30
red teaming
2507.00239v1

Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models

Tung-Ling Li, Hongliang Liu
2025-06-30
red teaming
2506.24056v1

Leveraging the Potential of Prompt Engineering for Hate Speech Detection in Low-Resource Languages

Ruhina Tabasshum Prome, Tarikul Islam Tamiti, Anomadarshi Barua
2025-06-30
red teaming
2506.23930v1

Evaluating Multi-Agent Defences Against Jailbreaking Attacks on Large Language Models

Maria Carolina Cornelia Wit, Jun Pang
2025-06-30
red teaming
2506.23576v1

June 23 - June 29, 2025

11 papers

Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models

Younwoo Choi, Changling Li, Yongjin Yang, Zhijing Jin
2025-06-28
red teaming
2506.22957v1

VERA: Variational Inference Framework for Jailbreaking Large Language Models

Anamika Lochab, Lu Yan, Patrick Pynadath, Xiangyu Zhang, Ruqi Zhang
2025-06-27
red teaming
2506.22666v1

MetaCipher: A General and Extensible Reinforcement Learning Framework for Obfuscation-Based Jailbreak Attacks on Black-Box LLMs

Boyuan Chen, Minghao Shao, Abdul Basit, Siddharth Garg, Muhammad Shafique
2025-06-27
red teaming
2506.22557v1

Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses

Mohamed Ahmed, Mohamed Abdelmouty, Mingyu Kim, Gunvanth Kandula, Alex Park, James C. Davis
2025-06-27
red teaming safety
2506.21972v1

Red Teaming for Generative AI, Report on a Copyright-Focused Exercise Completed in an Academic Medical Center

James Wen, Sahil Nalawade, Zhiwei Liang, Catherine Bielick, Marisa Ferrara Boston, Alexander Chowdhury, Adele Collin, Luigi De Angelis, Jacob Ellen, Heather Frase, Rodrigo R. Gameiro, Juan Manuel Gutierrez, Pooja Kadam, Murat Keceli, Srikanth Krishnamurthy, Anne Kwok, Yanan Lance Lu, Heather Mattie, Liam G. McCoy, Katherine Miller, Allison C. Morgan, Marlene Louisa Moerig, Trang Nguyen, Alexander Owen-Post, Alex D. Ruiz, Sreekar Reddy Puchala, Soujanya Samineni, Takeshi Tohyama, Varun Ullanat, Carmine Valenza, Camilo Velez, Pengcheng Wang, Anna Wuest, Yuxiang Zhou, Yingde Zhu, Jason M. Johnson, Naomi Lenane, Jennifer Willcox, Francis J. Vitiello, Leo Anthony G. Celi, Renato Umeton
2025-06-26
red teaming
2506.22523v2

Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation

Chenkai Sun, Denghui Zhang, ChengXiang Zhai, Heng Ji
2025-06-26
safety
2506.20949v1

A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures

Dezhang Kong, Shi Lin, Zhenhua Xu, Zhebo Wang, Minghao Li, Yufeng Li, Yilun Zhang, Zeyang Sha, Yuyuan Li, Changting Lin, Xun Wang, Xuan Liu, Muhammad Khurram Khan, Ningyu Zhang, Chaochao Chen, Meng Han
2025-06-24
safety
2506.19676v1

PrivacyXray: Detecting Privacy Breaches in LLMs through Semantic Consistency and Probability Certainty

Jinwen He, Yiyang Lu, Zijin Lin, Kai Chen, Yue Zhao
2025-06-24
2506.19563v1

MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models

Yinan Xia, Yilei Jiang, Yingshui Tan, Xiaoyong Zhu, Xiangyu Yue, Bo Zheng
2025-06-24
red teaming
2506.19257v1

Command-V: Pasting LLM Behaviors via Activation Profiles

Barry Wang, Avi Schwarzschild, Alexander Robey, Ali Payani, Charles Fleming, Mingjie Sun, Daphne Ippolito
2025-06-23
red teaming
2506.19140v1

Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks

Xiaodong Wu, Xiangman Li, Jianbing Ni
2025-06-23
red teaming
2506.18543v1