Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 770 papers total

September 22 - September 28, 2025

16 papers

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

Miao Yu, Zhenhong Zhou, Moayad Aloqaily, Kun Wang, Biwei Huang, Stephen Wang, Yueming Jin, Qingsong Wen
2025-09-26
red teaming
2509.21761v1

Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks

Haibo Tong, Dongcheng Zhao, Guobin Shen, Xiang He, Dachuan Lin, Feifei Zhao, Yi Zeng
2025-09-25
red teaming safety
2509.22732v1

FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction

Runqi Lin, Alasdair Paren, Suqin Yuan, Muyang Li, Philip Torr, Adel Bibi, Tongliang Liu
2025-09-25
red teaming
2509.21029v2

FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction

Runqi Lin, Alasdair Paren, Suqin Yuan, Muyang Li, Philip Torr, Adel Bibi, Tongliang Liu
2025-09-25
red teaming
2509.21029v1

Can Federated Learning Safeguard Private Data in LLM Training? Vulnerabilities, Attacks, and Defense Evaluation

Wenkai Guo, Xuefeng Liu, Haolin Wang, Jianwei Niu, Shaojie Tang, Jing Yuan
2025-09-25
safety
2509.20680v1

Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization

Wenhan Wu, Zheyuan Liu, Chongyang Gao, Ren Wang, Kaize Ding
2025-09-24
red teaming
2509.20230v3

Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization

Wenhan Wu, Zheyuan Liu, Chongyang Gao, Ren Wang, Kaize Ding
2025-09-24
2509.20230v2

Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization

Wenhan Wu, Zheyuan Liu, Chongyang Gao, Ren Wang, Kaize Ding
2025-09-24
red teaming
2509.20230v1

JaiLIP: Jailbreaking Vision-Language Models via Loss Guided Image Perturbation

Md Jueal Mia, M. Hadi Amini
2025-09-24
red teaming
2509.21401v1

SafeSteer: Adaptive Subspace Steering for Efficient Jailbreak Defense in Vision-Language Models

Xiyu Zeng, Siyuan Liang, Liming Lu, Haotian Zhu, Enguang Liu, Jisheng Dang, Yongbin Zhou, Shuchao Pang
2025-09-24
2509.21400v1

bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs

Wence Ji, Jiancan Wu, Aiying Li, Shuyi Zhang, Junkang Wu, An Zhang, Xiang Wang, Xiangnan He
2025-09-24
red teaming
2509.19775v1

Algorithms for Adversarially Robust Deep Learning

Alexander Robey
2025-09-23
red teaming
2509.19100v1

The Secret Agenda: LLMs Strategically Lie and Our Current Safety Tools Are Blind

Caleb DeLeeuw, Gaurav Chawla, Aniket Sharma, Vanessa Dietze
2025-09-23
safety
2509.20393v1

LLMZ+: Contextual Prompt Whitelist Principles for Agentic LLMs

Tom Pawelek, Raj Patel, Charlotte Crowell, Noorbakhsh Amiri, Sudip Mittal, Shahram Rahimi, Andy Perkins
2025-09-23
red teaming
2509.18557v1

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs

Alexander Panfilov, Evgenii Kortukov, Kristina Nikolić, Matthias Bethge, Sebastian Lapuschkin, Wojciech Samek, Ameya Prabhu, Maksym Andriushchenko, Jonas Geiping
2025-09-22
red teaming safety
2509.18058v2

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLM

Alexander Panfilov, Evgenii Kortukov, Kristina Nikolić, Matthias Bethge, Sebastian Lapuschkin, Wojciech Samek, Ameya Prabhu, Maksym Andriushchenko, Jonas Geiping
2025-09-22
red teaming safety
2509.18058v1

September 15 - September 21, 2025

8 papers

DecipherGuard: Understanding and Deciphering Jailbreak Prompts for a Safer Deployment of Intelligent Software Systems

Rui Yang, Michael Fu, Chakkrit Tantithamthavorn, Chetan Arora, Gunel Gulmammadova, Joey Chua
2025-09-21
red teaming
2509.16870v1

AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software

Rui Yang, Michael Fu, Chakkrit Tantithamthavorn, Chetan Arora, Gunel Gulmammadova, Joey Chua
2025-09-21
red teaming safety
2509.16861v1

AIPsychoBench: Understanding the Psychometric Differences between LLMs and Humans

Wei Xie, Shuoyoucheng Ma, Zhenhua Wang, Enze Wang, Kai Chen, Xiaobing Sun, Baosheng Wang
2025-09-20
2509.16530v1

Domain-Specific Constitutional AI: Enhancing Safety in LLM-Powered Mental Health Chatbots

Chenhan Lyu, Yutong Song, Pengfei Zhang, Amir M. Rahmani
2025-09-19
safety
2509.16444v1

Randomized Smoothing Meets Vision-Language Models

Emmanouil Seferis, Changshun Wu, Stefanos Kollias, Saddek Bensalem, Chih-Hong Cheng
2025-09-19
red teaming
2509.16088v1

SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection

Maithili Joshi, Palash Nandi, Tanmoy Chakraborty
2025-09-19
red teaming
2509.16060v1

Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction

Yuanbo Xie, Yingjie Zhang, Tianyun Liu, Duohe Ma, Tingwen Liu
2025-09-18
red teaming safety
2509.15202v1

Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages

Yujia Hu, Ming Shan Hee, Preslav Nakov, Roy Ka-Wei Lee
2025-09-18
safety
2509.15260v2