Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1169 papers total

September 22 - September 28, 2025

24 papers

The Rogue Scalpel: Activation Steering Compromises LLM Safety

Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Y. Rogov, Ivan Oseledets, Elena Tutubalina
2025-09-26
red teaming safety
2509.22067v1

"Your AI, My Shell": Demystifying Prompt Injection Attacks on Agentic AI Coding Editors

Yue Liu, Yanjie Zhao, Yunbo Lyu, Ting Zhang, Haoyu Wang, David Lo
2025-09-26
red teaming
2509.22040v1

Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment

Jaehan Kim, Minkyoo Song, Seungwon Shin, Sooel Son
2025-09-26
safety
2509.22745v1

PSRT: Accelerating LRM-based Guard Models via Prefilled Safe Reasoning Traces

Jiawei Zhao, Yuang Qi, Weiming Zhang, Nenghai Yu, Kejiang Chen
2025-09-26
2509.21768v1

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

Miao Yu, Zhenhong Zhou, Moayad Aloqaily, Kun Wang, Biwei Huang, Stephen Wang, Yueming Jin, Qingsong Wen
2025-09-26
red teaming
2509.21761v2

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

Miao Yu, Zhenhong Zhou, Moayad Aloqaily, Kun Wang, Biwei Huang, Stephen Wang, Yueming Jin, Qingsong Wen
2025-09-26
red teaming
2509.21761v1

Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks

Haibo Tong, Dongcheng Zhao, Guobin Shen, Xiang He, Dachuan Lin, Feifei Zhao, Yi Zeng
2025-09-25
red teaming safety
2509.22732v1

FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction

Runqi Lin, Alasdair Paren, Suqin Yuan, Muyang Li, Philip Torr, Adel Bibi, Tongliang Liu
2025-09-25
red teaming
2509.21029v2

FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction

Runqi Lin, Alasdair Paren, Suqin Yuan, Muyang Li, Philip Torr, Adel Bibi, Tongliang Liu
2025-09-25
red teaming
2509.21029v1

Can Federated Learning Safeguard Private Data in LLM Training? Vulnerabilities, Attacks, and Defense Evaluation

Wenkai Guo, Xuefeng Liu, Haolin Wang, Jianwei Niu, Shaojie Tang, Jing Yuan
2025-09-25
safety
2509.20680v1

Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization

Wenhan Wu, Zheyuan Liu, Chongyang Gao, Ren Wang, Kaize Ding
2025-09-24
red teaming
2509.20230v3

Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization

Wenhan Wu, Zheyuan Liu, Chongyang Gao, Ren Wang, Kaize Ding
2025-09-24
2509.20230v2

Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization

Wenhan Wu, Zheyuan Liu, Chongyang Gao, Ren Wang, Kaize Ding
2025-09-24
red teaming
2509.20230v1

JaiLIP: Jailbreaking Vision-Language Models via Loss Guided Image Perturbation

Md Jueal Mia, M. Hadi Amini
2025-09-24
red teaming
2509.21401v1

SafeSteer: Adaptive Subspace Steering for Efficient Jailbreak Defense in Vision-Language Models

Xiyu Zeng, Siyuan Liang, Liming Lu, Haotian Zhu, Enguang Liu, Jisheng Dang, Yongbin Zhou, Shuchao Pang
2025-09-24
2509.21400v1

bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs

Wence Ji, Jiancan Wu, Aiying Li, Shuyi Zhang, Junkang Wu, An Zhang, Xiang Wang, Xiangnan He
2025-09-24
red teaming
2509.19775v1

Algorithms for Adversarially Robust Deep Learning

Alexander Robey
2025-09-23
red teaming
2509.19100v1

Rule Encoding and Compliance in Large Language Models: An Information-Theoretic Analysis

Joachim Diederich
2025-09-23
2510.05106v2

The Secret Agenda: LLMs Strategically Lie and Our Current Safety Tools Are Blind

Caleb DeLeeuw, Gaurav Chawla, Aniket Sharma, Vanessa Dietze
2025-09-23
safety
2509.20393v1

LLMZ+: Contextual Prompt Whitelist Principles for Agentic LLMs

Tom Pawelek, Raj Patel, Charlotte Crowell, Noorbakhsh Amiri, Sudip Mittal, Shahram Rahimi, Andy Perkins
2025-09-23
red teaming
2509.18557v1

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs

Alexander Panfilov, Evgenii Kortukov, Kristina Nikolić, Matthias Bethge, Sebastian Lapuschkin, Wojciech Samek, Ameya Prabhu, Maksym Andriushchenko, Jonas Geiping
2025-09-22
red teaming safety
2509.18058v2

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLM

Alexander Panfilov, Evgenii Kortukov, Kristina Nikolić, Matthias Bethge, Sebastian Lapuschkin, Wojciech Samek, Ameya Prabhu, Maksym Andriushchenko, Jonas Geiping
2025-09-22
red teaming safety
2509.18058v1

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

Satyapriya Krishna, Andy Zou, Rahul Gupta, Eliot Krzysztof Jones, Nick Winter, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson, Spyros Matsoukas
2025-09-22
red teaming
2509.17938v1

Design and Implementation of a Secure RAG-Enhanced AI Chatbot for Smart Tourism Customer Service: Defending Against Prompt Injection Attacks -- A Case Study of Hsinchu, Taiwan

Yu-Kai Shih, You-Kai Kang
2025-09-22
red teaming
2509.21367v1