Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 812 papers total

October 20 - October 26, 2025

24 papers

NeuroGenPoisoning: Neuron-Guided Attacks on Retrieval-Augmented Generation of LLM via Genetic Optimization of External Knowledge

Hanyu Zhu, Lance Fiondella, Jiawei Yuan, Kai Zeng, Long Jiao
2025-10-24
red teaming
2510.21144v1

Soft Instruction De-escalation Defense

Nils Philipp Walter, Chawin Sitawarin, Jamie Hayes, David Stutz, Ilia Shumailov
2025-10-24
red teaming
2510.21057v1

Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training

Zheng-Xin Yong, Stephen H. Bach
2025-10-23
red teaming
2510.20956v1

Preventing Shortcuts in Adapter Training via Providing the Shortcuts

Anujraaj Argo Goyal, Guocheng Gordon Qian, Huseyin Coskun, Aarush Gupta, Himmy Tam, Daniil Ostashev, Ju Hu, Dhritiman Sagar, Sergey Tulyakov, Kfir Aberman, Kuan-Chieh Jackson Wang
2025-10-23
2510.20887v1

GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?

Chiyu Chen, Xinhao Song, Yunkai Chai, Yang Yao, Haodong Zhao, Lijun Li, Jie Li, Yan Teng, Gongshen Liu, Yingchun Wang
2025-10-23
2510.20333v2

GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?

Chiyu Chen, Xinhao Song, Yunkai Chai, Yang Yao, Haodong Zhao, Lijun Li, Jie Li, Yan Teng, Gongshen Liu, Yingchun Wang
2025-10-23
2510.20333v1

Beyond Text: Multimodal Jailbreaking of Vision-Language and Audio Models through Perceptually Simple Transformations

Divyanshu Kumar, Shreyas Jena, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, Prashanth Harshangi
2025-10-23
red teaming
2510.20223v1

SAID: Empowering Large Language Models with Self-Activating Internal Defense

Yulong Chen, Yadong Liu, Jiawen Zhang, Mu Li, Chao Huang, Jie Wen
2025-10-23
red teaming
2510.20129v1

Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning

Xichen Zhang, Sitong Wu, Yinghao Zhu, Haoru Tan, Shaozuo Yu, Ziyi He, Jiaya Jia
2025-10-22
2510.19807v1

RLIE: Rule Generation with Logistic Regression, Iterative Refinement, and Evaluation for Large Language Models

Yang Yang, Hua XU, Zhangyi Hu, Yutao Yue
2025-10-22
2510.19698v1

SORA-ATMAS: Adaptive Trust Management and Multi-LLM Aligned Governance for Future Smart Cities

Usama Antuley, Shahbaz Siddiqui, Sufian Hameed, Waqas Arif, Subhan Shah, Syed Attique Shah
2025-10-22
governance
2510.19327v1

Defending Against Prompt Injection with DataFilter

Yizhu Wang, Sizhe Chen, Raghad Alkhudair, Basel Alomair, David Wagner
2025-10-22
red teaming
2510.19207v1

OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models

Thomas Wang, Haowen Li
2025-10-22
red teaming
2510.19169v2

OpenGuardrails: An Open-Source Context-Aware AI Guardrails Platform

Thomas Wang, Haowen Li
2025-10-22
red teaming
2510.19169v1

Evidence of Energy Injection in the Short and Distant GRB 250221A

Camila Angulo-Valdez, Rosa L. Becerra, Ramandeep Gill, Noémie Globus, William H. Lee, Diego López-Cámara, Cassidy Mihalenko, Enrique Moreno-Méndez, Roberto Ricci, Karelle Siellez, Alan M. Watson, Muskan Yadav, Yu-han Yang, Dalya Akl, Sarah Antier, Jean-Luc Atteia, Stéphane Basa, Nathaniel R. Butler, Simone Dichiara, Damien Dornic, Jean-Grégoire Ducoin, Francis Fortin, Leonardo García-García, Kin Ocelotl López, Francesco Magnani, Brendan O'Connor, Margarita Pereyra, Ny Avo Rakotondrainibe, Fredd Sánchez-Álvarez, Benjamin Schneider, Eleonora Troja, Antonio de Ugarte Postigo
2025-10-21
2510.19132v4

Steering Autoregressive Music Generation with Recursive Feature Machines

Daniel Zhao, Daniel Beaglehole, Taylor Berg-Kirkpatrick, Julian McAuley, Zachary Novack
2025-10-21
2510.19127v1

HarmNet: A Framework for Adaptive Multi-Turn Jailbreak Attacks on Large Language Models

Sidhant Narula, Javad Rafiei Asl, Mohammad Ghasemigol, Eduardo Blanco, Daniel Takabi
2025-10-21
red teaming
2510.18728v1

Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation

Giovanni De Muri, Mark Vero, Robin Staab, Martin Vechev
2025-10-21
red teaming
2510.18541v1

SegTune: Structured and Fine-Grained Control for Song Generation

Pengfei Cai, Joanna Wang, Haorui Zheng, Xu Li, Zihao Ji, Teng Ma, Zhongliang Liu, Chen Zhang, Pengfei Wan
2025-10-21
2510.18416v1

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

Jiawei Zhang, Andrew Estornell, David D. Baek, Bo Li, Xiaojun Xu
2025-10-20
safety
2510.18081v1

CourtGuard: A Local, Multiagent Prompt Injection Classifier

Isaac Wu, Michael Maslowski
2025-10-20
red teaming
2510.19844v1

PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits

Neeladri Bhuiya, Madhav Aggarwal, Diptanshu Purwar
2025-10-20
red teaming
2510.17947v2

PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits

Neeladri Bhuiya, Madhav Aggarwal, Diptanshu Purwar
2025-10-20
red teaming
2510.17947v1

VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

Qilin Liao, Anamika Lochab, Ruqi Zhang
2025-10-20
red teaming
2510.17759v1