Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 773 papers total

October 20 - October 26, 2025

24 papers

Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks

Mahavir Dabas, Tran Huynh, Nikhil Reddy Billa, Jiachen T. Wang, Peng Gao, Charith Peris, Yao Ma, Rahul Gupta, Ming Jin, Prateek Mittal, Ruoxi Jia
2025-10-24
red teaming
2510.21910v1

FairImagen: Post-Processing for Bias Mitigation in Text-to-Image Models

Zihao Fu, Ryan Brown, Shun Shao, Kai Rawal, Eoin Delaney, Chris Russell
2025-10-24
2510.21363v1

When Models Outthink Their Safety: Mitigating Self-Jailbreak in Large Reasoning Models with Chain-of-Guardrails

Yingzhi Mao, Chunkang Zhang, Junxiang Wang, Xinyan Guan, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun
2025-10-24
red teaming
2510.21285v2

When Models Outthink Their Safety: Mitigating Self-Jailbreak in Large Reasoning Models with Chain-of-Guardrails

Yingzhi Mao, Chunkang Zhang, Junxiang Wang, Xinyan Guan, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun
2025-10-24
red teaming
2510.21285v1

Enhanced MLLM Black-Box Jailbreaking Attacks and Defenses

Xingwei Zhong, Kar Wai Fok, Vrizlynn L. L. Thing
2025-10-24
red teaming
2510.21214v1

The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning

Mingrui Liu, Sixiao Zhang, Cheng Long, Kwok Yan Lam
2025-10-24
red teaming
2510.21190v1

Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency

Yukun Jiang, Mingjie Li, Michael Backes, Yang Zhang
2025-10-24
red teaming
2510.21189v1

NeuroGenPoisoning: Neuron-Guided Attacks on Retrieval-Augmented Generation of LLM via Genetic Optimization of External Knowledge

Hanyu Zhu, Lance Fiondella, Jiawei Yuan, Kai Zeng, Long Jiao
2025-10-24
red teaming
2510.21144v1

Soft Instruction De-escalation Defense

Nils Philipp Walter, Chawin Sitawarin, Jamie Hayes, David Stutz, Ilia Shumailov
2025-10-24
red teaming
2510.21057v1

Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training

Zheng-Xin Yong, Stephen H. Bach
2025-10-23
red teaming
2510.20956v1

Preventing Shortcuts in Adapter Training via Providing the Shortcuts

Anujraaj Argo Goyal, Guocheng Gordon Qian, Huseyin Coskun, Aarush Gupta, Himmy Tam, Daniil Ostashev, Ju Hu, Dhritiman Sagar, Sergey Tulyakov, Kfir Aberman, Kuan-Chieh Jackson Wang
2025-10-23
2510.20887v1

GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?

Chiyu Chen, Xinhao Song, Yunkai Chai, Yang Yao, Haodong Zhao, Lijun Li, Jie Li, Yan Teng, Gongshen Liu, Yingchun Wang
2025-10-23
2510.20333v2

GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?

Chiyu Chen, Xinhao Song, Yunkai Chai, Yang Yao, Haodong Zhao, Lijun Li, Jie Li, Yan Teng, Gongshen Liu, Yingchun Wang
2025-10-23
2510.20333v1

Beyond Text: Multimodal Jailbreaking of Vision-Language and Audio Models through Perceptually Simple Transformations

Divyanshu Kumar, Shreyas Jena, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, Prashanth Harshangi
2025-10-23
red teaming
2510.20223v1

SAID: Empowering Large Language Models with Self-Activating Internal Defense

Yulong Chen, Yadong Liu, Jiawen Zhang, Mu Li, Chao Huang, Jie Wen
2025-10-23
red teaming
2510.20129v1

Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning

Xichen Zhang, Sitong Wu, Yinghao Zhu, Haoru Tan, Shaozuo Yu, Ziyi He, Jiaya Jia
2025-10-22
2510.19807v1

RLIE: Rule Generation with Logistic Regression, Iterative Refinement, and Evaluation for Large Language Models

Yang Yang, Hua XU, Zhangyi Hu, Yutao Yue
2025-10-22
2510.19698v1

SORA-ATMAS: Adaptive Trust Management and Multi-LLM Aligned Governance for Future Smart Cities

Usama Antuley, Shahbaz Siddiqui, Sufian Hameed, Waqas Arif, Subhan Shah, Syed Attique Shah
2025-10-22
governance
2510.19327v1

Defending Against Prompt Injection with DataFilter

Yizhu Wang, Sizhe Chen, Raghad Alkhudair, Basel Alomair, David Wagner
2025-10-22
red teaming
2510.19207v1

OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models

Thomas Wang, Haowen Li
2025-10-22
red teaming
2510.19169v2

OpenGuardrails: An Open-Source Context-Aware AI Guardrails Platform

Thomas Wang, Haowen Li
2025-10-22
red teaming
2510.19169v1

Evidence of Energy Injection in the Short and Distant GRB 250221A

Camila Angulo-Valdez, Rosa L. Becerra, Ramandeep Gill, Noémie Globus, William H. Lee, Diego López-Cámara, Cassidy Mihalenko, Enrique Moreno-Méndez, Roberto Ricci, Karelle Siellez, Alan M. Watson, Muskan Yadav, Yu-han Yang, Dalya Akl, Sarah Antier, Jean-Luc Atteia, Stéphane Basa, Nathaniel R. Butler, Simone Dichiara, Damien Dornic, Jean-Grégoire Ducoin, Francis Fortin, Leonardo García-García, Kin Ocelotl López, Francesco Magnani, Brendan O'Connor, Margarita Pereyra, Ny Avo Rakotondrainibe, Fredd Sánchez-Álvarez, Benjamin Schneider, Eleonora Troja, Antonio de Ugarte Postigo
2025-10-21
2510.19132v4

Steering Autoregressive Music Generation with Recursive Feature Machines

Daniel Zhao, Daniel Beaglehole, Taylor Berg-Kirkpatrick, Julian McAuley, Zachary Novack
2025-10-21
2510.19127v1

HarmNet: A Framework for Adaptive Multi-Turn Jailbreak Attacks on Large Language Models

Sidhant Narula, Javad Rafiei Asl, Mohammad Ghasemigol, Eduardo Blanco, Daniel Takabi
2025-10-21
red teaming
2510.18728v1