Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1331 papers total

March 16 - March 22, 2026

1 paper

March 09 - March 15, 2026

23 papers

Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

Suvadeep Hajra, Palash Nandi, Tanmoy Chakraborty
2026-03-15
red teaming
2603.14355v1

State-Dependent Safety Failures in Multi-Turn Language Model Interaction

Pengcheng Li, Jie Zhang, Tianwei Zhang, Han Qiu, Zhang kejun, Weiming Zhang, Nenghai Yu, Wenbo Zhou
2026-03-15
2603.15684v1

Activation Surgery: Jailbreaking White-box LLMs without Touching the Prompt

Maël Jenny, Jérémie Dentan, Sonia Vanier, Michaël Krajecki
2026-03-15
red teaming
2603.14278v1

Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining

Chongxin Li, Hanzhang Wang, Lian Duan
2026-03-15
red teaming
2603.14219v1

Relationship-Aware Safety Unlearning for Multimodal LLMs

Vishnu Narayanan Anilkumar, Abhijith Sreesylesh Babu, Trieu Hai Vo, Mohankrishna Kolla, Alexander Cuneo
2026-03-15
safety
2603.14185v2

Relationship-Aware Safety Unlearning for Multimodal LLMs

Vishnu Narayanan Anilkumar, Abhijith Sreesylesh Babu, Trieu Hai Vo, Mohankrishna Kolla, Alexander Cuneo
2026-03-15
safety
2603.14185v1

Sirens' Whisper: Inaudible Near-Ultrasonic Jailbreaks of Speech-Driven LLMs

Zijian Ling, Pingyi Hu, Xiuyong Gao, Xiaojing Ma, Man Zhou, Jun Feng, Songfeng Lu, Dongmei Zhang, Bin Benjamin Zhu
2026-03-14
red teaming
2603.13847v1

LLM Constitutional Multi-Agent Governance

J. de Curtò, I. de Zarzà
2026-03-13
governance
2603.13189v1

PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses

Chenlong Yin, Runpeng Geng, Yanting Wang, Jinyuan Jia
2026-03-13
red teaming
2603.13026v1

Colluding LoRA: A Composite Attack on LLM Safety Alignment

Sihao Ding
2026-03-13
safety
2603.12681v1

Uncovering Security Threats and Architecting Defenses in Autonomous Agents: A Case Study of OpenClaw

Zonghao Ying, Xiao Yang, Siyang Wu, Yumeng Song, Yang Qu, Hainan Li, Tianlin Li, Jiakai Wang, Aishan Liu, Xianglong Liu
2026-03-13
red teaming
2603.12644v1

Agent Privilege Separation in OpenClaw: A Structural Defense Against Prompt Injection

Darren Cheng, Wen-Kwang Tsao
2026-03-13
2603.13424v1

Accelerating Suffix Jailbreak attacks with Prefix-Shared KV-cache

Xinhai Wang, Shaopeng Fu, Shu Yang, Liangyu Wang, Tianhang Zheng, Di Wang
2026-03-12
red teaming
2603.13420v1

Security Considerations for Artificial Intelligence Agents

Ninghui Li, Kaiyuan Zhang, Kyle Polley, Jerry Ma
2026-03-12
red teaming
2603.12230v2

Security Considerations for Artificial Intelligence Agents

Ninghui Li, Kaiyuan Zhang, Kyle Polley, Jerry Ma
2026-03-12
red teaming
2603.12230v1

Cascade: Composing Software-Hardware Attack Gadgets for Adversarial Threat Amplification in Compound AI Systems

Sarbartha Banerjee, Prateek Sahu, Anjo Vahldiek-Oberwagner, Jose Sanchez Vicarte, Mohit Tiwari
2026-03-12
red teaming
2603.12023v1

The Mirror Design Pattern: Strict Data Geometry over Model Scale for Prompt Injection Detection

J Alex Corll
2026-03-12
2603.11875v1

OpenClaw PRISM: A Zero-Fork, Defense-in-Depth Runtime Security Layer for Tool-Augmented LLM Agents

Frank Li
2026-03-12
safety
2603.11853v1

Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework

Chingkwun Lam, Jiaxin Li, Lingfei Zhang, Kuo Zhao
2026-03-12
safety governance
2603.11768v1

Taming OpenClaw: Security Analysis and Mitigation of Autonomous LLM Agent Threats

Xinhao Deng, Yixiang Zhang, Jiaqing Wu, Jiaqi Bai, Sibo Yi, Zhuoheng Zou, Yue Xiao, Rennai Qiu, Jianan Ma, Jialuo Chen, Xiaohu Du, Xiaofang Yang, Shiwen Cui, Changhua Meng, Weiqiang Wang, Jiaxing Song, Ke Xu, Qi Li
2026-03-12
red teaming
2603.11619v1

Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning

Seung hee Choi, MinJu Jeon, Hyunwoo Oh, Jihwan Lee, Dong-Jin Kim
2026-03-12
2603.11460v2

Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning

Seung hee Choi, MinJu Jeon, Hyunwoo Oh, Jihwan Lee, Dong-Jin Kim
2026-03-12
2603.11460v1

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

Zhiyu Xue, Zimo Qi, Guangliang Liu, Bocheng Chen, Ramtin Pedarsani
2026-03-12
2603.11388v1