Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1331 papers total

February 02 - February 08, 2026

24 papers

When the Model Said 'No Comment', We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified

Gautam Siddharth Kashyap, Mark Dras, Usman Naseem
2026-02-07
2602.07381v1

Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control

Yonghui Yang, Wenjian Tao, Jilong Liu, Xingyu Zhu, Junfeng Fang, Weibiao Huang, Le Wu, Richang Hong, Tat-Sent Chua
2026-02-07
safety
2602.07340v1

Agents in the Wild: Safety, Society, and the Illusion of Sociality on Moltbook

Yunbei Zhang, Kai Mei, Ming Liu, Janet Wang, Dimitris N. Metaxas, Xiao Wang, Jihun Hamm, Yingqiang Ge
2026-02-07
red teaming
2602.13284v1

ShallowJail: Steering Jailbreaks against Large Language Models

Shang Liu, Hanyu Pei, Zeyan Liu
2026-02-06
red teaming
2602.07107v1

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Saad Hossain, Tom Tseng, Punya Syon Pandey, Samanvay Vajpayee, Matthew Kowal, Nayeema Nonta, Samuel Simko, Stephen Casper, Zhijing Jin, Kellin Pelrine, Sirisha Rambhatla
2026-02-06
red teaming
2602.06911v1

Plato's Form: Toward Backdoor Defense-as-a-Service for LLMs with Prototype Representations

Chen Chen, Yuchen Sun, Jiaxin Gao, Yanwen Jia, Xueluan Gong, Qian Wang, Kwok-Yan Lam
2026-02-06
safety
2602.06887v1

Extended to Reality: Prompt Injection in 3D Environments

Zhuoheng Li, Ying Chen
2026-02-06
red teaming
2602.07104v1

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

Mingqian Feng, Xiaodong Liu, Weiwei Yang, Jialin Song, Xuekai Zhu, Chenliang Xu, Jianfeng Gao
2026-02-06
red teaming
2602.06854v1

compar:IA: The French Government's LLM arena to collect French-language human prompts and preference data

Lucie Termignon, Simonas Zilinskas, Hadrien Pélissier, Aurélien Barrot, Nicolas Chesnais, Elie Gavoty
2026-02-06
2602.06669v1

Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

Jianfeng Si, Lin Sun, Weihong Lin, Xiangzheng Zhang
2026-02-06
safety
2602.06650v1

TrapSuffix: Proactive Defense Against Adversarial Suffixes in Jailbreaking

Mengyao Du, Han Fang, Haokai Ma, Gang Yang, Quanjun Yin, Shouling Ji, Ee-Chien Chang
2026-02-06
red teaming
2602.06630v1

Do Prompts Guarantee Safety? Mitigating Toxicity from LLM Generations through Subspace Intervention

Himanshu Singh, Ziwei Xu, A. V. Subramanyam, Mohan Kankanhalli
2026-02-06
safety
2602.06623v1

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

Sung-Hoon Yoon, Ruizhi Qian, Minda Zhao, Weiyue Li, Mengyu Wang
2026-02-06
red teaming
2602.06440v1

MPIB: A Benchmark for Medical Prompt Injection Attacks and Clinical Safety in LLMs

Junhyeok Lee, Han Jang, Kyu Sung Choi
2026-02-06
red teaming
2602.06268v1

Steering Safely or Off a Cliff? Rethinking Specificity and Robustness in Inference-Time Interventions

Navita Goyal, Hal Daumé
2026-02-05
red teaming
2602.06256v1

Agent2Agent Threats in Safety-Critical LLM Assistants: A Human-Centric Taxonomy

Lukas Stappen, Ahmet Erkan Turan, Johann Hagerer, Georg Groh
2026-02-05
safety
2602.05877v1

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Xin Chen, Jie Zhang, Florian Tramer
2026-02-05
red teaming
2602.05746v1

Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification

Tao Huang, Rui Wang, Xiaofei Liu, Yi Qin, Li Duan, Liping Jing
2026-02-05
red teaming
2602.05535v1

Clouding the Mirror: Stealthy Prompt Injection Attacks Targeting LLM-based Phishing Detection

Takashi Koide, Hiroki Nakano, Daiki Chiba
2026-02-05
red teaming
2602.05484v1

Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

Yao Zhou, Zeen Song, Wenwen Qiang, Fengge Wu, Shuyi Zhou, Changwen Zheng, Hui Xiong
2026-02-05
red teaming
2602.05444v2

Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

Yao Zhou, Zeen Song, Wenwen Qiang, Fengge Wu, Shuyi Zhou, Changwen Zheng, Hui Xiong
2026-02-05
red teaming
2602.05444v1

Copyright Detective: A Forensic System to Evidence LLMs Flickering Copyright Leakage Risks

Guangwei Zhang, Jianing Zhu, Cheng Qian, Neil Gong, Rada Mihalcea, Zhaozhuo Xu, Jingrui He, Jiaqi Ma, Yun Huang, Chaowei Xiao, Bo Li, Ahmed Abbasi, Dongwon Lee, Heng Ji, Denghui Zhang
2026-02-05
2602.05252v2

Copyright Detective: A Forensic System to Evidence LLMs Flickering Copyright Leakage Risks

Guangwei Zhang, Jianing Zhu, Cheng Qian, Neil Gong, Rada Mihalcea, Zhaozhuo Xu, Jingrui He, Jiaqi Ma, Yun Huang, Chaowei Xiao, Bo Li, Ahmed Abbasi, Dongwon Lee, Heng Ji, Denghui Zhang
2026-02-05
2602.05252v1

Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks

Jafar Isbarov, Murat Kantarcioglu
2026-02-04
red teaming
2602.05066v2