Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1169 papers total

February 09 - February 15, 2026

1 paper

February 02 - February 08, 2026

23 papers

Robustness of Vision Language Models Against Split-Image Harmful Input Attacks

Md Rafi Ur Rashid, MD Sadik Hossain Shanto, Vishnu Asutosh Dasu, Shagufta Mehnaz
2026-02-08
red teaming
2602.08136v1

Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation

Shayan Ali Hassan, Tao Ni, Zafar Ayyub Qazi, Marco Canini
2026-02-08
red teaming
2602.08062v1

Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation

Krzysztof Wróbel, Jan Maria Kowalski, Jerzy Surma, Igor Ciuciura, Maciej Szymański
2026-02-08
safety
2602.07954v2

Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation

Krzysztof Wróbel, Jan Maria Kowalski, Jerzy Surma, Igor Ciuciura, Maciej Szymański
2026-02-08
safety
2602.07954v1

CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution

Minbeom Kim, Mihir Parmar, Phillip Wallis, Lesly Miculicich, Kyomin Jung, Krishnamurthy Dj Dvijotham, Long T. Le, Tomas Pfister
2026-02-08
red teaming
2602.07918v1

AgentSys: Secure and Dynamic LLM Agents Through Explicit Hierarchical Memory Management

Ruoyao Wen, Hao Li, Chaowei Xiao, Ning Zhang
2026-02-07
red teaming
2602.07398v1

When the Model Said 'No Comment', We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified

Gautam Siddharth Kashyap, Mark Dras, Usman Naseem
2026-02-07
2602.07381v1

Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control

Yonghui Yang, Wenjian Tao, Jilong Liu, Xingyu Zhu, Junfeng Fang, Weibiao Huang, Le Wu, Richang Hong, Tat-Sent Chua
2026-02-07
safety
2602.07340v1

Agents in the Wild: Safety, Society, and the Illusion of Sociality on Moltbook

Yunbei Zhang, Kai Mei, Ming Liu, Janet Wang, Dimitris N. Metaxas, Xiao Wang, Jihun Hamm, Yingqiang Ge
2026-02-07
red teaming
2602.13284v1

ShallowJail: Steering Jailbreaks against Large Language Models

Shang Liu, Hanyu Pei, Zeyan Liu
2026-02-06
red teaming
2602.07107v1

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Saad Hossain, Tom Tseng, Punya Syon Pandey, Samanvay Vajpayee, Matthew Kowal, Nayeema Nonta, Samuel Simko, Stephen Casper, Zhijing Jin, Kellin Pelrine, Sirisha Rambhatla
2026-02-06
red teaming
2602.06911v1

Plato's Form: Toward Backdoor Defense-as-a-Service for LLMs with Prototype Representations

Chen Chen, Yuchen Sun, Jiaxin Gao, Yanwen Jia, Xueluan Gong, Qian Wang, Kwok-Yan Lam
2026-02-06
safety
2602.06887v1

Extended to Reality: Prompt Injection in 3D Environments

Zhuoheng Li, Ying Chen
2026-02-06
red teaming
2602.07104v1

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

Mingqian Feng, Xiaodong Liu, Weiwei Yang, Jialin Song, Xuekai Zhu, Chenliang Xu, Jianfeng Gao
2026-02-06
red teaming
2602.06854v1

compar:IA: The French Government's LLM arena to collect French-language human prompts and preference data

Lucie Termignon, Simonas Zilinskas, Hadrien Pélissier, Aurélien Barrot, Nicolas Chesnais, Elie Gavoty
2026-02-06
2602.06669v1

Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

Jianfeng Si, Lin Sun, Weihong Lin, Xiangzheng Zhang
2026-02-06
safety
2602.06650v1

TrapSuffix: Proactive Defense Against Adversarial Suffixes in Jailbreaking

Mengyao Du, Han Fang, Haokai Ma, Gang Yang, Quanjun Yin, Shouling Ji, Ee-Chien Chang
2026-02-06
red teaming
2602.06630v1

Do Prompts Guarantee Safety? Mitigating Toxicity from LLM Generations through Subspace Intervention

Himanshu Singh, Ziwei Xu, A. V. Subramanyam, Mohan Kankanhalli
2026-02-06
safety
2602.06623v1

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

Sung-Hoon Yoon, Ruizhi Qian, Minda Zhao, Weiyue Li, Mengyu Wang
2026-02-06
red teaming
2602.06440v1

MPIB: A Benchmark for Medical Prompt Injection Attacks and Clinical Safety in LLMs

Junhyeok Lee, Han Jang, Kyu Sung Choi
2026-02-06
red teaming
2602.06268v1

Steering Safely or Off a Cliff? Rethinking Specificity and Robustness in Inference-Time Interventions

Navita Goyal, Hal Daumé
2026-02-05
red teaming
2602.06256v1

Agent2Agent Threats in Safety-Critical LLM Assistants: A Human-Centric Taxonomy

Lukas Stappen, Ahmet Erkan Turan, Johann Hagerer, Georg Groh
2026-02-05
safety
2602.05877v1

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Xin Chen, Jie Zhang, Florian Tramer
2026-02-05
red teaming
2602.05746v1