Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1160 papers total

March 02 - March 08, 2026

23 papers

Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

Xisen Jin, Michael Duan, Qin Lin, Aaron Chan, Zhenglun Chen, Junyi Du, Xiang Ren
2026-03-06
2603.05786v1

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

Jinman Wu, Yi Xie, Shen Lin, Shiqian Zhao, Xiaofeng Chen
2026-03-06
red teaming
2603.05773v1

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

Jinman Wu, Yi Xie, Shiqian Zhao, Xiaofeng Chen
2026-03-06
red teaming
2603.05772v1

Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems

Hiroki Fukui
2026-03-05
safety
2603.04904v1

Efficient Refusal Ablation in LLM through Optimal Transport

Geraldin Nanfack, Eugene Belilovsky, Elvis Dohmatob
2026-03-04
red teaming
2603.04355v1

When Safety Becomes a Vulnerability: Exploiting LLM Alignment Homogeneity for Transferable Blocking in RAG

Junchen Li, Chao Qi, Rongzheng Wang, Qizhi Chen, Liang Xu, Di Liang, Bob Simons, Shuang Liang
2026-03-04
safety
2603.03919v1

Mozi: Governed Autonomy for Drug Discovery LLM Agents

He Cao, Siyu Liu, Fan Zhang, Zijing Liu, Hao Li, Bin Feng, Shengyuan Bai, Leqing Chen, Kai Xie, Yu Li
2026-03-04
governance
2603.03655v1

Beyond Input Guardrails: Reconstructing Cross-Agent Semantic Flows for Execution-Aware Attack Detection

Yangyang Wei, Yijie Xu, Zhenyuan Li, Xiangmin Shen, Shouling Ji
2026-03-04
red teaming
2603.04469v1

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions

Neha Nagaraja, Lan Zhang, Zhilong Wang, Bo Zhang, Pawan Patil
2026-03-04
red teaming
2603.03637v1

Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study

Neha Nagaraja, Hayretdin Bahsi
2026-03-04
red teaming
2603.03633v1

SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems

Haochang Hao, Yifan Xu, Xinzhuo Li, Yingqiang Ge, Lu Cheng
2026-03-03
safety
2603.03536v1

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi, Ahmed Awadallah
2026-03-03
2603.03205v1

TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

Zhi Xu, Jiaqi Li, Xiaotong Zhang, Hong Yu, Han Liu
2026-03-03
red teaming
2603.03081v1

Contextualized Privacy Defense for LLM Agents

Yule Wen, Yanzhe Zhang, Jianxun Lian, Xiaoyuan Yi, Xing Xie, Diyi Yang
2026-03-03
safety
2603.02983v1

Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks

Junjie Chu, Xinyue Shen, Ye Leng, Michael Backes, Yun Shen, Yang Zhang
2026-03-03
safety
2603.04459v1

From Shallow to Deep: Pinning Semantic Intent via Causal GRPO

Shuyi Zhou, Zeen Song, Wenwen Qiang, Jiyan Sun, Yao Zhou, Yinlong Liu, Wei Ma
2026-03-03
red teaming
2603.02675v1

SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

Zixuan Xu, Tiancheng He, Huahui Yi, Kun Wang, Xi Chen, Gongli Xi, Qiankun Li, Kang Li, Yang Liu, Zhigang Zeng
2026-03-03
2603.02635v1

AnchorDrive: LLM Scenario Rollout with Anchor-Guided Diffusion Regeneration for Safety-Critical Scenario Generation

Zhulin Jiang, Zetao Li, Cheng Wang, Ziwen Wang, Chen Xiong
2026-03-03
safety
2603.02542v1

Ignore All Previous Instructions: Jailbreaking as a de-escalatory peace building practise to resist LLM social media bots

Huw Day, Adrianna Jezierska, Jessica Woodgate
2026-03-02
2603.01942v1

Co-Evolutionary Multi-Modal Alignment via Structured Adversarial Evolution

Guoxin Shi, Haoyu Wang, Zaihui Yang, Yuxing Wang, Yongzhe Chang
2026-03-02
red teaming
2603.01784v1

DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern

Xiaoyi Pang, Xuanyi Hao, Pengyu Liu, Qi Luo, Song Guo, Zhibo Wang
2026-03-02
red teaming
2603.01574v1

Inference-Time Safety For Code LLMs Via Retrieval-Augmented Revision

Manisha Mukherjee, Vincent J. Hellendoorn
2026-03-02
safety
2603.01494v1

Jailbreaking Embodied LLMs via Action-level Manipulation

Xinyu Huang, Qiang Yang, Leming Shen, Zijing Ma, Yuanqing Zheng
2026-03-02
red teaming
2603.01414v1

February 23 - March 01, 2026

1 paper