Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1169 papers total

March 09 - March 15, 2026

2 papers

March 02 - March 08, 2026

22 papers

DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

Bo Jiang
2026-03-08
2603.07835v1

VoiceSHIELD-Small: Real-Time Malicious Speech Detection and Transcription

Sumit Ranjan, Sugandha Sharma, Ubaid Abbas, Puneeth N Ail
2026-03-08
2603.07708v1

Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints

Chenxi Li, Xianggan Liu, Dake Shen, Yaosong Du, Zhibo Yao, Hao Jiang, Linyi Jiang, Chengwei Cao, Jingzhe Zhang, RanYi Peng, Peiling Bai, Xiande Huang
2026-03-08
red teaming
2603.07590v1

Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice

Yuxu Ge
2026-03-07
red teaming
2603.07191v1

Two Frames Matter: A Temporal Attack for Text-to-Video Model Jailbreaking

Moyang Chen, Zonghao Ying, Wenzhuo Xu, Quancheng Zou, Deyue Zhang, Dongdong Yang, Xiangzheng Zhang
2026-03-07
red teaming
2603.07028v1

Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

Xisen Jin, Michael Duan, Qin Lin, Aaron Chan, Zhenglun Chen, Junyi Du, Xiang Ren
2026-03-06
2603.05786v1

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

Jinman Wu, Yi Xie, Shen Lin, Shiqian Zhao, Xiaofeng Chen
2026-03-06
red teaming
2603.05773v1

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

Jinman Wu, Yi Xie, Shiqian Zhao, Xiaofeng Chen
2026-03-06
red teaming
2603.05772v1

Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems

Hiroki Fukui
2026-03-05
safety
2603.04904v1

Efficient Refusal Ablation in LLM through Optimal Transport

Geraldin Nanfack, Eugene Belilovsky, Elvis Dohmatob
2026-03-04
red teaming
2603.04355v1

When Safety Becomes a Vulnerability: Exploiting LLM Alignment Homogeneity for Transferable Blocking in RAG

Junchen Li, Chao Qi, Rongzheng Wang, Qizhi Chen, Liang Xu, Di Liang, Bob Simons, Shuang Liang
2026-03-04
safety
2603.03919v1

Mozi: Governed Autonomy for Drug Discovery LLM Agents

He Cao, Siyu Liu, Fan Zhang, Zijing Liu, Hao Li, Bin Feng, Shengyuan Bai, Leqing Chen, Kai Xie, Yu Li
2026-03-04
governance
2603.03655v1

Beyond Input Guardrails: Reconstructing Cross-Agent Semantic Flows for Execution-Aware Attack Detection

Yangyang Wei, Yijie Xu, Zhenyuan Li, Xiangmin Shen, Shouling Ji
2026-03-04
red teaming
2603.04469v1

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions

Neha Nagaraja, Lan Zhang, Zhilong Wang, Bo Zhang, Pawan Patil
2026-03-04
red teaming
2603.03637v1

Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study

Neha Nagaraja, Hayretdin Bahsi
2026-03-04
red teaming
2603.03633v1

SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems

Haochang Hao, Yifan Xu, Xinzhuo Li, Yingqiang Ge, Lu Cheng
2026-03-03
safety
2603.03536v1

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi, Ahmed Awadallah
2026-03-03
2603.03205v1

TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

Zhi Xu, Jiaqi Li, Xiaotong Zhang, Hong Yu, Han Liu
2026-03-03
red teaming
2603.03081v1

Contextualized Privacy Defense for LLM Agents

Yule Wen, Yanzhe Zhang, Jianxun Lian, Xiaoyuan Yi, Xing Xie, Diyi Yang
2026-03-03
safety
2603.02983v1

Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks

Junjie Chu, Xinyue Shen, Ye Leng, Michael Backes, Yun Shen, Yang Zhang
2026-03-03
safety
2603.04459v1

From Shallow to Deep: Pinning Semantic Intent via Causal GRPO

Shuyi Zhou, Zeen Song, Wenwen Qiang, Jiyan Sun, Yao Zhou, Yinlong Liu, Wei Ma
2026-03-03
red teaming
2603.02675v1

SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

Zixuan Xu, Tiancheng He, Huahui Yi, Kun Wang, Xi Chen, Gongli Xi, Qiankun Li, Kang Li, Yang Liu, Zhigang Zeng
2026-03-03
2603.02635v1