Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1331 papers total

April 06 - April 12, 2026

24 papers

PIArena: A Platform for Prompt Injection Evaluation

Runpeng Geng, Chenlong Yin, Yanting Wang, Ying Chen, Jinyuan Jia
2026-04-09
red teaming
2604.08499v1

From Safety Risk to Design Principle: Peer-Preservation in Multi-Agent LLM Systems and Its Implications for Orchestrated Democratic Discourse Analysis

Juergen Dietrich
2026-04-09
safety
2604.08465v1

Security Concerns in Generative AI Coding Assistants: Insights from Online Discussions on GitHub Copilot

Nicolás E. Díaz Ferreyra, Monika Swetha Gurupathi, Zadia Codabux, Nalin Arachchilage, Riccardo Scandariato
2026-04-09
red teaming
2604.08352v1

Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation

Wenpeng Xing, Moran Fang, Guangtai Wang, Changting Lin, Meng Han
2026-04-09
red teaming
2604.07835v1

Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection

Wenkui Yang, Chao Jin, Haisu Zhu, Weilin Luo, Derek Yuen, Kun Shao, Huaibo Huang, Junxian Duan, Jie Cao, Ran He
2026-04-09
red teaming
2604.07831v1

TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

Cheng Liu, Xiaolei Liu, Xingyu Li, Bangzhou Xin, Kangyi Ding
2026-04-09
red teaming
2604.07727v1

ADAG: Automatically Describing Attribution Graphs

Aryaman Arora, Zhengxuan Wu, Jacob Steinhardt, Sarah Schwettmann
2026-04-08
2604.07615v1

TRUSTDESC: Preventing Tool Poisoning in LLM Applications via Trusted Description Generation

Hengkai Ye, Zhechang Zhang, Jinyuan Jia, Hong Hu
2026-04-08
red teaming
2604.07536v1

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen
2026-04-08
2604.07223v1

Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

Zhiheng Li, Zongyang Ma, Yuntong Pan, Ziqi Zhang, Xiaolei Lv, Bo Li, Jun Gao, Jianing Zhang, Chunfeng Yuan, Bing Li, Weiming Hu
2026-04-08
red teaming
2604.06950v2

Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

Zhiheng Li, Zongyang Ma, Yuntong Pan, Ziqi Zhang, Xiaolei Lv, Bo Li, Jun Gao, Jianing Zhang, Chunfeng Yuan, Bing Li, Weiming Hu
2026-04-08
red teaming
2604.06950v1

A Graph-Enhanced Defense Framework for Explainable Fake News Detection with LLM

Bo Wang, Jing Ma, Hongzhan Lin, Zhiwei Yang, Ruichao Yang, Yuan Tian, Yi Chang
2026-04-08
2604.06666v1

SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

Yinghan Hou, Zongyou Yang
2026-04-08
2604.06550v1

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

Manish Bhatt, Sarthak Munshi, Vineeth Sai Narajala, Idan Habler, Ammar Al-Kahfah, Ken Huang, Joel Webb, Blake Gatto
2026-04-07
2604.06436v2

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

Manish Bhatt, Sarthak Munshi, Vineeth Sai Narajala, Idan Habler, Ammar Al-Kahfah, Ken Huang, Blake Gatto
2026-04-07
2604.06436v1

Exclusive Unlearning

Mutsumi Sasaki, Kouta Nakayama, Yusuke Miyao, Yohei Oseki, Masaru Isonuma
2026-04-07
2604.06154v1

Reading Between the Pixels: An Inscriptive Jailbreak Attack on Text-to-Image Models

Zonghao Ying, Haowen Dai, Lianyu Hu, Zonglei Jing, Quanchen Zou, Yaodong Yang, Aishan Liu, Xianglong Liu
2026-04-07
red teaming
2604.05853v2

Reading Between the Pixels: An Inscriptive Jailbreak Attack on Text-to-Image Models

Zonghao Ying, Haowen Dai, Lianyu Hu, Zonglei Jing, Quanchen Zou, Yaodong Yang, Aishan Liu, Xianglong Liu
2026-04-07
red teaming
2604.05853v1

JailWAM: Jailbreaking World Action Models in Robot Control

Hanqing Liu, Songping Wang, Jiahuan Long, Jiacheng Hou, Jialiang Sun, Chao Li, Yang Yang, Wei Peng, Xu Liu, Tingsong Jiang, Wen Yao, Yao Mu
2026-04-07
2604.05498v1

Pre-Execution Safety Gate & Task Safety Contracts for LLM-Controlled Robot Systems

Ike Obi, Vishnunandan L. N. Venkatesh, Weizheng Wang, Ruiqi Wang, Dayoon Suh, Temitope I. Amosa, Wonse Jo, Byung-Cheol Min
2026-04-07
safety
2604.05427v1

Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

Purva Chiniya, Kevin Scaria, Sagar Chaturvedi
2026-04-06
red teaming
2604.05179v1

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, Yiyuan Li, Han-chung Lee
2026-04-06
safety
2604.05172v2

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, Yiyuan Li, Han-chung Lee
2026-04-06
safety
2604.05172v1

Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation

Geert Trooskens, Aaron Karlsberg, Anmol Sharma, Lamara De Brouwer, Max Van Puyvelde, Matthew Young, John Thickstun, Gil Alterovitz, Walter A. De Brouwer
2026-04-06
2604.05150v1