Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1160 papers total

February 23 - March 01, 2026

24 papers

Tracking Capabilities for Safer Agents

Martin Odersky, Yaoyu Zhao, Yichen Xu, Oliver Bračevac, Cao Nguyen Pham
2026-03-01
2603.00991v1

MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs

Yilian Liu, Xiaojun Jia, Guoshun Nan, Jiuyang Lyu, Zhican Chen, Tao Guan, Shuyuan Luo, Zhongyi Zhai, Yang Liu
2026-02-28
red teaming
2603.00565v1

From Goals to Aspects, Revisited: An NFR Pattern Language for Agentic AI Systems

Yijun Yu
2026-02-28
2603.00472v1

SafeGen-LLM: Enhancing Safety Generalization in Task Planning for Robotic Systems

Jialiang Fan, Weizhe Xu, Mengyu Liu, Oleg Sokolsky, Insup Lee, Fangxin Kong
2026-02-27
safety
2602.24235v1

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Zhicheng Fang, Jingjie Zheng, Chenxu Fu, Wei Xu
2026-02-27
red teaming
2602.24009v3

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Zhicheng Fang, Jingjie Zheng, Chenxu Fu, Wei Xu
2026-02-27
red teaming
2602.24009v2

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Zhicheng Fang, Jingjie Zheng, Chenxu Fu, Wei Xu
2026-02-27
red teaming
2602.24009v1

SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls

Qianxun Xu, Chenxi Song, Yujun Cai, Chi Zhang
2026-02-27
2602.23956v1

LiaisonAgent: An Multi-Agent Framework for Autonomous Risk Investigation and Governance

Chuanming Tang, Ling Qing, Shifeng Chen
2026-02-27
2603.00200v1

Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search

Xun Huang, Simeng Qin, Xiaoshuang Jia, Ranjie Duan, Huanqian Yan, Zhitao Zeng, Fei Yang, Yang Liu, Xiaojun Jia
2026-02-26
red teaming
2602.22983v2

Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search

Xun Huang, Simeng Qin, Xiaoshuang Jia, Ranjie Duan, Huanqian Yan, Zhitao Zeng, Fei Yang, Yang Liu, Xiaojun Jia
2026-02-26
red teaming
2602.22983v1

AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

Tian Zhang, Yiwei Xu, Juan Wang, Keyan Guo, Xiaoyang Xu, Bowen Xiao, Quanlong Guan, Jinlin Fan, Jiawei Liu, Zhiquan Liu, Hongxin Hu
2026-02-26
red teaming
2602.22724v1

Reverse CAPTCHA: Evaluating LLM Susceptibility to Invisible Unicode Instruction Injection

Marcus Graves
2026-02-26
red teaming
2603.00164v1

CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

Umid Suleymanov, Rufiz Bayramov, Suad Gafarli, Seljan Musayeva, Taghi Mammadov, Aynur Akhundlu, Murat Kantarcioglu
2026-02-26
safety
2602.22557v1

Silent Egress: When Implicit Prompt Injection Makes LLM Agents Leak Without a Trace

Qianlong Lan, Anuj Kaul, Shaun Jones, Stephanie Westrum
2026-02-25
red teaming
2602.22450v1

A Multi-Turn Framework for Evaluating AI Misuse in Fraud and Cybercrime Scenarios

Kimberly T. Mai, Anna Gausen, Magda Dubois, Mona Murad, Bessie O'Dell, Nadine Staes-Polet, Christopher Summerfield, Andrew Strait
2026-02-25
red teaming
2602.21831v2

A Multi-Turn Framework for Evaluating AI Misuse in Fraud and Cybercrime Scenarios

Kimberly T. Mai, Anna Gausen, Magda Dubois, Mona Murad, Bessie O'Dell, Nadine Staes-Polet, Christopher Summerfield, Andrew Strait
2026-02-25
2602.21831v1

Detoxifying LLMs via Representation Erasure-Based Preference Optimization

Nazanin Mohammadi Sepahvand, Eleni Triantafillou, Hugo Larochelle, Doina Precup, Daniel M. Roy, Gintare Karolina Dziugaite
2026-02-24
red teaming
2602.23391v1

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Mengxuan Hu, Vivek V. Datla, Anoop Kumar, Zihan Guan, Sheng Li, Alfy Samuel, Daben Liu
2026-02-24
red teaming
2602.21346v1

VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models

Bowen Zheng, Yongli Xiang, Ziming Hong, Zerong Lin, Chaojian Yu, Tongliang Liu, Xinge You
2026-02-24
red teaming
2602.20999v2

VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models

Bowen Zheng, Yongli Xiang, Ziming Hong, Zerong Lin, Chaojian Yu, Tongliang Liu, Xinge You
2026-02-24
red teaming
2602.20999v1

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, Guangsheng Yu
2026-02-24
red teaming
2602.20867v1

Analysis of LLMs Against Prompt Injection and Jailbreak Attacks

Piyush Jaiswal, Aaditya Pratap, Shreyansh Saraswati, Harsh Kasyap, Somanath Tripathy
2026-02-24
red teaming
2602.22242v1

AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs

Che Wang, Jiaming Zhang, Ziqi Zhang, Zijie Wang, Yinghui Wang, Jianbo Gao, Tao Wei, Zhong Chen, Wei Yang Bryan Lim
2026-02-24
red teaming
2602.20720v1