Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 947 papers total

January 05 - January 11, 2026

24 papers

VIGIL: Defending LLM Agents Against Tool Stream Injection via Verify-Before-Commit

Junda Lin, Zhaomeng Zhou, Zhi Zheng, Shuochen Liu, Tong Xu, Yong Chen, Enhong Chen
2026-01-09
red teaming
2601.05755v2

VIGIL: Defending LLM Agents Against Tool Stream Injection via Verify-Before-Commit

Junda Lin, Zhaomeng Zhou, Zhi Zheng, Shuochen Liu, Tong Xu, Yong Chen, Enhong Chen
2026-01-09
red teaming
2601.05755v1

The Echo Chamber Multi-Turn LLM Jailbreak

Ahmad Alobaid, Martí Jordà Roca, Carlos Castillo, Joan Vendrell
2026-01-09
red teaming
2601.05742v1

PII-VisBench: Evaluating Personally Identifiable Information Safety in Vision Language Models Along a Continuum of Visibility

G M Shahariar, Zabir Al Nazi, Md Olid Hasan Bhuiyan, Zhouxing Shi
2026-01-09
2601.05739v1

Safety Not Found (404): Hidden Risks of LLM-Based Robotics Decision Making

Jua Han, Jaeyoon Seo, Jungbin Min, Jean Oh, Jihie Kim
2026-01-09
safety
2601.05529v2

Safety Not Found (404): Hidden Risks of LLM-Based Robotics Decision Making

Jua Han, Jaeyoon Seo, Jungbin Min, Jean Oh, Jihie Kim
2026-01-09
safety
2601.05529v1

Memory Poisoning Attack and Defense on Memory Based LLM-Agents

Balachandra Devarangadi Sunil, Isheeta Sinha, Piyush Maheshwari, Shantanu Todmal, Shreyan Mallik, Shuchi Mishra
2026-01-09
safety
2601.05504v2

Memory Poisoning Attack and Defense on Memory Based LLM-Agents

Balachandra Devarangadi Sunil, Isheeta Sinha, Piyush Maheshwari, Shantanu Todmal, Shreyan Malik, Shuchi Mishra
2026-01-09
safety
2601.05504v1

FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

Zhi Yang, Runguo Li, Qiqi Qiang, Jiashun Wang, Fangqi Lou, Mengping Li, Dongpo Cheng, Rui Xu, Heng Lian, Shuo Zhang, Xiaolong Liang, Xiaoming Huang, Zheng Wei, Zhaowei Liu, Xin Guo, Huacan Wang, Ronghao Chen, Liwen Zhang
2026-01-09
red teaming
2601.07853v1

Jailbreaking Large Language Models through Iterative Tool-Disguised Attacks via Reinforcement Learning

Zhaoqi Wang, Zijian Zhang, Daqing He, Pengtao Kou, Xin Li, Jiamou Liu, Jincheng An, Yong Liu
2026-01-09
red teaming
2601.05466v1

Knowledge-Driven Multi-Turn Jailbreaking on Large Language Models

Songze Li, Ruishi He, Xiaojun Jia, Jun Wang, Zhihui Fu
2026-01-09
red teaming
2601.05445v1

Multi-turn Jailbreaking Attack in Multi-Modal Large Language Models

Badhan Chandra Das, Md Tasnim Jawad, Joaquin Molto, M. Hadi Amini, Yanzhao Wu
2026-01-08
red teaming
2601.05339v1

$PC^2$: Politically Controversial Content Generation via Jailbreaking Attacks on GPT-based Text-to-Image Models

Wonwoo Choi, Minjae Seo, Minkyoo Song, Hwanjo Heo, Seungwon Shin, Myoungsung You
2026-01-08
red teaming
2601.05150v1

From Understanding to Engagement: Personalized pharmacy Video Clips via Vision Language Models (VLMs)

Suyash Mishra, Qiang Li, Srikanth Patil, Anubhav Girdhar
2026-01-08
2601.05059v1

Defense Against Indirect Prompt Injection via Tool Result Parsing

Qiang Yu, Xinran Cheng, Chuanyi Liu
2026-01-08
red teaming
2601.04795v1

Know Thy Enemy: Securing LLMs Against Prompt Injection via Diverse Data Synthesis and Instruction-Level Chain-of-Thought Learning

Zhiyuan Chang, Mingyang Li, Yuekai Huang, Ziyou Jiang, Xiaojun Jia, Qian Xiong, Junjie Wang, Zhaoyang Li, Qing Wang
2026-01-08
2601.04666v1

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

Hoagy Cunningham, Jerry Wei, Zihan Wang, Andrew Persic, Alwin Peng, Jordan Abderrachid, Raj Agarwal, Bobby Chen, Austin Cohen, Andy Dau, Alek Dimitriev, Rob Gilson, Logan Howard, Yijin Hua, Jared Kaplan, Jan Leike, Mu Lin, Christopher Liu, Vladimir Mikulik, Rohit Mittapalli, Clare O'Hara, Jin Pan, Nikhil Saxena, Alex Silverstein, Yue Song, Xunjie Yu, Giulio Zhou, Ethan Perez, Mrinank Sharma
2026-01-08
red teaming
2601.04603v1

Autonomous Agents on Blockchains: Standards, Execution Models, and Trust Boundaries

Saad Alqithami
2026-01-08
2601.04583v1

MiJaBench: Revealing Minority Biases in Large Language Models via Hate Speech Jailbreaking

Iago Alves Brito, Walcy Santos Rezende Rios, Julia Soares Dollis, Diogo Fernandes Costa Silva, Arlindo Rodrigues Galvão Filho
2026-01-07
red teaming
2601.04389v1

SearchAttack: Red-Teaming LLMs against Real-World Threats via Framing Unsafe Web Information-Seeking Tasks

Yu Yan, Sheng Sun, Mingfeng Li, Zheming Yang, Chiwei Zhu, Fei Ma, Benfeng Xu, Min Liu
2026-01-07
red teaming
2601.04093v1

When Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM-Powered Safety in Daily Life

Xinyue Lou, Jinan Xu, Jingyi Yin, Xiaolong Wang, Zhaolu Kang, Youwei Liao, Yixuan Wang, Xiangyu Shi, Fengran Mo, Su Yao, Kaiyu Huang
2026-01-07
safety
2601.04043v1

HoneyTrap: Deceiving Large Language Model Attackers to Honeypot Traps with Resilient Multi-Agent Defense

Siyuan Li, Xi Lin, Jun Wu, Zehao Liu, Haoyu Li, Tianjie Ju, Xiang Chen, Jianhua Li
2026-01-07
red teaming
2601.04034v1

What Matters For Safety Alignment?

Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan
2026-01-07
red teaming
2601.03868v1

ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification

Xiao Lin, Philip Li, Zhichen Zeng, Tingwei Li, Tianxin Wei, Xuying Ning, Gaotang Li, Yuzhong Chen, Hanghang Tong
2026-01-07
red teaming
2601.03600v1