Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1331 papers total

February 02 - February 08, 2026

3 papers

January 26 - February 01, 2026

21 papers

Context Dependence and Reliability in Autoregressive Language Models

Poushali Sengupta, Shashi Raj Pandey, Sabita Maharjan, Frank Eliassen
2026-02-01
2602.01378v1

Step-Wise Refusal Dynamics in Autoregressive and Diffusion Language Models

Eliron Rahimi, Elad Hirshel, Rom Himelstein, Amit LeVi, Avi Mendelson, Chaim Baskin
2026-02-01
2602.02600v1

SMCP: Secure Model Context Protocol

Xinyi Hou, Shenao Wang, Yifan Zhang, Ziluo Xue, Yanjie Zhao, Cai Fu, Haoyu Wang
2026-02-01
2602.01129v1

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

Kaiyuan Cui, Yige Li, Yutao Wu, Xingjun Ma, Sarah Erfani, Christopher Leckie, Hanxun Huang
2026-02-01
red teaming
2602.01025v1

A Causal Perspective for Enhancing Jailbreak Attack and Defense

Licheng Pan, Yunsheng Lu, Jiexi Liu, Jialing Tao, Haozhe Feng, Hui Xue, Zhixuan Chu, Kui Ren
2026-01-31
red teaming
2602.04893v1

Bypassing Prompt Injection Detectors through Evasive Injections

Md Jahedur Rahman, Ihsen Alouani
2026-01-31
red teaming
2602.00750v1

Jailbreaking LLMs via Calibration

Yuxuan Lu, Yongkang Guo, Yuqing Kong
2026-01-31
red teaming
2602.00619v1

Text is All You Need for Vision-Language Model Jailbreaking

Yihang Chen, Zhao Xu, Youyuan Jiang, Tianle Zheng, Cho-Jui Hsieh
2026-01-31
red teaming
2602.00420v1

A Fragile Guardrail: Diffusion LLM's Safety Blessing and Its Failure Mode

Zeyuan He, Yupeng Chen, Lang Lin, Yihan Wang, Shenxu Chang, Eric Sommerlade, Philip Torr, Junchi Yu, Adel Bibi, Jialin Yu
2026-01-30
red teaming safety
2602.00388v1

Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models

Ye Yu, Haibo Jin, Yaoning Yu, Jun Zhuang, Haohan Wang
2026-01-30
red teaming
2601.23255v1

Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures

Yanghao Su, Wenbo Zhou, Tianwei Zhang, Qiu Han, Weiming Zhang, Nenghai Yu, Jie Zhang
2026-01-30
red teaming
2601.23081v1

The Alignment Curse: Cross-Modality Jailbreak Transfer in Omni-Models

Yupeng Chen, Junchi Yu, Aoxi Liu, Philip Torr, Adel Bibi
2026-01-30
red teaming
2602.02557v1

Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

Mingqian Feng, Xiaodong Liu, Weiwei Yang, Chenliang Xu, Christopher White, Jianfeng Gao
2026-01-30
red teaming
2601.22636v1

Whispers of Wealth: Red-Teaming Google's Agent Payments Protocol via Prompt Injection

Tanusree Debi, Wentian Zhu
2026-01-30
red teaming
2601.22569v1

FraudShield: Knowledge Graph Empowered Defense for LLMs against Fraud Attacks

Naen Xu, Jinghuai Zhang, Ping He, Chunyi Zhou, Jun Wang, Zhihui Fu, Tianyu Du, Zhaoxiang Wang, Shouling Ji
2026-01-30
safety
2601.22485v1

Jailbreaks on Vision Language Model via Multimodal Reasoning

Aarush Noheria, Yuguang Yao
2026-01-29
red teaming
2601.22398v1

Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment

Yavuz Bakman, Duygu Nur Yaldiz, Salman Avestimehr, Sai Praneeth Karimireddy
2026-01-29
red teaming
2601.22313v1

A Systematic Literature Review on LLM Defenses Against Prompt Injection and Jailbreaking: Expanding NIST Taxonomy

Pedro H. Barcha Correia, Ryan W. Achjian, Diego E. G. Caetano de Oliveira, Ygor Acacio Maria, Victor Takashi Hayashi, Marcos Lopes, Charles Christian Miers, Marcos A. Simplicio
2026-01-29
safety
2601.22240v1

OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence

Jarrod Barnes
2026-01-28
red teaming
2601.21083v3

OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence

Jarrod Barnes
2026-01-28
red teaming
2601.21083v2

OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence

Jarrod Barnes
2026-01-28
red teaming
2601.21083v1