Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1169 papers total

February 02 - February 08, 2026

10 papers

vLLM Hook v0: A Plug-in for Programming Model Internals on vLLM

Ching-Yun Ko, Pin-Yu Chen
2026-02-02
2603.06588v1

RACA: Representation-Aware Coverage Criteria for LLM Safety Testing

Zeming Wei, Zhixin Zhang, Chengcan Wu, Yihao Zhang, Xiaokun Luan, Meng Sun
2026-02-02
red teaming safety
2602.02280v1

Light Alignment Improves LLM Safety via Model Self-Reflection with a Single Neuron

Sicheng Shen, Mingyang Lv, Han Shen, Jialin Wu, Binghao Wang, Zhou Yang, Guobin Shen, Dongcheng Zhao, Feifei Zhao, Yi Zeng
2026-02-02
safety
2602.02027v1

Human Society-Inspired Approaches to Agentic AI Security: The 4C Framework

Alsharif Abuadbba, Nazatul Sultan, Surya Nepal, Sanjay Jha
2026-02-02
2602.01942v1

Concept-Based Dictionary Learning for Inference-Time Safety in Vision Language Action Models

Siqi Wen, Shu Yang, Shaopeng Fu, Jingfeng Zhang, Lijie Hu, Di Wang
2026-02-02
2602.01834v1

RedVisor: Reasoning-Aware Prompt Injection Defense via Zero-Copy KV Cache Reuse

Mingrui Liu, Sixiao Zhang, Cheng Long, Kwok-Yan Lam
2026-02-02
red teaming
2602.01795v1

Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs

Yen-Shan Chen, Zhi Rui Tam, Cheng-Kuang Wu, Yun-Nung Chen
2026-02-02
red teaming safety
2602.01600v1

Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment

Zehua Cheng, Jianwei Yang, Wei Dai, Jiahao Sun
2026-02-02
red teaming safety
2602.01587v1

MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety

Xiaoyu Wen, Zhida He, Han Qi, Ziyu Wan, Zhongtian Ma, Ying Wen, Tianhang Zheng, Xingcheng Xu, Chaochao Lu, Qiaosheng Zhang
2026-02-02
safety
2602.01539v2

MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety

Xiaoyu Wen, Zhida He, Han Qi, Ziyu Wan, Zhongtian Ma, Ying Wen, Tianhang Zheng, Xingcheng Xu, Chaochao Lu, Qiaosheng Zhang
2026-02-02
safety
2602.01539v1

January 26 - February 01, 2026

14 papers

Context Dependence and Reliability in Autoregressive Language Models

Poushali Sengupta, Shashi Raj Pandey, Sabita Maharjan, Frank Eliassen
2026-02-01
2602.01378v1

Step-Wise Refusal Dynamics in Autoregressive and Diffusion Language Models

Eliron Rahimi, Elad Hirshel, Rom Himelstein, Amit LeVi, Avi Mendelson, Chaim Baskin
2026-02-01
2602.02600v1

SMCP: Secure Model Context Protocol

Xinyi Hou, Shenao Wang, Yifan Zhang, Ziluo Xue, Yanjie Zhao, Cai Fu, Haoyu Wang
2026-02-01
2602.01129v1

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

Kaiyuan Cui, Yige Li, Yutao Wu, Xingjun Ma, Sarah Erfani, Christopher Leckie, Hanxun Huang
2026-02-01
red teaming
2602.01025v1

A Causal Perspective for Enhancing Jailbreak Attack and Defense

Licheng Pan, Yunsheng Lu, Jiexi Liu, Jialing Tao, Haozhe Feng, Hui Xue, Zhixuan Chu, Kui Ren
2026-01-31
red teaming
2602.04893v1

Bypassing Prompt Injection Detectors through Evasive Injections

Md Jahedur Rahman, Ihsen Alouani
2026-01-31
red teaming
2602.00750v1

Jailbreaking LLMs via Calibration

Yuxuan Lu, Yongkang Guo, Yuqing Kong
2026-01-31
red teaming
2602.00619v1

Text is All You Need for Vision-Language Model Jailbreaking

Yihang Chen, Zhao Xu, Youyuan Jiang, Tianle Zheng, Cho-Jui Hsieh
2026-01-31
red teaming
2602.00420v1

A Fragile Guardrail: Diffusion LLM's Safety Blessing and Its Failure Mode

Zeyuan He, Yupeng Chen, Lang Lin, Yihan Wang, Shenxu Chang, Eric Sommerlade, Philip Torr, Junchi Yu, Adel Bibi, Jialin Yu
2026-01-30
red teaming safety
2602.00388v1

Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models

Ye Yu, Haibo Jin, Yaoning Yu, Jun Zhuang, Haohan Wang
2026-01-30
red teaming
2601.23255v1

Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures

Yanghao Su, Wenbo Zhou, Tianwei Zhang, Qiu Han, Weiming Zhang, Nenghai Yu, Jie Zhang
2026-01-30
red teaming
2601.23081v1

The Alignment Curse: Cross-Modality Jailbreak Transfer in Omni-Models

Yupeng Chen, Junchi Yu, Aoxi Liu, Philip Torr, Adel Bibi
2026-01-30
red teaming
2602.02557v1

Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

Mingqian Feng, Xiaodong Liu, Weiwei Yang, Chenliang Xu, Christopher White, Jianfeng Gao
2026-01-30
red teaming
2601.22636v1

Whispers of Wealth: Red-Teaming Google's Agent Payments Protocol via Prompt Injection

Tanusree Debi, Wentian Zhu
2026-01-30
red teaming
2601.22569v1