Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1169 papers total

January 05 - January 11, 2026

19 papers

HoneyTrap: Deceiving Large Language Model Attackers to Honeypot Traps with Resilient Multi-Agent Defense

Siyuan Li, Xi Lin, Jun Wu, Zehao Liu, Haoyu Li, Tianjie Ju, Xiang Chen, Jianhua Li
2026-01-07
red teaming
2601.04034v1

What Matters For Safety Alignment?

Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan
2026-01-07
red teaming
2601.03868v2

What Matters For Safety Alignment?

Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan
2026-01-07
red teaming
2601.03868v1

ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification

Xiao Lin, Philip Li, Zhichen Zeng, Tingwei Li, Tianxin Wei, Xuying Ning, Gaotang Li, Yuzhong Chen, Hanghang Tong
2026-01-07
red teaming
2601.03600v1

Jailbreaking LLMs & VLMs: Mechanisms, Evaluation, and Unified Defense

Zejian Chen, Chaozhuo Li, Chao Li, Xi Zhang, Litian Zhang, Yiming He
2026-01-07
red teaming
2601.03594v1

STAR-S: Improving Safety Alignment through Self-Taught Reasoning on Safety Rules

Di Wu, Yanyan Zhao, Xin Lu, Mingzhe Li, Bing Qin
2026-01-07
2601.03537v1

Jailbreaking LLMs Without Gradients or Priors: Effective and Transferable Attacks

Zhakshylyk Nurlanov, Frank R. Schmidt, Florian Bernard
2026-01-06
red teaming
2601.03420v1

GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models

Xiangdong Hu, Yangyang Jiang, Qin Hu, Xiaojun Jia
2026-01-06
red teaming
2601.03416v1

JPU: Bridging Jailbreak Defense and Unlearning via On-Policy Path Rectification

Xi Wang, Songlei Jian, Shasha Li, Xiaopeng Li, Zhaoye Li, Bin Ji, Baosheng Wang, Jie Yu
2026-01-06
red teaming
2601.03005v1

Adversarial Contrastive Learning for LLM Quantization Attacks

Dinghong Song, Zhiwei Xu, Hai Wan, Xibin Zhao, Pengfei Su, Dong Li
2026-01-06
red teaming
2601.02680v1

TRYLOCK: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering

Scott Thornton
2026-01-06
red teaming
2601.03300v1

Extracting books from production language models

Ahmed Ahmed, A. Feder Cooper, Sanmi Koyejo, Percy Liang
2026-01-06
red teaming
2601.02671v1

Multi-Turn Jailbreaking of Aligned LLMs via Lexical Anchor Tree Search

Devang Kulshreshtha, Hang Su, Chinmay Hegde, Haohan Wang
2026-01-06
red teaming
2601.02670v1

LLM-Empowered Functional Safety and Security by Design in Automotive Systems

Nenad Petrovic, Vahid Zolfaghari, Fengjunjie Pan, Alois Knoll
2026-01-05
safety
2601.02215v1

Exploring Approaches for Detecting Memorization of Recommender System Data in Large Language Models

Antonio Colacicco, Vito Guida, Dario Di Palma, Fedelucio Narducci, Tommaso Di Noia
2026-01-05
2601.02002v1

Hidden State Poisoning Attacks against Mamba-based Language Models

Alexandre Le Mercier, Chris Develder, Thomas Demeester
2026-01-05
red teaming
2601.01972v2

Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance

Jiawen Zhang, Lipeng He, Kejia Chen, Jian Lou, Jian Liu, Xiaohu Yang, Ruoxi Jia
2026-01-05
safety
2601.01887v2

Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization

Jiwei Guan, Haibo Jin, Haohan Wang
2026-01-05
red teaming
2601.01747v2

Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization

Jiwei Guan, Haibo Jin, Haohan Wang
2026-01-05
red teaming
2601.01747v1

December 29 - January 04, 2026

5 papers