Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 947 papers total

January 05 - January 11, 2026

15 papers

Jailbreaking LLMs & VLMs: Mechanisms, Evaluation, and Unified Defense

Zejian Chen, Chaozhuo Li, Chao Li, Xi Zhang, Litian Zhang, Yiming He
2026-01-07
red teaming
2601.03594v1

STAR-S: Improving Safety Alignment through Self-Taught Reasoning on Safety Rules

Di Wu, Yanyan Zhao, Xin Lu, Mingzhe Li, Bing Qin
2026-01-07
2601.03537v1

Jailbreaking LLMs Without Gradients or Priors: Effective and Transferable Attacks

Zhakshylyk Nurlanov, Frank R. Schmidt, Florian Bernard
2026-01-06
red teaming
2601.03420v1

GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models

Xiangdong Hu, Yangyang Jiang, Qin Hu, Xiaojun Jia
2026-01-06
red teaming
2601.03416v1

JPU: Bridging Jailbreak Defense and Unlearning via On-Policy Path Rectification

Xi Wang, Songlei Jian, Shasha Li, Xiaopeng Li, Zhaoye Li, Bin Ji, Baosheng Wang, Jie Yu
2026-01-06
red teaming
2601.03005v1

Adversarial Contrastive Learning for LLM Quantization Attacks

Dinghong Song, Zhiwei Xu, Hai Wan, Xibin Zhao, Pengfei Su, Dong Li
2026-01-06
red teaming
2601.02680v1

TRYLOCK: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering

Scott Thornton
2026-01-06
red teaming
2601.03300v1

Extracting books from production language models

Ahmed Ahmed, A. Feder Cooper, Sanmi Koyejo, Percy Liang
2026-01-06
red teaming
2601.02671v1

Multi-Turn Jailbreaking of Aligned LLMs via Lexical Anchor Tree Search

Devang Kulshreshtha, Hang Su, Chinmay Hegde, Haohan Wang
2026-01-06
red teaming
2601.02670v1

LLM-Empowered Functional Safety and Security by Design in Automotive Systems

Nenad Petrovic, Vahid Zolfaghari, Fengjunjie Pan, Alois Knoll
2026-01-05
safety
2601.02215v1

Exploring Approaches for Detecting Memorization of Recommender System Data in Large Language Models

Antonio Colacicco, Vito Guida, Dario Di Palma, Fedelucio Narducci, Tommaso Di Noia
2026-01-05
2601.02002v1

Hidden State Poisoning Attacks against Mamba-based Language Models

Alexandre Le Mercier, Chris Develder, Thomas Demeester
2026-01-05
red teaming
2601.01972v2

Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance

Jiawen Zhang, Lipeng He, Kejia Chen, Jian Lou, Jian Liu, Xiaohu Yang, Ruoxi Jia
2026-01-05
safety
2601.01887v2

Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization

Jiwei Guan, Haibo Jin, Haohan Wang
2026-01-05
red teaming
2601.01747v2

Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization

Jiwei Guan, Haibo Jin, Haohan Wang
2026-01-05
red teaming
2601.01747v1

December 29 - January 04, 2026

9 papers

JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

Junyu Liu, Zirui Li, Qian Niu, Zequn Zhang, Yue Xun, Wenlong Hou, Shujun Wang, Yusuke Iwasawa, Yutaka Matsuo, Kan Hatakeyama-Sato
2026-01-04
red teaming
2601.01627v1

OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs

Xin Wang, Yunhao Chen, Juncheng Li, Yixu Wang, Yang Yao, Tianle Gu, Jie Li, Yan Teng, Xingjun Ma, Yingchun Wang, Xia Hu
2026-01-04
red teaming
2601.01592v1

How Real is Your Jailbreak? Fine-grained Jailbreak Evaluation with Anchored Reference

Songyang Liu, Chaozhuo Li, Rui Pu, Litian Zhang, Chenxu Wang, Zejian Chen, Yuting Zhang, Yiming Hei
2026-01-04
red teaming
2601.03288v1

MCP-SandboxScan: WASM-based Secure Execution and Runtime Analysis for MCP Tools

Zhuoran Tan, Run Hao, Jeremy Singer, Yutian Tang, Christos Anagnostopoulos
2026-01-03
2601.01241v1

Emoji-Based Jailbreaking of Large Language Models

M P V S Gopinadh, S Mahaboob Hussain
2026-01-02
red teaming
2601.00936v1

CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns

Zhenhong Zhou, Shilinlu Yan, Chuanpu Liu, Qiankun Li, Kun Wang, Zhigang Zeng
2026-01-02
safety
2601.00588v2

Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations

Hyunjun Kim
2026-01-01
2601.00454v1

$α^3$-Bench: A Unified Benchmark of Safety, Robustness, and Efficiency for LLM-Based UAV Agents over 6G Networks

Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah
2026-01-01
safety
2601.03281v1

Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak

Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang, Yaochu Jin
2026-01-01
red teaming safety
2601.00213v1