Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 770 papers total

November 10 - November 16, 2025

8 papers

Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs

Yuxuan Zhou, Yuzhao Peng, Yang Bai, Kuofeng Gao, Yihao Zhang, Yechao Zhang, Xun Chen, Tao Yu, Tao Dai, Shu-Tao Xia
2025-11-11
red teaming
2511.08367v1

Alignment-Aware Quantization for LLM Safety

Sunghyun Wee, Suyoung Kim, Hyeonjin Kim, Kyomin Hwang, Nojun Kwak
2025-11-11
safety
2511.07842v1

JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework

Yuxuan Zhou, Yang Bai, Kuofeng Gao, Tao Dai, Shu-Tao Xia
2025-11-10
red teaming
2511.07315v1

EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

Yilin Jiang, Mingzi Zhang, Xuanyu Yin, Sheng Jin, Suyu Lu, Zuocan Ying, Zengyi Yu, Xiangjie Kong
2025-11-10
safety
2511.06890v1

Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment

Peng Zhang, Peijie Sun
2025-11-10
safety
2511.06852v2

Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment

Peng Zhang, peijie sun
2025-11-10
red teaming safety
2511.06852v1

SAFENLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces

Ruiheng Liu, XiaoBing Chen, Jinyu Zhang, Qiongwen Zhang, Yu Zhang, Bailong Yang
2025-11-10
safety
2511.06778v2

SAFENLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces

Ruiheng Liu, XiaoBing Chen, Jinyu Zhang, Qiongwen Zhang, Yu Zhang, Bailong Yang
2025-11-10
safety
2511.06778v1

November 03 - November 09, 2025

16 papers

EASE: Practical and Efficient Safety Alignment for Small Language Models

Haonan Shi, Guoli Wang, Tu Ouyang, An Wang
2025-11-09
red teaming
2511.06512v1

KG-DF: A Black-box Defense Framework against Jailbreak Attacks Based on Knowledge Graphs

Shuyuan Liu, Jiawei Chen, Xiao Yang, Hang Su, Zhaoxia Yin
2025-11-09
red teaming
2511.07480v1

Efficient LLM Safety Evaluation through Multi-Agent Debate

Dachuan Lin, Guobin Shen, Zihao Yang, Tianrong Liu, Dongcheng Zhao, Yi Zeng
2025-11-09
red teaming safety
2511.06396v1

RelightMaster: Precise Video Relighting with Multi-plane Light Images

Weikang Bian, Xiaoyu Shi, Zhaoyang Huang, Jianhong Bai, Qinghe Wang, Xintao Wang, Pengfei Wan, Kun Gai, Hongsheng Li
2025-11-09
2511.06271v1

RAG-targeted Adversarial Attack on LLM-based Threat Detection and Mitigation Framework

Seif Ikbarieh, Kshitiz Aryal, Maanak Gupta
2025-11-09
red teaming
2511.06212v1

Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs

Alina Fastowski, Bardh Prenkaj, Yuxiao Li, Gjergji Kasneci
2025-11-08
red teaming
2511.05919v2

Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs

Alina Fastowski, Bardh Prenkaj, Yuxiao Li, Gjergji Kasneci
2025-11-08
red teaming
2511.05919v1

MCP-RiskCue: Can LLM Infer Risk Information From MCP Server System Logs?

Jiayi Fu, Qiyao Sun
2025-11-08
2511.05867v2

When AI Meets the Web: Prompt Injection Risks in Third-Party AI Chatbot Plugins

Yigitcan Kaya, Anton Landerer, Stijn Pletinckx, Michelle Zimmermann, Christopher Kruegel, Giovanni Vigna
2025-11-08
red teaming
2511.05797v1

Reflective Personalization Optimization: A Post-hoc Rewriting Framework for Black-Box Large Language Models

Teqi Hao, Xioayu Tan, Shaojie Shi, Yinghui Xu, Xihe Qiu
2025-11-07
2511.05286v1

Large Language Models for Cyber Security

Raunak Somani, Aswani Kumar Cherukuri
2025-11-06
2511.04508v1

AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research

Tim Beyer, Jonas Dornbusch, Jakob Steimle, Moritz Ladenburger, Leo Schwinn, Stephan Günnemann
2025-11-06
red teaming
2511.04316v1

Secure Code Generation at Scale with Reflexion

Arup Datta, Ahmed Aljohani, Hyunsook Do
2025-11-05
2511.03898v1

Whisper Leak: a side-channel attack on Large Language Models

Geoff McDonald, Jonathan Bar Or
2025-11-05
2511.03675v1

Inter-Agent Trust Models: A Comparative Study of Brief, Claim, Proof, Stake, Reputation and Constraint in Agentic Web Protocol Design-A2A, AP2, ERC-8004, and Beyond

Botao 'Amber' Hu, Helena Rong
2025-11-05
red teaming
2511.03434v1

Let the Bees Find the Weak Spots: A Path Planning Perspective on Multi-Turn Jailbreak Attacks against LLMs

Yize Liu, Yunyun Hou, Aina Sui
2025-11-05
red teaming
2511.03271v1