Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 770 papers total

November 03 - November 09, 2025

12 papers

Death by a Thousand Prompts: Open Model Vulnerability Analysis

Amy Chang, Nicholas Conley, Harish Santhanalakshmi Ganesan, Adam Swanda
2025-11-05
red teaming
2511.03247v1

Jailbreaking in the Haystack

Rishi Rajesh Shah, Chen Henry Wu, Shashwat Saxena, Ziqian Zhong, Alexander Robey, Aditi Raghunathan
2025-11-05
red teaming
2511.04707v1

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Aashray Reddy, Andrew Zagula, Nicholas Saban, Kevin Zhu
2025-11-04
red teaming
2511.02376v2

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Aashray Reddy, Andrew Zagula, Nicholas Saban
2025-11-04
red teaming
2511.02376v1

LiveSecBench: A Dynamic and Culturally-Relevant AI Safety Benchmark for LLMs in Chinese Context

Yudong Li, Zhongliang Yang, Kejiang Chen, Wenxuan Wang, Tianxin Zhang, Sifang Wan, Kecheng Wang, Haitian Li, Xu Wang, Lefan Cheng, Youdan Yang, Baocheng Chen, Ziyu Liu, Yufei Sun, Liyan Wu, Wenya Wen, Xingchi Gu, Peiru Yang
2025-11-04
safety
2511.02366v1

An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks

Xu Liu, Yan Chen, Kan Ling, Yichi Zhu, Hengrun Zhang, Guisheng Fan, Huiqun Yu
2025-11-04
red teaming
2511.02356v1

LM-Fix: Lightweight Bit-Flip Detection and Rapid Recovery Framework for Language Models

Ahmad Tahmasivand, Noureldin Zahran, Saba Al-Sayouri, Mohammed Fouda, Khaled N. Khasawneh
2025-11-03
2511.02866v1

Prompt Injection as an Emerging Threat: Evaluating the Resilience of Large Language Models

Daniyal Ganiuly, Assel Smaiyl
2025-11-03
red teaming
2511.01634v2

Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models

Xiaoyu Zhan, Wenxuan Huang, Hao Sun, Xinyu Fu, Changfeng Ma, Shaosheng Cao, Bohan Jia, Shaohui Lin, Zhenfei Yin, Lei Bai, Wanli Ouyang, Yuanqi Li, Jie Guo, Yanwen Guo
2025-11-03
2511.01618v1

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

Hamin Koo, Minseon Kim, Jaehyung Kim
2025-11-03
red teaming
2511.01375v1

MIQ-SAM3D: From Single-Point Prompt to Multi-Instance Segmentation via Competitive Query Refinement

Jierui Qu, Jianchun Zhao
2025-11-03
2511.01345v1

"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers

Qin Zhou, Zhexin Zhang, Zhi Li, Limin Sun
2025-11-03
red teaming
2511.01287v1

October 27 - November 02, 2025

12 papers

Do Methods to Jailbreak and Defend LLMs Generalize Across Languages?

Berk Atil, Rebecca J. Passonneau, Fred Morstatter
2025-11-01
red teaming
2511.00689v2

ShadowLogic: Backdoors in Any Whitebox LLM

Kasimir Schulz, Amelia Kawasaki, Leo Ring
2025-11-01
red teaming
2511.00664v1

Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack

Peng Ding, Jun Kuang, Wen Sun, Zongyu Wang, Xuezhi Cao, Xunliang Cai, Jiajun Chen, Shujian Huang
2025-11-01
red teaming safety
2511.00556v1

Reimagining Safety Alignment with An Image

Yifan Xia, Guorui Chen, Wenqian Yu, Zhijiang Li, Philip Torr, Jindong Gu
2025-11-01
2511.00509v1

DRIP: Defending Prompt Injection via Token-wise Representation Editing and Residual Instruction Fusion

Ruofan Liu, Yun Lin, Zhiyong Huang, Jin Song Dong
2025-11-01
2511.00447v2

DRIP: Defending Prompt Injection via De-instruction Training and Residual Fusion Model Architecture

Ruofan Liu, Yun Lin, Jin Song Dong
2025-11-01
red teaming
2511.00447v1

Efficiency vs. Alignment: Investigating Safety and Fairness Risks in Parameter-Efficient Fine-Tuning of LLMs

Mina Taraghi, Yann Pequignot, Amin Nikanjam, Mohamed Amine Merzouk, Foutse Khomh
2025-11-01
safety
2511.00382v1

Exploiting Latent Space Discontinuities for Building Universal LLM Jailbreaks and Data Extraction Attacks

Kayua Oleques Paim, Rodrigo Brandao Mansilha, Diego Kreutz, Muriel Figueredo Franco, Weverton Cordeiro
2025-11-01
red teaming
2511.00346v1

Diffusion LLMs are Natural Adversaries for any LLM

David Lüdke, Tom Wollschläger, Paul Ungermann, Stephan Günnemann, Leo Schwinn
2025-10-31
red teaming
2511.00203v1

Prevalence of Security and Privacy Risk-Inducing Usage of AI-based Conversational Agents

Kathrin Grosse, Nico Ebert
2025-10-31
2510.27275v1

Consistency Training Helps Stop Sycophancy and Jailbreaks

Alex Irpan, Alexander Matt Turner, Mark Kurzeja, David K. Elson, Rohin Shah
2025-10-31
2510.27062v1

Reasoning Up the Instruction Ladder for Controllable Language Models

Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar
2025-10-30
red teaming
2511.04694v3