Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1331 papers total

October 13 - October 19, 2025

24 papers

Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems

Karthik Avinash, Nikhil Pareek, Rishav Hada
2025-10-15
2510.13351v1

Prompt-based Adaptation in Large-scale Vision Models: A Survey

Xi Xiao, Yunbei Zhang, Lin Zhao, Yiyang Liu, Xiaoying Liao, Zheda Mai, Xingjian Li, Xiao Wang, Hao Xu, Jihun Hamm, Xue Lin, Min Xu, Qifan Wang, Tianyang Wang, Cheng Han
2025-10-15
2510.13219v1

SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs

Juan Ren, Mark Dras, Usman Naseem
2025-10-15
red teaming
2510.13190v1

A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation

João A. Leite, Arnav Arora, Silvia Gargova, João Luz, Gustavo Sampaio, Ian Roberts, Carolina Scarton, Kalina Bontcheva
2025-10-14
red teaming
2510.12993v1

SENTINEL: A Multi-Level Formal Framework for Safety Evaluation of LLM-based Embodied Agents

Simon Sinong Zhan, Yao Liu, Philip Wang, Zinan Wang, Qineng Wang, Zhian Ruan, Xiangyu Shi, Xinyu Cao, Frank Yang, Kangrui Wang, Huajie Shao, Manling Li, Qi Zhu
2025-10-14
safety
2510.12985v1

CADE 2.5 - ZeResFDG: Frequency-Decoupled, Rescaled and Zero-Projected Guidance for SD/SDXL Latent Diffusion Models

Denis Rychkovskiy
2025-10-14
2510.12954v2

RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

Tuan T. Nguyen, John Le, Thai T. Vu, Willy Susilo, Heath Cooper
2025-10-14
red teaming
2510.13901v1

UniFusion: Vision-Language Model as Unified Encoder in Image Generation

Kevin Li, Manuel Brack, Sudeep Katakol, Hareesh Ravi, Ajinkya Kale
2025-10-14
2510.12789v1

Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers

Ruben Belo, Marta Guimaraes, Claudia Soares
2025-10-14
2510.12672v2

Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers

Ruben Belo, Claudia Soares, Marta Guimaraes
2025-10-14
2510.12672v1

Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection

Olga E. Sorokoletova, Francesco Giarrusso, Vincenzo Suriani, Daniele Nardi
2025-10-14
red teaming
2510.13893v1

PromptLocate: Localizing Prompt Injection Attacks

Yuqi Jia, Yupei Liu, Zedian Shao, Jinyuan Jia, Neil Gong
2025-10-14
red teaming
2510.12252v2

MCP Security Bench (MSB): Benchmarking Attacks Against Model Context Protocol in LLM Agents

Dongsen Zhang, Zekun Li, Xu Luo, Xuannan Liu, Peipei Li, Wenjun Xu
2025-10-14
red teaming
2510.15994v1

SafeMT: Multi-turn Safety for Multimodal Language Models

Han Zhu, Juntao Dai, Jiaming Ji, Haoran Li, Chengkun Cai, Pengcheng Wen, Chi-Min Chan, Boyuan Chen, Yaodong Yang, Sirui Han, Yike Guo
2025-10-14
red teaming
2510.12133v1

Deep Research Brings Deeper Harm

Shuo Chen, Zonggen Li, Zhen Han, Bailan He, Tong Liu, Haokun Chen, Georg Groh, Philip Torr, Volker Tresp, Jindong Gu
2025-10-13
red teaming
2510.11851v2

Deep Research Brings Deeper Harm

Shuo Chen, Zonggen Li, Zhen Han, Bailan He, Tong Liu, Haokun Chen, Georg Groh, Philip Torr, Volker Tresp, Jindong Gu
2025-10-13
red teaming
2510.11851v1

Countermind: A Multi-Layered Security Architecture for Large Language Models

Dominik Schwarz
2025-10-13
2510.11837v1

Don't Walk the Line: Boundary Guidance for Filtered Generation

Sarah Ball, Andreas Haupt
2025-10-13
2510.11834v1

Bag of Tricks for Subverting Reasoning-based Safety Guardrails

Shuo Chen, Zhen Han, Haokun Chen, Bailan He, Shengyun Si, Jingpei Wu, Philip Torr, Volker Tresp, Jindong Gu
2025-10-13
red teaming
2510.11570v1

Beyond the Crowd: LLM-Augmented Community Notes for Governing Health Misinformation

Jiaying Wu, Zihang Fu, Haonan Wang, Fanxiao Li, Min-Yen Kan
2025-10-13
governance
2510.11423v1

Attacks by Content: Automated Fact-checking is an AI Security Issue

Michael Schlichtkrull
2025-10-13
2510.11238v1

TypePilot: Leveraging the Scala Type System for Secure LLM-generated Code

Alexander Sternfeld, Andrei Kucharavy, Ljiljana Dolamic
2025-10-13
2510.11151v1

Demystifying Numerosity in Diffusion Models -- Limitations and Remedies

Yaqi Zhao, Xiaochen Wang, Li Dong, Wentao Zhang, Yuhui Yuan
2025-10-13
2510.11117v1

SceneTextStylizer: A Training-Free Scene Text Style Transfer Framework with Diffusion Model

Honghui Yuan, Keiji Yanai
2025-10-13
2510.10910v1