Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 773 papers total

October 13 - October 19, 2025

16 papers

Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers

Ruben Belo, Marta Guimaraes, Claudia Soares
2025-10-14
2510.12672v2

Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers

Ruben Belo, Claudia Soares, Marta Guimaraes
2025-10-14
2510.12672v1

Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection

Olga E. Sorokoletova, Francesco Giarrusso, Vincenzo Suriani, Daniele Nardi
2025-10-14
red teaming
2510.13893v1

PromptLocate: Localizing Prompt Injection Attacks

Yuqi Jia, Yupei Liu, Zedian Shao, Jinyuan Jia, Neil Gong
2025-10-14
red teaming
2510.12252v2

MCP Security Bench (MSB): Benchmarking Attacks Against Model Context Protocol in LLM Agents

Dongsen Zhang, Zekun Li, Xu Luo, Xuannan Liu, Peipei Li, Wenjun Xu
2025-10-14
red teaming
2510.15994v1

SafeMT: Multi-turn Safety for Multimodal Language Models

Han Zhu, Juntao Dai, Jiaming Ji, Haoran Li, Chengkun Cai, Pengcheng Wen, Chi-Min Chan, Boyuan Chen, Yaodong Yang, Sirui Han, Yike Guo
2025-10-14
red teaming
2510.12133v1

Deep Research Brings Deeper Harm

Shuo Chen, Zonggen Li, Zhen Han, Bailan He, Tong Liu, Haokun Chen, Georg Groh, Philip Torr, Volker Tresp, Jindong Gu
2025-10-13
red teaming
2510.11851v2

Deep Research Brings Deeper Harm

Shuo Chen, Zonggen Li, Zhen Han, Bailan He, Tong Liu, Haokun Chen, Georg Groh, Philip Torr, Volker Tresp, Jindong Gu
2025-10-13
red teaming
2510.11851v1

Countermind: A Multi-Layered Security Architecture for Large Language Models

Dominik Schwarz
2025-10-13
2510.11837v1

Don't Walk the Line: Boundary Guidance for Filtered Generation

Sarah Ball, Andreas Haupt
2025-10-13
2510.11834v1

Bag of Tricks for Subverting Reasoning-based Safety Guardrails

Shuo Chen, Zhen Han, Haokun Chen, Bailan He, Shengyun Si, Jingpei Wu, Philip Torr, Volker Tresp, Jindong Gu
2025-10-13
red teaming
2510.11570v1

Beyond the Crowd: LLM-Augmented Community Notes for Governing Health Misinformation

Jiaying Wu, Zihang Fu, Haonan Wang, Fanxiao Li, Min-Yen Kan
2025-10-13
governance
2510.11423v1

Attacks by Content: Automated Fact-checking is an AI Security Issue

Michael Schlichtkrull
2025-10-13
2510.11238v1

TypePilot: Leveraging the Scala Type System for Secure LLM-generated Code

Alexander Sternfeld, Andrei Kucharavy, Ljiljana Dolamic
2025-10-13
2510.11151v1

Demystifying Numerosity in Diffusion Models -- Limitations and Remedies

Yaqi Zhao, Xiaochen Wang, Li Dong, Wentao Zhang, Yuhui Yuan
2025-10-13
2510.11117v1

SceneTextStylizer: A Training-Free Scene Text Style Transfer Framework with Diffusion Model

Honghui Yuan, Keiji Yanai
2025-10-13
2510.10910v1

October 06 - October 12, 2025

8 papers