Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 770 papers total

September 15 - September 21, 2025

10 papers

Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages

Yujia Hu, Ming Shan Hee, Preslav Nakov, Roy Ka-Wei Lee
2025-09-18
safety
2509.15260v1

MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models

Siyu Yan, Long Zeng, Xuecheng Wu, Chengcheng Han, Kongcheng Zhang, Chong Peng, Xuezhi Cao, Xunliang Cai, Chenjuan Guo
2025-09-18
red teaming
2509.14651v1

LLM Jailbreak Detection for (Almost) Free!

Guorui Chen, Yifan Xia, Xiaojun Jia, Zhijiang Li, Philip Torr, Jindong Gu
2025-09-18
red teaming
2509.14558v1

A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li, Ruifeng Xu
2025-09-17
red teaming
2509.14297v1

A Multi-Agent LLM Defense Pipeline Against Prompt Injection Attacks

S M Asif Hossain, Ruksat Khan Shayoni, Mohd Ruhul Ameen, Akif Islam, M. F. Mridha, Jungpil Shin
2025-09-16
2509.14285v2

A Multi-Agent LLM Defense Pipeline Against Prompt Injection Attacks

S M Asif Hossain, Ruksat Khan Shayoni, Mohd Ruhul Ameen, Akif Islam, M. F. Mridha, Jungpil Shin
2025-09-16
safety
2509.14285v1

Jailbreaking Large Language Models Through Content Concretization

Johan Wahréus, Ahmed Hussain, Panos Papadimitratos
2025-09-16
red teaming
2509.12937v1

Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models

Yunhan Zhao, Xiang Zheng, Xingjun Ma
2025-09-16
red teaming
2509.12724v1

Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models

Gustavo Sandoval, Denys Fenchenko, Junyao Chen
2025-09-15
red teaming
2509.14271v1

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check

Chentao Cao, Xiaojun Xu, Bo Han, Hang Li
2025-09-15
2509.11629v1

September 08 - September 14, 2025

13 papers

When Smiley Turns Hostile: Interpreting How Emojis Trigger LLMs' Toxicity

Shiyao Cui, Xijia Feng, Yingkang Wang, Junxiao Yang, Zhexin Zhang, Biplab Sikdar, Hongning Wang, Han Qiu, Minlie Huang
2025-09-14
red teaming
2509.11141v1

ENJ: Optimizing Noise with Genetic Algorithms to Jailbreak LSMs

Yibo Zhang, Liang Lin
2025-09-14
2509.11128v1

Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding

Seongho Joo, Hyukhun Koh, Kyomin Jung
2025-09-13
red teaming
2509.10931v1

Prompt Injection Attacks on LLM Generated Reviews of Scientific Publications

Janis Keuper
2025-09-12
red teaming
2509.10248v3

When Your Reviewer is an LLM: Biases, Divergence, and Prompt Injection Risks in Peer Review

Changjia Zhu, Junjie Xiong, Renkai Ma, Zhicong Lu, Yao Liu, Lingyao Li
2025-09-12
red teaming
2509.09912v1

Steering MoE LLMs via Expert (De)Activation

Mohsen Fayyaz, Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Ryan Rossi, Trung Bui, Hinrich Schütze, Nanyun Peng
2025-09-11
red teaming
2509.09660v1

Improving LLM Safety and Helpfulness using SFT and DPO: A Study on OPT-350M

Piyush Pant
2025-09-10
safety
2509.09055v1

PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLMs with Enhanced Safety, Fairness, and Controllability

Tung Vu, Lam Nguyen, Quynh Dao
2025-09-10
safety
2509.08910v1

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park
2025-09-10
red teaming
2509.08729v1

ImportSnare: Directed "Code Manual" Hijacking in Retrieval-Augmented Code Generation

Kai Ye, Liangcai Su, Chenxiong Qian
2025-09-09
red teaming
2509.07941v1

Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling

Minghui Li, Hao Zhang, Yechao Zhang, Wei Wan, Shengshan Hu, pei Xiaobing, Jing Wang
2025-09-09
red teaming
2509.07617v1

SafeToolBench: Pioneering a Prospective Benchmark to Evaluating Tool Utilization Safety in LLMs

Hongfei Xia, Hongru Wang, Zeming Liu, Qian Yu, Yuhang Guo, Haifeng Wang
2025-09-09
safety
2509.07315v1

Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

Junjie Mu, Zonghao Ying, Zhekui Fan, Zonglei Jing, Yaoyuan Zhang, Zhengmin Yu, Wenxin Zhang, Quanchen Zou, Xiangzheng Zhang
2025-09-08
red teaming
2509.06350v1

September 01 - September 07, 2025

1 paper