← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1169 papers total
September 08 - September 14, 2025
14 papers
ENJ: Optimizing Noise with Genetic Algorithms to Jailbreak LSMs
Yibo Zhang, Liang Lin
2025-09-14
2509.11128v1
Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding
Seongho Joo, Hyukhun Koh, Kyomin Jung
2025-09-13
red teaming
2509.10931v1
Prompt Injection Attacks on LLM Generated Reviews of Scientific Publications
Janis Keuper
2025-09-12
red teaming
2509.10248v3
Realism Control One-step Diffusion for Real-World Image Super-Resolution
Zongliang Wu, Siming Zheng, Peng-Tao Jiang, Xin Yuan
2025-09-12
2509.10122v2
When Your Reviewer is an LLM: Biases, Divergence, and Prompt Injection Risks in Peer Review
Changjia Zhu, Junjie Xiong, Renkai Ma, Zhicong Lu, Yao Liu, Lingyao Li
2025-09-12
red teaming
2509.09912v1
Steering MoE LLMs via Expert (De)Activation
Mohsen Fayyaz, Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Ryan Rossi, Trung Bui, Hinrich Schütze, Nanyun Peng
2025-09-11
red teaming
2509.09660v1
Improving LLM Safety and Helpfulness using SFT and DPO: A Study on OPT-350M
Piyush Pant
2025-09-10
safety
2509.09055v1
PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLMs with Enhanced Safety, Fairness, and Controllability
Tung Vu, Lam Nguyen, Quynh Dao
2025-09-10
safety
2509.08910v1
X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates
Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park
2025-09-10
red teaming
2509.08729v1
Architecting Resilient LLM Agents: A Guide to Secure Plan-then-Execute Implementations
Ron F. Del Rosario, Klaudia Krawiecka, Christian Schroeder de Witt
2025-09-10
2509.08646v1
ImportSnare: Directed "Code Manual" Hijacking in Retrieval-Augmented Code Generation
Kai Ye, Liangcai Su, Chenxiong Qian
2025-09-09
red teaming
2509.07941v1
Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling
Minghui Li, Hao Zhang, Yechao Zhang, Wei Wan, Shengshan Hu, pei Xiaobing, Jing Wang
2025-09-09
red teaming
2509.07617v1
SafeToolBench: Pioneering a Prospective Benchmark to Evaluating Tool Utilization Safety in LLMs
Hongfei Xia, Hongru Wang, Zeming Liu, Qian Yu, Yuhang Guo, Haifeng Wang
2025-09-09
safety
2509.07315v1
Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?
Junjie Mu, Zonghao Ying, Zhekui Fan, Zonglei Jing, Yaoyuan Zhang, Zhengmin Yu, Wenxin Zhang, Quanchen Zou, Xiangzheng Zhang
2025-09-08
red teaming
2509.06350v1
September 01 - September 07, 2025
10 papers
Measuring the Vulnerability Disclosure Policies of AI Vendors
Yangheran Piao, Jingjie Li, Daniel W. Woods
2025-09-07
2509.06136v1
Multimodal Prompt Injection Attacks: Risks and Defenses for Modern LLMs
Andrew Yeo, Daeseon Choi
2025-09-07
red teaming
safety
2509.05883v1
AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs
Debdeep Sanyal, Manodeep Ray, Murari Mandal
2025-09-06
red teaming
2509.08000v1
EchoLeak: The First Real-World Zero-Click Prompt Injection Exploit in a Production LLM System
Pavan Reddy, Aditya Sanjay Gujral
2025-09-06
red teaming
2509.10540v1
Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models
Youjia Zheng, Mohammad Zandsalimy, Shanu Sushmita
2025-09-05
red teaming
2509.05471v1
The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models
Danielle Ensign, Henry Sleight, Kyle Fish
2025-09-05
2509.04781v1
NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models
Chuhan Zhang, Ye Zhang, Bowen Shi, Yuyou Gan, Tianyu Du, Shouling Ji, Dazhan Deng, Yingcai Wu
2025-09-04
red teaming
2509.03985v1
Between a Rock and a Hard Place: Exploiting Ethical Reasoning to Jailbreak LLMs
Shei Pern Chua, Thai Zhen Leng, Teh Kai Jun, Xiao Li, Xiaolin Hu
2025-09-04
red teaming
2509.05367v1
SafeProtein: Red-Teaming Framework and Benchmark for Protein Foundation Models
Jigang Fan, Zhenghong Zhou, Ruofan Jin, Le Cong, Mengdi Wang, Zaixi Zhang
2025-09-03
red teaming
2509.03487v1
BioBlue: Notable runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format
Roland Pihlakas, Sruthi Kuriakose
2025-09-02
safety
2509.02655v1
‹
1
2
3
...
34
35
36
...
47
48
49
›