← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1169 papers total
September 22 - September 28, 2025
1 paper
SilentStriker:Toward Stealthy Bit-Flip Attacks on Large Language Models
Haotian Xu, Qingsong Peng, Jie Shi, Huadi Zheng, Yu Li, Cheng Zhuo
2025-09-22
red teaming
2509.17371v2
September 15 - September 21, 2025
21 papers
DecipherGuard: Understanding and Deciphering Jailbreak Prompts for a Safer Deployment of Intelligent Software Systems
Rui Yang, Michael Fu, Chakkrit Tantithamthavorn, Chetan Arora, Gunel Gulmammadova, Joey Chua
2025-09-21
red teaming
2509.16870v1
AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software
Rui Yang, Michael Fu, Chakkrit Tantithamthavorn, Chetan Arora, Gunel Gulmammadova, Joey Chua
2025-09-21
red teaming
safety
2509.16861v1
AIPsychoBench: Understanding the Psychometric Differences between LLMs and Humans
Wei Xie, Shuoyoucheng Ma, Zhenhua Wang, Enze Wang, Kai Chen, Xiaobing Sun, Baosheng Wang
2025-09-20
2509.16530v1
Domain-Specific Constitutional AI: Enhancing Safety in LLM-Powered Mental Health Chatbots
Chenhan Lyu, Yutong Song, Pengfei Zhang, Amir M. Rahmani
2025-09-19
safety
2509.16444v1
Randomized Smoothing Meets Vision-Language Models
Emmanouil Seferis, Changshun Wu, Stefanos Kollias, Saddek Bensalem, Chih-Hong Cheng
2025-09-19
red teaming
2509.16088v1
SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection
Maithili Joshi, Palash Nandi, Tanmoy Chakraborty
2025-09-19
red teaming
2509.16060v1
EmoQ: Speech Emotion Recognition via Speech-Aware Q-Former and Large Language Model
Yiqing Yang, Man-Wai Mak
2025-09-19
2509.15775v1
Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction
Yuanbo Xie, Yingjie Zhang, Tianyun Liu, Duohe Ma, Tingwen Liu
2025-09-18
red teaming
safety
2509.15202v1
Sentinel Agents for Secure and Trustworthy Agentic AI in Multi-Agent Systems
Diego Gosmar, Deborah A. Dahl
2025-09-18
red teaming
2509.14956v1
Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages
Yujia Hu, Ming Shan Hee, Preslav Nakov, Roy Ka-Wei Lee
2025-09-18
safety
2509.15260v2
Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages
Yujia Hu, Ming Shan Hee, Preslav Nakov, Roy Ka-Wei Lee
2025-09-18
safety
2509.15260v1
MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models
Siyu Yan, Long Zeng, Xuecheng Wu, Chengcheng Han, Kongcheng Zhang, Chong Peng, Xuezhi Cao, Xunliang Cai, Chenjuan Guo
2025-09-18
red teaming
2509.14651v1
LLM Jailbreak Detection for (Almost) Free!
Guorui Chen, Yifan Xia, Xiaojun Jia, Zhijiang Li, Philip Torr, Jindong Gu
2025-09-18
red teaming
2509.14558v1
A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness
Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li, Ruifeng Xu
2025-09-17
red teaming
2509.14297v1
Agentic JWT: A Secure Delegation Protocol for Autonomous AI Agents
Abhishek Goswami
2025-09-16
2509.13597v1
A Multi-Agent LLM Defense Pipeline Against Prompt Injection Attacks
S M Asif Hossain, Ruksat Khan Shayoni, Mohd Ruhul Ameen, Akif Islam, M. F. Mridha, Jungpil Shin
2025-09-16
2509.14285v2
A Multi-Agent LLM Defense Pipeline Against Prompt Injection Attacks
S M Asif Hossain, Ruksat Khan Shayoni, Mohd Ruhul Ameen, Akif Islam, M. F. Mridha, Jungpil Shin
2025-09-16
safety
2509.14285v1
Jailbreaking Large Language Models Through Content Concretization
Johan Wahréus, Ahmed Hussain, Panos Papadimitratos
2025-09-16
red teaming
2509.12937v1
Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models
Yunhan Zhao, Xiang Zheng, Xingjun Ma
2025-09-16
red teaming
2509.12724v1
Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models
Gustavo Sandoval, Denys Fenchenko, Junyao Chen
2025-09-15
red teaming
2509.14271v1
Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check
Chentao Cao, Xiaojun Xu, Bo Han, Hang Li
2025-09-15
2509.11629v1
September 08 - September 14, 2025
2 papers
Securing AI Agents: Implementing Role-Based Access Control for Industrial Applications
Aadil Gani Ganie
2025-09-14
2509.11431v1
When Smiley Turns Hostile: Interpreting How Emojis Trigger LLMs' Toxicity
Shiyao Cui, Xijia Feng, Yingkang Wang, Junxiao Yang, Zhexin Zhang, Biplab Sikdar, Hongning Wang, Han Qiu, Minlie Huang
2025-09-14
red teaming
2509.11141v1
‹
1
2
3
...
33
34
35
...
47
48
49
›