← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1331 papers total
October 27 - November 02, 2025
2 papers
Adapting Speech Foundation Models with Large Language Models for Unified Speech Recognition
Jing-Xuan Zhang, Genshun Wan, Jin Li, Jianqing Gao
2025-10-27
2510.22961v1
FAME: Fairness-aware Attention-modulated Video Editing
Zhangkai Wu, Xuhui Fan, Zhongyuan Xie, Kaize Shi, Zhidong Li, Longbing Cao
2025-10-27
2510.22960v1
October 20 - October 26, 2025
22 papers
Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks
Md. Mehedi Hasan, Ziaur Rahman, Rafid Mostafiz, Md. Abir Hossain
2025-10-26
red teaming
safety
2510.22628v1
Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models
Pavlos Ntais
2025-10-24
red teaming
2510.22085v1
Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models
Sarah Ball, Niki Hasrati, Alexander Robey, Avi Schwarzschild, Frauke Kreuter, Zico Kolter, Andrej Risteski
2025-10-24
red teaming
2510.22014v1
Uncovering the Persuasive Fingerprint of LLMs in Jailbreaking Attacks
Havva Alizadeh Noughabi, Julien Serbanescu, Fattane Zarrinkalam, Ali Dehghantanha
2025-10-24
red teaming
2510.21983v1
Characterizing Low-Latency Sky Localization in Multi-Detector Gravitational-Wave Networks
Amazigh Ouzriat, Viola Sordini, Francesco Di Renzo
2025-10-24
2510.21930v1
Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks
Mahavir Dabas, Tran Huynh, Nikhil Reddy Billa, Jiachen T. Wang, Peng Gao, Charith Peris, Yao Ma, Rahul Gupta, Ming Jin, Prateek Mittal, Ruoxi Jia
2025-10-24
red teaming
2510.21910v1
FairImagen: Post-Processing for Bias Mitigation in Text-to-Image Models
Zihao Fu, Ryan Brown, Shun Shao, Kai Rawal, Eoin Delaney, Chris Russell
2025-10-24
2510.21363v1
When Models Outthink Their Safety: Mitigating Self-Jailbreak in Large Reasoning Models with Chain-of-Guardrails
Yingzhi Mao, Chunkang Zhang, Junxiang Wang, Xinyan Guan, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun
2025-10-24
red teaming
2510.21285v2
When Models Outthink Their Safety: Mitigating Self-Jailbreak in Large Reasoning Models with Chain-of-Guardrails
Yingzhi Mao, Chunkang Zhang, Junxiang Wang, Xinyan Guan, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun
2025-10-24
red teaming
2510.21285v1
Enhanced MLLM Black-Box Jailbreaking Attacks and Defenses
Xingwei Zhong, Kar Wai Fok, Vrizlynn L. L. Thing
2025-10-24
red teaming
2510.21214v1
The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning
Mingrui Liu, Sixiao Zhang, Cheng Long, Kwok Yan Lam
2025-10-24
red teaming
2510.21190v1
Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency
Yukun Jiang, Mingjie Li, Michael Backes, Yang Zhang
2025-10-24
red teaming
2510.21189v1
NeuroGenPoisoning: Neuron-Guided Attacks on Retrieval-Augmented Generation of LLM via Genetic Optimization of External Knowledge
Hanyu Zhu, Lance Fiondella, Jiawei Yuan, Kai Zeng, Long Jiao
2025-10-24
red teaming
2510.21144v1
Soft Instruction De-escalation Defense
Nils Philipp Walter, Chawin Sitawarin, Jamie Hayes, David Stutz, Ilia Shumailov
2025-10-24
2510.21057v2
Soft Instruction De-escalation Defense
Nils Philipp Walter, Chawin Sitawarin, Jamie Hayes, David Stutz, Ilia Shumailov
2025-10-24
red teaming
2510.21057v1
Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training
Zheng-Xin Yong, Stephen H. Bach
2025-10-23
red teaming
2510.20956v1
Preventing Shortcuts in Adapter Training via Providing the Shortcuts
Anujraaj Argo Goyal, Guocheng Gordon Qian, Huseyin Coskun, Aarush Gupta, Himmy Tam, Daniil Ostashev, Ju Hu, Dhritiman Sagar, Sergey Tulyakov, Kfir Aberman, Kuan-Chieh Jackson Wang
2025-10-23
2510.20887v1
GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?
Chiyu Chen, Xinhao Song, Yunkai Chai, Yang Yao, Haodong Zhao, Lijun Li, Jie Li, Yan Teng, Gongshen Liu, Yingchun Wang
2025-10-23
2510.20333v2
GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?
Chiyu Chen, Xinhao Song, Yunkai Chai, Yang Yao, Haodong Zhao, Lijun Li, Jie Li, Yan Teng, Gongshen Liu, Yingchun Wang
2025-10-23
2510.20333v1
Beyond Text: Multimodal Jailbreaking of Vision-Language and Audio Models through Perceptually Simple Transformations
Divyanshu Kumar, Shreyas Jena, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, Prashanth Harshangi
2025-10-23
red teaming
2510.20223v1
SAID: Empowering Large Language Models with Self-Activating Internal Defense
Yulong Chen, Yadong Liu, Jiawen Zhang, Mu Li, Chao Huang, Jie Wen
2025-10-23
red teaming
2510.20129v1
Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning
Xichen Zhang, Sitong Wu, Yinghao Zhu, Haoru Tan, Shaozuo Yu, Ziyi He, Jiaya Jia
2025-10-22
2510.19807v1
‹
1
2
3
...
30
31
32
...
54
55
56
›