← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1331 papers total
September 29 - October 05, 2025
24 papers
Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach
Xiangfang Li, Yu Wang, Bo Li
2025-10-01
red teaming
2510.01342v1
Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense
Guobin Shen, Dongcheng Zhao, Haibo Tong, Jindong Li, Feifei Zhao, Yi Zeng
2025-10-01
red teaming
safety
2510.01088v1
Large Reasoning Models Learn Better Alignment from Flawed Thinking
ShengYun Peng, Eric Smith, Ivan Evtimov, Song Jiang, Pin-Yu Chen, Hongyuan Zhan, Haozhu Wang, Duen Horng Chau, Mahesh Pasupuleti, Jianfeng Chi
2025-10-01
red teaming
2510.00938v1
ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs
Adi Simhi, Jonathan Herzig, Martin Tutek, Itay Itzhak, Idan Szpektor, Yonatan Belinkov
2025-10-01
safety
2510.00857v1
Toward Safer Diffusion Language Models: Discovery and Mitigation of Priming Vulnerability
Shojiro Yamabe, Jun Sakuma
2025-10-01
red teaming
2510.00565v1
A Call to Action for a Secure-by-Design Generative AI Paradigm
Dalal Alharthi, Ivan Roberto Kawaminami Garcia
2025-10-01
2510.00451v1
Fairness Testing in Retrieval-Augmented Generation: How Small Perturbations Reveal Bias in Small Language Models
Matheus Vinicius da Silva de Oliveira, Jonathan de Andrade Silva, Awdren de Lima Fontao
2025-09-30
2509.26584v1
STaR-Attack: A Spatio-Temporal and Narrative Reasoning Attack Framework for Unified Multimodal Understanding and Generation Models
Shaoxiong Guo, Tianyi Du, Lijun Li, Yuyao Wu, Jie Li, Jing Shao
2025-09-30
red teaming
2509.26473v1
SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models
Qinjian Zhao, Jiaqi Wang, Zhiqiang Gao, Zhihao Dou, Belal Abuhaija, Kaizhu Huang
2025-09-30
red teaming
2509.26345v1
SafeEvalAgent: Toward Agentic and Self-Evolving Safety Evaluation of LLMs
Yixu Wang, Xin Wang, Yang Yao, Xinyuan Li, Yan Teng, Xingjun Ma, Yingchun Wang
2025-09-30
safety
2509.26100v1
Better Privilege Separation for Agents by Restricting Data Types
Dennis Jacob, Emad Alghamdi, Zhanhao Hu, Basel Alomair, David Wagner
2025-09-30
2509.25926v1
SafeMind: Benchmarking and Mitigating Safety Risks in Embodied LLM Agents
Ruolin Chen, Yinqian Sun, Jihang Wang, Mingyang Lv, Qian Zhang, Yi Zeng
2025-09-30
safety
2509.25885v1
ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack
Yein Park, Jungwoo Park, Jaewoo Kang
2025-09-30
red teaming
2509.25843v1
How Diffusion Models Memorize
Juyeop Kim, Songkuk Kim, Jong-Seok Lee
2025-09-30
2509.25705v1
STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents
Jing-Jing Li, Jianfeng He, Chao Shang, Devang Kulshreshtha, Xun Xian, Yi Zhang, Hang Su, Sandesh Swamy, Yanjun Qi
2025-09-30
red teaming
2509.25624v1
Fingerprinting LLMs via Prompt Injection
Yuepeng Hu, Zhengyuan Jiang, Mengyuan Li, Osama Ahmed, Zhicong Huang, Cheng Hong, Neil Gong
2025-09-29
2509.25448v2
Towards Trustworthy Lexical Simplification: Exploring Safety and Efficiency with Small LLMs
Akio Hayakawa, Stefan Bott, Horacio Saggion
2025-09-29
safety
2509.25086v1
A-MemGuard: A Proactive Defense Framework for LLM-Based Agent Memory
Qianshan Wei, Tengchao Yang, Yaochen Wang, Xinfeng Li, Lijun Li, Zhenfei Yin, Yi Zhan, Thorsten Holz, Zhiqiang Lin, XiaoFeng Wang
2025-09-29
safety
2510.02373v1
SecInfer: Preventing Prompt Injection via Inference-time Scaling
Yupei Liu, Yanting Wang, Yuqi Jia, Jinyuan Jia, Neil Zhenqiang Gong
2025-09-29
2509.24967v4
SecInfer: Preventing Prompt Injection via Inference-time Scaling
Yupei Liu, Yanting Wang, Yuqi Jia, Jinyuan Jia, Neil Zhenqiang Gong
2025-09-29
2509.24967v3
SecInfer: Preventing Prompt Injection via Inference-time Scaling
Yupei Liu, Yanting Wang, Yuqi Jia, Jinyuan Jia, Neil Zhenqiang Gong
2025-09-29
red teaming
2509.24967v2
GSPR: Aligning LLM Safeguards as Generalizable Safety Policy Reasoners
Haoran Li, Yulin Chen, Jingru Zeng, Hao Peng, Huihao Jing, Wenbin Hu, Xi Yang, Ziqian Zeng, Sirui Han, Yangqiu Song
2025-09-29
safety
2509.24418v1
Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention
Yichi Zhang, Yue Ding, Jingwen Yang, Tianwei Luo, Dongbai Li, Ranjie Duan, Qiang Liu, Hang Su, Yinpeng Dong, Jun Zhu
2025-09-29
2509.24393v1
HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment
Langqi Yang, Tianhang Zheng, Kedong Xiu, Yixuan Chen, Di Wang, Puning Zhao, Zhan Qin, Kui Ren
2025-09-29
red teaming
2509.24384v1
‹
1
2
3
...
37
38
39
...
54
55
56
›