← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 770 papers total
September 29 - October 05, 2025
8 papers
SecInfer: Preventing Prompt Injection via Inference-time Scaling
Yupei Liu, Yanting Wang, Yuqi Jia, Jinyuan Jia, Neil Zhenqiang Gong
2025-09-29
2509.24967v3
SecInfer: Preventing Prompt Injection via Inference-time Scaling
Yupei Liu, Yanting Wang, Yuqi Jia, Jinyuan Jia, Neil Zhenqiang Gong
2025-09-29
red teaming
2509.24967v2
GSPR: Aligning LLM Safeguards as Generalizable Safety Policy Reasoners
Haoran Li, Yulin Chen, Jingru Zeng, Hao Peng, Huihao Jing, Wenbin Hu, Xi Yang, Ziqian Zeng, Sirui Han, Yangqiu Song
2025-09-29
safety
2509.24418v1
Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention
Yichi Zhang, Yue Ding, Jingwen Yang, Tianwei Luo, Dongbai Li, Ranjie Duan, Qiang Liu, Hang Su, Yinpeng Dong, Jun Zhu
2025-09-29
2509.24393v1
HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment
Langqi Yang, Tianhang Zheng, Kedong Xiu, Yixuan Chen, Di Wang, Puning Zhao, Zhan Qin, Kui Ren
2025-09-29
red teaming
2509.24384v1
Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in LLMs
Jongwook Han, Jongwon Lim, Injin Kong, Yohan Jo
2025-09-29
red teaming
2509.24319v1
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models
Zherui Li, Zheng Nie, Zhenhong Zhou, Yufei Guo, Yue Liu, Yitong Zhang, Yu Cheng, Qingsong Wen, Kun Wang, Jiaheng Zhang
2025-09-29
red teaming
2509.24296v1
AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models
Zihao Zhu, Xinyu Wu, Gehan Hu, Siwei Lyu, Ke Xu, Baoyuan Wu
2025-09-29
red teaming
2509.24269v1
September 22 - September 28, 2025
16 papers
Quant Fever, Reasoning Blackholes, Schrodinger's Compliance, and More: Probing GPT-OSS-20B
Shuyi Lin, Tian Lu, Zikai Wang, Bo Wen, Yibo Zhao, Cheng Tan
2025-09-28
red teaming
2509.23882v1
RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration
Xiuyuan Chen, Jian Zhao, Yuchen Yuan, Tianle Zhang, Huilin Zhou, Zheng Zhu, Ping Hu, Linghe Kong, Chi Zhang, Weiran Huang, Xuelong Li
2025-09-28
safety
2509.25271v1
SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents
Jianshuo Dong, Sheng Guo, Hao Wang, Zhuotao Liu, Tianwei Zhang, Ke Xu, Minlie Huang, Han Qiu
2025-09-28
safety
2509.23694v2
SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents
Jianshuo Dong, Sheng Guo, Hao Wang, Zhuotao Liu, Tianwei Zhang, Ke Xu, Minlie Huang, Han Qiu
2025-09-28
safety
2509.23694v1
PSG-Agent: Personality-Aware Safety Guardrail for LLM-based Agents
Yaozu Wu, Jizhou Guo, Dongyuan Li, Henry Peng Zou, Wei-Chieh Huang, Yankai Chen, Zhen Wang, Weizhi Zhang, Yangning Li, Meng Zhang, Renhe Jiang, Philip S. Yu
2025-09-28
safety
2509.23614v1
Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning
Zhaoqi Wang, Daqing He, Zijian Zhang, Xin Li, Liehuang Zhu, Meng Li, Jiamou Liu
2025-09-28
red teaming
2509.23558v1
Dual-Space Smoothness for Robust and Balanced LLM Unlearning
Han Yan, Zheyuan Liu, Meng Jiang
2025-09-27
red teaming
2509.23362v1
Preventing Robotic Jailbreaking via Multimodal Domain Adaptation
Francesco Marchiori, Rohan Sinha, Christopher Agia, Alexander Robey, George J. Pappas, Mauro Conti, Marco Pavone
2025-09-27
red teaming
2509.23281v1
GuardNet: Graph-Attention Filtering for Jailbreak Defense in Large Language Models
Javad Forough, Mohammad Maheri, Hamed Haddadi
2025-09-27
red teaming
2509.23037v1
ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents
Hwan Chang, Yonghyun Jun, Hwanhee Lee
2025-09-26
red teaming
2509.22830v1
Jailbreaking on Text-to-Video Models via Scene Splitting Strategy
Wonjun Lee, Haon Park, Doehyeon Lee, Bumsub Ham, Suhyun Kim
2025-09-26
red teaming
2509.22292v1
Safety Compliance: Rethinking LLM Safety Reasoning through the Lens of Compliance
Wenbin Hu, Huihao Jing, Haochen Shi, Haoran Li, Yangqiu Song
2025-09-26
safety
2509.22250v1
The Rogue Scalpel: Activation Steering Compromises LLM Safety
Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Y. Rogov, Ivan Oseledets, Elena Tutubalina
2025-09-26
red teaming
safety
2509.22067v1
Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment
Jaehan Kim, Minkyoo Song, Seungwon Shin, Sooel Son
2025-09-26
safety
2509.22745v1
PSRT: Accelerating LRM-based Guard Models via Prefilled Safe Reasoning Traces
Jiawei Zhao, Yuang Qi, Weiming Zhang, Nenghai Yu, Kejiang Chen
2025-09-26
2509.21768v1
Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models
Miao Yu, Zhenhong Zhou, Moayad Aloqaily, Kun Wang, Biwei Huang, Stephen Wang, Yueming Jin, Qingsong Wen
2025-09-26
red teaming
2509.21761v2
‹
1
2
3
...
15
16
17
...
31
32
33
›