â Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
đ Search
Showing 235 papers total
June 09 - June 15, 2025
21 papers
Exploring the Secondary Risks of Large Language Models
Jiawei Chen, Zhengwei Fang, Xiao Yang, Chao Yu, Zhaoxia Yin, Hang Su
2025-06-14
2506.12382v1
QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety
Taegyeong Lee, Jeonghwa Yoo, Hyoungseo Cho, Soo Yong Kim, Yunho Maeng
2025-06-14
safety
2506.12299v1
InfoFlood: Jailbreaking Large Language Models with Information Overload
Advait Yadav, Haibo Jin, Man Luo, Jun Zhuang, Haohan Wang
2025-06-13
red teaming
2506.12274v1
Investigating Vulnerabilities and Defenses Against Audio-Visual Attacks: A Comprehensive Survey Emphasizing Multimodal Models
Jinming Wen, Xinyi Wu, Shuai Zhao, Yanhao Jia, Yuwen Li
2025-06-13
red teaming
2506.11521v1
DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents
Hao Li, Xiaogeng Liu, Hung-Chun Chiu, Dianqi Li, Ning Zhang, Chaowei Xiao
2025-06-13
safety
2506.12104v1
How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?
Sohee Yang, Sang-Woo Lee, Nora Kassner, Daniela Gottesman, Sebastian Riedel, Mor Geva
2025-06-12
red teaming
2506.10979v1
SoK: Evaluating Jailbreak Guardrails for Large Language Models
Xunguang Wang, Zhenlan Ji, Wenxuan Wang, Zongjie Li, Daoyuan Wu, Shuai Wang
2025-06-12
red teaming
2506.10597v1
Effective Red-Teaming of Policy-Adherent Agents
Itay Nakash, George Kour, Koren Lazar, Matan Vetzler, Guy Uziel, Ateret Anaby-Tavor
2025-06-11
red teaming
2506.09600v1
Risks & Benefits of LLMs & GenAI for Platform Integrity, Healthcare Diagnostics, Cybersecurity, Privacy & AI Safety: A Comprehensive Survey, Roadmap & Implementation Blueprint
Kiarash Ahi
2025-06-10
safety
2506.12088v1
AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)
Danush Khanna, Krishna Kumar, Basab Ghosh, Vinija Jain, Vasu Sharma, Aman Chadha, Amitava Das
2025-06-10
red teaming
safety
2506.08885v2
AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)
Danush Khanna, Krishna Kumar, Basab Ghosh, Vinija Jain, Vasu Sharma, Aman Chadha, Amitava Das
2025-06-10
red teaming
safety
2506.08885v1
Evaluation empirique de la sécurisation et de l'alignement de ChatGPT et Gemini: analyse comparative des vulnérabilités par expérimentations de jailbreaks
Rafaël Nouailles
2025-06-10
red teaming
2506.10029v1
AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin
Shuo Yang, Qihui Zhang, Yuyang Liu, Yue Huang, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, Li Yuan
2025-06-10
safety
2506.08473v2
AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin
Shuo Yang, Qihui Zhang, Yuyang Liu, Yue Huang, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, Li Yuan
2025-06-10
safety
2506.08473v1
RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards
Jingnan Zheng, Xiangtian Ji, Yijun Lu, Chenhang Cui, Weixiang Zhao, Gelei Deng, Zhenkai Liang, An Zhang, Tat-Seng Chua
2025-06-09
red teaming
2506.07736v2
RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards
Jingnan Zheng, Xiangtian Ji, Yijun Lu, Chenhang Cui, Weixiang Zhao, Gelei Deng, Zhenkai Liang, An Zhang, Tat-Seng Chua
2025-06-09
red teaming
2506.07736v1
LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges
Haoyang Li, Huan Gao, Zhiyuan Zhao, Zhiyu Lin, Junyu Gao, Xuelong Li
2025-06-09
red teaming
2506.10022v1
Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models
Maciej ChrabÄ szcz, Katarzyna Lorenc, Karolina Seweryn
2025-06-09
red teaming
2506.07645v1
TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts
Torsten KrauĂ, Hamid Dashtbani, Alexandra Dmitrienko
2025-06-09
red teaming
2506.07596v1
When Style Breaks Safety: Defending Language Models Against Superficial Style Alignment
Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi
2025-06-09
red teaming
2506.07452v1
Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures
Yukai Zhou, Sibei Yang, Wenjie Wang
2025-06-09
red teaming
2506.07402v1
June 02 - June 08, 2025
3 papers
Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations
Zhiyu Xue, Reza Abbasi-Asl, Ramtin Pedarsani
2025-06-08
2506.09067v1
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint
Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, Tat-Seng Chua
2025-06-08
red teaming
2506.07022v1
Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test
Xiaoyuan Zhu, Yaowen Ye, Tianyi Qiu, Hanlin Zhu, Sijun Tan, Ajraf Mannan, Jonathan Michala, Raluca Ada Popa, Willie Neiswanger
2025-06-08
red teaming
2506.06975v3
âč
1
2
3
...
7
8
9
10
âș