← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1169 papers total
June 09 - June 15, 2025
3 papers
TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts
Torsten Krauß, Hamid Dashtbani, Alexandra Dmitrienko
2025-06-09
red teaming
2506.07596v1
When Style Breaks Safety: Defending Language Models Against Superficial Style Alignment
Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi
2025-06-09
red teaming
2506.07452v1
Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures
Yukai Zhou, Sibei Yang, Wenjie Wang
2025-06-09
red teaming
2506.07402v1
June 02 - June 08, 2025
21 papers
Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations
Zhiyu Xue, Reza Abbasi-Asl, Ramtin Pedarsani
2025-06-08
2506.09067v1
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint
Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, Tat-Seng Chua
2025-06-08
red teaming
2506.07022v1
Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test
Xiaoyuan Zhu, Yaowen Ye, Tianyi Qiu, Hanlin Zhu, Sijun Tan, Ajraf Mannan, Jonathan Michala, Raluca Ada Popa, Willie Neiswanger
2025-06-08
red teaming
2506.06975v3
Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test
Xiaoyuan Zhu, Yaowen Ye, Tianyi Qiu, Hanlin Zhu, Sijun Tan, Ajraf Mannan, Jonathan Michala, Raluca Ada Popa, Willie Neiswanger
2025-06-08
red teaming
2506.06975v2
Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test
Xiaoyuan Zhu, Yaowen Ye, Tianyi Qiu, Hanlin Zhu, Sijun Tan, Ajraf Mannan, Jonathan Michala, Raluca Ada Popa, Willie Neiswanger
2025-06-08
red teaming
2506.06975v1
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment
Kyubyung Chae, Hyunbin Jin, Taesup Kim
2025-06-07
red teaming
2506.10020v1
Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance
Ruizhong Qiu, Gaotang Li, Tianxin Wei, Jingrui He, Hanghang Tong
2025-06-06
safety
2506.06444v1
Small Models, Big Support: A Local LLM Framework for Teacher-Centric Content Creation and Assessment using RAG and CAG
Zarreen Reza, Alexander Mazur, Michael T. Dugdale, Robin Ray-Chaudhuri
2025-06-06
2506.05925v1
To Protect the LLM Agent Against the Prompt Injection Attack with Polymorphic Prompt
Zhilong Wang, Neha Nagaraja, Lan Zhang, Hayretdin Bahsi, Pawan Patil, Peng Liu
2025-06-06
red teaming
2506.05739v1
Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets
Lei Hsiung, Tianyu Pang, Yung-Chen Tang, Linyue Song, Tsung-Yi Ho, Pin-Yu Chen, Yaoqing Yang
2025-06-05
red teaming
safety
2506.05346v1
Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety
Seongmin Lee, Aeree Cho, Grace C. Kim, ShengYun Peng, Mansi Phute, Duen Horng Chau
2025-06-05
safety
2506.05451v1
Sentinel: SOTA model to protect against prompt injections
Dror Ivry, Oran Nahum
2025-06-05
2506.05446v1
HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model
Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang
2025-06-05
2506.04704v2
HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model
Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang
2025-06-05
2506.04704v1
Detection Method for Prompt Injection by Integrating Pre-trained Model and Heuristic Feature Engineering
Yi Ji, Runzhi Li, Baolei Mao
2025-06-05
red teaming
2506.06384v1
Adversarial Attacks on Robotic Vision Language Action Models
Eliot Krzysztof Jones, Alexander Robey, Andy Zou, Zachary Ravichandran, George J. Pappas, Hamed Hassani, Matt Fredrikson, J. Zico Kolter
2025-06-03
red teaming
2506.03350v1
It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics
Matthew Kowal, Jasper Timm, Jean-Francois Godbout, Thomas Costello, Antonio A. Arechar, Gordon Pennycook, David Rand, Adam Gleave, Kellin Pelrine
2025-06-03
red teaming
2506.02873v1
From Prompts to Protection: Large Language Model-Enabled In-Context Learning for Smart Public Safety UAV
Yousef Emami, Hao Zhou, Miguel Gutierrez Gaitan, Kai Li, Luis Almeida, Zhu Han
2025-06-03
2506.02649v1
IndoSafety: Culturally Grounded Safety for LLMs in Indonesian Languages
Muhammad Falensi Azmi, Muhammad Dehan Al Kautsar, Alfan Farizki Wicaksono, Fajri Koto
2025-06-03
safety
2506.02573v1
BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage
Kalyan Nakka, Nitesh Saxena
2025-06-03
red teaming
2506.02479v1
Should LLM Safety Be More Than Refusing Harmful Instructions?
Utsav Maskey, Mark Dras, Usman Naseem
2025-06-03
safety
2506.02442v2
‹
1
2
3
...
44
45
46
47
48
49
›