← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 97 papers total
June 09 - June 15, 2025
3 papers
TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts
Torsten Krauß, Hamid Dashtbani, Alexandra Dmitrienko
2025-06-09
red teaming
2506.07596v1
When Style Breaks Safety: Defending Language Models Against Superficial Style Alignment
Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi
2025-06-09
red teaming
2506.07452v1
Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures
Yukai Zhou, Sibei Yang, Wenjie Wang
2025-06-09
red teaming
2506.07402v1
June 02 - June 08, 2025
21 papers
Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations
Zhiyu Xue, Reza Abbasi-Asl, Ramtin Pedarsani
2025-06-08
2506.09067v1
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint
Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, Tat-Seng Chua
2025-06-08
red teaming
2506.07022v1
Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test
Xiaoyuan Zhu, Yaowen Ye, Tianyi Qiu, Hanlin Zhu, Sijun Tan, Ajraf Mannan, Jonathan Michala, Raluca Ada Popa, Willie Neiswanger
2025-06-08
red teaming
2506.06975v3
Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test
Xiaoyuan Zhu, Yaowen Ye, Tianyi Qiu, Hanlin Zhu, Sijun Tan, Ajraf Mannan, Jonathan Michala, Raluca Ada Popa, Willie Neiswanger
2025-06-08
red teaming
2506.06975v2
Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test
Xiaoyuan Zhu, Yaowen Ye, Tianyi Qiu, Hanlin Zhu, Sijun Tan, Ajraf Mannan, Jonathan Michala, Raluca Ada Popa, Willie Neiswanger
2025-06-08
red teaming
2506.06975v1
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment
Kyubyung Chae, Hyunbin Jin, Taesup Kim
2025-06-07
red teaming
2506.10020v1
Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance
Ruizhong Qiu, Gaotang Li, Tianxin Wei, Jingrui He, Hanghang Tong
2025-06-06
safety
2506.06444v1
Small Models, Big Support: A Local LLM Framework for Teacher-Centric Content Creation and Assessment using RAG and CAG
Zarreen Reza, Alexander Mazur, Michael T. Dugdale, Robin Ray-Chaudhuri
2025-06-06
2506.05925v1
Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets
Lei Hsiung, Tianyu Pang, Yung-Chen Tang, Linyue Song, Tsung-Yi Ho, Pin-Yu Chen, Yaoqing Yang
2025-06-05
red teaming
safety
2506.05346v1
Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety
Seongmin Lee, Aeree Cho, Grace C. Kim, ShengYun Peng, Mansi Phute, Duen Horng Chau
2025-06-05
safety
2506.05451v1
HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model
Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang
2025-06-05
2506.04704v2
HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model
Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang
2025-06-05
2506.04704v1
Adversarial Attacks on Robotic Vision Language Action Models
Eliot Krzysztof Jones, Alexander Robey, Andy Zou, Zachary Ravichandran, George J. Pappas, Hamed Hassani, Matt Fredrikson, J. Zico Kolter
2025-06-03
red teaming
2506.03350v1
It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics
Matthew Kowal, Jasper Timm, Jean-Francois Godbout, Thomas Costello, Antonio A. Arechar, Gordon Pennycook, David Rand, Adam Gleave, Kellin Pelrine
2025-06-03
red teaming
2506.02873v1
From Prompts to Protection: Large Language Model-Enabled In-Context Learning for Smart Public Safety UAV
Yousef Emami, Hao Zhou, Miguel Gutierrez Gaitan, Kai Li, Luis Almeida, Zhu Han
2025-06-03
2506.02649v1
IndoSafety: Culturally Grounded Safety for LLMs in Indonesian Languages
Muhammad Falensi Azmi, Muhammad Dehan Al Kautsar, Alfan Farizki Wicaksono, Fajri Koto
2025-06-03
safety
2506.02573v1
BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage
Kalyan Nakka, Nitesh Saxena
2025-06-03
red teaming
2506.02479v1
Should LLM Safety Be More Than Refusing Harmful Instructions?
Utsav Maskey, Mark Dras, Usman Naseem
2025-06-03
safety
2506.02442v2
Should LLM Safety Be More Than Refusing Harmful Instructions?
Utsav Maskey, Mark Dras, Usman Naseem
2025-06-03
safety
2506.02442v1
AnswerCarefully: A Dataset for Improving the Safety of Japanese LLM Output
Hisami Suzuki, Satoru Katsumata, Takashi Kodama, Tetsuro Takahashi, Kouta Nakayama, Satoshi Sekine
2025-06-03
safety
2506.02372v1
Evaluating LLM Agent Adherence to Hierarchical Safety Principles: A Lightweight Benchmark for Probing Foundational Controllability Components
Ram Potham
2025-06-03
safety
2506.02357v1
‹
1
2
3
4
5
›