Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 97 papers total

June 09 - June 15, 2025

3 papers

June 02 - June 08, 2025

21 papers

Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations

Zhiyu Xue, Reza Abbasi-Asl, Ramtin Pedarsani
2025-06-08
2506.09067v1

AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint

Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, Tat-Seng Chua
2025-06-08
red teaming
2506.07022v1

Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test

Xiaoyuan Zhu, Yaowen Ye, Tianyi Qiu, Hanlin Zhu, Sijun Tan, Ajraf Mannan, Jonathan Michala, Raluca Ada Popa, Willie Neiswanger
2025-06-08
red teaming
2506.06975v3

Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test

Xiaoyuan Zhu, Yaowen Ye, Tianyi Qiu, Hanlin Zhu, Sijun Tan, Ajraf Mannan, Jonathan Michala, Raluca Ada Popa, Willie Neiswanger
2025-06-08
red teaming
2506.06975v2

Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test

Xiaoyuan Zhu, Yaowen Ye, Tianyi Qiu, Hanlin Zhu, Sijun Tan, Ajraf Mannan, Jonathan Michala, Raluca Ada Popa, Willie Neiswanger
2025-06-08
red teaming
2506.06975v1

From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment

Kyubyung Chae, Hyunbin Jin, Taesup Kim
2025-06-07
red teaming
2506.10020v1

Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance

Ruizhong Qiu, Gaotang Li, Tianxin Wei, Jingrui He, Hanghang Tong
2025-06-06
safety
2506.06444v1

Small Models, Big Support: A Local LLM Framework for Teacher-Centric Content Creation and Assessment using RAG and CAG

Zarreen Reza, Alexander Mazur, Michael T. Dugdale, Robin Ray-Chaudhuri
2025-06-06
2506.05925v1

Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets

Lei Hsiung, Tianyu Pang, Yung-Chen Tang, Linyue Song, Tsung-Yi Ho, Pin-Yu Chen, Yaoqing Yang
2025-06-05
red teaming safety
2506.05346v1

Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety

Seongmin Lee, Aeree Cho, Grace C. Kim, ShengYun Peng, Mansi Phute, Duen Horng Chau
2025-06-05
safety
2506.05451v1

HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model

Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang
2025-06-05
2506.04704v2

HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model

Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang
2025-06-05
2506.04704v1

Adversarial Attacks on Robotic Vision Language Action Models

Eliot Krzysztof Jones, Alexander Robey, Andy Zou, Zachary Ravichandran, George J. Pappas, Hamed Hassani, Matt Fredrikson, J. Zico Kolter
2025-06-03
red teaming
2506.03350v1

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

Matthew Kowal, Jasper Timm, Jean-Francois Godbout, Thomas Costello, Antonio A. Arechar, Gordon Pennycook, David Rand, Adam Gleave, Kellin Pelrine
2025-06-03
red teaming
2506.02873v1

From Prompts to Protection: Large Language Model-Enabled In-Context Learning for Smart Public Safety UAV

Yousef Emami, Hao Zhou, Miguel Gutierrez Gaitan, Kai Li, Luis Almeida, Zhu Han
2025-06-03
2506.02649v1

IndoSafety: Culturally Grounded Safety for LLMs in Indonesian Languages

Muhammad Falensi Azmi, Muhammad Dehan Al Kautsar, Alfan Farizki Wicaksono, Fajri Koto
2025-06-03
safety
2506.02573v1

BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage

Kalyan Nakka, Nitesh Saxena
2025-06-03
red teaming
2506.02479v1

Should LLM Safety Be More Than Refusing Harmful Instructions?

Utsav Maskey, Mark Dras, Usman Naseem
2025-06-03
safety
2506.02442v2

Should LLM Safety Be More Than Refusing Harmful Instructions?

Utsav Maskey, Mark Dras, Usman Naseem
2025-06-03
safety
2506.02442v1

AnswerCarefully: A Dataset for Improving the Safety of Japanese LLM Output

Hisami Suzuki, Satoru Katsumata, Takashi Kodama, Tetsuro Takahashi, Kouta Nakayama, Satoshi Sekine
2025-06-03
safety
2506.02372v1

Evaluating LLM Agent Adherence to Hierarchical Safety Principles: A Lightweight Benchmark for Probing Foundational Controllability Components

Ram Potham
2025-06-03
safety
2506.02357v1