Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 770 papers total

June 02 - June 08, 2025

16 papers

HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model

Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang
2025-06-05
2506.04704v2

HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model

Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang
2025-06-05
2506.04704v1

Detection Method for Prompt Injection by Integrating Pre-trained Model and Heuristic Feature Engineering

Yi Ji, Runzhi Li, Baolei Mao
2025-06-05
red teaming
2506.06384v1

Adversarial Attacks on Robotic Vision Language Action Models

Eliot Krzysztof Jones, Alexander Robey, Andy Zou, Zachary Ravichandran, George J. Pappas, Hamed Hassani, Matt Fredrikson, J. Zico Kolter
2025-06-03
red teaming
2506.03350v1

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

Matthew Kowal, Jasper Timm, Jean-Francois Godbout, Thomas Costello, Antonio A. Arechar, Gordon Pennycook, David Rand, Adam Gleave, Kellin Pelrine
2025-06-03
red teaming
2506.02873v1

From Prompts to Protection: Large Language Model-Enabled In-Context Learning for Smart Public Safety UAV

Yousef Emami, Hao Zhou, Miguel Gutierrez Gaitan, Kai Li, Luis Almeida, Zhu Han
2025-06-03
2506.02649v1

IndoSafety: Culturally Grounded Safety for LLMs in Indonesian Languages

Muhammad Falensi Azmi, Muhammad Dehan Al Kautsar, Alfan Farizki Wicaksono, Fajri Koto
2025-06-03
safety
2506.02573v1

BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage

Kalyan Nakka, Nitesh Saxena
2025-06-03
red teaming
2506.02479v1

Should LLM Safety Be More Than Refusing Harmful Instructions?

Utsav Maskey, Mark Dras, Usman Naseem
2025-06-03
safety
2506.02442v2

Should LLM Safety Be More Than Refusing Harmful Instructions?

Utsav Maskey, Mark Dras, Usman Naseem
2025-06-03
safety
2506.02442v1

AnswerCarefully: A Dataset for Improving the Safety of Japanese LLM Output

Hisami Suzuki, Satoru Katsumata, Takashi Kodama, Tetsuro Takahashi, Kouta Nakayama, Satoshi Sekine
2025-06-03
safety
2506.02372v1

Evaluating LLM Agent Adherence to Hierarchical Safety Principles: A Lightweight Benchmark for Probing Foundational Controllability Components

Ram Potham
2025-06-03
safety
2506.02357v1

ReGA: Representation-Guided Abstraction for Model-based Safeguarding of LLMs

Zeming Wei, Chengcan Wu, Meng Sun
2025-06-02
2506.01770v1

ESGenius: Benchmarking LLMs on Environmental, Social, and Governance (ESG) and Sustainability Knowledge

Chaoyue He, Xin Zhou, Yi Wu, Xinjia Yu, Yan Zhang, Lei Zhang, Di Wang, Shengfei Lyu, Hong Xu, Xiaoqiao Wang, Wei Liu, Chunyan Miao
2025-06-02
governance
2506.01646v1

Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models

Youze Wang, Wenbo Hu, Yinpeng Dong, Jing Liu, Hanwang Zhang, Richang Hong
2025-06-02
red teaming
2506.01307v1

MTCMB: A Multi-Task Benchmark Framework for Evaluating LLMs on Knowledge, Reasoning, and Safety in Traditional Chinese Medicine

Shufeng Kong, Xingru Yang, Yuanyuan Wei, Zijie Wang, Hao Tang, Jiuqi Qin, Shuting Lan, Yingheng Wang, Junwen Bai, Zhuangbin Chen, Zibin Zheng, Caihua Liu, Hao Liang
2025-06-02
safety
2506.01252v1

May 26 - June 01, 2025

8 papers