← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 947 papers total
June 02 - June 08, 2025
24 papers
Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test
Xiaoyuan Zhu, Yaowen Ye, Tianyi Qiu, Hanlin Zhu, Sijun Tan, Ajraf Mannan, Jonathan Michala, Raluca Ada Popa, Willie Neiswanger
2025-06-08
red teaming
2506.06975v2
Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test
Xiaoyuan Zhu, Yaowen Ye, Tianyi Qiu, Hanlin Zhu, Sijun Tan, Ajraf Mannan, Jonathan Michala, Raluca Ada Popa, Willie Neiswanger
2025-06-08
red teaming
2506.06975v1
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment
Kyubyung Chae, Hyunbin Jin, Taesup Kim
2025-06-07
red teaming
2506.10020v1
Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance
Ruizhong Qiu, Gaotang Li, Tianxin Wei, Jingrui He, Hanghang Tong
2025-06-06
safety
2506.06444v1
Small Models, Big Support: A Local LLM Framework for Teacher-Centric Content Creation and Assessment using RAG and CAG
Zarreen Reza, Alexander Mazur, Michael T. Dugdale, Robin Ray-Chaudhuri
2025-06-06
2506.05925v1
To Protect the LLM Agent Against the Prompt Injection Attack with Polymorphic Prompt
Zhilong Wang, Neha Nagaraja, Lan Zhang, Hayretdin Bahsi, Pawan Patil, Peng Liu
2025-06-06
red teaming
2506.05739v1
Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets
Lei Hsiung, Tianyu Pang, Yung-Chen Tang, Linyue Song, Tsung-Yi Ho, Pin-Yu Chen, Yaoqing Yang
2025-06-05
red teaming
safety
2506.05346v1
Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety
Seongmin Lee, Aeree Cho, Grace C. Kim, ShengYun Peng, Mansi Phute, Duen Horng Chau
2025-06-05
safety
2506.05451v1
Sentinel: SOTA model to protect against prompt injections
Dror Ivry, Oran Nahum
2025-06-05
2506.05446v1
HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model
Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang
2025-06-05
2506.04704v2
HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model
Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang
2025-06-05
2506.04704v1
Detection Method for Prompt Injection by Integrating Pre-trained Model and Heuristic Feature Engineering
Yi Ji, Runzhi Li, Baolei Mao
2025-06-05
red teaming
2506.06384v1
Adversarial Attacks on Robotic Vision Language Action Models
Eliot Krzysztof Jones, Alexander Robey, Andy Zou, Zachary Ravichandran, George J. Pappas, Hamed Hassani, Matt Fredrikson, J. Zico Kolter
2025-06-03
red teaming
2506.03350v1
It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics
Matthew Kowal, Jasper Timm, Jean-Francois Godbout, Thomas Costello, Antonio A. Arechar, Gordon Pennycook, David Rand, Adam Gleave, Kellin Pelrine
2025-06-03
red teaming
2506.02873v1
From Prompts to Protection: Large Language Model-Enabled In-Context Learning for Smart Public Safety UAV
Yousef Emami, Hao Zhou, Miguel Gutierrez Gaitan, Kai Li, Luis Almeida, Zhu Han
2025-06-03
2506.02649v1
IndoSafety: Culturally Grounded Safety for LLMs in Indonesian Languages
Muhammad Falensi Azmi, Muhammad Dehan Al Kautsar, Alfan Farizki Wicaksono, Fajri Koto
2025-06-03
safety
2506.02573v1
BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage
Kalyan Nakka, Nitesh Saxena
2025-06-03
red teaming
2506.02479v1
Should LLM Safety Be More Than Refusing Harmful Instructions?
Utsav Maskey, Mark Dras, Usman Naseem
2025-06-03
safety
2506.02442v2
Should LLM Safety Be More Than Refusing Harmful Instructions?
Utsav Maskey, Mark Dras, Usman Naseem
2025-06-03
safety
2506.02442v1
AnswerCarefully: A Dataset for Improving the Safety of Japanese LLM Output
Hisami Suzuki, Satoru Katsumata, Takashi Kodama, Tetsuro Takahashi, Kouta Nakayama, Satoshi Sekine
2025-06-03
safety
2506.02372v1
Evaluating LLM Agent Adherence to Hierarchical Safety Principles: A Lightweight Benchmark for Probing Foundational Controllability Components
Ram Potham
2025-06-03
safety
2506.02357v1
ReGA: Representation-Guided Abstraction for Model-based Safeguarding of LLMs
Zeming Wei, Chengcan Wu, Meng Sun
2025-06-02
2506.01770v1
ESGenius: Benchmarking LLMs on Environmental, Social, and Governance (ESG) and Sustainability Knowledge
Chaoyue He, Xin Zhou, Yi Wu, Xinjia Yu, Yan Zhang, Lei Zhang, Di Wang, Shengfei Lyu, Hong Xu, Xiaoqiao Wang, Wei Liu, Chunyan Miao
2025-06-02
governance
2506.01646v1
Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models
Youze Wang, Wenbo Hu, Yinpeng Dong, Jing Liu, Hanwang Zhang, Richang Hong
2025-06-02
red teaming
2506.01307v1
‹
1
2
3
...
35
36
37
38
39
40
›