Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 770 papers total

June 09 - June 15, 2025

12 papers

AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)

Danush Khanna, Krishna Kumar, Basab Ghosh, Vinija Jain, Vasu Sharma, Aman Chadha, Amitava Das
2025-06-10
red teaming safety
2506.08885v1

Design Patterns for Securing LLM Agents against Prompt Injections

Luca Beurer-Kellner, Beat Buesser, Ana-Maria CreĆŁu, Edoardo Debenedetti, Daniel Dobos, Daniel Fabian, Marc Fischer, David Froelicher, Kathrin Grosse, Daniel Naeff, Ezinwanne Ozoani, Andrew Paverd, Florian TramĂšr, VĂĄclav Volhejn
2025-06-10
2506.08837v3

Evaluation empirique de la sécurisation et de l'alignement de ChatGPT et Gemini: analyse comparative des vulnérabilités par expérimentations de jailbreaks

Rafaël Nouailles
2025-06-10
red teaming
2506.10029v1

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Shuo Yang, Qihui Zhang, Yuyang Liu, Yue Huang, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, Li Yuan
2025-06-10
safety
2506.08473v2

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Shuo Yang, Qihui Zhang, Yuyang Liu, Yue Huang, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, Li Yuan
2025-06-10
safety
2506.08473v1

RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards

Jingnan Zheng, Xiangtian Ji, Yijun Lu, Chenhang Cui, Weixiang Zhao, Gelei Deng, Zhenkai Liang, An Zhang, Tat-Seng Chua
2025-06-09
red teaming
2506.07736v2

RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards

Jingnan Zheng, Xiangtian Ji, Yijun Lu, Chenhang Cui, Weixiang Zhao, Gelei Deng, Zhenkai Liang, An Zhang, Tat-Seng Chua
2025-06-09
red teaming
2506.07736v1

LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges

Haoyang Li, Huan Gao, Zhiyuan Zhao, Zhiyu Lin, Junyu Gao, Xuelong Li
2025-06-09
red teaming
2506.10022v1

Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models

Maciej Chrabąszcz, Katarzyna Lorenc, Karolina Seweryn
2025-06-09
red teaming
2506.07645v1

TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts

Torsten Krauß, Hamid Dashtbani, Alexandra Dmitrienko
2025-06-09
red teaming
2506.07596v1

When Style Breaks Safety: Defending Language Models Against Superficial Style Alignment

Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi
2025-06-09
red teaming
2506.07452v1

Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures

Yukai Zhou, Sibei Yang, Wenjie Wang
2025-06-09
red teaming
2506.07402v1

June 02 - June 08, 2025

12 papers

Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations

Zhiyu Xue, Reza Abbasi-Asl, Ramtin Pedarsani
2025-06-08
2506.09067v1

AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint

Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, Tat-Seng Chua
2025-06-08
red teaming
2506.07022v1

Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test

Xiaoyuan Zhu, Yaowen Ye, Tianyi Qiu, Hanlin Zhu, Sijun Tan, Ajraf Mannan, Jonathan Michala, Raluca Ada Popa, Willie Neiswanger
2025-06-08
red teaming
2506.06975v3

Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test

Xiaoyuan Zhu, Yaowen Ye, Tianyi Qiu, Hanlin Zhu, Sijun Tan, Ajraf Mannan, Jonathan Michala, Raluca Ada Popa, Willie Neiswanger
2025-06-08
red teaming
2506.06975v2

Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test

Xiaoyuan Zhu, Yaowen Ye, Tianyi Qiu, Hanlin Zhu, Sijun Tan, Ajraf Mannan, Jonathan Michala, Raluca Ada Popa, Willie Neiswanger
2025-06-08
red teaming
2506.06975v1

From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment

Kyubyung Chae, Hyunbin Jin, Taesup Kim
2025-06-07
red teaming
2506.10020v1

Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance

Ruizhong Qiu, Gaotang Li, Tianxin Wei, Jingrui He, Hanghang Tong
2025-06-06
safety
2506.06444v1

Small Models, Big Support: A Local LLM Framework for Teacher-Centric Content Creation and Assessment using RAG and CAG

Zarreen Reza, Alexander Mazur, Michael T. Dugdale, Robin Ray-Chaudhuri
2025-06-06
2506.05925v1

To Protect the LLM Agent Against the Prompt Injection Attack with Polymorphic Prompt

Zhilong Wang, Neha Nagaraja, Lan Zhang, Hayretdin Bahsi, Pawan Patil, Peng Liu
2025-06-06
red teaming
2506.05739v1

Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets

Lei Hsiung, Tianyu Pang, Yung-Chen Tang, Linyue Song, Tsung-Yi Ho, Pin-Yu Chen, Yaoqing Yang
2025-06-05
red teaming safety
2506.05346v1

Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety

Seongmin Lee, Aeree Cho, Grace C. Kim, ShengYun Peng, Mansi Phute, Duen Horng Chau
2025-06-05
safety
2506.05451v1

Sentinel: SOTA model to protect against prompt injections

Dror Ivry, Oran Nahum
2025-06-05
2506.05446v1