Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 770 papers total

June 16 - June 22, 2025

7 papers

From LLMs to MLLMs to Agents: A Survey of Emerging Paradigms in Jailbreak Attacks and Defenses within LLM Ecosystem

Yanxu Mao, Tiehan Cui, Peipei Liu, Datao You, Hongsong Zhu
2025-06-18
red teaming safety
2506.15170v1

Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts

Kartik Sharma, Yiqiao Jin, Vineeth Rakesh, Yingtong Dou, Menghai Pan, Mahashweta Das, Srijan Kumar
2025-06-18
red teaming
2506.15751v1

LLM Jailbreak Oracle

Shuyi Lin, Anshuman Suri, Alina Oprea, Cheng Tan
2025-06-17
red teaming
2506.17299v1

Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations

Abhilekh Borah, Chhavi Sharma, Danush Khanna, Utkarsh Bhatt, Gurpreet Singh, Hasnat Md Abdullah, Raghav Kaushik Ravi, Vinija Jain, Jyoti Patel, Shubham Singh, Vasu Sharma, Arpita Vats, Rahul Raja, Aman Chadha, Amitava Das
2025-06-16
red teaming
2506.13901v1

Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-LLM Interactions

Junfeng Jiao, Saleh Afroogh, Kevin Chen, Abhejay Murali, David Atkinson, Amit Dhurandhar
2025-06-16
red teaming safety
2506.13510v2

Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-AI Interactions

Junfeng Jiao, Saleh Afroogh, Kevin Chen, Abhejay Murali, David Atkinson, Amit Dhurandhar
2025-06-16
red teaming safety
2506.13510v1

Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMs

Houcheng Jiang, Zetong Zhao, Junfeng Fang, Haokai Ma, Ruipeng Wang, Yang Deng, Xiang Wang, Xiangnan He
2025-06-16
safety
2506.13285v1

June 09 - June 15, 2025

17 papers

Jailbreak Strength and Model Similarity Predict Transferability

Rico Angell, Jannik Brinkmann, He He
2025-06-15
red teaming
2506.12913v1

Universal Jailbreak Suffixes Are Strong Attention Hijackers

Matan Ben-Tov, Mor Geva, Mahmood Sharif
2025-06-15
red teaming
2506.12880v1

Governments Should Mandate Tiered Anonymity on Social-Media Platforms to Counter Deepfakes and LLM-Driven Mass Misinformation

David Khachaturov, Roxanne Schnyder, Robert Mullins
2025-06-15
governance
2506.12814v1

SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression

Yucheng Li, Surin Ahn, Huiqiang Jiang, Amir H. Abdi, Yuqing Yang, Lili Qiu
2025-06-15
red teaming safety
2506.12707v1

Alphabet Index Mapping: Jailbreaking LLMs through Semantic Dissimilarity

Bilal Saleh Husain
2025-06-15
red teaming
2506.12685v1

Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025

Zonghao Ying, Siyang Wu, Run Hao, Peng Ying, Shixuan Sun, Pengyu Chen, Junze Chen, Hao Du, Kaiwen Shen, Shangkun Wu, Jiwei Wei, Shiyuan He, Yang Yang, Xiaohai Xu, Ke Ma, Qianqian Xu, Qingming Huang, Shi Lin, Xun Wang, Changting Lin, Meng Han, Yilei Jiang, Siqi Lai, Yaozhi Zheng, Yifei Song, Xiangyu Yue, Zonglei Jing, Tianyuan Zhang, Zhilei Zhu, Aishan Liu, Jiakai Wang, Siyuan Liang, Xianglong Kong, Hainan Li, Junjie Mu, Haotong Qin, Yue Yu, Lei Chen, Felix Juefei-Xu, Qing Guo, Xinyun Chen, Yew Soon Ong, Xianglong Liu, Dawn Song, Alan Yuille, Philip Torr, Dacheng Tao
2025-06-14
red teaming
2506.12430v1

Exploring the Secondary Risks of Large Language Models

Jiawei Chen, Zhengwei Fang, Xiao Yang, Chao Yu, Zhaoxia Yin, Hang Su
2025-06-14
2506.12382v1

QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety

Taegyeong Lee, Jeonghwa Yoo, Hyoungseo Cho, Soo Yong Kim, Yunho Maeng
2025-06-14
safety
2506.12299v1

InfoFlood: Jailbreaking Large Language Models with Information Overload

Advait Yadav, Haibo Jin, Man Luo, Jun Zhuang, Haohan Wang
2025-06-13
red teaming
2506.12274v1

Investigating Vulnerabilities and Defenses Against Audio-Visual Attacks: A Comprehensive Survey Emphasizing Multimodal Models

Jinming Wen, Xinyi Wu, Shuai Zhao, Yanhao Jia, Yuwen Li
2025-06-13
red teaming
2506.11521v1

DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents

Hao Li, Xiaogeng Liu, Hung-Chun Chiu, Dianqi Li, Ning Zhang, Chaowei Xiao
2025-06-13
safety
2506.12104v1

How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?

Sohee Yang, Sang-Woo Lee, Nora Kassner, Daniela Gottesman, Sebastian Riedel, Mor Geva
2025-06-12
red teaming
2506.10979v1

SoK: Evaluating Jailbreak Guardrails for Large Language Models

Xunguang Wang, Zhenlan Ji, Wenxuan Wang, Zongjie Li, Daoyuan Wu, Shuai Wang
2025-06-12
red teaming
2506.10597v1

LLMail-Inject: A Dataset from a Realistic Adaptive Prompt Injection Challenge

Sahar Abdelnabi, Aideen Fay, Ahmed Salem, Egor Zverev, Kai-Chieh Liao, Chi-Huang Liu, Chun-Chih Kuo, Jannis Weigend, Danyael Manlangit, Alex Apostolov, Haris Umair, João Donato, Masayuki Kawakita, Athar Mahboob, Tran Huu Bach, Tsun-Han Chiang, Myeongjin Cho, Hajin Choi, Byeonghyeon Kim, Hyeonjin Lee, Benjamin Pannell, Conor McCauley, Mark Russinovich, Andrew Paverd, Giovanni Cherubin
2025-06-11
red teaming
2506.09956v1

Effective Red-Teaming of Policy-Adherent Agents

Itay Nakash, George Kour, Koren Lazar, Matan Vetzler, Guy Uziel, Ateret Anaby-Tavor
2025-06-11
red teaming
2506.09600v1

Risks & Benefits of LLMs & GenAI for Platform Integrity, Healthcare Diagnostics, Cybersecurity, Privacy & AI Safety: A Comprehensive Survey, Roadmap & Implementation Blueprint

Kiarash Ahi
2025-06-10
safety
2506.12088v1

AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)

Danush Khanna, Krishna Kumar, Basab Ghosh, Vinija Jain, Vasu Sharma, Aman Chadha, Amitava Das
2025-06-10
red teaming safety
2506.08885v2