Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 235 papers total

June 23 - June 29, 2025

1 paper

June 16 - June 22, 2025

17 papers

Multi-turn Jailbreaking via Global Refinement and Active Fabrication

Hua Tang, Lingyong Yan, Yukun Zhao, Shuaiqiang Wang, Jizhou Huang, Dawei Yin
2025-06-22
red teaming
2506.17881v1

Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs

Shuang Ao, Yi Dong, Jinwei Hu, Sarvapali Ramchurn
2025-06-21
safety
2506.18931v1

SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

Zhenglin Lai, Mengyao Liao, Dong Xu, Zebin Zhao, Zhihang Yuan, Chao Fan, Jianqiang Li, Bingzhe Wu
2025-06-20
safety
2506.17368v1

From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers

Jingtong Su, Julia Kempe, Karen Ullrich
2025-06-20
red teaming
2506.17052v1

MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning

Muyang Zheng, Yuanzhi Yao, Changting Lin, Rui Wang, Meng Han
2025-06-20
red teaming
2506.16792v1

Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models

Lei Jiang, Zixun Zhang, Zizhou Wang, Xiaobing Sun, Zhen Li, Liangli Zhen, Xiaohua Xu
2025-06-20
red teaming
2506.16760v1

Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models

Biao Yi, Tiansheng Huang, Sishuo Chen, Tong Li, Zheli Liu, Zhixuan Chu, Yiming Li
2025-06-19
red teaming
2506.16447v1

Probing the Robustness of Large Language Models Safety to Latent Perturbations

Tianle Gu, Kexin Huang, Zongqi Wang, Yixu Wang, Jie Li, Yuanqi Yao, Yang Yao, Yujiu Yang, Yan Teng, Yingchun Wang
2025-06-19
red teaming
2506.16078v1

deepSURF: Detecting Memory Safety Vulnerabilities in Rust Through Fuzzing LLM-Augmented Harnesses

Georgios Androutsopoulos, Antonio Bianchi
2025-06-18
safety
2506.15648v1

LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

Gabrel J. Perin, Runjin Chen, Xuxi Chen, Nina S. T. Hirata, Zhangyang Wang, Junyuan Hong
2025-06-18
safety
2506.15606v1

From LLMs to MLLMs to Agents: A Survey of Emerging Paradigms in Jailbreak Attacks and Defenses within LLM Ecosystem

Yanxu Mao, Tiehan Cui, Peipei Liu, Datao You, Hongsong Zhu
2025-06-18
red teaming safety
2506.15170v1

Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts

Kartik Sharma, Yiqiao Jin, Vineeth Rakesh, Yingtong Dou, Menghai Pan, Mahashweta Das, Srijan Kumar
2025-06-18
red teaming
2506.15751v1

LLM Jailbreak Oracle

Shuyi Lin, Anshuman Suri, Alina Oprea, Cheng Tan
2025-06-17
red teaming
2506.17299v1

Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations

Abhilekh Borah, Chhavi Sharma, Danush Khanna, Utkarsh Bhatt, Gurpreet Singh, Hasnat Md Abdullah, Raghav Kaushik Ravi, Vinija Jain, Jyoti Patel, Shubham Singh, Vasu Sharma, Arpita Vats, Rahul Raja, Aman Chadha, Amitava Das
2025-06-16
red teaming
2506.13901v1

Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-LLM Interactions

Junfeng Jiao, Saleh Afroogh, Kevin Chen, Abhejay Murali, David Atkinson, Amit Dhurandhar
2025-06-16
red teaming safety
2506.13510v2

Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-AI Interactions

Junfeng Jiao, Saleh Afroogh, Kevin Chen, Abhejay Murali, David Atkinson, Amit Dhurandhar
2025-06-16
red teaming safety
2506.13510v1

Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMs

Houcheng Jiang, Zetong Zhao, Junfeng Fang, Haokai Ma, Ruipeng Wang, Yang Deng, Xiang Wang, Xiangnan He
2025-06-16
safety
2506.13285v1

June 09 - June 15, 2025

6 papers

Jailbreak Strength and Model Similarity Predict Transferability

Rico Angell, Jannik Brinkmann, He He
2025-06-15
red teaming
2506.12913v1

Universal Jailbreak Suffixes Are Strong Attention Hijackers

Matan Ben-Tov, Mor Geva, Mahmood Sharif
2025-06-15
red teaming
2506.12880v1

Governments Should Mandate Tiered Anonymity on Social-Media Platforms to Counter Deepfakes and LLM-Driven Mass Misinformation

David Khachaturov, Roxanne Schnyder, Robert Mullins
2025-06-15
governance
2506.12814v1

SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression

Yucheng Li, Surin Ahn, Huiqiang Jiang, Amir H. Abdi, Yuqing Yang, Lili Qiu
2025-06-15
red teaming safety
2506.12707v1

Alphabet Index Mapping: Jailbreaking LLMs through Semantic Dissimilarity

Bilal Saleh Husain
2025-06-15
red teaming
2506.12685v1

Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025

Zonghao Ying, Siyang Wu, Run Hao, Peng Ying, Shixuan Sun, Pengyu Chen, Junze Chen, Hao Du, Kaiwen Shen, Shangkun Wu, Jiwei Wei, Shiyuan He, Yang Yang, Xiaohai Xu, Ke Ma, Qianqian Xu, Qingming Huang, Shi Lin, Xun Wang, Changting Lin, Meng Han, Yilei Jiang, Siqi Lai, Yaozhi Zheng, Yifei Song, Xiangyu Yue, Zonglei Jing, Tianyuan Zhang, Zhilei Zhu, Aishan Liu, Jiakai Wang, Siyuan Liang, Xianglong Kong, Hainan Li, Junjie Mu, Haotong Qin, Yue Yu, Lei Chen, Felix Juefei-Xu, Qing Guo, Xinyun Chen, Yew Soon Ong, Xianglong Liu, Dawn Song, Alan Yuille, Philip Torr, Dacheng Tao
2025-06-14
red teaming
2506.12430v1