Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 97 papers total

June 23 - June 29, 2025

7 papers

June 16 - June 22, 2025

17 papers

Multi-turn Jailbreaking via Global Refinement and Active Fabrication

Hua Tang, Lingyong Yan, Yukun Zhao, Shuaiqiang Wang, Jizhou Huang, Dawei Yin
2025-06-22
red teaming
2506.17881v1

Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs

Shuang Ao, Yi Dong, Jinwei Hu, Sarvapali Ramchurn
2025-06-21
safety
2506.18931v1

SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

Zhenglin Lai, Mengyao Liao, Dong Xu, Zebin Zhao, Zhihang Yuan, Chao Fan, Jianqiang Li, Bingzhe Wu
2025-06-20
safety
2506.17368v1

From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers

Jingtong Su, Julia Kempe, Karen Ullrich
2025-06-20
red teaming
2506.17052v1

MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning

Muyang Zheng, Yuanzhi Yao, Changting Lin, Rui Wang, Meng Han
2025-06-20
red teaming
2506.16792v1

Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models

Lei Jiang, Zixun Zhang, Zizhou Wang, Xiaobing Sun, Zhen Li, Liangli Zhen, Xiaohua Xu
2025-06-20
red teaming
2506.16760v1

Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models

Biao Yi, Tiansheng Huang, Sishuo Chen, Tong Li, Zheli Liu, Zhixuan Chu, Yiming Li
2025-06-19
red teaming
2506.16447v1

Probing the Robustness of Large Language Models Safety to Latent Perturbations

Tianle Gu, Kexin Huang, Zongqi Wang, Yixu Wang, Jie Li, Yuanqi Yao, Yang Yao, Yujiu Yang, Yan Teng, Yingchun Wang
2025-06-19
red teaming
2506.16078v1

deepSURF: Detecting Memory Safety Vulnerabilities in Rust Through Fuzzing LLM-Augmented Harnesses

Georgios Androutsopoulos, Antonio Bianchi
2025-06-18
safety
2506.15648v1

LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

Gabrel J. Perin, Runjin Chen, Xuxi Chen, Nina S. T. Hirata, Zhangyang Wang, Junyuan Hong
2025-06-18
safety
2506.15606v1

From LLMs to MLLMs to Agents: A Survey of Emerging Paradigms in Jailbreak Attacks and Defenses within LLM Ecosystem

Yanxu Mao, Tiehan Cui, Peipei Liu, Datao You, Hongsong Zhu
2025-06-18
red teaming safety
2506.15170v1

Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts

Kartik Sharma, Yiqiao Jin, Vineeth Rakesh, Yingtong Dou, Menghai Pan, Mahashweta Das, Srijan Kumar
2025-06-18
red teaming
2506.15751v1

LLM Jailbreak Oracle

Shuyi Lin, Anshuman Suri, Alina Oprea, Cheng Tan
2025-06-17
red teaming
2506.17299v1

Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations

Abhilekh Borah, Chhavi Sharma, Danush Khanna, Utkarsh Bhatt, Gurpreet Singh, Hasnat Md Abdullah, Raghav Kaushik Ravi, Vinija Jain, Jyoti Patel, Shubham Singh, Vasu Sharma, Arpita Vats, Rahul Raja, Aman Chadha, Amitava Das
2025-06-16
red teaming
2506.13901v1

Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-LLM Interactions

Junfeng Jiao, Saleh Afroogh, Kevin Chen, Abhejay Murali, David Atkinson, Amit Dhurandhar
2025-06-16
red teaming safety
2506.13510v2

Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-AI Interactions

Junfeng Jiao, Saleh Afroogh, Kevin Chen, Abhejay Murali, David Atkinson, Amit Dhurandhar
2025-06-16
red teaming safety
2506.13510v1

Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMs

Houcheng Jiang, Zetong Zhao, Junfeng Fang, Haokai Ma, Ruipeng Wang, Yang Deng, Xiang Wang, Xiangnan He
2025-06-16
safety
2506.13285v1