Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 97 papers total

June 02 - June 08, 2025

4 papers

May 26 - June 01, 2025

20 papers

Predicting Empirical AI Research Outcomes with Language Models

Jiaxin Wen, Chenglei Si, Yueh-han Chen, He He, Shi Feng
2025-06-01
2506.00794v1

Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning

Weiyang Guo, Zesheng Shi, Zhuo Li, Yequan Wang, Xuebo Liu, Wenya Wang, Fangming Liu, Min Zhang, Jing Li
2025-06-01
red teaming
2506.00782v1

CoP: Agentic Red-teaming for Large Language Models using Composition of Principles

Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho
2025-06-01
red teaming
2506.00781v1

SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs

Shaona Ghosh, Amrita Bhattacharjee, Yftah Ziser, Christopher Parisien
2025-06-01
safety
2506.04250v1

SafeTuneBed: A Toolkit for Benchmarking LLM Safety Alignment in Fine-Tuning

Saad Hossain, Samanvay Vajpayee, Sirisha Rambhatla
2025-05-31
safety
2506.00676v1

AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents

Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, Hanan Salam
2025-05-31
safety
2506.00641v1

Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities

Jiahui Geng, Thy Thy Tran, Preslav Nakov, Iryna Gurevych
2025-05-31
red teaming
2506.00548v1

Wide Reflective Equilibrium in LLM Alignment: Bridging Moral Epistemology and AI Safety

Matthew Brophy
2025-05-31
safety
2506.00415v1

TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis

Xiaorui Wu, Xiaofeng Mao, Fei Li, Xin Zhang, Xuanhong Li, Chong Teng, Donghong Ji, Zhuang Li
2025-05-30
red teaming
2505.24672v1

Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization

Utsav Maskey, Chencheng Zhu, Usman Naseem
2025-05-30
red teaming
2505.24621v1

AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders

Yuqi Zhang, Yuchun Miao, Zuchao Li, Liang Ding
2025-05-30
2505.24519v1

Model Unlearning via Sparse Autoencoder Subspace Guided Projections

Xu Wang, Zihao Li, Benyou Wang, Yan Hu, Difan Zou
2025-05-30
2505.24428v1

From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models

Haibo Jin, Peiyan Zhang, Peiran Wang, Man Luo, Haohan Wang
2025-05-30
red teaming
2505.24232v1

Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap

Wenhan Yang, Spencer Stice, Ali Payani, Baharan Mirzasoleiman
2025-05-30
safety
2505.24208v1

The State of Multilingual LLM Safety Research: From Measuring the Language Gap to Mitigating It

Zheng-Xin Yong, Beyza Ermis, Marzieh Fadaee, Stephen H. Bach, Julia Kreutzer
2025-05-30
safety
2505.24119v1

Understanding Refusal in Language Models with Sparse Autoencoders

Wei Jie Yeo, Nirmalendu Prakash, Clement Neo, Roy Ka-Wei Lee, Erik Cambria, Ranjan Satapathy
2025-05-29
red teaming
2505.23556v1

Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models

Mingyu Yu, Wei Wang, Yanjie Wei, Sujuan Qin
2025-05-29
red teaming
2505.23404v1

Operationalizing CaMeL: Strengthening LLM Defenses for Enterprise Deployment

Krti Tallam, Emma Miller
2025-05-28
safety
2505.22852v1

GeneBreaker: Jailbreak Attacks against DNA Language Models with Pathogenicity Guidance

Zaixi Zhang, Zhenghong Zhou, Ruofan Jin, Le Cong, Mengdi Wang
2025-05-28
2505.23839v1

JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models

Jiaxin Song, Yixu Wang, Jie Li, Rui Yu, Yan Teng, Xingjun Ma, Yingchun Wang
2025-05-26
red teaming
2505.19610v2