Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1169 papers total

August 04 - August 10, 2025

17 papers

Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks

Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, Yi Zeng
2025-08-08
safety
2508.09190v1

Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM

Chi Zhang, Changjia Zhu, Junjie Xiong, Xiaoran Xu, Lingyao Li, Yao Liu, Zhuo Lu
2025-08-07
red teaming
2508.05775v2

Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation

Chi Zhang, Changjia Zhu, Junjie Xiong, Xiaoran Xu, Lingyao Li, Yao Liu, Zhuo Lu
2025-08-07
red teaming
2508.05775v1

AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety

Adi Levi, Or Levi, Sardhendu Mishra, Jonathan Morra
2025-08-07
safety
2508.05527v1

JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

Renmiao Chen, Shiyao Cui, Xuancheng Huang, Chengwei Pan, Victor Shea-Jay Huang, QingLin Zhang, Xuan Ouyang, Zhexin Zhang, Hongning Wang, Minlie Huang
2025-08-07
red teaming
2508.05087v1

Adversarial Attacks and Defenses on Graph-aware Large Language Models (LLMs)

Iyiola E. Olatunji, Franziska Boenisch, Jing Xu, Adam Dziedzic
2025-08-06
safety
2508.04894v1

ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

Yuquan Wang, Mi Zhang, Yining Wang, Geng Hong, Xiaoyu You, Min Yang
2025-08-06
red teaming
2508.04204v1

Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models

Siddhant Panpatil, Hiskias Dingeto, Haon Park
2025-08-06
red teaming
2508.04196v1

Risk Analysis Techniques for Governed LLM-based Multi-Agent Systems

Alistair Reid, Simon O'Callaghan, Liam Carroll, Tiberio Caetano
2025-08-06
governance
2508.05687v1

Evo-MARL: Co-Evolutionary Multi-Agent Reinforcement Learning for Internalized Safety

Zhenyu Pan, Yiting Zhang, Yutong Zhang, Jianshu Zhang, Haozheng Luo, Yuwei Han, Dennis Wu, Hong-Yu Chen, Philip S. Yu, Manling Li, Han Liu
2025-08-05
red teaming
2508.03864v1

When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

Bodam Kim, Hiskias Dingeto, Taeyoun Kwon, Dasol Choi, DongGeon Lee, Haon Park, JaeHoon Lee, Jongho Shin
2025-08-05
red teaming
2508.03365v1

Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning

Rui Pu, Chaozhuo Li, Rui Ha, Litian Zhang, Lirong Qiu, Xi Zhang
2025-08-05
2508.03054v1

CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors

Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
2025-08-05
red teaming
2508.02997v2

CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors

Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
2025-08-05
red teaming
2508.02997v1

Highlight & Summarize: RAG without the jailbreaks

Giovanni Cherubin, Andrew Paverd
2025-08-04
2508.02872v1

Large Reasoning Models Are Autonomous Jailbreak Agents

Thilo Hagendorff, Erik Derner, Nuria Oliver
2025-08-04
red teaming
2508.04039v1

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

Amitava Das, Vinija Jain, Aman Chadha
2025-08-04
2508.02063v1

July 28 - August 03, 2025

7 papers

Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models

Ruofan Wang, Xin Wang, Yang Yao, Xuan Tong, Xingjun Ma
2025-08-03
red teaming
2508.01741v1

PUZZLED: Jailbreaking LLMs through Word-Based Puzzles

Yelim Ahn, Jaejin Lee
2025-08-02
red teaming
2508.01306v1

LeakSealer: A Semisupervised Defense for LLMs Against Prompt Injection and Leakage Attacks

Francesco Panebianco, Stefano Bonfanti, Francesco Trovò, Michele Carminati
2025-08-01
red teaming safety
2508.00602v1

Activation-Guided Local Editing for Jailbreaking Attacks

Jiecong Wang, Haoran Li, Hao Peng, Ziqian Zeng, Zihao Wang, Haohua Du, Zhengtao Yu
2025-08-01
red teaming
2508.00555v1

Pro2Guard: Proactive Runtime Enforcement of LLM Agent Safety via Probabilistic Model Checking

Haoyu Wang, Chris M. Poskitt, Jun Sun, Jiali Wei
2025-08-01
safety
2508.00500v1

A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

Shirui Wang, Zhihui Tang, Huaxia Yang, Qiuhong Gong, Tiantian Gu, Hongyang Ma, Yongxin Wang, Wubin Sun, Zeliang Lian, Kehang Mao, Yinan Jiang, Zhicheng Huang, Lingyun Ma, Wenjie Shen, Yajie Ji, Yunhui Tan, Chunbo Wang, Yunlu Gao, Qianling Ye, Rui Lin, Mingyu Chen, Lijuan Niu, Zhihao Wang, Peng Yu, Mengran Lang, Yue Liu, Huimin Zhang, Haitao Shen, Long Chen, Qiguang Zhao, Si-Xuan Liu, Lina Zhou, Hua Gao, Dongqiang Ye, Lingmin Meng, Youtao Yu, Naixin Liang, Jianxiong Wu
2025-07-31
safety
2507.23486v1

Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

Saeed Almheiri, Yerulan Kongrat, Adrian Santosh, Ruslan Tasmukhanov, Josemaria Vera, Muhammad Dehan Al Kautsar, Fajri Koto
2025-07-31
red teaming
2507.23465v1