Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 235 papers total

August 04 - August 10, 2025

21 papers

Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models: A Unified and Accurate Approach

Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue, Xiting Wang
2025-08-08
red teaming
2508.09201v1

LLM Robustness Leaderboard v1 --Technical report

Pierre Peigné - Lefebvre, Quentin Feuillade-Montixi, Tom David, Nicolas Miailhe
2025-08-08
red teaming
2508.06296v2

LLM Robustness Leaderboard v1 --Technical report

Pierre Peigné - Lefebvre, Quentin Feuillade-Montixi, Tom David, Nicolas Miailhe
2025-08-08
red teaming
2508.06296v1

Beyond Uniform Criteria: Scenario-Adaptive Multi-Dimensional Jailbreak Evaluation

Lai Jiang, Yuekang Li, Xiaohan Zhang, Youtao Ding, Li Pan
2025-08-08
red teaming
2508.06194v1

Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks

Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, Yi Zeng
2025-08-08
safety
2508.09190v1

Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM

Chi Zhang, Changjia Zhu, Junjie Xiong, Xiaoran Xu, Lingyao Li, Yao Liu, Zhuo Lu
2025-08-07
red teaming
2508.05775v2

Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation

Chi Zhang, Changjia Zhu, Junjie Xiong, Xiaoran Xu, Lingyao Li, Yao Liu, Zhuo Lu
2025-08-07
red teaming
2508.05775v1

AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety

Adi Levi, Or Levi, Sardhendu Mishra, Jonathan Morra
2025-08-07
safety
2508.05527v1

JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

Renmiao Chen, Shiyao Cui, Xuancheng Huang, Chengwei Pan, Victor Shea-Jay Huang, QingLin Zhang, Xuan Ouyang, Zhexin Zhang, Hongning Wang, Minlie Huang
2025-08-07
red teaming
2508.05087v1

Adversarial Attacks and Defenses on Graph-aware Large Language Models (LLMs)

Iyiola E. Olatunji, Franziska Boenisch, Jing Xu, Adam Dziedzic
2025-08-06
safety
2508.04894v1

ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

Yuquan Wang, Mi Zhang, Yining Wang, Geng Hong, Xiaoyu You, Min Yang
2025-08-06
red teaming
2508.04204v1

Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models

Siddhant Panpatil, Hiskias Dingeto, Haon Park
2025-08-06
red teaming
2508.04196v1

Risk Analysis Techniques for Governed LLM-based Multi-Agent Systems

Alistair Reid, Simon O'Callaghan, Liam Carroll, Tiberio Caetano
2025-08-06
governance
2508.05687v1

Evo-MARL: Co-Evolutionary Multi-Agent Reinforcement Learning for Internalized Safety

Zhenyu Pan, Yiting Zhang, Yutong Zhang, Jianshu Zhang, Haozheng Luo, Yuwei Han, Dennis Wu, Hong-Yu Chen, Philip S. Yu, Manling Li, Han Liu
2025-08-05
red teaming
2508.03864v1

When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

Bodam Kim, Hiskias Dingeto, Taeyoun Kwon, Dasol Choi, DongGeon Lee, Haon Park, JaeHoon Lee, Jongho Shin
2025-08-05
red teaming
2508.03365v1

Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning

Rui Pu, Chaozhuo Li, Rui Ha, Litian Zhang, Lirong Qiu, Xi Zhang
2025-08-05
2508.03054v1

CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors

Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
2025-08-05
red teaming
2508.02997v2

CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors

Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
2025-08-05
red teaming
2508.02997v1

Highlight & Summarize: RAG without the jailbreaks

Giovanni Cherubin, Andrew Paverd
2025-08-04
2508.02872v1

Large Reasoning Models Are Autonomous Jailbreak Agents

Thilo Hagendorff, Erik Derner, Nuria Oliver
2025-08-04
red teaming
2508.04039v1

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

Amitava Das, Vinija Jain, Aman Chadha
2025-08-04
2508.02063v1

July 28 - August 03, 2025

3 papers