Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1331 papers total

August 11 - August 17, 2025

9 papers

NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs

Birong Pan, Mayi Xu, Qiankun Pi, Jianhao Chen, Yuanyuan Zhu, Ming Zhong, Tieyun Qian
2025-08-13
safety
2508.09473v1

Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs

Aayush Gupta
2025-08-12
red teaming
2508.09288v1

Can We Trust AI to Govern AI? Benchmarking LLM Performance on Privacy and AI Governance Exams

Zane Witherspoon, Thet Mon Aye, YingYing Hao
2025-08-12
governance
2508.09036v1

Attacks and Defenses Against LLM Fingerprinting

Kevin Kurian, Ethan Holland, Sean Oesch
2025-08-12
safety
2508.09021v1

Securing Educational LLMs: A Generalised Taxonomy of Attacks on LLMs and DREAD Risk Assessment

Farzana Zahid, Anjalika Sewwandi, Lee Brandon, Vimal Kumar, Roopak Sinha
2025-08-12
red teaming
2508.08629v1

When the Domain Expert Has No Time and the LLM Developer Has No Clinical Expertise: Real-World Lessons from LLM Co-Design in a Safety-Net Hospital

Avni Kothari, Patrick Vossler, Jean Digitale, Mohammad Forouzannia, Elise Rosenberg, Michele Lee, Jennee Bryant, Melanie Molina, James Marks, Lucas Zier, Jean Feng
2025-08-11
safety
2508.08504v1

Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity

Zuoou Li, Weitong Zhang, Jingyuan Wang, Shuyuan Zhang, Wenjia Bai, Bernhard Kainz, Mengyun Qiao
2025-08-11
red teaming
2508.09218v1

Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge

Yunna Cai, Fan Wang, Haowei Wang, Kun Wang, Kailai Yang, Sophia Ananiadou, Moyan Li, Mingming Fan
2025-08-11
safety
2508.08236v1

Multi-Turn Jailbreaks Are Simpler Than They Seem

Xiaoxue Yang, Jaeha Lee, Anna-Katharina Dick, Jasper Timm, Fei Xie, Diogo Cruz
2025-08-11
red teaming
2508.07646v1

August 04 - August 10, 2025

15 papers

A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection

Ivan Zhang
2025-08-10
red teaming
2508.07139v1

The Cost of Thinking: Increased Jailbreak Risk in Large Language Models

Fan Yang
2025-08-09
red teaming
2508.10032v1

Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs

Jinhwa Kim, Ian G. Harris
2025-08-09
2508.10031v1

Many-Turn Jailbreaking

Xianjun Yang, Liqiang Xiao, Shiyang Li, Faisal Ladhak, Hyokun Yun, Linda Ruth Petzold, Yi Xu, William Yang Wang
2025-08-09
red teaming
2508.06755v1

Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs

Wenpeng Xing, Mohan Li, Chunqiang Hu, Haitao XuNingyu Zhang, Bo Lin, Meng Han
2025-08-08
red teaming
2508.10029v1

Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models: A Unified and Accurate Approach

Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue, Xiting Wang
2025-08-08
red teaming
2508.09201v1

LLM Robustness Leaderboard v1 --Technical report

Pierre Peigné - Lefebvre, Quentin Feuillade-Montixi, Tom David, Nicolas Miailhe
2025-08-08
red teaming
2508.06296v2

LLM Robustness Leaderboard v1 --Technical report

Pierre Peigné - Lefebvre, Quentin Feuillade-Montixi, Tom David, Nicolas Miailhe
2025-08-08
red teaming
2508.06296v1

Beyond Uniform Criteria: Scenario-Adaptive Multi-Dimensional Jailbreak Evaluation

Lai Jiang, Yuekang Li, Xiaohan Zhang, Youtao Ding, Li Pan
2025-08-08
red teaming
2508.06194v1

Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks

Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, Yi Zeng
2025-08-08
safety
2508.09190v1

Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM

Chi Zhang, Changjia Zhu, Junjie Xiong, Xiaoran Xu, Lingyao Li, Yao Liu, Zhuo Lu
2025-08-07
red teaming
2508.05775v2

Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation

Chi Zhang, Changjia Zhu, Junjie Xiong, Xiaoran Xu, Lingyao Li, Yao Liu, Zhuo Lu
2025-08-07
red teaming
2508.05775v1

AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety

Adi Levi, Or Levi, Sardhendu Mishra, Jonathan Morra
2025-08-07
safety
2508.05527v1

JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

Renmiao Chen, Shiyao Cui, Xuancheng Huang, Chengwei Pan, Victor Shea-Jay Huang, QingLin Zhang, Xuan Ouyang, Zhexin Zhang, Hongning Wang, Minlie Huang
2025-08-07
red teaming
2508.05087v1

Adversarial Attacks and Defenses on Graph-aware Large Language Models (LLMs)

Iyiola E. Olatunji, Franziska Boenisch, Jing Xu, Adam Dziedzic
2025-08-06
safety
2508.04894v1