Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 235 papers total

August 18 - August 24, 2025

4 papers

August 11 - August 17, 2025

15 papers

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Minseon Kim, Jin Myung Kwak, Lama Alssum, Bernard Ghanem, Philip Torr, David Krueger, Fazl Barez, Adel Bibi
2025-08-17
safety
2508.12531v1

Mitigating Jailbreaks with Intent-Aware LLMs

Wei Jie Yeo, Ranjan Satapathy, Erik Cambria
2025-08-16
red teaming
2508.12072v1

SafeConstellations: Steering LLM Safety to Reduce Over-Refusals Through Task-Specific Trajectory

Utsav Maskey, Sumit Yadav, Mark Dras, Usman Naseem
2025-08-15
safety
2508.11290v1

ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal

Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, Jiashui Wang, Xinlei Ying, Long Liu, Wenhai Wang
2025-08-15
safety
2508.11222v1

Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation

Huizhen Shu, Xuying Li, Qirui Wang, Yuji Kosuga, Mengqiu Tian, Zhuo Li
2025-08-14
red teaming
2508.10404v1

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

Chiyu Zhang, Lu Zhou, Xiaogang Xu, Jiafei Wu, Liming Fang, Zhe Liu
2025-08-14
red teaming
2508.10390v1

NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs

Birong Pan, Mayi Xu, Qiankun Pi, Jianhao Chen, Yuanyuan Zhu, Ming Zhong, Tieyun Qian
2025-08-13
safety
2508.09473v1

Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs

Aayush Gupta
2025-08-12
red teaming
2508.09288v1

Can We Trust AI to Govern AI? Benchmarking LLM Performance on Privacy and AI Governance Exams

Zane Witherspoon, Thet Mon Aye, YingYing Hao
2025-08-12
governance
2508.09036v1

Attacks and Defenses Against LLM Fingerprinting

Kevin Kurian, Ethan Holland, Sean Oesch
2025-08-12
safety
2508.09021v1

Securing Educational LLMs: A Generalised Taxonomy of Attacks on LLMs and DREAD Risk Assessment

Farzana Zahid, Anjalika Sewwandi, Lee Brandon, Vimal Kumar, Roopak Sinha
2025-08-12
red teaming
2508.08629v1

When the Domain Expert Has No Time and the LLM Developer Has No Clinical Expertise: Real-World Lessons from LLM Co-Design in a Safety-Net Hospital

Avni Kothari, Patrick Vossler, Jean Digitale, Mohammad Forouzannia, Elise Rosenberg, Michele Lee, Jennee Bryant, Melanie Molina, James Marks, Lucas Zier, Jean Feng
2025-08-11
safety
2508.08504v1

Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity

Zuoou Li, Weitong Zhang, Jingyuan Wang, Shuyuan Zhang, Wenjia Bai, Bernhard Kainz, Mengyun Qiao
2025-08-11
red teaming
2508.09218v1

Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge

Yunna Cai, Fan Wang, Haowei Wang, Kun Wang, Kailai Yang, Sophia Ananiadou, Moyan Li, Mingming Fan
2025-08-11
safety
2508.08236v1

Multi-Turn Jailbreaks Are Simpler Than They Seem

Xiaoxue Yang, Jaeha Lee, Anna-Katharina Dick, Jasper Timm, Fei Xie, Diogo Cruz
2025-08-11
red teaming
2508.07646v1

August 04 - August 10, 2025

5 papers