Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1331 papers total

August 04 - August 10, 2025

11 papers

ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

Yuquan Wang, Mi Zhang, Yining Wang, Geng Hong, Xiaoyu You, Min Yang
2025-08-06
red teaming
2508.04204v1

Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models

Siddhant Panpatil, Hiskias Dingeto, Haon Park
2025-08-06
red teaming
2508.04196v1

Risk Analysis Techniques for Governed LLM-based Multi-Agent Systems

Alistair Reid, Simon O'Callaghan, Liam Carroll, Tiberio Caetano
2025-08-06
governance
2508.05687v1

Evo-MARL: Co-Evolutionary Multi-Agent Reinforcement Learning for Internalized Safety

Zhenyu Pan, Yiting Zhang, Yutong Zhang, Jianshu Zhang, Haozheng Luo, Yuwei Han, Dennis Wu, Hong-Yu Chen, Philip S. Yu, Manling Li, Han Liu
2025-08-05
red teaming
2508.03864v1

When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

Bodam Kim, Hiskias Dingeto, Taeyoun Kwon, Dasol Choi, DongGeon Lee, Haon Park, JaeHoon Lee, Jongho Shin
2025-08-05
red teaming
2508.03365v1

Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning

Rui Pu, Chaozhuo Li, Rui Ha, Litian Zhang, Lirong Qiu, Xi Zhang
2025-08-05
2508.03054v1

CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors

Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
2025-08-05
red teaming
2508.02997v2

CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors

Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
2025-08-05
red teaming
2508.02997v1

Highlight & Summarize: RAG without the jailbreaks

Giovanni Cherubin, Andrew Paverd
2025-08-04
2508.02872v1

Large Reasoning Models Are Autonomous Jailbreak Agents

Thilo Hagendorff, Erik Derner, Nuria Oliver
2025-08-04
red teaming
2508.04039v1

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

Amitava Das, Vinija Jain, Aman Chadha
2025-08-04
2508.02063v1

July 28 - August 03, 2025

13 papers

Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models

Ruofan Wang, Xin Wang, Yang Yao, Xuan Tong, Xingjun Ma
2025-08-03
red teaming
2508.01741v1

PUZZLED: Jailbreaking LLMs through Word-Based Puzzles

Yelim Ahn, Jaejin Lee
2025-08-02
red teaming
2508.01306v1

LeakSealer: A Semisupervised Defense for LLMs Against Prompt Injection and Leakage Attacks

Francesco Panebianco, Stefano Bonfanti, Francesco Trovò, Michele Carminati
2025-08-01
red teaming safety
2508.00602v1

Activation-Guided Local Editing for Jailbreaking Attacks

Jiecong Wang, Haoran Li, Hao Peng, Ziqian Zeng, Zihao Wang, Haohua Du, Zhengtao Yu
2025-08-01
red teaming
2508.00555v1

Pro2Guard: Proactive Runtime Enforcement of LLM Agent Safety via Probabilistic Model Checking

Haoyu Wang, Chris M. Poskitt, Jun Sun, Jiali Wei
2025-08-01
safety
2508.00500v1

A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

Shirui Wang, Zhihui Tang, Huaxia Yang, Qiuhong Gong, Tiantian Gu, Hongyang Ma, Yongxin Wang, Wubin Sun, Zeliang Lian, Kehang Mao, Yinan Jiang, Zhicheng Huang, Lingyun Ma, Wenjie Shen, Yajie Ji, Yunhui Tan, Chunbo Wang, Yunlu Gao, Qianling Ye, Rui Lin, Mingyu Chen, Lijuan Niu, Zhihao Wang, Peng Yu, Mengran Lang, Yue Liu, Huimin Zhang, Haitao Shen, Long Chen, Qiguang Zhao, Si-Xuan Liu, Lina Zhou, Hua Gao, Dongqiang Ye, Lingmin Meng, Youtao Yu, Naixin Liang, Jianxiong Wu
2025-07-31
safety
2507.23486v1

Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

Saeed Almheiri, Yerulan Kongrat, Adrian Santosh, Ruslan Tasmukhanov, Josemaria Vera, Muhammad Dehan Al Kautsar, Fajri Koto
2025-07-31
red teaming
2507.23465v1

Measuring Harmfulness of Computer-Using Agents

Aaron Xuxiang Tian, Ruofan Zhang, Janet Tang, Jiaxin Wen
2025-07-31
red teaming
2508.00935v1

Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs

Xikang Yang, Biyu Zhou, Xuehai Tang, Jizhong Han, Songlin Hu
2025-07-30
red teaming safety
2507.22564v1

Promoting Online Safety by Simulating Unsafe Conversations with LLMs

Owen Hoffman, Kangze Peng, Zehua You, Sajid Kamal, Sukrit Venkatagiri
2025-07-29
safety
2507.22267v1

Strategic Deflection: Defending LLMs from Logit Manipulation

Yassine Rachidy, Jihad Rbaiti, Youssef Hmamouche, Faissal Sehbaoui, Amal El Fallah Seghrouchni
2025-07-29
red teaming
2507.22160v1

Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security

Muzhi Dai, Shixuan Liu, Zhiyuan Zhao, Junyu Gao, Hao Sun, Xuelong Li
2025-07-29
red teaming
2507.22037v1

Anyone Can Jailbreak: Prompt-Based Attacks on LLMs and T2Is

Ahmed B Mustafa, Zihan Ye, Yang Lu, Michael P Pound, Shreyank N Gowda
2025-07-29
red teaming
2507.21820v1