Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1331 papers total

September 01 - September 07, 2025

2 papers

August 25 - August 31, 2025

18 papers

The Resurgence of GCG Adversarial Attacks on Large Language Models

Yuting Tan, Xuying Li, Zhuo Li, Huizhen Shu, Peikang Hu
2025-08-30
red teaming
2509.00391v1

Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models

Sihao Wu, Gaojie Jin, Wei Huang, Jianhong Wang, Xiaowei Huang
2025-08-30
2509.00373v1

Cybersecurity AI: Hacking the AI Hackers via Prompt Injection

Víctor Mayoral-Vilches, Per Mannermaa Rynning
2025-08-29
red teaming
2508.21669v1

PromptSleuth: Detecting Prompt Injection via Semantic Intent Invariance

Mengxiao Wang, Yuxuan Zhang, Guofei Gu
2025-08-28
red teaming
2508.20890v2

JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring

Junjie Chu, Mingjie Li, Ziqing Yang, Ye Leng, Chenhao Lin, Chao Shen, Michael Backes, Yun Shen, Yang Zhang
2025-08-28
red teaming
2508.20848v1

From Law to Gherkin: A Human-Centred Quasi-Experiment on the Quality of LLM-Generated Behavioural Specifications from Food-Safety Regulations

Shabnam Hassani, Mehrdad Sabetzadeh, Daniel Amyot
2025-08-28
safety
2508.20744v1

Towards Mechanistic Defenses Against Typographic Attacks in CLIP

Lorenz Hufe, Constantin Venhoff, Maximilian Dreyer, Sebastian Lapuschkin, Wojciech Samek
2025-08-28
2508.20570v1

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Yang Zhang, Haohan Wang
2025-08-28
red teaming
2508.20325v1

IntentionReasoner: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement

Yuanzhe Shen, Zisu Huang, Zhengkang Guo, Yide Liu, Guanxu Chen, Ruicheng Yin, Xiaoqing Zheng, Xuanjing Huang
2025-08-27
red teaming
2508.20151v1

Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks

Sheng Liu, Qiang Sheng, Danding Wang, Yang Li, Guang Yang, Juan Cao
2025-08-27
red teaming safety
2508.20038v2

Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks

Sheng Liu, Qiang Sheng, Danding Wang, Yang Li, Guang Yang, Juan Cao
2025-08-27
red teaming safety
2508.20038v1

Pruning Strategies for Backdoor Defense in LLMs

Santosh Chapagain, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi
2025-08-27
safety
2508.20032v1

Evaluating Language Model Reasoning about Confidential Information

Dylan Sam, Alexander Robey, Andy Zou, Matt Fredrikson, J. Zico Kolter
2025-08-27
red teaming
2508.19980v1

AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema

Ting-Chun Liu, Ching-Yu Hsu, Kuan-Yi Lee, Chi-An Fu, Hung-yi Lee
2025-08-27
red teaming
2509.00088v2

Safety Alignment Should Be Made More Than Just A Few Attention Heads

Chao Huang, Zefeng Zhang, Juewei Yue, Quangang Li, Chuang Zhang, Tingwen Liu
2025-08-27
red teaming
2508.19697v1

On Surjectivity of Neural Networks: Can you elicit any behavior from your model?

Haozhe Jiang, Nika Haghtalab
2025-08-26
red teaming
2508.19445v1

Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience

Xi Wang, Songlei Jian, Shasha Li, Xiaopeng Li, Bin Ji, Jun Ma, Xiaodong Liu, Jing Wang, Feilong Bao, Jianfeng Zhang, Baosheng Wang, Jie Yu
2025-08-25
red teaming
2508.19292v1

Speculative Safety-Aware Decoding

Xuekang Wang, Shengyu Zhu, Xueqi Cheng
2025-08-25
red teaming
2508.17739v1

August 18 - August 24, 2025

4 papers