Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1169 papers total

September 01 - September 07, 2025

8 papers

Enhancing Reliability in LLM-Integrated Robotic Systems: A Unified Approach to Security and Safety

Wenxiao Zhang, Xiangrui Kong, Conan Dewitt, Thomas Bräunl, Jin B. Hong
2025-09-02
safety
2509.02163v1

Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue
2025-09-02
2509.01909v3

Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue
2025-09-02
2509.01909v2

Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue
2025-09-02
2509.01909v1

Unraveling LLM Jailbreaks Through Safety Knowledge Neurons

Chongwen Zhao, Kaizhu Huang
2025-09-01
red teaming safety
2509.01631v1

Strata-Sword: A Hierarchical Safety Evaluation towards LLMs based on Reasoning Complexity of Jailbreak Instructions

Shiji Zhao, Ranjie Duan, Jiexi Liu, Xiaojun Jia, Fengxiang Wang, Cheng Wei, Ruoxi Cheng, Yong Xie, Chang Liu, Qing Guo, Jialing Tao, Hui Xue, Xingxing Wei
2025-09-01
red teaming safety
2509.01444v1

LLM-empowered Agents Simulation Framework for Scenario Generation in Service Ecosystem Governance

Deyu Zhou, Yuqi Hou, Xiao Xue, Xudong Lu, Qingzhong Li, Lizhen Cui
2025-09-01
governance
2509.01441v1

Web Fraud Attacks Against LLM-Driven Multi-Agent Systems

Dezhang Kong, Hujin Peng, Yilun Zhang, Lele Zhao, Zhenhua Xu, Shi Lin, Changting Lin, Meng Han
2025-09-01
2509.01211v1

August 25 - August 31, 2025

16 papers

The Resurgence of GCG Adversarial Attacks on Large Language Models

Yuting Tan, Xuying Li, Zhuo Li, Huizhen Shu, Peikang Hu
2025-08-30
red teaming
2509.00391v1

Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models

Sihao Wu, Gaojie Jin, Wei Huang, Jianhong Wang, Xiaowei Huang
2025-08-30
2509.00373v1

Cybersecurity AI: Hacking the AI Hackers via Prompt Injection

Víctor Mayoral-Vilches, Per Mannermaa Rynning
2025-08-29
red teaming
2508.21669v1

PromptSleuth: Detecting Prompt Injection via Semantic Intent Invariance

Mengxiao Wang, Yuxuan Zhang, Guofei Gu
2025-08-28
red teaming
2508.20890v2

JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring

Junjie Chu, Mingjie Li, Ziqing Yang, Ye Leng, Chenhao Lin, Chao Shen, Michael Backes, Yun Shen, Yang Zhang
2025-08-28
red teaming
2508.20848v1

From Law to Gherkin: A Human-Centred Quasi-Experiment on the Quality of LLM-Generated Behavioural Specifications from Food-Safety Regulations

Shabnam Hassani, Mehrdad Sabetzadeh, Daniel Amyot
2025-08-28
safety
2508.20744v1

Towards Mechanistic Defenses Against Typographic Attacks in CLIP

Lorenz Hufe, Constantin Venhoff, Maximilian Dreyer, Sebastian Lapuschkin, Wojciech Samek
2025-08-28
2508.20570v1

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Yang Zhang, Haohan Wang
2025-08-28
red teaming
2508.20325v1

IntentionReasoner: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement

Yuanzhe Shen, Zisu Huang, Zhengkang Guo, Yide Liu, Guanxu Chen, Ruicheng Yin, Xiaoqing Zheng, Xuanjing Huang
2025-08-27
red teaming
2508.20151v1

Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks

Sheng Liu, Qiang Sheng, Danding Wang, Yang Li, Guang Yang, Juan Cao
2025-08-27
red teaming safety
2508.20038v2

Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks

Sheng Liu, Qiang Sheng, Danding Wang, Yang Li, Guang Yang, Juan Cao
2025-08-27
red teaming safety
2508.20038v1

Pruning Strategies for Backdoor Defense in LLMs

Santosh Chapagain, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi
2025-08-27
safety
2508.20032v1

Evaluating Language Model Reasoning about Confidential Information

Dylan Sam, Alexander Robey, Andy Zou, Matt Fredrikson, J. Zico Kolter
2025-08-27
red teaming
2508.19980v1

AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema

Ting-Chun Liu, Ching-Yu Hsu, Kuan-Yi Lee, Chi-An Fu, Hung-yi Lee
2025-08-27
red teaming
2509.00088v2

Safety Alignment Should Be Made More Than Just A Few Attention Heads

Chao Huang, Zefeng Zhang, Juewei Yue, Quangang Li, Chuang Zhang, Tingwen Liu
2025-08-27
red teaming
2508.19697v1

On Surjectivity of Neural Networks: Can you elicit any behavior from your model?

Haozhe Jiang, Nika Haghtalab
2025-08-26
red teaming
2508.19445v1