Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 770 papers total

November 24 - November 30, 2025

10 papers

Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization

Xurui Li, Kaisong Song, Rui Zhu, Pin-Yu Chen, Haixu Tang
2025-11-24
red teaming safety
2511.19218v1

Can LLMs Threaten Human Survival? Benchmarking Potential Existential Threats from LLMs via Prefix Completion

Yu Cui, Yifei Liu, Hang Fu, Sicheng Pan, Haibin Zhang, Cong Zuo, Licheng Wang
2025-11-24
red teaming
2511.19171v1

Medical Malice: A Dataset for Context-Aware Safety in Healthcare LLMs

Andrew Maranhão Ventura D'addario
2025-11-24
safety
2511.21757v1

Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation

Junbo Zhang, Ran Chen, Qianli Zhou, Xinyang Deng, Wen Jiang
2025-11-24
2511.19009v1

Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations

Ryan Wong, Hosea David Yu Fei Ng, Dhananjai Sharma, Glenn Jun Jie Ng, Kavishvaran Srinivasan
2025-11-24
2511.18933v1

BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models

Juncheng Li, Yige Li, Hanxun Huang, Yunhao Chen, Xin Wang, Yixu Wang, Xingjun Ma, Yu-Gang Jiang
2025-11-24
2511.18921v1

EAGER: Edge-Aligned LLM Defense for Robust, Efficient, and Accurate Cybersecurity Question Answering

Onat Gungor, Roshan Sood, Jiasheng Zhou, Tajana Rosing
2025-11-24
safety
2511.19523v1

RoguePrompt: Dual-Layer Ciphering for Self-Reconstruction to Circumvent LLM Moderation

Benyamin Tafreshian
2025-11-24
red teaming
2511.18790v1

Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

Adarsh Kumarappan, Ayushi Mehrotra
2025-11-24
red teaming
2511.18721v1

Automating Deception: Scalable Multi-Turn LLM Jailbreaks

Adarsh Kumarappan, Ananya Mujoo
2025-11-24
red teaming
2511.19517v1

November 17 - November 23, 2025

14 papers

Semantics as a Shield: Label Disguise Defense (LDD) against Prompt Injection in LLM Sentiment Classification

Yanxi Li, Ruocheng Shan
2025-11-23
red teaming
2511.21752v1

TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization

Yanting Wang, Runpeng Geng, Jinghui Chen, Minhao Cheng, Jinyuan Jia
2025-11-23
red teaming
2511.18581v2

TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization

Yanting Wang, Runpeng Geng, Jinghui Chen, Minhao Cheng, Jinyuan Jia
2025-11-23
red teaming
2511.18581v1

Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems

Xiaoqing Wang, Keman Huang, Bin Liang, Hongyu Li, Xiaoyong Du
2025-11-23
safety
2511.18467v1

Z-Space: A Multi-Agent Tool Orchestration Framework for Enterprise-Grade LLM Automation

Qingsong He, Jing Nan, Jiayu Jiao, Liangjie Tang, Xiaodong Xu, Mengmeng Sun, Qingyao Wang, Minghui Yan
2025-11-23
2511.19483v1

Curvature-Aware Safety Restoration In LLMs Fine-Tuning

Thong Bach, Thanh Nguyen-Tang, Dung Nguyen, Thao Minh Le, Truyen Tran
2025-11-22
safety
2511.18039v1

Building Browser Agents: Architecture, Security, and Practical Solutions

Aram Vardanyan
2025-11-22
2511.19477v1

Beyond Jailbreak: Unveiling Risks in LLM Applications Arising from Blurred Capability Boundaries

Yunyi Zhang, Shibo Cui, Baojun Liu, Jingkai Yu, Min Zhang, Fan Shi, Han Zheng
2025-11-22
2511.17874v1

Evaluating Adversarial Vulnerabilities in Modern Large Language Models

Tom Perel
2025-11-21
red teaming
2511.17666v1

The Shawshank Redemption of Embodied AI: Understanding and Benchmarking Indirect Environmental Jailbreaks

Chunyang Li, Zifeng Kang, Junwei Zhang, Zhuo Ma, Anda Cheng, Xinghua Li, Jianfeng Ma
2025-11-20
red teaming
2511.16347v1

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

Zhen Sun, Zongmin Zhang, Deqi Liang, Han Sun, Yule Liu, Yun Shen, Xiangshan Gao, Yilong Yang, Shuai Liu, Yutao Yue, Xinlei He
2025-11-20
red teaming
2511.16278v1

Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security

Wei Zhao, Zhe Li, Yige Li, Jun Sun
2025-11-20
red teaming
2511.16229v1

An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs

Zhi Luo, Zenghui Yuan, Wenqi Wei, Daizong Liu, Pan Zhou
2025-11-20
red teaming
2511.16163v1

D4C: Data-free Quantization for Contrastive Language-Image Pre-training Models

Wenlun Zhang, Yunshan Zhong, Zihao Ding, Xinyu Li, Kentaro Yoshioka
2025-11-19
2511.15411v1