← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 770 papers total
November 24 - November 30, 2025
10 papers
Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization
Xurui Li, Kaisong Song, Rui Zhu, Pin-Yu Chen, Haixu Tang
2025-11-24
red teaming
safety
2511.19218v1
Can LLMs Threaten Human Survival? Benchmarking Potential Existential Threats from LLMs via Prefix Completion
Yu Cui, Yifei Liu, Hang Fu, Sicheng Pan, Haibin Zhang, Cong Zuo, Licheng Wang
2025-11-24
red teaming
2511.19171v1
Medical Malice: A Dataset for Context-Aware Safety in Healthcare LLMs
Andrew Maranhão Ventura D'addario
2025-11-24
safety
2511.21757v1
Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation
Junbo Zhang, Ran Chen, Qianli Zhou, Xinyang Deng, Wen Jiang
2025-11-24
2511.19009v1
Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations
Ryan Wong, Hosea David Yu Fei Ng, Dhananjai Sharma, Glenn Jun Jie Ng, Kavishvaran Srinivasan
2025-11-24
2511.18933v1
BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models
Juncheng Li, Yige Li, Hanxun Huang, Yunhao Chen, Xin Wang, Yixu Wang, Xingjun Ma, Yu-Gang Jiang
2025-11-24
2511.18921v1
EAGER: Edge-Aligned LLM Defense for Robust, Efficient, and Accurate Cybersecurity Question Answering
Onat Gungor, Roshan Sood, Jiasheng Zhou, Tajana Rosing
2025-11-24
safety
2511.19523v1
RoguePrompt: Dual-Layer Ciphering for Self-Reconstruction to Circumvent LLM Moderation
Benyamin Tafreshian
2025-11-24
red teaming
2511.18790v1
Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM
Adarsh Kumarappan, Ayushi Mehrotra
2025-11-24
red teaming
2511.18721v1
Automating Deception: Scalable Multi-Turn LLM Jailbreaks
Adarsh Kumarappan, Ananya Mujoo
2025-11-24
red teaming
2511.19517v1
November 17 - November 23, 2025
14 papers
Semantics as a Shield: Label Disguise Defense (LDD) against Prompt Injection in LLM Sentiment Classification
Yanxi Li, Ruocheng Shan
2025-11-23
red teaming
2511.21752v1
TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization
Yanting Wang, Runpeng Geng, Jinghui Chen, Minhao Cheng, Jinyuan Jia
2025-11-23
red teaming
2511.18581v2
TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization
Yanting Wang, Runpeng Geng, Jinghui Chen, Minhao Cheng, Jinyuan Jia
2025-11-23
red teaming
2511.18581v1
Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems
Xiaoqing Wang, Keman Huang, Bin Liang, Hongyu Li, Xiaoyong Du
2025-11-23
safety
2511.18467v1
Z-Space: A Multi-Agent Tool Orchestration Framework for Enterprise-Grade LLM Automation
Qingsong He, Jing Nan, Jiayu Jiao, Liangjie Tang, Xiaodong Xu, Mengmeng Sun, Qingyao Wang, Minghui Yan
2025-11-23
2511.19483v1
Curvature-Aware Safety Restoration In LLMs Fine-Tuning
Thong Bach, Thanh Nguyen-Tang, Dung Nguyen, Thao Minh Le, Truyen Tran
2025-11-22
safety
2511.18039v1
Building Browser Agents: Architecture, Security, and Practical Solutions
Aram Vardanyan
2025-11-22
2511.19477v1
Beyond Jailbreak: Unveiling Risks in LLM Applications Arising from Blurred Capability Boundaries
Yunyi Zhang, Shibo Cui, Baojun Liu, Jingkai Yu, Min Zhang, Fan Shi, Han Zheng
2025-11-22
2511.17874v1
Evaluating Adversarial Vulnerabilities in Modern Large Language Models
Tom Perel
2025-11-21
red teaming
2511.17666v1
The Shawshank Redemption of Embodied AI: Understanding and Benchmarking Indirect Environmental Jailbreaks
Chunyang Li, Zifeng Kang, Junwei Zhang, Zhuo Ma, Anda Cheng, Xinghua Li, Jianfeng Ma
2025-11-20
red teaming
2511.16347v1
"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios
Zhen Sun, Zongmin Zhang, Deqi Liang, Han Sun, Yule Liu, Yun Shen, Xiangshan Gao, Yilong Yang, Shuai Liu, Yutao Yue, Xinlei He
2025-11-20
red teaming
2511.16278v1
Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security
Wei Zhao, Zhe Li, Yige Li, Jun Sun
2025-11-20
red teaming
2511.16229v1
An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs
Zhi Luo, Zenghui Yuan, Wenqi Wei, Daizong Liu, Pan Zhou
2025-11-20
red teaming
2511.16163v1
D4C: Data-free Quantization for Contrastive Language-Image Pre-training Models
Wenlun Zhang, Yunshan Zhong, Zihao Ding, Xinyu Li, Kentaro Yoshioka
2025-11-19
2511.15411v1
‹
1
2
3
4
...
31
32
33
›