← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1331 papers total
September 01 - September 07, 2025
2 papers
LLM-empowered Agents Simulation Framework for Scenario Generation in Service Ecosystem Governance
Deyu Zhou, Yuqi Hou, Xiao Xue, Xudong Lu, Qingzhong Li, Lizhen Cui
2025-09-01
governance
2509.01441v1
Web Fraud Attacks Against LLM-Driven Multi-Agent Systems
Dezhang Kong, Hujin Peng, Yilun Zhang, Lele Zhao, Zhenhua Xu, Shi Lin, Changting Lin, Meng Han
2025-09-01
2509.01211v1
August 25 - August 31, 2025
18 papers
The Resurgence of GCG Adversarial Attacks on Large Language Models
Yuting Tan, Xuying Li, Zhuo Li, Huizhen Shu, Peikang Hu
2025-08-30
red teaming
2509.00391v1
Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models
Sihao Wu, Gaojie Jin, Wei Huang, Jianhong Wang, Xiaowei Huang
2025-08-30
2509.00373v1
Cybersecurity AI: Hacking the AI Hackers via Prompt Injection
Víctor Mayoral-Vilches, Per Mannermaa Rynning
2025-08-29
red teaming
2508.21669v1
PromptSleuth: Detecting Prompt Injection via Semantic Intent Invariance
Mengxiao Wang, Yuxuan Zhang, Guofei Gu
2025-08-28
red teaming
2508.20890v2
JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring
Junjie Chu, Mingjie Li, Ziqing Yang, Ye Leng, Chenhao Lin, Chao Shen, Michael Backes, Yun Shen, Yang Zhang
2025-08-28
red teaming
2508.20848v1
From Law to Gherkin: A Human-Centred Quasi-Experiment on the Quality of LLM-Generated Behavioural Specifications from Food-Safety Regulations
Shabnam Hassani, Mehrdad Sabetzadeh, Daniel Amyot
2025-08-28
safety
2508.20744v1
Towards Mechanistic Defenses Against Typographic Attacks in CLIP
Lorenz Hufe, Constantin Venhoff, Maximilian Dreyer, Sebastian Lapuschkin, Wojciech Samek
2025-08-28
2508.20570v1
GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs
Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Yang Zhang, Haohan Wang
2025-08-28
red teaming
2508.20325v1
IntentionReasoner: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement
Yuanzhe Shen, Zisu Huang, Zhengkang Guo, Yide Liu, Guanxu Chen, Ruicheng Yin, Xiaoqing Zheng, Xuanjing Huang
2025-08-27
red teaming
2508.20151v1
Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks
Sheng Liu, Qiang Sheng, Danding Wang, Yang Li, Guang Yang, Juan Cao
2025-08-27
red teaming
safety
2508.20038v2
Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks
Sheng Liu, Qiang Sheng, Danding Wang, Yang Li, Guang Yang, Juan Cao
2025-08-27
red teaming
safety
2508.20038v1
Pruning Strategies for Backdoor Defense in LLMs
Santosh Chapagain, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi
2025-08-27
safety
2508.20032v1
Evaluating Language Model Reasoning about Confidential Information
Dylan Sam, Alexander Robey, Andy Zou, Matt Fredrikson, J. Zico Kolter
2025-08-27
red teaming
2508.19980v1
AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema
Ting-Chun Liu, Ching-Yu Hsu, Kuan-Yi Lee, Chi-An Fu, Hung-yi Lee
2025-08-27
red teaming
2509.00088v2
Safety Alignment Should Be Made More Than Just A Few Attention Heads
Chao Huang, Zefeng Zhang, Juewei Yue, Quangang Li, Chuang Zhang, Tingwen Liu
2025-08-27
red teaming
2508.19697v1
On Surjectivity of Neural Networks: Can you elicit any behavior from your model?
Haozhe Jiang, Nika Haghtalab
2025-08-26
red teaming
2508.19445v1
Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience
Xi Wang, Songlei Jian, Shasha Li, Xiaopeng Li, Bin Ji, Jun Ma, Xiaodong Liu, Jing Wang, Feilong Bao, Jianfeng Zhang, Baosheng Wang, Jie Yu
2025-08-25
red teaming
2508.19292v1
Speculative Safety-Aware Decoding
Xuekang Wang, Shengyu Zhu, Xueqi Cheng
2025-08-25
red teaming
2508.17739v1
August 18 - August 24, 2025
4 papers
Activation Transport Operators
Andrzej Szablewski, Marek Masiak
2025-08-24
2508.17540v1
Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD
Bryan Chen Zhengyu Tan, Daniel Wai Kit Chin, Zhengyuan Liu, Nancy F. Chen, Roy Ka-Wei Lee
2025-08-24
safety
2508.17450v1
Unveiling the Latent Directions of Reflection in Large Language Models
Fu-Chieh Chang, Yu-Ting Lee, Pei-Yuan Wu
2025-08-23
red teaming
2508.16989v1
ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks
Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park
2025-08-23
red teaming
2508.16889v1
‹
1
2
3
...
42
43
44
...
54
55
56
›