← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1331 papers total
February 16 - February 22, 2026
9 papers
Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents
Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut
2026-02-18
red teaming
2602.16346v1
The Vulnerability of LLM Rankers to Prompt Injection Attacks
Yu Yin, Shuai Wang, Bevan Koopman, Guido Zuccon
2026-02-18
red teaming
2602.16752v1
Can Adversarial Code Comments Fool AI Security Reviewers -- Large-Scale Empirical Study of Comment-Based Attacks and Defenses Against LLM Code Analysis
Scott Thornton
2026-02-18
safety
2602.16741v1
Intent Laundering: AI Safety Datasets Are Not What They Seem
Shahriar Golchin, Marc Wetter
2026-02-17
red teaming
2602.16729v1
Boundary Point Jailbreaking of Black-Box LLMs
Xander Davies, Giorgi Giglemiani, Edmund Lau, Eric Winsor, Geoffrey Irving, Yarin Gal
2026-02-16
red teaming
2602.15001v2
Boundary Point Jailbreaking of Black-Box LLMs
Xander Davies, Giorgi Giglemiani, Edmund Lau, Eric Winsor, Geoffrey Irving, Yarin Gal
2026-02-16
red teaming
2602.15001v1
Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks
Lukas Struppek, Adam Gleave, Kellin Pelrine
2026-02-16
red teaming
2602.14689v1
Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models
In Chong Choi, Jiacheng Zhang, Feng Liu, Yiliao Song
2026-02-16
red teaming
2602.14399v1
A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)
Tianyu Chen, Dongrui Liu, Xia Hu, Jingyi Yu, Wenjie Wang
2026-02-16
2602.14364v1
February 09 - February 15, 2026
15 papers
SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement
Xiaojun Jia, Jie Liao, Simeng Qin, Jindong Gu, Wenqi Ren, Xiaochun Cao, Yang Liu, Philip Torr
2026-02-15
red teaming
2602.14211v1
When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift
Max Fomin
2026-02-15
red teaming
2602.14161v1
AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks
Yuqi Jia, Ruiqi Wang, Xilong Wang, Chong Xiang, Neil Gong
2026-02-14
red teaming
2602.13597v2
AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks
Yuqi Jia, Ruiqi Wang, Xilong Wang, Chong Xiang, Neil Gong
2026-02-14
red teaming
2602.13597v1
Mitigating the Safety-utility Trade-off in LLM Alignment via Adaptive Safe Context Learning
Yanbo Wang, Minzheng Wang, Jian Liang, Lu Wang, Yongcan Yu, Ran He
2026-02-14
safety
2602.13562v1
AISA: Awakening Intrinsic Safety Awareness in Large Language Models against Jailbreak Attacks
Weiming Song, Xuan Xie, Ruiping Yin
2026-02-14
2602.13547v1
OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage
Akshat Naik, Jay Culligan, Yarin Gal, Philip Torr, Rahaf Aljundi, Alasdair Paren, Adel Bibi
2026-02-13
red teaming
2602.13477v2
OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage
Akshat Naik, Jay Culligan, Yarin Gal, Philip Torr, Rahaf Aljundi, Alasdair Paren, Adel Bibi
2026-02-13
red teaming
2602.13477v1
Buy versus Build an LLM: A Decision Framework for Governments
Jiahao Lu, Ziwei Xu, William Tjhi, Junnan Li, Antoine Bosselut, Pang Wei Koh, Mohan Kankanhalli
2026-02-13
governance
2602.13033v1
Sparse Autoencoders are Capable LLM Jailbreak Mitigators
Yannick Assogba, Jacopo Cortellazzi, Javier Abad, Pau Rodriguez, Xavier Suau, Arno Blaas
2026-02-12
red teaming
2602.12418v1
Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing
Keita Broadwater
2026-02-12
safety
2602.11786v1
Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs
Dong Yan, Jian Liang, Ran He, Tieniu Tan
2026-02-12
safety
2602.11528v1
Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models
Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
2026-02-12
red teaming
2602.11495v1
Towards Reliable Machine Translation: Scaling LLMs for Critical Error Detection and Safety
Muskaan Chopra, Lorenz Sparrenberg, Rafet Sifa
2026-02-11
safety
2602.11444v1
Optimizing Agent Planning for Security and Autonomy
Aashish Kolluri, Rishi Sharma, Manuel Costa, Boris Köpf, Tobias Nießen, Mark Russinovich, Shruti Tople, Santiago Zanella-Béguelin
2026-02-11
2602.11416v1
‹
1
2
3
...
10
11
12
...
54
55
56
›