← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1169 papers total
February 16 - February 22, 2026
16 papers
NeST: Neuron Selective Tuning for LLM Safety
Sasha Behrouzi, Lichao Wu, Mohamadreza Rostami, Ahmad-Reza Sadeghi
2026-02-18
safety
2602.16835v1
IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages
Priyaranjan Pattnayak, Sanchari Chowdhuri
2026-02-18
red teaming
2602.16832v1
Policy Compiler for Secure Agentic Systems
Nils Palumbo, Sarthak Choudhary, Jihye Choi, Prasad Chalasani, Somesh Jha
2026-02-18
2602.16708v2
Policy Compiler for Secure Agentic Systems
Nils Palumbo, Sarthak Choudhary, Jihye Choi, Prasad Chalasani, Mihai Christodorescu, Somesh Jha
2026-02-18
2602.16708v1
Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment
Yuyan Bu, Xiaohao Liu, ZhaoXing Ren, Yaodong Yang, Juntao Dai
2026-02-18
safety
2602.16660v1
Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents
Doron Shavit
2026-02-18
red teaming
2602.16520v1
Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents
Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut
2026-02-18
red teaming
2602.16346v2
Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents
Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut
2026-02-18
red teaming
2602.16346v1
The Vulnerability of LLM Rankers to Prompt Injection Attacks
Yu Yin, Shuai Wang, Bevan Koopman, Guido Zuccon
2026-02-18
red teaming
2602.16752v1
Can Adversarial Code Comments Fool AI Security Reviewers -- Large-Scale Empirical Study of Comment-Based Attacks and Defenses Against LLM Code Analysis
Scott Thornton
2026-02-18
safety
2602.16741v1
Intent Laundering: AI Safety Datasets Are Not What They Seem
Shahriar Golchin, Marc Wetter
2026-02-17
red teaming
2602.16729v1
Boundary Point Jailbreaking of Black-Box LLMs
Xander Davies, Giorgi Giglemiani, Edmund Lau, Eric Winsor, Geoffrey Irving, Yarin Gal
2026-02-16
red teaming
2602.15001v2
Boundary Point Jailbreaking of Black-Box LLMs
Xander Davies, Giorgi Giglemiani, Edmund Lau, Eric Winsor, Geoffrey Irving, Yarin Gal
2026-02-16
red teaming
2602.15001v1
Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks
Lukas Struppek, Adam Gleave, Kellin Pelrine
2026-02-16
red teaming
2602.14689v1
Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models
In Chong Choi, Jiacheng Zhang, Feng Liu, Yiliao Song
2026-02-16
red teaming
2602.14399v1
A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)
Tianyu Chen, Dongrui Liu, Xia Hu, Jingyi Yu, Wenjie Wang
2026-02-16
2602.14364v1
February 09 - February 15, 2026
8 papers
SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement
Xiaojun Jia, Jie Liao, Simeng Qin, Jindong Gu, Wenqi Ren, Xiaochun Cao, Yang Liu, Philip Torr
2026-02-15
red teaming
2602.14211v1
When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift
Max Fomin
2026-02-15
red teaming
2602.14161v1
AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks
Yuqi Jia, Ruiqi Wang, Xilong Wang, Chong Xiang, Neil Gong
2026-02-14
red teaming
2602.13597v2
AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks
Yuqi Jia, Ruiqi Wang, Xilong Wang, Chong Xiang, Neil Gong
2026-02-14
red teaming
2602.13597v1
Mitigating the Safety-utility Trade-off in LLM Alignment via Adaptive Safe Context Learning
Yanbo Wang, Minzheng Wang, Jian Liang, Lu Wang, Yongcan Yu, Ran He
2026-02-14
safety
2602.13562v1
AISA: Awakening Intrinsic Safety Awareness in Large Language Models against Jailbreak Attacks
Weiming Song, Xuan Xie, Ruiping Yin
2026-02-14
2602.13547v1
OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage
Akshat Naik, Jay Culligan, Yarin Gal, Philip Torr, Rahaf Aljundi, Alasdair Paren, Adel Bibi
2026-02-13
red teaming
2602.13477v2
OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage
Akshat Naik, Jay Culligan, Yarin Gal, Philip Torr, Rahaf Aljundi, Alasdair Paren, Adel Bibi
2026-02-13
red teaming
2602.13477v1
‹
1
2
3
4
5
...
47
48
49
›