← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1160 papers total
February 23 - March 01, 2026
10 papers
ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction
Che Wang, Fuyao Zhang, Jiaming Zhang, Ziqi Zhang, Yinghui Wang, Longtao Huang, Jianbo Gao, Zhong Chen, Wei Yang Bryan Lim
2026-02-24
red teaming
2602.20708v1
An LLM-driven Scenario Generation Pipeline Using an Extended Scenic DSL for Autonomous Driving Safety Validation
Fida Khandaker Safa, Yupeng Jiang, Xi Zheng
2026-02-24
safety
2602.20644v1
Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks
David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi, Maksym Andriushchenko
2026-02-23
red teaming
2602.20156v3
Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks
David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi, Maksym Andriushchenko
2026-02-23
red teaming
2602.20156v2
Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks
David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi, Maksym Andriushchenko
2026-02-23
red teaming
2602.20156v1
BarrierSteer: LLM Safety via Learning Barrier Steering
Thanh Q. Tran, Arun Verma, Kiwan Wong, Bryan Kian Hsiang Low, Daniela Rus, Wei Xiao
2026-02-23
safety
2602.20102v1
The LLMbda Calculus: AI Agents, Conversations, and Information Flow
Zac Garby, Andrew D. Gordon, David Sands
2026-02-23
red teaming
2602.20064v1
Beyond the Binary: A nuanced path for open-weight advanced AI
Bengüsu Özcan, Alex Petropoulos, Max Reddel
2026-02-23
2602.19682v1
CIBER: A Comprehensive Benchmark for Security Evaluation of Code Interpreter Agents
Lei Ba, Qinbin Li, Songze Li
2026-02-23
red teaming
2602.19547v1
Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement
Amirhossein Farzam, Majid Behabahani, Mani Malek, Yuriy Nevmyvaka, Guillermo Sapiro
2026-02-23
red teaming
2602.19396v1
February 16 - February 22, 2026
14 papers
AMV-L: Lifecycle-Managed Agent Memory for Tail-Latency Control in Long-Running LLM Systems
Emmanuel Bamidele
2026-02-22
2603.04443v1
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu, Ruizhe Li, Maheep Chaudhary
2026-02-21
safety
2602.18782v1
FENCE: A Financial and Multimodal Jailbreak Detection Dataset
Mirae Kim, Seonghun Jeong, Youngjun Kwak
2026-02-20
red teaming
2602.18154v1
Trojan Horses in Recruiting: A Red-Teaming Case Study on Indirect Prompt Injection in Standard vs. Reasoning Models
Manuel Wirth
2026-02-19
red teaming
2602.18514v1
Fail-Closed Alignment for Large Language Models
Zachary Coalson, Beth Sohler, Aiden Gabriel, Sanghyun Hong
2026-02-19
red teaming
2602.16977v1
Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents
Arnold Cartagena, Ariane Teixeira
2026-02-18
safety
2602.16943v1
DeepContext: Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLMs
Justin Albrethsen, Yash Datta, Kunal Kumar, Sharath Rajasekar
2026-02-18
red teaming
2602.16935v1
NeST: Neuron Selective Tuning for LLM Safety
Sasha Behrouzi, Lichao Wu, Mohamadreza Rostami, Ahmad-Reza Sadeghi
2026-02-18
safety
2602.16835v1
IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages
Priyaranjan Pattnayak, Sanchari Chowdhuri
2026-02-18
red teaming
2602.16832v1
Policy Compiler for Secure Agentic Systems
Nils Palumbo, Sarthak Choudhary, Jihye Choi, Prasad Chalasani, Somesh Jha
2026-02-18
2602.16708v2
Policy Compiler for Secure Agentic Systems
Nils Palumbo, Sarthak Choudhary, Jihye Choi, Prasad Chalasani, Mihai Christodorescu, Somesh Jha
2026-02-18
2602.16708v1
Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment
Yuyan Bu, Xiaohao Liu, ZhaoXing Ren, Yaodong Yang, Juntao Dai
2026-02-18
safety
2602.16660v1
Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents
Doron Shavit
2026-02-18
red teaming
2602.16520v1
Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents
Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut
2026-02-18
red teaming
2602.16346v2
‹
1
2
3
4
...
47
48
49
›