← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1331 papers total
February 09 - February 15, 2026
18 papers
Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away
Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Furong Huang, Dinesh Manocha, Amrit Singh Bedi
2026-02-11
red teaming
2602.11096v1
Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection
J Alex Corll
2026-02-11
red teaming
2602.11247v2
Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection
J Alex Corll
2026-02-11
red teaming
2602.11247v1
Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System
Zhenhua Zou, Sheng Guo, Qiuyang Zhan, Lepeng Zhao, Shuo Li, Qi Li, Ke Xu, Mingwei Xu, Zhuotao Liu
2026-02-11
2602.10915v3
Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System
Zhenhua Zou, Sheng Guo, Qiuyang Zhan, Lepeng Zhao, Shuo Li, Qi Li, Ke Xu, Mingwei Xu, Zhuotao Liu
2026-02-11
2602.10915v2
Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System
Zhenhua Zou, Sheng Guo, Qiuyang Zhan, Lepeng Zhao, Shuo Li, Qi Li, Ke Xu, Mingwei Xu, Zhuotao Liu
2026-02-11
2602.10915v1
When Skills Lie: Hidden-Comment Injection in LLM Agents
Qianli Wang, Boyang Ma, Minghui Xu, Yue Zhang
2026-02-11
red teaming
2602.10498v1
Protecting Context and Prompts: Deterministic Security for Non-Deterministic AI
Mohan Rajagopalan, Vinay Rao
2026-02-11
2602.10481v1
The Landscape of Prompt Injection Threats in LLM Agents: From Taxonomy to Analysis
Peiran Wang, Xinfeng Li, Chong Xiang, Jinghuai Zhang, Ying Li, Lixia Zhang, Xiaofeng Wang, Yuan Tian
2026-02-11
red teaming
2602.10453v1
Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction
Tri Nguyen, Huy Hoang Bao Le, Lohith Srikanth Pentapalli, Laurah Turner, Kelly Cohen
2026-02-10
2602.13321v1
When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models
Jiacheng Hou, Yining Sun, Ruochong Jin, Haochen Han, Fangming Liu, Wai Kin Victor Chan, Alex Jinpeng Wang
2026-02-10
2602.10179v1
Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks
Hayfa Dhabhi, Kashyap Thimmaraju
2026-02-10
red teaming
safety
2602.09629v1
Autonomous Action Runtime Management(AARM):A System Specification for Securing AI-Driven Actions at Runtime
Herman Errico
2026-02-10
2602.09433v1
MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks
Georgios Syros, Evan Rose, Brian Grinstead, Christoph Kerschbaumer, William Robertson, Cristina Nita-Rotaru, Alina Oprea
2026-02-09
red teaming
2602.09222v1
When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents
Yuting Ning, Jaylen Jones, Zhehao Zhang, Chentao Ye, Weitong Ruan, Junyi Li, Rahul Gupta, Huan Sun
2026-02-09
2602.08995v1
Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks
Yu Yan, Sheng Sun, Shengjia Cheng, Teli Liu, Mingfeng Li, Min Liu
2026-02-09
red teaming
2602.10148v1
Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing
Jona te Lintelo, Lichao Wu, Stjepan Picek
2026-02-09
red teaming
2602.08741v1
Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs
Yukun Jiang, Hai Huang, Mingjie Li, Yage Zhang, Michael Backes, Yang Zhang
2026-02-09
safety
2602.08621v1
February 02 - February 08, 2026
6 papers
Robustness of Vision Language Models Against Split-Image Harmful Input Attacks
Md Rafi Ur Rashid, MD Sadik Hossain Shanto, Vishnu Asutosh Dasu, Shagufta Mehnaz
2026-02-08
red teaming
2602.08136v1
Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation
Shayan Ali Hassan, Tao Ni, Zafar Ayyub Qazi, Marco Canini
2026-02-08
red teaming
2602.08062v1
Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation
Krzysztof Wróbel, Jan Maria Kowalski, Jerzy Surma, Igor Ciuciura, Maciej Szymański
2026-02-08
safety
2602.07954v2
Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation
Krzysztof Wróbel, Jan Maria Kowalski, Jerzy Surma, Igor Ciuciura, Maciej Szymański
2026-02-08
safety
2602.07954v1
CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution
Minbeom Kim, Mihir Parmar, Phillip Wallis, Lesly Miculicich, Kyomin Jung, Krishnamurthy Dj Dvijotham, Long T. Le, Tomas Pfister
2026-02-08
red teaming
2602.07918v1
AgentSys: Secure and Dynamic LLM Agents Through Explicit Hierarchical Memory Management
Ruoyao Wen, Hao Li, Chaowei Xiao, Ning Zhang
2026-02-07
red teaming
2602.07398v1
‹
1
2
3
...
11
12
13
...
54
55
56
›