Paper Library

Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Furong Huang, Dinesh Manocha, Amrit Singh Bedi

2026-02-11

red teaming

2602.11096v1

Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection

J Alex Corll

2026-02-11

red teaming

2602.11247v2

Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection

J Alex Corll

2026-02-11

red teaming

2602.11247v1

Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System

Zhenhua Zou, Sheng Guo, Qiuyang Zhan, Lepeng Zhao, Shuo Li, Qi Li, Ke Xu, Mingwei Xu, Zhuotao Liu

2026-02-11

2602.10915v3

Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System

Zhenhua Zou, Sheng Guo, Qiuyang Zhan, Lepeng Zhao, Shuo Li, Qi Li, Ke Xu, Mingwei Xu, Zhuotao Liu

2026-02-11

2602.10915v2

Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System

Zhenhua Zou, Sheng Guo, Qiuyang Zhan, Lepeng Zhao, Shuo Li, Qi Li, Ke Xu, Mingwei Xu, Zhuotao Liu

2026-02-11

2602.10915v1

When Skills Lie: Hidden-Comment Injection in LLM Agents

Qianli Wang, Boyang Ma, Minghui Xu, Yue Zhang

2026-02-11

red teaming

2602.10498v1

Protecting Context and Prompts: Deterministic Security for Non-Deterministic AI

Mohan Rajagopalan, Vinay Rao

2026-02-11

2602.10481v1

The Landscape of Prompt Injection Threats in LLM Agents: From Taxonomy to Analysis

Peiran Wang, Xinfeng Li, Chong Xiang, Jinghuai Zhang, Ying Li, Lixia Zhang, Xiaofeng Wang, Yuan Tian

2026-02-11

red teaming

2602.10453v1

Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction

Tri Nguyen, Huy Hoang Bao Le, Lohith Srikanth Pentapalli, Laurah Turner, Kelly Cohen

2026-02-10

2602.13321v1

When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

Jiacheng Hou, Yining Sun, Ruochong Jin, Haochen Han, Fangming Liu, Wai Kin Victor Chan, Alex Jinpeng Wang

2026-02-10

2602.10179v1

Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks

Hayfa Dhabhi, Kashyap Thimmaraju

2026-02-10

red teaming safety

2602.09629v1

Autonomous Action Runtime Management(AARM):A System Specification for Securing AI-Driven Actions at Runtime

Herman Errico

2026-02-10

2602.09433v1

MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

Georgios Syros, Evan Rose, Brian Grinstead, Christoph Kerschbaumer, William Robertson, Cristina Nita-Rotaru, Alina Oprea

2026-02-09

red teaming

2602.09222v1

When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents

Yuting Ning, Jaylen Jones, Zhehao Zhang, Chentao Ye, Weitong Ruan, Junyi Li, Rahul Gupta, Huan Sun

2026-02-09

2602.08995v1

Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

Yu Yan, Sheng Sun, Shengjia Cheng, Teli Liu, Mingfeng Li, Min Liu

2026-02-09

red teaming

2602.10148v1

Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing

Jona te Lintelo, Lichao Wu, Stjepan Picek

2026-02-09

red teaming

2602.08741v1

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

Yukun Jiang, Hai Huang, Mingjie Li, Yage Zhang, Michael Backes, Yang Zhang

2026-02-09

safety

2602.08621v1

February 09 - February 15, 2026

Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection

Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection

Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System

Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System

Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System

When Skills Lie: Hidden-Comment Injection in LLM Agents

Protecting Context and Prompts: Deterministic Security for Non-Deterministic AI

The Landscape of Prompt Injection Threats in LLM Agents: From Taxonomy to Analysis

Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction

When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks

Autonomous Action Runtime Management(AARM):A System Specification for Securing AI-Driven Actions at Runtime

MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents

Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

February 02 - February 08, 2026

Robustness of Vision Language Models Against Split-Image Harmful Input Attacks

Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation

Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation

Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation

CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution

AgentSys: Secure and Dynamic LLM Agents Through Explicit Hierarchical Memory Management