Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1331 papers total

February 09 - February 15, 2026

18 papers

Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Furong Huang, Dinesh Manocha, Amrit Singh Bedi
2026-02-11
red teaming
2602.11096v1

Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection

J Alex Corll
2026-02-11
red teaming
2602.11247v2

Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection

J Alex Corll
2026-02-11
red teaming
2602.11247v1

Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System

Zhenhua Zou, Sheng Guo, Qiuyang Zhan, Lepeng Zhao, Shuo Li, Qi Li, Ke Xu, Mingwei Xu, Zhuotao Liu
2026-02-11
2602.10915v3

Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System

Zhenhua Zou, Sheng Guo, Qiuyang Zhan, Lepeng Zhao, Shuo Li, Qi Li, Ke Xu, Mingwei Xu, Zhuotao Liu
2026-02-11
2602.10915v2

Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System

Zhenhua Zou, Sheng Guo, Qiuyang Zhan, Lepeng Zhao, Shuo Li, Qi Li, Ke Xu, Mingwei Xu, Zhuotao Liu
2026-02-11
2602.10915v1

When Skills Lie: Hidden-Comment Injection in LLM Agents

Qianli Wang, Boyang Ma, Minghui Xu, Yue Zhang
2026-02-11
red teaming
2602.10498v1

Protecting Context and Prompts: Deterministic Security for Non-Deterministic AI

Mohan Rajagopalan, Vinay Rao
2026-02-11
2602.10481v1

The Landscape of Prompt Injection Threats in LLM Agents: From Taxonomy to Analysis

Peiran Wang, Xinfeng Li, Chong Xiang, Jinghuai Zhang, Ying Li, Lixia Zhang, Xiaofeng Wang, Yuan Tian
2026-02-11
red teaming
2602.10453v1

Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction

Tri Nguyen, Huy Hoang Bao Le, Lohith Srikanth Pentapalli, Laurah Turner, Kelly Cohen
2026-02-10
2602.13321v1

When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

Jiacheng Hou, Yining Sun, Ruochong Jin, Haochen Han, Fangming Liu, Wai Kin Victor Chan, Alex Jinpeng Wang
2026-02-10
2602.10179v1

Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks

Hayfa Dhabhi, Kashyap Thimmaraju
2026-02-10
red teaming safety
2602.09629v1

Autonomous Action Runtime Management(AARM):A System Specification for Securing AI-Driven Actions at Runtime

Herman Errico
2026-02-10
2602.09433v1

MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

Georgios Syros, Evan Rose, Brian Grinstead, Christoph Kerschbaumer, William Robertson, Cristina Nita-Rotaru, Alina Oprea
2026-02-09
red teaming
2602.09222v1

When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents

Yuting Ning, Jaylen Jones, Zhehao Zhang, Chentao Ye, Weitong Ruan, Junyi Li, Rahul Gupta, Huan Sun
2026-02-09
2602.08995v1

Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

Yu Yan, Sheng Sun, Shengjia Cheng, Teli Liu, Mingfeng Li, Min Liu
2026-02-09
red teaming
2602.10148v1

Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing

Jona te Lintelo, Lichao Wu, Stjepan Picek
2026-02-09
red teaming
2602.08741v1

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

Yukun Jiang, Hai Huang, Mingjie Li, Yage Zhang, Michael Backes, Yang Zhang
2026-02-09
safety
2602.08621v1

February 02 - February 08, 2026

6 papers