← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1169 papers total
February 09 - February 15, 2026
24 papers
Buy versus Build an LLM: A Decision Framework for Governments
Jiahao Lu, Ziwei Xu, William Tjhi, Junnan Li, Antoine Bosselut, Pang Wei Koh, Mohan Kankanhalli
2026-02-13
governance
2602.13033v1
Sparse Autoencoders are Capable LLM Jailbreak Mitigators
Yannick Assogba, Jacopo Cortellazzi, Javier Abad, Pau Rodriguez, Xavier Suau, Arno Blaas
2026-02-12
red teaming
2602.12418v1
Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing
Keita Broadwater
2026-02-12
safety
2602.11786v1
Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs
Dong Yan, Jian Liang, Ran He, Tieniu Tan
2026-02-12
safety
2602.11528v1
Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models
Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
2026-02-12
red teaming
2602.11495v1
Towards Reliable Machine Translation: Scaling LLMs for Critical Error Detection and Safety
Muskaan Chopra, Lorenz Sparrenberg, Rafet Sifa
2026-02-11
safety
2602.11444v1
Optimizing Agent Planning for Security and Autonomy
Aashish Kolluri, Rishi Sharma, Manuel Costa, Boris Köpf, Tobias Nießen, Mark Russinovich, Shruti Tople, Santiago Zanella-Béguelin
2026-02-11
2602.11416v1
Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away
Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Furong Huang, Dinesh Manocha, Amrit Singh Bedi
2026-02-11
red teaming
2602.11096v1
Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection
J Alex Corll
2026-02-11
red teaming
2602.11247v2
Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection
J Alex Corll
2026-02-11
red teaming
2602.11247v1
Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System
Zhenhua Zou, Sheng Guo, Qiuyang Zhan, Lepeng Zhao, Shuo Li, Qi Li, Ke Xu, Mingwei Xu, Zhuotao Liu
2026-02-11
2602.10915v3
Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System
Zhenhua Zou, Sheng Guo, Qiuyang Zhan, Lepeng Zhao, Shuo Li, Qi Li, Ke Xu, Mingwei Xu, Zhuotao Liu
2026-02-11
2602.10915v2
Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System
Zhenhua Zou, Sheng Guo, Qiuyang Zhan, Lepeng Zhao, Shuo Li, Qi Li, Ke Xu, Mingwei Xu, Zhuotao Liu
2026-02-11
2602.10915v1
When Skills Lie: Hidden-Comment Injection in LLM Agents
Qianli Wang, Boyang Ma, Minghui Xu, Yue Zhang
2026-02-11
red teaming
2602.10498v1
Protecting Context and Prompts: Deterministic Security for Non-Deterministic AI
Mohan Rajagopalan, Vinay Rao
2026-02-11
2602.10481v1
The Landscape of Prompt Injection Threats in LLM Agents: From Taxonomy to Analysis
Peiran Wang, Xinfeng Li, Chong Xiang, Jinghuai Zhang, Ying Li, Lixia Zhang, Xiaofeng Wang, Yuan Tian
2026-02-11
red teaming
2602.10453v1
Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction
Tri Nguyen, Huy Hoang Bao Le, Lohith Srikanth Pentapalli, Laurah Turner, Kelly Cohen
2026-02-10
2602.13321v1
When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models
Jiacheng Hou, Yining Sun, Ruochong Jin, Haochen Han, Fangming Liu, Wai Kin Victor Chan, Alex Jinpeng Wang
2026-02-10
2602.10179v1
Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks
Hayfa Dhabhi, Kashyap Thimmaraju
2026-02-10
red teaming
safety
2602.09629v1
Autonomous Action Runtime Management(AARM):A System Specification for Securing AI-Driven Actions at Runtime
Herman Errico
2026-02-10
2602.09433v1
MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks
Georgios Syros, Evan Rose, Brian Grinstead, Christoph Kerschbaumer, William Robertson, Cristina Nita-Rotaru, Alina Oprea
2026-02-09
red teaming
2602.09222v1
When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents
Yuting Ning, Jaylen Jones, Zhehao Zhang, Chentao Ye, Weitong Ruan, Junyi Li, Rahul Gupta, Huan Sun
2026-02-09
2602.08995v1
Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks
Yu Yan, Sheng Sun, Shengjia Cheng, Teli Liu, Mingfeng Li, Min Liu
2026-02-09
red teaming
2602.10148v1
Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing
Jona te Lintelo, Lichao Wu, Stjepan Picek
2026-02-09
red teaming
2602.08741v1
‹
1
2
3
4
5
6
...
47
48
49
›