Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1331 papers total

February 02 - February 08, 2026

24 papers

Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks

Jafar Isbarov, Murat Kantarcioglu
2026-02-04
red teaming
2602.05066v1

Trust The Typical

Debargha Ganguly, Sreehari Sankar, Biyao Zhang, Vikash Singh, Kanan Gupta, Harshini Kavuru, Alan Luo, Weicong Chen, Warren Morningstar, Raghu Machiraju, Vipin Chaudhary
2026-02-04
2602.04581v1

LLM as a Risk Manager: LLM Semantic Filtering for Lead-Lag Trading in Prediction Markets

Sumin Kim, Minjae Kim, Jihoon Kwon, Yoon Kim, Nicole Kagan, Joo Won Lee, Oscar Levy, Alejandro Lopez-Lira, Yongjae Lee, Chanyeol Choi
2026-02-04
2602.07048v1

The Supportiveness-Safety Tradeoff in LLM Well-Being Agents

Himanshi Lalwani, Hanan Salam
2026-02-04
safety
2602.04487v1

RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models

Jiacheng Liang, Yuhui Wang, Tanqiu Jiang, Ting Wang
2026-02-04
red teaming
2602.04448v1

How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks

Yanshu Wang, Shuaishuai Yang, Jingjing He, Tong Yang
2026-02-04
red teaming safety
2602.04294v1

RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning

Zeming Wei, Qiaosheng Zhang, Xia Hu, Xingcheng Xu
2026-02-04
red teaming
2602.04224v1

The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment

Zhexin Zhang, Yida Lu, Junfeng Fang, Junxiao Yang, Shiyao Cui, Hao Zhou, Fandong Meng, Jie Zhou, Hongning Wang, Minlie Huang, Tat-Seng Chua
2026-02-04
2602.04196v1

WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents

Xilong Wang, Yinuo Liu, Zhun Wang, Dawn Song, Neil Gong
2026-02-03
red teaming
2602.03792v1

Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models

Chen Xiong, Zhiyuan He, Pin-Yu Chen, Ching-Yun Ko, Tsung-Yi Ho
2026-02-03
red teaming
2602.04896v1

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

Mengxuan Wang, Yuxin Chen, Gang Xu, Tao He, Hongjie Jiang, Ming Li
2026-02-03
red teaming
2602.03402v1

Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models

Hicham Eddoubi, Umar Faruk Abdullahi, Fadi Hassan
2026-02-03
red teaming
2602.03265v1

Is It Possible to Make Chatbots Virtuous? Investigating a Virtue-Based Design Methodology Applied to LLMs

Matthew P. Lad, Louisa Conwill, Megan Levis Scheirer
2026-02-03
2602.03155v1

AgentDyn: A Dynamic Open-Ended Benchmark for Evaluating Prompt Injection Attacks of Real-World Agent Security System

Hao Li, Ruoyao Wen, Shanghao Shi, Ning Zhang, Chaowei Xiao
2026-02-03
red teaming
2602.03117v2

AgentDyn: A Dynamic Open-Ended Benchmark for Evaluating Prompt Injection Attacks of Real-World Agent Security System

Hao Li, Ruoyao Wen, Shanghao Shi, Ning Zhang, Chaowei Xiao
2026-02-03
red teaming
2602.03117v1

Monotonicity as an Architectural Bias for Robust Language Models

Patrick Cooper, Alireza Nadali, Ashutosh Trivedi, Alvaro Velasquez
2026-02-02
2602.02686v1

David vs. Goliath: Verifiable Agent-to-Agent Jailbreaking via Reinforcement Learning

Samuel Nellessen, Tal Kachman
2026-02-02
red teaming
2602.02395v1

vLLM Hook v0: A Plug-in for Programming Model Internals on vLLM

Ching-Yun Ko, Pin-Yu Chen
2026-02-02
2603.06588v1

RACA: Representation-Aware Coverage Criteria for LLM Safety Testing

Zeming Wei, Zhixin Zhang, Chengcan Wu, Yihao Zhang, Xiaokun Luan, Meng Sun
2026-02-02
red teaming safety
2602.02280v1

Light Alignment Improves LLM Safety via Model Self-Reflection with a Single Neuron

Sicheng Shen, Mingyang Lv, Han Shen, Jialin Wu, Binghao Wang, Zhou Yang, Guobin Shen, Dongcheng Zhao, Feifei Zhao, Yi Zeng
2026-02-02
safety
2602.02027v1

Human Society-Inspired Approaches to Agentic AI Security: The 4C Framework

Alsharif Abuadbba, Nazatul Sultan, Surya Nepal, Sanjay Jha
2026-02-02
2602.01942v1

Concept-Based Dictionary Learning for Inference-Time Safety in Vision Language Action Models

Siqi Wen, Shu Yang, Shaopeng Fu, Jingfeng Zhang, Lijie Hu, Di Wang
2026-02-02
2602.01834v1

RedVisor: Reasoning-Aware Prompt Injection Defense via Zero-Copy KV Cache Reuse

Mingrui Liu, Sixiao Zhang, Cheng Long, Kwok-Yan Lam
2026-02-02
red teaming
2602.01795v1

Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs

Yen-Shan Chen, Zhi Rui Tam, Cheng-Kuang Wu, Yun-Nung Chen
2026-02-02
red teaming safety
2602.01600v1