← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1331 papers total
February 02 - February 08, 2026
24 papers
Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks
Jafar Isbarov, Murat Kantarcioglu
2026-02-04
red teaming
2602.05066v1
Trust The Typical
Debargha Ganguly, Sreehari Sankar, Biyao Zhang, Vikash Singh, Kanan Gupta, Harshini Kavuru, Alan Luo, Weicong Chen, Warren Morningstar, Raghu Machiraju, Vipin Chaudhary
2026-02-04
2602.04581v1
LLM as a Risk Manager: LLM Semantic Filtering for Lead-Lag Trading in Prediction Markets
Sumin Kim, Minjae Kim, Jihoon Kwon, Yoon Kim, Nicole Kagan, Joo Won Lee, Oscar Levy, Alejandro Lopez-Lira, Yongjae Lee, Chanyeol Choi
2026-02-04
2602.07048v1
The Supportiveness-Safety Tradeoff in LLM Well-Being Agents
Himanshi Lalwani, Hanan Salam
2026-02-04
safety
2602.04487v1
RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models
Jiacheng Liang, Yuhui Wang, Tanqiu Jiang, Ting Wang
2026-02-04
red teaming
2602.04448v1
How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks
Yanshu Wang, Shuaishuai Yang, Jingjing He, Tong Yang
2026-02-04
red teaming
safety
2602.04294v1
RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning
Zeming Wei, Qiaosheng Zhang, Xia Hu, Xingcheng Xu
2026-02-04
red teaming
2602.04224v1
The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment
Zhexin Zhang, Yida Lu, Junfeng Fang, Junxiao Yang, Shiyao Cui, Hao Zhou, Fandong Meng, Jie Zhou, Hongning Wang, Minlie Huang, Tat-Seng Chua
2026-02-04
2602.04196v1
WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents
Xilong Wang, Yinuo Liu, Zhun Wang, Dawn Song, Neil Gong
2026-02-03
red teaming
2602.03792v1
Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models
Chen Xiong, Zhiyuan He, Pin-Yu Chen, Ching-Yun Ko, Tsung-Yi Ho
2026-02-03
red teaming
2602.04896v1
Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility
Mengxuan Wang, Yuxin Chen, Gang Xu, Tao He, Hongjie Jiang, Ming Li
2026-02-03
red teaming
2602.03402v1
Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models
Hicham Eddoubi, Umar Faruk Abdullahi, Fadi Hassan
2026-02-03
red teaming
2602.03265v1
Is It Possible to Make Chatbots Virtuous? Investigating a Virtue-Based Design Methodology Applied to LLMs
Matthew P. Lad, Louisa Conwill, Megan Levis Scheirer
2026-02-03
2602.03155v1
AgentDyn: A Dynamic Open-Ended Benchmark for Evaluating Prompt Injection Attacks of Real-World Agent Security System
Hao Li, Ruoyao Wen, Shanghao Shi, Ning Zhang, Chaowei Xiao
2026-02-03
red teaming
2602.03117v2
AgentDyn: A Dynamic Open-Ended Benchmark for Evaluating Prompt Injection Attacks of Real-World Agent Security System
Hao Li, Ruoyao Wen, Shanghao Shi, Ning Zhang, Chaowei Xiao
2026-02-03
red teaming
2602.03117v1
Monotonicity as an Architectural Bias for Robust Language Models
Patrick Cooper, Alireza Nadali, Ashutosh Trivedi, Alvaro Velasquez
2026-02-02
2602.02686v1
David vs. Goliath: Verifiable Agent-to-Agent Jailbreaking via Reinforcement Learning
Samuel Nellessen, Tal Kachman
2026-02-02
red teaming
2602.02395v1
vLLM Hook v0: A Plug-in for Programming Model Internals on vLLM
Ching-Yun Ko, Pin-Yu Chen
2026-02-02
2603.06588v1
RACA: Representation-Aware Coverage Criteria for LLM Safety Testing
Zeming Wei, Zhixin Zhang, Chengcan Wu, Yihao Zhang, Xiaokun Luan, Meng Sun
2026-02-02
red teaming
safety
2602.02280v1
Light Alignment Improves LLM Safety via Model Self-Reflection with a Single Neuron
Sicheng Shen, Mingyang Lv, Han Shen, Jialin Wu, Binghao Wang, Zhou Yang, Guobin Shen, Dongcheng Zhao, Feifei Zhao, Yi Zeng
2026-02-02
safety
2602.02027v1
Human Society-Inspired Approaches to Agentic AI Security: The 4C Framework
Alsharif Abuadbba, Nazatul Sultan, Surya Nepal, Sanjay Jha
2026-02-02
2602.01942v1
Concept-Based Dictionary Learning for Inference-Time Safety in Vision Language Action Models
Siqi Wen, Shu Yang, Shaopeng Fu, Jingfeng Zhang, Lijie Hu, Di Wang
2026-02-02
2602.01834v1
RedVisor: Reasoning-Aware Prompt Injection Defense via Zero-Copy KV Cache Reuse
Mingrui Liu, Sixiao Zhang, Cheng Long, Kwok-Yan Lam
2026-02-02
red teaming
2602.01795v1
Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs
Yen-Shan Chen, Zhi Rui Tam, Cheng-Kuang Wu, Yun-Nung Chen
2026-02-02
red teaming
safety
2602.01600v1
‹
1
2
3
...
13
14
15
...
54
55
56
›