← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1331 papers total
March 02 - March 08, 2026
20 papers
When Safety Becomes a Vulnerability: Exploiting LLM Alignment Homogeneity for Transferable Blocking in RAG
Junchen Li, Chao Qi, Rongzheng Wang, Qizhi Chen, Liang Xu, Di Liang, Bob Simons, Shuang Liang
2026-03-04
safety
2603.03919v1
Mozi: Governed Autonomy for Drug Discovery LLM Agents
He Cao, Siyu Liu, Fan Zhang, Zijing Liu, Hao Li, Bin Feng, Shengyuan Bai, Leqing Chen, Kai Xie, Yu Li
2026-03-04
governance
2603.03655v1
Beyond Input Guardrails: Reconstructing Cross-Agent Semantic Flows for Execution-Aware Attack Detection
Yangyang Wei, Yijie Xu, Zhenyuan Li, Xiangmin Shen, Shouling Ji
2026-03-04
red teaming
2603.04469v1
Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions
Neha Nagaraja, Lan Zhang, Zhilong Wang, Bo Zhang, Pawan Patil
2026-03-04
red teaming
2603.03637v1
Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study
Neha Nagaraja, Hayretdin Bahsi
2026-03-04
red teaming
2603.03633v1
SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems
Haochang Hao, Yifan Xu, Xinzhuo Li, Yingqiang Ge, Lu Cheng
2026-03-03
safety
2603.03536v1
Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use
Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi, Ahmed Awadallah
2026-03-03
2603.03205v1
TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models
Zhi Xu, Jiaqi Li, Xiaotong Zhang, Hong Yu, Han Liu
2026-03-03
red teaming
2603.03081v1
Contextualized Privacy Defense for LLM Agents
Yule Wen, Yanzhe Zhang, Jianxun Lian, Xiaoyuan Yi, Xing Xie, Diyi Yang
2026-03-03
safety
2603.02983v1
Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks
Junjie Chu, Xinyue Shen, Ye Leng, Michael Backes, Yun Shen, Yang Zhang
2026-03-03
2603.04459v2
Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks
Junjie Chu, Xinyue Shen, Ye Leng, Michael Backes, Yun Shen, Yang Zhang
2026-03-03
safety
2603.04459v1
From Shallow to Deep: Pinning Semantic Intent via Causal GRPO
Shuyi Zhou, Zeen Song, Wenwen Qiang, Jiyan Sun, Yao Zhou, Yinlong Liu, Wei Ma
2026-03-03
red teaming
2603.02675v1
SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety
Zixuan Xu, Tiancheng He, Huahui Yi, Kun Wang, Xi Chen, Gongli Xi, Qiankun Li, Kang Li, Yang Liu, Zhigang Zeng
2026-03-03
2603.02635v1
AnchorDrive: LLM Scenario Rollout with Anchor-Guided Diffusion Regeneration for Safety-Critical Scenario Generation
Zhulin Jiang, Zetao Li, Cheng Wang, Ziwen Wang, Chen Xiong
2026-03-03
safety
2603.02542v1
Beyond Detection: Governing GenAI in Academic Peer Review as a Sociotechnical Challenge
Tatiana Chakravorti, Pranav Narayanan Venkit, Sourojit Ghosh, Sarah Rajtmajer
2026-03-02
2603.20214v1
Ignore All Previous Instructions: Jailbreaking as a de-escalatory peace building practise to resist LLM social media bots
Huw Day, Adrianna Jezierska, Jessica Woodgate
2026-03-02
2603.01942v1
Co-Evolutionary Multi-Modal Alignment via Structured Adversarial Evolution
Guoxin Shi, Haoyu Wang, Zaihui Yang, Yuxing Wang, Yongzhe Chang
2026-03-02
red teaming
2603.01784v1
DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern
Xiaoyi Pang, Xuanyi Hao, Pengyu Liu, Qi Luo, Song Guo, Zhibo Wang
2026-03-02
red teaming
2603.01574v1
Inference-Time Safety For Code LLMs Via Retrieval-Augmented Revision
Manisha Mukherjee, Vincent J. Hellendoorn
2026-03-02
safety
2603.01494v1
Jailbreaking Embodied LLMs via Action-level Manipulation
Xinyu Huang, Qiang Yang, Leming Shen, Zijing Ma, Yuanqing Zheng
2026-03-02
red teaming
2603.01414v1
February 23 - March 01, 2026
4 papers
JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks
Masahiro Kaneko, Ayana Niwa, Timothy Baldwin
2026-03-01
red teaming
2603.01291v1
Tracking Capabilities for Safer Agents
Martin Odersky, Yaoyu Zhao, Yichen Xu, Oliver Bračevac, Cao Nguyen Pham
2026-03-01
2603.00991v1
MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs
Yilian Liu, Xiaojun Jia, Guoshun Nan, Jiuyang Lyu, Zhican Chen, Tao Guan, Shuyuan Luo, Zhongyi Zhai, Yang Liu
2026-02-28
red teaming
2603.00565v1
From Goals to Aspects, Revisited: An NFR Pattern Language for Agentic AI Systems
Yijun Yu
2026-02-28
2603.00472v1
‹
1
2
3
...
7
8
9
...
54
55
56
›