← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1160 papers total
March 02 - March 08, 2026
23 papers
Proof-of-Guardrail in AI Agents and What (Not) to Trust from It
Xisen Jin, Michael Duan, Qin Lin, Aaron Chan, Zhenglun Chen, Junyi Du, Xiang Ren
2026-03-06
2603.05786v1
Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models
Jinman Wu, Yi Xie, Shen Lin, Shiqian Zhao, Xiaofeng Chen
2026-03-06
red teaming
2603.05773v1
Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads
Jinman Wu, Yi Xie, Shiqian Zhao, Xiaofeng Chen
2026-03-06
red teaming
2603.05772v1
Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems
Hiroki Fukui
2026-03-05
safety
2603.04904v1
Efficient Refusal Ablation in LLM through Optimal Transport
Geraldin Nanfack, Eugene Belilovsky, Elvis Dohmatob
2026-03-04
red teaming
2603.04355v1
When Safety Becomes a Vulnerability: Exploiting LLM Alignment Homogeneity for Transferable Blocking in RAG
Junchen Li, Chao Qi, Rongzheng Wang, Qizhi Chen, Liang Xu, Di Liang, Bob Simons, Shuang Liang
2026-03-04
safety
2603.03919v1
Mozi: Governed Autonomy for Drug Discovery LLM Agents
He Cao, Siyu Liu, Fan Zhang, Zijing Liu, Hao Li, Bin Feng, Shengyuan Bai, Leqing Chen, Kai Xie, Yu Li
2026-03-04
governance
2603.03655v1
Beyond Input Guardrails: Reconstructing Cross-Agent Semantic Flows for Execution-Aware Attack Detection
Yangyang Wei, Yijie Xu, Zhenyuan Li, Xiangmin Shen, Shouling Ji
2026-03-04
red teaming
2603.04469v1
Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions
Neha Nagaraja, Lan Zhang, Zhilong Wang, Bo Zhang, Pawan Patil
2026-03-04
red teaming
2603.03637v1
Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study
Neha Nagaraja, Hayretdin Bahsi
2026-03-04
red teaming
2603.03633v1
SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems
Haochang Hao, Yifan Xu, Xinzhuo Li, Yingqiang Ge, Lu Cheng
2026-03-03
safety
2603.03536v1
Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use
Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi, Ahmed Awadallah
2026-03-03
2603.03205v1
TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models
Zhi Xu, Jiaqi Li, Xiaotong Zhang, Hong Yu, Han Liu
2026-03-03
red teaming
2603.03081v1
Contextualized Privacy Defense for LLM Agents
Yule Wen, Yanzhe Zhang, Jianxun Lian, Xiaoyuan Yi, Xing Xie, Diyi Yang
2026-03-03
safety
2603.02983v1
Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks
Junjie Chu, Xinyue Shen, Ye Leng, Michael Backes, Yun Shen, Yang Zhang
2026-03-03
safety
2603.04459v1
From Shallow to Deep: Pinning Semantic Intent via Causal GRPO
Shuyi Zhou, Zeen Song, Wenwen Qiang, Jiyan Sun, Yao Zhou, Yinlong Liu, Wei Ma
2026-03-03
red teaming
2603.02675v1
SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety
Zixuan Xu, Tiancheng He, Huahui Yi, Kun Wang, Xi Chen, Gongli Xi, Qiankun Li, Kang Li, Yang Liu, Zhigang Zeng
2026-03-03
2603.02635v1
AnchorDrive: LLM Scenario Rollout with Anchor-Guided Diffusion Regeneration for Safety-Critical Scenario Generation
Zhulin Jiang, Zetao Li, Cheng Wang, Ziwen Wang, Chen Xiong
2026-03-03
safety
2603.02542v1
Ignore All Previous Instructions: Jailbreaking as a de-escalatory peace building practise to resist LLM social media bots
Huw Day, Adrianna Jezierska, Jessica Woodgate
2026-03-02
2603.01942v1
Co-Evolutionary Multi-Modal Alignment via Structured Adversarial Evolution
Guoxin Shi, Haoyu Wang, Zaihui Yang, Yuxing Wang, Yongzhe Chang
2026-03-02
red teaming
2603.01784v1
DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern
Xiaoyi Pang, Xuanyi Hao, Pengyu Liu, Qi Luo, Song Guo, Zhibo Wang
2026-03-02
red teaming
2603.01574v1
Inference-Time Safety For Code LLMs Via Retrieval-Augmented Revision
Manisha Mukherjee, Vincent J. Hellendoorn
2026-03-02
safety
2603.01494v1
Jailbreaking Embodied LLMs via Action-level Manipulation
Xinyu Huang, Qiang Yang, Leming Shen, Zijing Ma, Yuanqing Zheng
2026-03-02
red teaming
2603.01414v1
February 23 - March 01, 2026
1 paper
JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks
Masahiro Kaneko, Ayana Niwa, Timothy Baldwin
2026-03-01
red teaming
2603.01291v1
‹
1
2
3
...
47
48
49
›