← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1331 papers total
February 02 - February 08, 2026
3 papers
Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment
Zehua Cheng, Jianwei Yang, Wei Dai, Jiahao Sun
2026-02-02
red teaming
safety
2602.01587v1
MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety
Xiaoyu Wen, Zhida He, Han Qi, Ziyu Wan, Zhongtian Ma, Ying Wen, Tianhang Zheng, Xingcheng Xu, Chaochao Lu, Qiaosheng Zhang
2026-02-02
safety
2602.01539v2
MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety
Xiaoyu Wen, Zhida He, Han Qi, Ziyu Wan, Zhongtian Ma, Ying Wen, Tianhang Zheng, Xingcheng Xu, Chaochao Lu, Qiaosheng Zhang
2026-02-02
safety
2602.01539v1
January 26 - February 01, 2026
21 papers
Context Dependence and Reliability in Autoregressive Language Models
Poushali Sengupta, Shashi Raj Pandey, Sabita Maharjan, Frank Eliassen
2026-02-01
2602.01378v1
Step-Wise Refusal Dynamics in Autoregressive and Diffusion Language Models
Eliron Rahimi, Elad Hirshel, Rom Himelstein, Amit LeVi, Avi Mendelson, Chaim Baskin
2026-02-01
2602.02600v1
SMCP: Secure Model Context Protocol
Xinyi Hou, Shenao Wang, Yifan Zhang, Ziluo Xue, Yanjie Zhao, Cai Fu, Haoyu Wang
2026-02-01
2602.01129v1
Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models
Kaiyuan Cui, Yige Li, Yutao Wu, Xingjun Ma, Sarah Erfani, Christopher Leckie, Hanxun Huang
2026-02-01
red teaming
2602.01025v1
A Causal Perspective for Enhancing Jailbreak Attack and Defense
Licheng Pan, Yunsheng Lu, Jiexi Liu, Jialing Tao, Haozhe Feng, Hui Xue, Zhixuan Chu, Kui Ren
2026-01-31
red teaming
2602.04893v1
Bypassing Prompt Injection Detectors through Evasive Injections
Md Jahedur Rahman, Ihsen Alouani
2026-01-31
red teaming
2602.00750v1
Jailbreaking LLMs via Calibration
Yuxuan Lu, Yongkang Guo, Yuqing Kong
2026-01-31
red teaming
2602.00619v1
Text is All You Need for Vision-Language Model Jailbreaking
Yihang Chen, Zhao Xu, Youyuan Jiang, Tianle Zheng, Cho-Jui Hsieh
2026-01-31
red teaming
2602.00420v1
A Fragile Guardrail: Diffusion LLM's Safety Blessing and Its Failure Mode
Zeyuan He, Yupeng Chen, Lang Lin, Yihan Wang, Shenxu Chang, Eric Sommerlade, Philip Torr, Junchi Yu, Adel Bibi, Jialin Yu
2026-01-30
red teaming
safety
2602.00388v1
Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models
Ye Yu, Haibo Jin, Yaoning Yu, Jun Zhuang, Haohan Wang
2026-01-30
red teaming
2601.23255v1
Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures
Yanghao Su, Wenbo Zhou, Tianwei Zhang, Qiu Han, Weiming Zhang, Nenghai Yu, Jie Zhang
2026-01-30
red teaming
2601.23081v1
The Alignment Curse: Cross-Modality Jailbreak Transfer in Omni-Models
Yupeng Chen, Junchi Yu, Aoxi Liu, Philip Torr, Adel Bibi
2026-01-30
red teaming
2602.02557v1
Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling
Mingqian Feng, Xiaodong Liu, Weiwei Yang, Chenliang Xu, Christopher White, Jianfeng Gao
2026-01-30
red teaming
2601.22636v1
Whispers of Wealth: Red-Teaming Google's Agent Payments Protocol via Prompt Injection
Tanusree Debi, Wentian Zhu
2026-01-30
red teaming
2601.22569v1
FraudShield: Knowledge Graph Empowered Defense for LLMs against Fraud Attacks
Naen Xu, Jinghuai Zhang, Ping He, Chunyi Zhou, Jun Wang, Zhihui Fu, Tianyu Du, Zhaoxiang Wang, Shouling Ji
2026-01-30
safety
2601.22485v1
Jailbreaks on Vision Language Model via Multimodal Reasoning
Aarush Noheria, Yuguang Yao
2026-01-29
red teaming
2601.22398v1
Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment
Yavuz Bakman, Duygu Nur Yaldiz, Salman Avestimehr, Sai Praneeth Karimireddy
2026-01-29
red teaming
2601.22313v1
A Systematic Literature Review on LLM Defenses Against Prompt Injection and Jailbreaking: Expanding NIST Taxonomy
Pedro H. Barcha Correia, Ryan W. Achjian, Diego E. G. Caetano de Oliveira, Ygor Acacio Maria, Victor Takashi Hayashi, Marcos Lopes, Charles Christian Miers, Marcos A. Simplicio
2026-01-29
safety
2601.22240v1
OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence
Jarrod Barnes
2026-01-28
red teaming
2601.21083v3
OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence
Jarrod Barnes
2026-01-28
red teaming
2601.21083v2
OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence
Jarrod Barnes
2026-01-28
red teaming
2601.21083v1
‹
1
2
3
...
14
15
16
...
54
55
56
›