← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1169 papers total
February 02 - February 08, 2026
10 papers
vLLM Hook v0: A Plug-in for Programming Model Internals on vLLM
Ching-Yun Ko, Pin-Yu Chen
2026-02-02
2603.06588v1
RACA: Representation-Aware Coverage Criteria for LLM Safety Testing
Zeming Wei, Zhixin Zhang, Chengcan Wu, Yihao Zhang, Xiaokun Luan, Meng Sun
2026-02-02
red teaming
safety
2602.02280v1
Light Alignment Improves LLM Safety via Model Self-Reflection with a Single Neuron
Sicheng Shen, Mingyang Lv, Han Shen, Jialin Wu, Binghao Wang, Zhou Yang, Guobin Shen, Dongcheng Zhao, Feifei Zhao, Yi Zeng
2026-02-02
safety
2602.02027v1
Human Society-Inspired Approaches to Agentic AI Security: The 4C Framework
Alsharif Abuadbba, Nazatul Sultan, Surya Nepal, Sanjay Jha
2026-02-02
2602.01942v1
Concept-Based Dictionary Learning for Inference-Time Safety in Vision Language Action Models
Siqi Wen, Shu Yang, Shaopeng Fu, Jingfeng Zhang, Lijie Hu, Di Wang
2026-02-02
2602.01834v1
RedVisor: Reasoning-Aware Prompt Injection Defense via Zero-Copy KV Cache Reuse
Mingrui Liu, Sixiao Zhang, Cheng Long, Kwok-Yan Lam
2026-02-02
red teaming
2602.01795v1
Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs
Yen-Shan Chen, Zhi Rui Tam, Cheng-Kuang Wu, Yun-Nung Chen
2026-02-02
red teaming
safety
2602.01600v1
Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment
Zehua Cheng, Jianwei Yang, Wei Dai, Jiahao Sun
2026-02-02
red teaming
safety
2602.01587v1
MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety
Xiaoyu Wen, Zhida He, Han Qi, Ziyu Wan, Zhongtian Ma, Ying Wen, Tianhang Zheng, Xingcheng Xu, Chaochao Lu, Qiaosheng Zhang
2026-02-02
safety
2602.01539v2
MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety
Xiaoyu Wen, Zhida He, Han Qi, Ziyu Wan, Zhongtian Ma, Ying Wen, Tianhang Zheng, Xingcheng Xu, Chaochao Lu, Qiaosheng Zhang
2026-02-02
safety
2602.01539v1
January 26 - February 01, 2026
14 papers
Context Dependence and Reliability in Autoregressive Language Models
Poushali Sengupta, Shashi Raj Pandey, Sabita Maharjan, Frank Eliassen
2026-02-01
2602.01378v1
Step-Wise Refusal Dynamics in Autoregressive and Diffusion Language Models
Eliron Rahimi, Elad Hirshel, Rom Himelstein, Amit LeVi, Avi Mendelson, Chaim Baskin
2026-02-01
2602.02600v1
SMCP: Secure Model Context Protocol
Xinyi Hou, Shenao Wang, Yifan Zhang, Ziluo Xue, Yanjie Zhao, Cai Fu, Haoyu Wang
2026-02-01
2602.01129v1
Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models
Kaiyuan Cui, Yige Li, Yutao Wu, Xingjun Ma, Sarah Erfani, Christopher Leckie, Hanxun Huang
2026-02-01
red teaming
2602.01025v1
A Causal Perspective for Enhancing Jailbreak Attack and Defense
Licheng Pan, Yunsheng Lu, Jiexi Liu, Jialing Tao, Haozhe Feng, Hui Xue, Zhixuan Chu, Kui Ren
2026-01-31
red teaming
2602.04893v1
Bypassing Prompt Injection Detectors through Evasive Injections
Md Jahedur Rahman, Ihsen Alouani
2026-01-31
red teaming
2602.00750v1
Jailbreaking LLMs via Calibration
Yuxuan Lu, Yongkang Guo, Yuqing Kong
2026-01-31
red teaming
2602.00619v1
Text is All You Need for Vision-Language Model Jailbreaking
Yihang Chen, Zhao Xu, Youyuan Jiang, Tianle Zheng, Cho-Jui Hsieh
2026-01-31
red teaming
2602.00420v1
A Fragile Guardrail: Diffusion LLM's Safety Blessing and Its Failure Mode
Zeyuan He, Yupeng Chen, Lang Lin, Yihan Wang, Shenxu Chang, Eric Sommerlade, Philip Torr, Junchi Yu, Adel Bibi, Jialin Yu
2026-01-30
red teaming
safety
2602.00388v1
Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models
Ye Yu, Haibo Jin, Yaoning Yu, Jun Zhuang, Haohan Wang
2026-01-30
red teaming
2601.23255v1
Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures
Yanghao Su, Wenbo Zhou, Tianwei Zhang, Qiu Han, Weiming Zhang, Nenghai Yu, Jie Zhang
2026-01-30
red teaming
2601.23081v1
The Alignment Curse: Cross-Modality Jailbreak Transfer in Omni-Models
Yupeng Chen, Junchi Yu, Aoxi Liu, Philip Torr, Adel Bibi
2026-01-30
red teaming
2602.02557v1
Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling
Mingqian Feng, Xiaodong Liu, Weiwei Yang, Chenliang Xu, Christopher White, Jianfeng Gao
2026-01-30
red teaming
2601.22636v1
Whispers of Wealth: Red-Teaming Google's Agent Payments Protocol via Prompt Injection
Tanusree Debi, Wentian Zhu
2026-01-30
red teaming
2601.22569v1
‹
1
2
3
...
7
8
9
...
47
48
49
›