← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1169 papers total
January 26 - February 01, 2026
17 papers
FraudShield: Knowledge Graph Empowered Defense for LLMs against Fraud Attacks
Naen Xu, Jinghuai Zhang, Ping He, Chunyi Zhou, Jun Wang, Zhihui Fu, Tianyu Du, Zhaoxiang Wang, Shouling Ji
2026-01-30
safety
2601.22485v1
Jailbreaks on Vision Language Model via Multimodal Reasoning
Aarush Noheria, Yuguang Yao
2026-01-29
red teaming
2601.22398v1
Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment
Yavuz Bakman, Duygu Nur Yaldiz, Salman Avestimehr, Sai Praneeth Karimireddy
2026-01-29
red teaming
2601.22313v1
A Systematic Literature Review on LLM Defenses Against Prompt Injection and Jailbreaking: Expanding NIST Taxonomy
Pedro H. Barcha Correia, Ryan W. Achjian, Diego E. G. Caetano de Oliveira, Ygor Acacio Maria, Victor Takashi Hayashi, Marcos Lopes, Charles Christian Miers, Marcos A. Simplicio
2026-01-29
safety
2601.22240v1
OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence
Jarrod Barnes
2026-01-28
red teaming
2601.21083v3
OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence
Jarrod Barnes
2026-01-28
red teaming
2601.21083v2
OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence
Jarrod Barnes
2026-01-28
red teaming
2601.21083v1
ICON: Intent-Context Coupling for Efficient Multi-Turn Jailbreak Attack
Xingwei Lin, Wenhao Lin, Sicong Cao, Jiahao Yu, Renke Huang, Lei Xue, Chunming Wu
2026-01-28
red teaming
2601.20903v1
RvB: Automating AI System Hardening via Iterative Red-Blue Games
Lige Huang, Zicheng Liu, Jie Zhang, Lewen Yan, Dongrui Liu, Jing Shao
2026-01-27
red teaming
2601.19726v1
LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment
Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, Wenhai Wang
2026-01-27
2601.19487v1
SHIELD: An Auto-Healing Agentic Defense Framework for LLM Resource Exhaustion Attacks
Nirhoshan Sivaroopan, Kanchana Thilakarathna, Albert Zomaya, Manu, Yi Guo, Jo Plested, Tim Lynar, Jack Yang, Wangli Yang
2026-01-27
safety
2601.19174v1
Proactive Hardening of LLM Defenses with HASTE
Henry Chen, Victor Aranda, Samarth Keshari, Ryan Heartfield, Nicole Nichols
2026-01-27
red teaming
safety
2601.19051v1
Malicious Repurposing of Open Science Artefacts by Using Large Language Models
Zahra Hashemi, Zhiqiang Zhong, Jun Pang, Wei Zhao
2026-01-26
red teaming
2601.18998v1
TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment
Zhewen Tan, Wenhan Yu, Jianfeng Si, Tongxin Liu, Kaiqi Guan, Huiyan Jin, Jiawen Tao, Xiaokun Yuan, Duohe Ma, Xiangzheng Zhang, Tong Yang, Lin Sun
2026-01-26
safety
2601.18292v1
Beyond Retention: Orchestrating Structural Safety and Plasticity in Continual Learning for LLMs
Fei Meng
2026-01-26
safety
2601.18255v1
From Transcripts to AI Agents: Knowledge Extraction, RAG Integration, and Robust Evaluation of Conversational AI Assistants
Krittin Pachtrachai, Petmongkon Pornpichitsuwan, Wachiravit Modecrua, Touchapon Kraisingkorn
2026-01-26
red teaming
2602.15859v1
Comparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming
Alexandra Chouldechova, A. Feder Cooper, Solon Barocas, Abhinav Palia, Dan Vann, Hanna Wallach
2026-01-26
red teaming
2601.18076v1
January 19 - January 25, 2026
7 papers
Prompt Injection Evaluations: Refusal Boundary Instability and Artifact-Dependent Compliance in GPT-4-Series Models
Thomas Heverin
2026-01-25
red teaming
2601.17911v1
Breaking the Protocol: Security Analysis of the Model Context Protocol Specification and Prompt Injection Vulnerabilities in Tool-Integrated LLM Agents
Narek Maloyan, Dmitry Namiot
2026-01-24
red teaming
2601.17549v1
Prompt Injection Attacks on Agentic Coding Assistants: A Systematic Analysis of Vulnerabilities in Skills, Tools, and Protocol Ecosystems
Narek Maloyan, Dmitry Namiot
2026-01-24
red teaming
2601.17548v1
Physical Prompt Injection Attacks on Large Vision-Language Models
Chen Ling, Kai Hu, Hangcheng Liu, Xingshuo Han, Tianwei Zhang, Changhai Ou
2026-01-24
red teaming
2601.17383v1
SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment
Xianya Fang, Xianying Luo, Yadong Wang, Xiang Chen, Yu Tian, Zequn Sun, Rui Liu, Jun Fang, Naiqiang Tan, Yuanning Cui, Sheng-Jun Huang
2026-01-23
red teaming
2601.16506v1
Persona Jailbreaking in Large Language Models
Jivnesh Sandhan, Fei Cheng, Tushar Sandhan, Yugo Murawaki
2026-01-23
red teaming
2601.16466v1
Machine-Assisted Grading of Nationwide School-Leaving Essay Exams with LLMs and Statistical NLP
Andres Karjus, Kais Allkivi, Silvia Maine, Katarin Leppik, Krister Kruusmaa, Merilin Aruvee
2026-01-22
2601.16314v1
‹
1
2
3
...
8
9
10
...
47
48
49
›