Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1331 papers total

January 26 - February 01, 2026

10 papers

ICON: Intent-Context Coupling for Efficient Multi-Turn Jailbreak Attack

Xingwei Lin, Wenhao Lin, Sicong Cao, Jiahao Yu, Renke Huang, Lei Xue, Chunming Wu
2026-01-28
red teaming
2601.20903v1

RvB: Automating AI System Hardening via Iterative Red-Blue Games

Lige Huang, Zicheng Liu, Jie Zhang, Lewen Yan, Dongrui Liu, Jing Shao
2026-01-27
red teaming
2601.19726v1

LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment

Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, Wenhai Wang
2026-01-27
2601.19487v1

SHIELD: An Auto-Healing Agentic Defense Framework for LLM Resource Exhaustion Attacks

Nirhoshan Sivaroopan, Kanchana Thilakarathna, Albert Zomaya, Manu, Yi Guo, Jo Plested, Tim Lynar, Jack Yang, Wangli Yang
2026-01-27
safety
2601.19174v1

Proactive Hardening of LLM Defenses with HASTE

Henry Chen, Victor Aranda, Samarth Keshari, Ryan Heartfield, Nicole Nichols
2026-01-27
red teaming safety
2601.19051v1

Malicious Repurposing of Open Science Artefacts by Using Large Language Models

Zahra Hashemi, Zhiqiang Zhong, Jun Pang, Wei Zhao
2026-01-26
red teaming
2601.18998v1

TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

Zhewen Tan, Wenhan Yu, Jianfeng Si, Tongxin Liu, Kaiqi Guan, Huiyan Jin, Jiawen Tao, Xiaokun Yuan, Duohe Ma, Xiangzheng Zhang, Tong Yang, Lin Sun
2026-01-26
safety
2601.18292v1

Beyond Retention: Orchestrating Structural Safety and Plasticity in Continual Learning for LLMs

Fei Meng
2026-01-26
safety
2601.18255v1

From Transcripts to AI Agents: Knowledge Extraction, RAG Integration, and Robust Evaluation of Conversational AI Assistants

Krittin Pachtrachai, Petmongkon Pornpichitsuwan, Wachiravit Modecrua, Touchapon Kraisingkorn
2026-01-26
red teaming
2602.15859v1

Comparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming

Alexandra Chouldechova, A. Feder Cooper, Solon Barocas, Abhinav Palia, Dan Vann, Hanna Wallach
2026-01-26
red teaming
2601.18076v1

January 19 - January 25, 2026

14 papers

Prompt Injection Evaluations: Refusal Boundary Instability and Artifact-Dependent Compliance in GPT-4-Series Models

Thomas Heverin
2026-01-25
red teaming
2601.17911v1

Breaking the Protocol: Security Analysis of the Model Context Protocol Specification and Prompt Injection Vulnerabilities in Tool-Integrated LLM Agents

Narek Maloyan, Dmitry Namiot
2026-01-24
red teaming
2601.17549v1

Prompt Injection Attacks on Agentic Coding Assistants: A Systematic Analysis of Vulnerabilities in Skills, Tools, and Protocol Ecosystems

Narek Maloyan, Dmitry Namiot
2026-01-24
red teaming
2601.17548v1

Physical Prompt Injection Attacks on Large Vision-Language Models

Chen Ling, Kai Hu, Hangcheng Liu, Xingshuo Han, Tianwei Zhang, Changhai Ou
2026-01-24
red teaming
2601.17383v1

SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment

Xianya Fang, Xianying Luo, Yadong Wang, Xiang Chen, Yu Tian, Zequn Sun, Rui Liu, Jun Fang, Naiqiang Tan, Yuanning Cui, Sheng-Jun Huang
2026-01-23
red teaming
2601.16506v1

Persona Jailbreaking in Large Language Models

Jivnesh Sandhan, Fei Cheng, Tushar Sandhan, Yugo Murawaki
2026-01-23
red teaming
2601.16466v1

Machine-Assisted Grading of Nationwide School-Leaving Essay Exams with LLMs and Statistical NLP

Andres Karjus, Kais Allkivi, Silvia Maine, Katarin Leppik, Krister Kruusmaa, Merilin Aruvee
2026-01-22
2601.16314v1

Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models

Fengheng Chu, Jiahao Chen, Yuhong Wang, Jun Wang, Zhihui Fu, Shouling Ji, Songze Li
2026-01-22
red teaming
2601.15801v1

Improving Methodologies for LLM Evaluations Across Global Languages

Akriti Vij, Benjamin Chua, Darshini Ramiah, En Qi Ng, Mahran Morsidi, Naga Nikshith Gangarapu, Sharmini Johnson, Vanessa Wilfred, Vikneswaran Kumaran, Wan Sie Lee, Wenzhuo Yang, Yongsen Zheng, Bill Black, Boming Xia, Frank Sun, Hao Zhang, Qinghua Lu, Suyu Ma, Yue Liu, Chi-kiu Lo, Fatemeh Azadi, Isar Nejadgholi, Sowmya Vajjala, Agnes Delaborde, Nicolas Rolin, Tom Seimandi, Akiko Murakami, Haruto Ishi, Satoshi Sekine, Takayuki Semitsu, Tasuku Sasaki, Angela Kinuthia, Jean Wangari, Michael Michie, Stephanie Kasaon, Hankyul Baek, Jaewon Noh, Kihyuk Nam, Sang Seo, Sungpil Shin, Taewhi Lee, Yongsu Kim, Daisy Newbold-Harrop, Jessica Wang, Mahmoud Ghanem, Vy Hong
2026-01-22
2601.15706v1

Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs

Mingyu Yu, Lana Liu, Zhehao Zhao, Wei Wang, Sujuan Qin
2026-01-22
red teaming
2601.15698v1

Securing LLM-as-a-Service for Small Businesses: An Industry Case Study of a Distributed Chatbot Deployment Platform

Jiazhu Xie, Bowen Li, Heyu Fu, Chong Gao, Ziqi Xu, Fengling Han
2026-01-21
2601.15528v1

From Biased Chatbots to Biased Agents: Examining Role Assignment Effects on LLM Agent Robustness

Linbo Cao, Lihao Sun, Yang Yue
2026-01-21
2602.12285v1

LLM Security and Safety: Insights from Homotopy-Inspired Prompt Obfuscation

Luis Lazo, Hamed Jelodar, Roozbeh Razavi-Far
2026-01-20
safety
2601.14528v1

RECAP: A Resource-Efficient Method for Adversarial Prompting in Large Language Models

Rishit Chugh
2026-01-20
red teaming
2601.15331v1