Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 947 papers total

January 12 - January 18, 2026

20 papers

Hidden-in-Plain-Text: A Benchmark for Social-Web Indirect Prompt Injection in RAG

Haoze Guo, Ziqi Wei
2026-01-16
red teaming
2601.10923v2

Hidden-in-Plain-Text: A Benchmark for Social-Web Indirect Prompt Injection in RAG

Haoze Guo, Ziqi Wei
2026-01-16
red teaming
2601.10923v1

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

Hao Wang, Yanting Wang, Hao Li, Rui Li, Lei Sha
2026-01-15
red teaming
2601.10589v1

Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing

Yinzhi Zhao, Ming Wang, Shi Feng, Xiaocui Yang, Daling Wang, Yifei Zhang
2026-01-15
red teaming
2601.10543v1

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, Jack Lindsey
2026-01-15
red teaming
2601.10387v1

Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

Yi Liu, Weizhe Wang, Ruitao Feng, Yao Zhang, Guangquan Xu, Gelei Deng, Yuekang Li, Leo Zhang
2026-01-15
red teaming
2601.10338v1

The Straight and Narrow: Do LLMs Possess an Internal Moral Path?

Luoming Hu, Jingjie Zeng, Liang Yang, Hongfei Lin
2026-01-15
2601.10307v1

ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

Hao Li, Yankai Yang, G. Edward Suh, Ning Zhang, Chaowei Xiao
2026-01-15
red teaming
2601.10173v1

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

Yutao Mou, Zhangchi Xue, Lijun Li, Peiyang Liu, Shikun Zhang, Wei Ye, Jing Shao
2026-01-15
safety
2601.10156v1

Understanding and Preserving Safety in Fine-Tuned LLMs

Jiawen Zhang, Yangfan Hu, Kejia Chen, Lipeng He, Jiachen Ma, Jian Lou, Dan Li, Jian Liu, Xiaohu Yang, Ruoxi Jia
2026-01-15
red teaming safety
2601.10141v1

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

Hanna Foerster, Robert Mullins, Tom Blanchard, Nicolas Papernot, Kristina Nikolić, Florian Tramèr, Ilia Shumailov, Cheng Zhang, Yiren Zhao
2026-01-14
2601.09923v1

The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multi-Step Malware

Ben Nassi, Bruce Schneier, Oleg Brodt
2026-01-14
red teaming
2601.09625v1

SpatialJB: How Text Distribution Art Becomes the "Jailbreak Key" for LLM Guardrails

Zhiyi Mou, Jingyuan Yang, Zeheng Qian, Wangze Ni, Tianfang Xiao, Ning Liu, Chen Zhang, Zhan Qin, Kui Ren
2026-01-14
red teaming
2601.09321v1

BenchOverflow: Measuring Overflow in Large Language Models via Plain-Text Prompts

Erin Feiglin, Nir Hutnik, Raz Lapid
2026-01-13
2601.08490v1

YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation

Abdelaziz Bounhar, Rania Hossam Elmohamady Elbadry, Hadi Abdine, Preslav Nakov, Michalis Vazirgiannis, Guokan Shang
2026-01-13
2601.08441v1

Reasoning over Precedents Alongside Statutes: Case-Augmented Deliberative Alignment for LLM Safety

Can Jin, Rui Wu, Tong Che, Qixin Zhang, Hongwu Peng, Jiahui Zhao, Zhenting Wang, Wenqi Wei, Ligong Han, Zhao Zhang, Yuan Cao, Ruixiang Tang, Dimitris N. Metaxas
2026-01-12
safety
2601.08000v1

SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations

Mohammed Himayath Ali, Mohammed Aqib Abdullah, Mohammed Mudassir Uddin, Shahnawaz Alam
2026-01-12
red teaming
2601.07835v1

When Bots Take the Bait: Exposing and Mitigating the Emerging Social Engineering Attack in Web Automation Agent

Xinyi Wu, Geng Hong, Yueyue Chen, MingXuan Liu, Feier Jin, Xudong Pan, Jiarun Dai, Baojun Liu
2026-01-12
2601.07263v1

Defenses Against Prompt Attacks Learn Surface Heuristics

Shawn Li, Chenxiao Yu, Zhiyu Ni, Hao Li, Charith Peris, Chaowei Xiao, Yue Zhao
2026-01-12
red teaming
2601.07185v1

MacPrompt: Maraconic-guided Jailbreak against Text-to-Image Models

Xi Ye, Yiwen Liu, Lina Wang, Run Wang, Geying Yang, Yufei Hou, Jiayi Yu
2026-01-12
red teaming
2601.07141v1

January 05 - January 11, 2026

4 papers