Paper Library

AJAR: Adaptive Jailbreak Architecture for Red-teaming

Yipu Dou, Wang Yang

2026-01-16

red teaming

2601.10971v1

Hidden-in-Plain-Text: A Benchmark for Social-Web Indirect Prompt Injection in RAG

Haoze Guo, Ziqi Wei

2026-01-16

red teaming

2601.10923v2

Hidden-in-Plain-Text: A Benchmark for Social-Web Indirect Prompt Injection in RAG

Haoze Guo, Ziqi Wei

2026-01-16

red teaming

2601.10923v1

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

Hao Wang, Yanting Wang, Hao Li, Rui Li, Lei Sha

2026-01-15

red teaming

2601.10589v1

Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing

Yinzhi Zhao, Ming Wang, Shi Feng, Xiaocui Yang, Daling Wang, Yifei Zhang

2026-01-15

red teaming

2601.10543v1

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, Jack Lindsey

2026-01-15

red teaming

2601.10387v1

Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

Yi Liu, Weizhe Wang, Ruitao Feng, Yao Zhang, Guangquan Xu, Gelei Deng, Yuekang Li, Leo Zhang

2026-01-15

red teaming

2601.10338v1

The Straight and Narrow: Do LLMs Possess an Internal Moral Path?

Luoming Hu, Jingjie Zeng, Liang Yang, Hongfei Lin

2026-01-15

2601.10307v1

ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

Hao Li, Yankai Yang, G. Edward Suh, Ning Zhang, Chaowei Xiao

2026-01-15

red teaming

2601.10173v1

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

Yutao Mou, Zhangchi Xue, Lijun Li, Peiyang Liu, Shikun Zhang, Wei Ye, Jing Shao

2026-01-15

safety

2601.10156v1

Understanding and Preserving Safety in Fine-Tuned LLMs

Jiawen Zhang, Yangfan Hu, Kejia Chen, Lipeng He, Jiachen Ma, Jian Lou, Dan Li, Jian Liu, Xiaohu Yang, Ruoxi Jia

2026-01-15

red teaming safety

2601.10141v1

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

Hanna Foerster, Tom Blanchard, Kristina Nikolić, Ilia Shumailov, Cheng Zhang, Robert Mullins, Nicolas Papernot, Florian Tramèr, Yiren Zhao

2026-01-14

2601.09923v2

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

Hanna Foerster, Robert Mullins, Tom Blanchard, Nicolas Papernot, Kristina Nikolić, Florian Tramèr, Ilia Shumailov, Cheng Zhang, Yiren Zhao

2026-01-14

2601.09923v1

The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multistep Malware Delivery Mechanism

Oleg Brodt, Elad Feldman, Bruce Schneier, Ben Nassi

2026-01-14

red teaming

2601.09625v2

The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multi-Step Malware

Ben Nassi, Bruce Schneier, Oleg Brodt

2026-01-14

red teaming

2601.09625v1

SpatialJB: How Text Distribution Art Becomes the "Jailbreak Key" for LLM Guardrails

Zhiyi Mou, Jingyuan Yang, Zeheng Qian, Wangze Ni, Tianfang Xiao, Ning Liu, Chen Zhang, Zhan Qin, Kui Ren

2026-01-14

red teaming

2601.09321v1

BenchOverflow: Measuring Overflow in Large Language Models via Plain-Text Prompts

Erin Feiglin, Nir Hutnik, Raz Lapid

2026-01-13

2601.08490v1

YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation

Abdelaziz Bounhar, Rania Hossam Elmohamady Elbadry, Hadi Abdine, Preslav Nakov, Michalis Vazirgiannis, Guokan Shang

2026-01-13

2601.08441v1

Reasoning over Precedents Alongside Statutes: Case-Augmented Deliberative Alignment for LLM Safety

Can Jin, Rui Wu, Tong Che, Qixin Zhang, Hongwu Peng, Jiahui Zhao, Zhenting Wang, Wenqi Wei, Ligong Han, Zhao Zhang, Yuan Cao, Ruixiang Tang, Dimitris N. Metaxas

2026-01-12

safety

2601.08000v1

SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations

Mohammed Himayath Ali, Mohammed Aqib Abdullah, Mohammed Mudassir Uddin, Shahnawaz Alam

2026-01-12

red teaming

2601.07835v1

When Bots Take the Bait: Exposing and Mitigating the Emerging Social Engineering Attack in Web Automation Agent

Xinyi Wu, Geng Hong, Yueyue Chen, MingXuan Liu, Feier Jin, Xudong Pan, Jiarun Dai, Baojun Liu

2026-01-12

2601.07263v1

Defenses Against Prompt Attacks Learn Surface Heuristics

Shawn Li, Chenxiao Yu, Zhiyu Ni, Hao Li, Charith Peris, Chaowei Xiao, Yue Zhao

2026-01-12

red teaming

2601.07185v1

MacPrompt: Maraconic-guided Jailbreak against Text-to-Image Models

Xi Ye, Yiwen Liu, Lina Wang, Run Wang, Geying Yang, Yufei Hou, Jiayi Yu

2026-01-12

red teaming

2601.07141v1

January 12 - January 18, 2026

AJAR: Adaptive Jailbreak Architecture for Red-teaming

Hidden-in-Plain-Text: A Benchmark for Social-Web Indirect Prompt Injection in RAG

Hidden-in-Plain-Text: A Benchmark for Social-Web Indirect Prompt Injection in RAG

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

The Straight and Narrow: Do LLMs Possess an Internal Moral Path?

ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

Understanding and Preserving Safety in Fine-Tuned LLMs

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multistep Malware Delivery Mechanism

The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multi-Step Malware

SpatialJB: How Text Distribution Art Becomes the "Jailbreak Key" for LLM Guardrails

BenchOverflow: Measuring Overflow in Large Language Models via Plain-Text Prompts

YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation

Reasoning over Precedents Alongside Statutes: Case-Augmented Deliberative Alignment for LLM Safety

SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations

When Bots Take the Bait: Exposing and Mitigating the Emerging Social Engineering Attack in Web Automation Agent

Defenses Against Prompt Attacks Learn Surface Heuristics

MacPrompt: Maraconic-guided Jailbreak against Text-to-Image Models

January 05 - January 11, 2026

When Should We Introduce Safety Interventions During Pretraining?