Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1331 papers total

March 09 - March 15, 2026

13 papers

Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

Indranil Halder, Annesya Banerjee, Cengiz Pehlevan
2026-03-11
red teaming
2603.11331v1

Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models

Xiangwen Wang, Ananth Balashankar, Varun Chandrasekaran
2026-03-11
red teaming
2603.11149v1

WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference

Zixun Xiong, Gaoyi Wu, Lingfeng Yao, Miao Pan, Xiaojiang Du, Hao Wang
2026-03-11
red teaming
2603.11132v1

Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services

Fabrizio Dimino, Bhaskarjit Sarmah, Stefano Pasquali
2026-03-11
red teaming
2603.10807v1

AttriGuard: Defeating Indirect Prompt Injection in LLM Agents via Causal Attribution of Tool Invocations

Yu He, Haozhe Zhu, Yiming Li, Shuo Shao, Hongwei Yao, Zhihao Liu, Zhan Qin
2026-03-11
red teaming
2603.10749v1

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

Chuan Guo, Juan Felipe Ceron Uribe, Sicheng Zhu, Christopher A. Choquette-Choo, Steph Lin, Nikhil Kandpal, Milad Nasr, Rai, Sam Toyer, Miles Wang, Yaodong Yu, Alex Beutel, Kai Xiao
2026-03-11
red teaming
2603.10521v1

Compatibility at a Cost: Systematic Discovery and Exploitation of MCP Clause-Compliance Vulnerabilities

Nanzi Yang, Weiheng Bai, Kangjie Lu
2026-03-10
red teaming
2603.10163v1

Multi-Stream Perturbation Attack: Breaking Safety Alignment of Thinking LLMs Through Concurrent Task Interference

Fan Yang
2026-03-10
red teaming safety
2603.10091v1

Reasoning-Oriented Programming: Chaining Semantic Gadgets to Jailbreak Large Vision Language Models

Quanchen Zou, Moyang Chen, Zonghao Ying, Wenzhuo Xu, Yisong Xiao, Deyue Zhang, Dongdong Yang, Zhao Liu, Xiangzheng Zhang
2026-03-10
red teaming
2603.09246v1

Bioalignment: Measuring and Improving LLM Disposition Toward Biological Systems for AI Safety

Trent R Northen, Mingxun Wang
2026-03-10
safety
2603.09154v1

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

Harry Owiredu-Ashley
2026-03-10
red teaming
2603.10068v1

The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs

Yonghong Deng, Zhen Yang, Ping Jian, Xinyue Zhang, Zhongbin Guo, Chengzhi Li
2026-03-09
red teaming
2603.08234v1

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Guangnian Wan, Xinyin Ma, Gongfan Fang, Xinchao Wang
2026-03-09
safety
2603.08104v1

March 02 - March 08, 2026

11 papers

DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

Bo Jiang
2026-03-08
2603.07835v1

VoiceSHIELD-Small: Real-Time Malicious Speech Detection and Transcription

Sumit Ranjan, Sugandha Sharma, Ubaid Abbas, Puneeth N Ail
2026-03-08
2603.07708v1

Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints

Chenxi Li, Xianggan Liu, Dake Shen, Yaosong Du, Zhibo Yao, Hao Jiang, Linyi Jiang, Chengwei Cao, Jingzhe Zhang, RanYi Peng, Peiling Bai, Xiande Huang
2026-03-08
red teaming
2603.07590v1

Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice

Yuxu Ge
2026-03-07
red teaming
2603.07191v2

Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice

Yuxu Ge
2026-03-07
red teaming
2603.07191v1

Two Frames Matter: A Temporal Attack for Text-to-Video Model Jailbreaking

Moyang Chen, Zonghao Ying, Wenzhuo Xu, Quancheng Zou, Deyue Zhang, Dongdong Yang, Xiangzheng Zhang
2026-03-07
red teaming
2603.07028v1

Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

Xisen Jin, Michael Duan, Qin Lin, Aaron Chan, Zhenglun Chen, Junyi Du, Xiang Ren
2026-03-06
2603.05786v1

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

Jinman Wu, Yi Xie, Shen Lin, Shiqian Zhao, Xiaofeng Chen
2026-03-06
red teaming
2603.05773v1

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

Jinman Wu, Yi Xie, Shiqian Zhao, Xiaofeng Chen
2026-03-06
red teaming
2603.05772v1

Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems

Hiroki Fukui
2026-03-05
safety
2603.04904v1

Efficient Refusal Ablation in LLM through Optimal Transport

Geraldin Nanfack, Eugene Belilovsky, Elvis Dohmatob
2026-03-04
red teaming
2603.04355v1