Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1331 papers total

February 16 - February 22, 2026

9 papers

February 09 - February 15, 2026

15 papers

SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement

Xiaojun Jia, Jie Liao, Simeng Qin, Jindong Gu, Wenqi Ren, Xiaochun Cao, Yang Liu, Philip Torr
2026-02-15
red teaming
2602.14211v1

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

Max Fomin
2026-02-15
red teaming
2602.14161v1

AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks

Yuqi Jia, Ruiqi Wang, Xilong Wang, Chong Xiang, Neil Gong
2026-02-14
red teaming
2602.13597v2

AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks

Yuqi Jia, Ruiqi Wang, Xilong Wang, Chong Xiang, Neil Gong
2026-02-14
red teaming
2602.13597v1

Mitigating the Safety-utility Trade-off in LLM Alignment via Adaptive Safe Context Learning

Yanbo Wang, Minzheng Wang, Jian Liang, Lu Wang, Yongcan Yu, Ran He
2026-02-14
safety
2602.13562v1

AISA: Awakening Intrinsic Safety Awareness in Large Language Models against Jailbreak Attacks

Weiming Song, Xuan Xie, Ruiping Yin
2026-02-14
2602.13547v1

OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage

Akshat Naik, Jay Culligan, Yarin Gal, Philip Torr, Rahaf Aljundi, Alasdair Paren, Adel Bibi
2026-02-13
red teaming
2602.13477v2

OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage

Akshat Naik, Jay Culligan, Yarin Gal, Philip Torr, Rahaf Aljundi, Alasdair Paren, Adel Bibi
2026-02-13
red teaming
2602.13477v1

Buy versus Build an LLM: A Decision Framework for Governments

Jiahao Lu, Ziwei Xu, William Tjhi, Junnan Li, Antoine Bosselut, Pang Wei Koh, Mohan Kankanhalli
2026-02-13
governance
2602.13033v1

Sparse Autoencoders are Capable LLM Jailbreak Mitigators

Yannick Assogba, Jacopo Cortellazzi, Javier Abad, Pau Rodriguez, Xavier Suau, Arno Blaas
2026-02-12
red teaming
2602.12418v1

Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing

Keita Broadwater
2026-02-12
safety
2602.11786v1

Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs

Dong Yan, Jian Liang, Ran He, Tieniu Tan
2026-02-12
safety
2602.11528v1

Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models

Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
2026-02-12
red teaming
2602.11495v1

Towards Reliable Machine Translation: Scaling LLMs for Critical Error Detection and Safety

Muskaan Chopra, Lorenz Sparrenberg, Rafet Sifa
2026-02-11
safety
2602.11444v1

Optimizing Agent Planning for Security and Autonomy

Aashish Kolluri, Rishi Sharma, Manuel Costa, Boris Köpf, Tobias Nießen, Mark Russinovich, Shruti Tople, Santiago Zanella-Béguelin
2026-02-11
2602.11416v1