Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1331 papers total

January 19 - January 25, 2026

5 papers

January 12 - January 18, 2026

19 papers

Agentic Artificial Intelligence (AI): Architectures, Taxonomies, and Evaluation of Large Language Model Agents

Arunkumar V, Gangadharan G. R., Rajkumar Buyya
2026-01-18
2601.12560v1

TrojanPraise: Jailbreak LLMs via Benign Fine-Tuning

Zhixin Xie, Xurui Song, Jun Luo
2026-01-18
red teaming
2601.12460v1

AgenTRIM: Tool Risk Mitigation for Agentic AI

Roy Betser, Shamik Bose, Amit Giloni, Chiara Picardi, Sindhu Padakandla, Roman Vainshtein
2026-01-18
2601.12449v1

Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs

Anirudh Sekar, Mrinal Agarwal, Rachel Sharma, Akitsugu Tanaka, Jasmine Zhang, Arjun Damerla, Kevin Zhu
2026-01-18
red teaming safety
2601.12359v1

DriveSafe: A Hierarchical Risk Taxonomy for Safety-Critical LLM-Based Driving Assistants

Abhishek Kumar, Riya Tapwal, Carsten Maple
2026-01-17
safety
2601.12138v1

Preserving Fairness and Safety in Quantized LLMs Through Critical Weight Protection

Muhammad Alif Al Hakim, Alfan Farizki Wicaksono, Fajri Koto
2026-01-17
safety
2601.12033v1

Faithfulness vs. Safety: Evaluating LLM Behavior Under Counterfactual Medical Evidence

Kaijie Mo, Siddhartha Venkatayogi, Chantal Shaib, Ramez Kouzy, Wei Xu, Byron C. Wallace, Junyi Jessy Li
2026-01-17
safety
2601.11886v1

Guardrails for trust, safety, and ethical development and deployment of Large Language Models (LLM)

Anjanava Biswas, Wrick Talukdar
2026-01-16
safety
2601.14298v1

Building Production-Ready Probes For Gemini

János Kramár, Joshua Engels, Zheng Wang, Bilal Chughtai, Rohin Shah, Neel Nanda, Arthur Conmy
2026-01-16
red teaming
2601.11516v1

Institutional AI: Governing LLM Collusion in Multi-Agent Cournot Markets via Public Governance Graphs

Marcantonio Bracale Syrnikov, Federico Pierucci, Marcello Galisai, Matteo Prandi, Piercosma Bisconti, Francesco Giarrusso, Olga Sorokoletova, Vincenzo Suriani, Daniele Nardi
2026-01-16
governance
2601.11369v2

Institutional AI: Governing LLM Collusion in Multi-Agent Cournot Markets via Public Governance Graphs

Marcantonio Bracale Syrnikov, Federico Pierucci, Marcello Galisai, Matteo Prandi, Piercosma Bisconti, Francesco Giarrusso, Olga Sorokoletova, Vincenzo Suriani, Daniele Nardi
2026-01-16
governance
2601.11369v1

SD-RAG: A Prompt-Injection-Resilient Framework for Selective Disclosure in Retrieval-Augmented Generation

Aiman Al Masoud, Marco Arazzi, Antonino Nocera
2026-01-16
2601.11199v1

AJAR: Adaptive Jailbreak Architecture for Red-teaming

Yipu Dou, Wang Yang
2026-01-16
red teaming
2601.10971v1

Hidden-in-Plain-Text: A Benchmark for Social-Web Indirect Prompt Injection in RAG

Haoze Guo, Ziqi Wei
2026-01-16
red teaming
2601.10923v2

Hidden-in-Plain-Text: A Benchmark for Social-Web Indirect Prompt Injection in RAG

Haoze Guo, Ziqi Wei
2026-01-16
red teaming
2601.10923v1

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

Hao Wang, Yanting Wang, Hao Li, Rui Li, Lei Sha
2026-01-15
red teaming
2601.10589v1

Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing

Yinzhi Zhao, Ming Wang, Shi Feng, Xiaocui Yang, Daling Wang, Yifei Zhang
2026-01-15
red teaming
2601.10543v1

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, Jack Lindsey
2026-01-15
red teaming
2601.10387v1

Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

Yi Liu, Weizhe Wang, Ruitao Feng, Yao Zhang, Guangquan Xu, Gelei Deng, Yuekang Li, Leo Zhang
2026-01-15
red teaming
2601.10338v1