Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 1169 papers total

December 01 - December 07, 2025

13 papers

When AI Takes the Couch: Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models

Afshin Khadangi, Hanna Marxen, Amir Sartipi, Igor Tchappi, Gilbert Fridgen
2025-12-02
2512.04124v2

When AI Takes the Couch: Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models

Afshin Khadangi, Hanna Marxen, Amir Sartipi, Igor Tchappi, Gilbert Fridgen
2025-12-02
2512.04124v1

Beyond Single-Agent Safety: A Taxonomy of Risks in LLM-to-LLM Interactions

Piercosma Bisconti, Marcello Galisai, Federico Pierucci, Marcantonio Bracale, Matteo Prandi
2025-12-02
safety
2512.02682v1

When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents

Tsimur Hadeliya, Mohammad Ali Jauhar, Nidhi Sakpal, Diogo Cruz
2025-12-02
safety
2512.02445v1

COGNITION: From Evaluation to Defense against Multimodal LLM CAPTCHA Solvers

Junyu Wang, Changjia Zhu, Yuanbo Zhou, Lingyao Li, Xu He, Junjie Xiong
2025-12-02
safety
2512.02318v2

COGNITION: From Evaluation to Defense against Multimodal LLM CAPTCHA Solvers

Junyu Wang, Changjia Zhu, Yuanbo Zhou, Lingyao Li, Xu He, Junjie Xiong
2025-12-02
safety
2512.02318v1

DialogGuard: Multi-Agent Psychosocial Safety Evaluation of Sensitive LLM Responses

Han Luo, Guy Laban
2025-12-01
safety
2512.02282v1

Ensemble Privacy Defense for Knowledge-Intensive LLMs against Membership Inference Attacks

Haowei Fu, Bo Ni, Han Xu, Kunpeng Liu, Dan Lin, Tyler Derr
2025-12-01
safety
2512.03100v1

GRASP: Guided Residual Adapters with Sample-wise Partitioning

Felix Nützel, Mischa Dombrowski, Bernhard Kainz
2025-12-01
2512.01675v1

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

Rongzhe Wei, Peizhi Niu, Xinjie Shen, Tony Tu, Yifan Li, Ruihan Wu, Eli Chien, Pin-Yu Chen, Olgica Milenkovic, Pan Li
2025-12-01
red teaming
2512.01353v2

A Wolf in Sheep's Clothing: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

Rongzhe Wei, Peizhi Niu, Xinjie Shen, Tony Tu, Yifan Li, Ruihan Wu, Eli Chien, Olgica Milenkovic, Pan Li
2025-12-01
red teaming
2512.01353v1

Securing Large Language Models (LLMs) from Prompt Injection Attacks

Omar Farooq Khan Suri, John McCrae
2025-12-01
red teaming
2512.01326v1

DefenSee: Dissecting Threat from Sight and Text - A Multi-View Defensive Pipeline for Multi-modal Jailbreaks

Zihao Wang, Kar Wai Fok, Vrizlynn L. L. Thing
2025-12-01
2512.01185v1

November 24 - November 30, 2025

11 papers

When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals

Riad Ahmed Anonto, Md Labid Al Nahiyan, Md Tanvir Hassan, Ch. Md. Rakin Haider
2025-11-30
safety
2512.01037v1

Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

Mintong Kang, Chong Xiang, Sanjay Kariyappa, Chaowei Xiao, Bo Li, Edward Suh
2025-11-30
red teaming
2512.00966v1

On the Regulatory Potential of User Interfaces for AI Agent Governance

K. J. Kevin Feng, Tae Soo Kim, Rock Yuren Pang, Faria Huq, Tal August, Amy X. Zhang
2025-11-30
2512.00742v1

Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model

Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, Qinglin Lu
2025-11-28
2511.23429v1

Unlocking Multilingual Reasoning Capability of LLMs and LVLMs through Representation Engineering

Qiming Li, Xiaocheng Feng, Yixuan Ma, Zekai Ye, Ruihan Chen, Xiachong Feng, Bing Qin
2025-11-28
2511.23231v1

Are LLMs Good Safety Agents or a Propaganda Engine?

Neemesh Yadav, Francesco Ortu, Jiarui Liu, Joeun Yook, Bernhard Schölkopf, Rada Mihalcea, Alberto Cazzaniga, Zhijing Jin
2025-11-28
red teaming safety
2511.23174v1

Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks

Richard J. Young
2025-11-27
red teaming
2511.22047v1

Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression

Tianyu Zhang, Zihang Xi, Jingyu Hua, Sheng Zhong
2025-11-27
red teaming
2511.22044v1

DiverseVAR: Balancing Diversity and Quality of Next-Scale Visual Autoregressive Models

Mingue Park, Prin Phunyaphibarn, Phillip Y. Lee, Minhyuk Sung
2025-11-26
2511.21415v1

Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

Yuhang Wang, Yanxu Zhu, Dongyuan Lu, Jitao Sang
2025-11-26
2511.21214v2

Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

Yuhang Wang, Yanxu Zhu, Dongyuan Lu, Jitao Sang
2025-11-26
2511.21214v1