← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1169 papers total
December 01 - December 07, 2025
13 papers
When AI Takes the Couch: Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models
Afshin Khadangi, Hanna Marxen, Amir Sartipi, Igor Tchappi, Gilbert Fridgen
2025-12-02
2512.04124v2
When AI Takes the Couch: Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models
Afshin Khadangi, Hanna Marxen, Amir Sartipi, Igor Tchappi, Gilbert Fridgen
2025-12-02
2512.04124v1
Beyond Single-Agent Safety: A Taxonomy of Risks in LLM-to-LLM Interactions
Piercosma Bisconti, Marcello Galisai, Federico Pierucci, Marcantonio Bracale, Matteo Prandi
2025-12-02
safety
2512.02682v1
When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents
Tsimur Hadeliya, Mohammad Ali Jauhar, Nidhi Sakpal, Diogo Cruz
2025-12-02
safety
2512.02445v1
COGNITION: From Evaluation to Defense against Multimodal LLM CAPTCHA Solvers
Junyu Wang, Changjia Zhu, Yuanbo Zhou, Lingyao Li, Xu He, Junjie Xiong
2025-12-02
safety
2512.02318v2
COGNITION: From Evaluation to Defense against Multimodal LLM CAPTCHA Solvers
Junyu Wang, Changjia Zhu, Yuanbo Zhou, Lingyao Li, Xu He, Junjie Xiong
2025-12-02
safety
2512.02318v1
DialogGuard: Multi-Agent Psychosocial Safety Evaluation of Sensitive LLM Responses
Han Luo, Guy Laban
2025-12-01
safety
2512.02282v1
Ensemble Privacy Defense for Knowledge-Intensive LLMs against Membership Inference Attacks
Haowei Fu, Bo Ni, Han Xu, Kunpeng Liu, Dan Lin, Tyler Derr
2025-12-01
safety
2512.03100v1
GRASP: Guided Residual Adapters with Sample-wise Partitioning
Felix Nützel, Mischa Dombrowski, Bernhard Kainz
2025-12-01
2512.01675v1
The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search
Rongzhe Wei, Peizhi Niu, Xinjie Shen, Tony Tu, Yifan Li, Ruihan Wu, Eli Chien, Pin-Yu Chen, Olgica Milenkovic, Pan Li
2025-12-01
red teaming
2512.01353v2
A Wolf in Sheep's Clothing: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search
Rongzhe Wei, Peizhi Niu, Xinjie Shen, Tony Tu, Yifan Li, Ruihan Wu, Eli Chien, Olgica Milenkovic, Pan Li
2025-12-01
red teaming
2512.01353v1
Securing Large Language Models (LLMs) from Prompt Injection Attacks
Omar Farooq Khan Suri, John McCrae
2025-12-01
red teaming
2512.01326v1
DefenSee: Dissecting Threat from Sight and Text - A Multi-View Defensive Pipeline for Multi-modal Jailbreaks
Zihao Wang, Kar Wai Fok, Vrizlynn L. L. Thing
2025-12-01
2512.01185v1
November 24 - November 30, 2025
11 papers
When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals
Riad Ahmed Anonto, Md Labid Al Nahiyan, Md Tanvir Hassan, Ch. Md. Rakin Haider
2025-11-30
safety
2512.01037v1
Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis
Mintong Kang, Chong Xiang, Sanjay Kariyappa, Chaowei Xiao, Bo Li, Edward Suh
2025-11-30
red teaming
2512.00966v1
On the Regulatory Potential of User Interfaces for AI Agent Governance
K. J. Kevin Feng, Tae Soo Kim, Rock Yuren Pang, Faria Huq, Tal August, Amy X. Zhang
2025-11-30
2512.00742v1
Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model
Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, Qinglin Lu
2025-11-28
2511.23429v1
Unlocking Multilingual Reasoning Capability of LLMs and LVLMs through Representation Engineering
Qiming Li, Xiaocheng Feng, Yixuan Ma, Zekai Ye, Ruihan Chen, Xiachong Feng, Bing Qin
2025-11-28
2511.23231v1
Are LLMs Good Safety Agents or a Propaganda Engine?
Neemesh Yadav, Francesco Ortu, Jiarui Liu, Joeun Yook, Bernhard Schölkopf, Rada Mihalcea, Alberto Cazzaniga, Zhijing Jin
2025-11-28
red teaming
safety
2511.23174v1
Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks
Richard J. Young
2025-11-27
red teaming
2511.22047v1
Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression
Tianyu Zhang, Zihang Xi, Jingyu Hua, Sheng Zhong
2025-11-27
red teaming
2511.22044v1
DiverseVAR: Balancing Diversity and Quality of Next-Scale Visual Autoregressive Models
Mingue Park, Prin Phunyaphibarn, Phillip Y. Lee, Minhyuk Sung
2025-11-26
2511.21415v1
Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines
Yuhang Wang, Yanxu Zhu, Dongyuan Lu, Jitao Sang
2025-11-26
2511.21214v2
Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines
Yuhang Wang, Yanxu Zhu, Dongyuan Lu, Jitao Sang
2025-11-26
2511.21214v1
‹
1
2
3
...
16
17
18
...
47
48
49
›