← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1172 papers total
November 03 - November 09, 2025
5 papers
Prompt Injection as an Emerging Threat: Evaluating the Resilience of Large Language Models
Daniyal Ganiuly, Assel Smaiyl
2025-11-03
red teaming
2511.01634v2
Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models
Xiaoyu Zhan, Wenxuan Huang, Hao Sun, Xinyu Fu, Changfeng Ma, Shaosheng Cao, Bohan Jia, Shaohui Lin, Zhenfei Yin, Lei Bai, Wanli Ouyang, Yuanqi Li, Jie Guo, Yanwen Guo
2025-11-03
2511.01618v1
Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges
Hamin Koo, Minseon Kim, Jaehyung Kim
2025-11-03
red teaming
2511.01375v1
MIQ-SAM3D: From Single-Point Prompt to Multi-Instance Segmentation via Competitive Query Refinement
Jierui Qu, Jianchun Zhao
2025-11-03
2511.01345v1
"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers
Qin Zhou, Zhexin Zhang, Zhi Li, Limin Sun
2025-11-03
red teaming
2511.01287v1
October 27 - November 02, 2025
19 papers
Do Methods to Jailbreak and Defend LLMs Generalize Across Languages?
Berk Atil, Rebecca J. Passonneau, Fred Morstatter
2025-11-01
red teaming
2511.00689v2
ShadowLogic: Backdoors in Any Whitebox LLM
Kasimir Schulz, Amelia Kawasaki, Leo Ring
2025-11-01
red teaming
2511.00664v1
Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack
Peng Ding, Jun Kuang, Wen Sun, Zongyu Wang, Xuezhi Cao, Xunliang Cai, Jiajun Chen, Shujian Huang
2025-11-01
red teaming
safety
2511.00556v1
Reimagining Safety Alignment with An Image
Yifan Xia, Guorui Chen, Wenqian Yu, Zhijiang Li, Philip Torr, Jindong Gu
2025-11-01
2511.00509v1
DRIP: Defending Prompt Injection via Token-wise Representation Editing and Residual Instruction Fusion
Ruofan Liu, Yun Lin, Zhiyong Huang, Jin Song Dong
2025-11-01
2511.00447v2
DRIP: Defending Prompt Injection via De-instruction Training and Residual Fusion Model Architecture
Ruofan Liu, Yun Lin, Jin Song Dong
2025-11-01
red teaming
2511.00447v1
Efficiency vs. Alignment: Investigating Safety and Fairness Risks in Parameter-Efficient Fine-Tuning of LLMs
Mina Taraghi, Yann Pequignot, Amin Nikanjam, Mohamed Amine Merzouk, Foutse Khomh
2025-11-01
safety
2511.00382v1
Exploiting Latent Space Discontinuities for Building Universal LLM Jailbreaks and Data Extraction Attacks
Kayua Oleques Paim, Rodrigo Brandao Mansilha, Diego Kreutz, Muriel Figueredo Franco, Weverton Cordeiro
2025-11-01
red teaming
2511.00346v1
Diffusion LLMs are Natural Adversaries for any LLM
David Lüdke, Tom Wollschläger, Paul Ungermann, Stephan Günnemann, Leo Schwinn
2025-10-31
red teaming
2511.00203v1
Prevalence of Security and Privacy Risk-Inducing Usage of AI-based Conversational Agents
Kathrin Grosse, Nico Ebert
2025-10-31
2510.27275v1
Consistency Training Helps Stop Sycophancy and Jailbreaks
Alex Irpan, Alexander Matt Turner, Mark Kurzeja, David K. Elson, Rohin Shah
2025-10-31
2510.27062v1
Reasoning Up the Instruction Ladder for Controllable Language Models
Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar
2025-10-30
red teaming
2511.04694v3
Reasoning Up the Instruction Ladder for Controllable Language Models
Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar
2025-10-30
red teaming
2511.04694v2
CATCH: A Modular Cross-domain Adaptive Template with Hook
Xinjin Li, Yulie Lu, Jinghan Cao, Yu Ma, Zhenglin Li, Yeyang Zhou
2025-10-30
2510.26582v1
Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token
Shaked Zychlinski, Yuval Kainan
2025-10-30
red teaming
2510.26847v1
Chain-of-Thought Hijacking
Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez
2025-10-30
red teaming
2510.26418v1
Agent Skills Enable a New Class of Realistic and Trivially Simple Prompt Injections
David Schmotz, Sahar Abdelnabi, Maksym Andriushchenko
2025-10-30
red teaming
2510.26328v1
ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models
Weifei Jin, Yuxin Cao, Junjie Su, Minhui Xue, Jie Hao, Ke Xu, Jin Song Dong, Derui Wang
2025-10-30
red teaming
2510.26096v1
RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline
André V. Duarte, Xuying li, Bin Zeng, Arlindo L. Oliveira, Lei Li, Zhuo Li
2025-10-29
red teaming
2510.25941v1
‹
1
2
3
...
22
23
24
...
47
48
49
›