Paper Library

Death by a Thousand Prompts: Open Model Vulnerability Analysis

Amy Chang, Nicholas Conley, Harish Santhanalakshmi Ganesan, Adam Swanda

2025-11-05

red teaming

2511.03247v1

Jailbreaking in the Haystack

Rishi Rajesh Shah, Chen Henry Wu, Shashwat Saxena, Ziqian Zhong, Alexander Robey, Aditi Raghunathan

2025-11-05

red teaming

2511.04707v1

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Aashray Reddy, Andrew Zagula, Nicholas Saban, Kevin Zhu

2025-11-04

red teaming

2511.02376v2

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Aashray Reddy, Andrew Zagula, Nicholas Saban

2025-11-04

red teaming

2511.02376v1

LiveSecBench: A Dynamic and Culturally-Relevant AI Safety Benchmark for LLMs in Chinese Context

Yudong Li, Zhongliang Yang, Kejiang Chen, Wenxuan Wang, Tianxin Zhang, Sifang Wan, Kecheng Wang, Haitian Li, Xu Wang, Lefan Cheng, Youdan Yang, Baocheng Chen, Ziyu Liu, Yufei Sun, Liyan Wu, Wenya Wen, Xingchi Gu, Peiru Yang

2025-11-04

safety

2511.02366v1

An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks

Xu Liu, Yan Chen, Kan Ling, Yichi Zhu, Hengrun Zhang, Guisheng Fan, Huiqun Yu

2025-11-04

red teaming

2511.02356v1

LM-Fix: Lightweight Bit-Flip Detection and Rapid Recovery Framework for Language Models

Ahmad Tahmasivand, Noureldin Zahran, Saba Al-Sayouri, Mohammed Fouda, Khaled N. Khasawneh

2025-11-03

2511.02866v1

Prompt Injection as an Emerging Threat: Evaluating the Resilience of Large Language Models

Daniyal Ganiuly, Assel Smaiyl

2025-11-03

red teaming

2511.01634v2

Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models

Xiaoyu Zhan, Wenxuan Huang, Hao Sun, Xinyu Fu, Changfeng Ma, Shaosheng Cao, Bohan Jia, Shaohui Lin, Zhenfei Yin, Lei Bai, Wanli Ouyang, Yuanqi Li, Jie Guo, Yanwen Guo

2025-11-03

2511.01618v1

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

Hamin Koo, Minseon Kim, Jaehyung Kim

2025-11-03

red teaming

2511.01375v1

MIQ-SAM3D: From Single-Point Prompt to Multi-Instance Segmentation via Competitive Query Refinement

Jierui Qu, Jianchun Zhao

2025-11-03

2511.01345v1

"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers

Qin Zhou, Zhexin Zhang, Zhi Li, Limin Sun

2025-11-03

red teaming

2511.01287v1

November 03 - November 09, 2025

Death by a Thousand Prompts: Open Model Vulnerability Analysis

Jailbreaking in the Haystack

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

LiveSecBench: A Dynamic and Culturally-Relevant AI Safety Benchmark for LLMs in Chinese Context

An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks

LM-Fix: Lightweight Bit-Flip Detection and Rapid Recovery Framework for Language Models

Prompt Injection as an Emerging Threat: Evaluating the Resilience of Large Language Models

Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

MIQ-SAM3D: From Single-Point Prompt to Multi-Instance Segmentation via Competitive Query Refinement

"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers

October 27 - November 02, 2025

Do Methods to Jailbreak and Defend LLMs Generalize Across Languages?

ShadowLogic: Backdoors in Any Whitebox LLM

Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack

Reimagining Safety Alignment with An Image

DRIP: Defending Prompt Injection via Token-wise Representation Editing and Residual Instruction Fusion

DRIP: Defending Prompt Injection via De-instruction Training and Residual Fusion Model Architecture

Efficiency vs. Alignment: Investigating Safety and Fairness Risks in Parameter-Efficient Fine-Tuning of LLMs

Exploiting Latent Space Discontinuities for Building Universal LLM Jailbreaks and Data Extraction Attacks

Diffusion LLMs are Natural Adversaries for any LLM

Prevalence of Security and Privacy Risk-Inducing Usage of AI-based Conversational Agents

Consistency Training Helps Stop Sycophancy and Jailbreaks

Reasoning Up the Instruction Ladder for Controllable Language Models