Paper Library

DRIP: Defending Prompt Injection via Token-wise Representation Editing and Residual Instruction Fusion

Ruofan Liu, Yun Lin, Zhiyong Huang, Jin Song Dong

2025-11-01

2511.00447v2

DRIP: Defending Prompt Injection via De-instruction Training and Residual Fusion Model Architecture

Ruofan Liu, Yun Lin, Jin Song Dong

2025-11-01

red teaming

2511.00447v1

Efficiency vs. Alignment: Investigating Safety and Fairness Risks in Parameter-Efficient Fine-Tuning of LLMs

Mina Taraghi, Yann Pequignot, Amin Nikanjam, Mohamed Amine Merzouk, Foutse Khomh

2025-11-01

safety

2511.00382v1

Exploiting Latent Space Discontinuities for Building Universal LLM Jailbreaks and Data Extraction Attacks

Kayua Oleques Paim, Rodrigo Brandao Mansilha, Diego Kreutz, Muriel Figueredo Franco, Weverton Cordeiro

2025-11-01

red teaming

2511.00346v1

Diffusion LLMs are Natural Adversaries for any LLM

David Lüdke, Tom Wollschläger, Paul Ungermann, Stephan Günnemann, Leo Schwinn

2025-10-31

red teaming

2511.00203v1

Prevalence of Security and Privacy Risk-Inducing Usage of AI-based Conversational Agents

Kathrin Grosse, Nico Ebert

2025-10-31

2510.27275v1

Consistency Training Helps Stop Sycophancy and Jailbreaks

Alex Irpan, Alexander Matt Turner, Mark Kurzeja, David K. Elson, Rohin Shah

2025-10-31

2510.27062v1

Reasoning Up the Instruction Ladder for Controllable Language Models

Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar

2025-10-30

red teaming

2511.04694v3

Reasoning Up the Instruction Ladder for Controllable Language Models

Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar

2025-10-30

red teaming

2511.04694v2

CATCH: A Modular Cross-domain Adaptive Template with Hook

Xinjin Li, Yulie Lu, Jinghan Cao, Yu Ma, Zhenglin Li, Yeyang Zhou

2025-10-30

2510.26582v1

Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

Shaked Zychlinski, Yuval Kainan

2025-10-30

red teaming

2510.26847v1

Chain-of-Thought Hijacking

Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez

2025-10-30

red teaming

2510.26418v1

Agent Skills Enable a New Class of Realistic and Trivially Simple Prompt Injections

David Schmotz, Sahar Abdelnabi, Maksym Andriushchenko

2025-10-30

red teaming

2510.26328v1

ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models

Weifei Jin, Yuxin Cao, Junjie Su, Minhui Xue, Jie Hao, Ke Xu, Jin Song Dong, Derui Wang

2025-10-30

red teaming

2510.26096v1

RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline

André V. Duarte, Xuying li, Bin Zeng, Arlindo L. Oliveira, Lei Li, Zhuo Li

2025-10-29

red teaming

2510.25941v1

VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning

Baolu Li, Yiming Zhang, Qinghe Wang, Liqian Ma, Xiaoyu Shi, Xintao Wang, Pengfei Wan, Zhenfei Yin, Yunzhi Zhuge, Huchuan Lu, Xu Jia

2025-10-29

2510.25772v1

Agentic Moderation: Multi-Agent Design for Safer Vision-Language Models

Juan Ren, Mark Dras, Usman Naseem

2025-10-29

red teaming

2510.25179v1

Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection

Chanhyeong Yang, Taehoon Song, Jihwan Park, Hyunwoo J. Kim

2025-10-29

2510.25094v1

Compositional Image Synthesis with Inference-Time Scaling

Minsuk Ji, Sanghyeok Lee, Namhyuk Ahn

2025-10-28

2510.24133v1

Fortytwo: Swarm Inference with Peer-Ranked Consensus

Vladyslav Larin, Ihor Naumenko, Aleksei Ivashov, Ivan Nikitin, Alexander Firsov

2025-10-27

2510.24801v1

ReCAP: Recursive Context-Aware Reasoning and Planning for Large Language Model Agents

Zhenyu Zhang, Tianyi Chen, Weiran Xu, Alex Pentland, Jiaxin Pei

2025-10-27

2510.23822v1

QueryIPI: Query-agnostic Indirect Prompt Injection on Coding Agents

Yuchong Xie, Zesen Liu, Mingyu Luo, Zhixiang Zhang, Kaikai Zhang, Yuanyuan Yuan, Zongjie Li, Ping Chen, Shuai Wang, Dongdong She

2025-10-27

red teaming

2510.23675v3

QueryIPI: Query-agnostic Indirect Prompt Injection on Coding Agents

Yuchong Xie, Zesen Liu, Mingyu Luo, Zhixiang Zhang, Kaikai Zhang, and Yuanyuan Yuan, Zongjie Li, Ping Chen, Shuai Wang, Dongdong She

2025-10-27

red teaming

2510.23675v2

QueryIPI: Query-agnostic Indirect Prompt Injection on Coding Agents

Yuchong Xie, Zesen Liu, Mingyu Luo, Zhixiang Zhang, Kaikai Zhang, Zongjie Li, Ping Chen, Shuai Wang, Dongdong She

2025-10-27

red teaming

2510.23675v1

October 27 - November 02, 2025

DRIP: Defending Prompt Injection via Token-wise Representation Editing and Residual Instruction Fusion

DRIP: Defending Prompt Injection via De-instruction Training and Residual Fusion Model Architecture

Efficiency vs. Alignment: Investigating Safety and Fairness Risks in Parameter-Efficient Fine-Tuning of LLMs

Exploiting Latent Space Discontinuities for Building Universal LLM Jailbreaks and Data Extraction Attacks

Diffusion LLMs are Natural Adversaries for any LLM

Prevalence of Security and Privacy Risk-Inducing Usage of AI-based Conversational Agents

Consistency Training Helps Stop Sycophancy and Jailbreaks

Reasoning Up the Instruction Ladder for Controllable Language Models

Reasoning Up the Instruction Ladder for Controllable Language Models

CATCH: A Modular Cross-domain Adaptive Template with Hook

Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

Chain-of-Thought Hijacking

Agent Skills Enable a New Class of Realistic and Trivially Simple Prompt Injections

ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models

RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline

VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning

Agentic Moderation: Multi-Agent Design for Safer Vision-Language Models

Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection

Compositional Image Synthesis with Inference-Time Scaling

Fortytwo: Swarm Inference with Peer-Ranked Consensus

ReCAP: Recursive Context-Aware Reasoning and Planning for Large Language Model Agents

QueryIPI: Query-agnostic Indirect Prompt Injection on Coding Agents

QueryIPI: Query-agnostic Indirect Prompt Injection on Coding Agents

QueryIPI: Query-agnostic Indirect Prompt Injection on Coding Agents