Paper Library

ceLLMate: Sandboxing Browser AI Agents

Luoxi Meng, Henry Feng, Ilia Shumailov, Earlence Fernandes

2025-12-14

2512.12594v1

Detecting Prompt Injection Attacks Against Application Using Classifiers

Safwan Shaheer, G. M. Refatul Islam, Mohammad Rafid Hamid, Md. Abrar Faiaz Khan, Md. Omar Faruk, Yaseen Nur

2025-12-14

red teaming

2512.12583v1

Challenges of Evaluating LLM Safety for User Welfare

Manon Kempermann, Sai Suresh Macharla Vasu, Mahalakshmi Raveenthiran, Theo Farrell, Ingmar Weber

2025-12-11

safety

2512.10687v1

When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection

Devanshu Sahoo, Manish Prasad, Vasudev Majhi, Jahnvi Singh, Vinay Chamola, Yash Sinha, Murari Mandal, Dhruv Kumar

2025-12-11

red teaming

2512.10449v3

When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection

Devanshu Sahoo, Manish Prasad, Vasudev Majhi, Jahnvi Singh, Vinay Chamola, Yash Sinha, Murari Mandal, Dhruv Kumar

2025-12-11

red teaming

2512.10449v1

How to Trick Your AI TA: A Systematic Study of Academic Jailbreaking in LLM Code Evaluation

Devanshu Sahoo, Vasudev Majhi, Arjun Neekhra, Yash Sinha, Murari Mandal, Dhruv Kumar

2025-12-11

red teaming

2512.10415v1

Phishing Email Detection Using Large Language Models

Najmul Hasan, Prashanth BusiReddyGari, Haitao Zhao, Yihao Ren, Jinsheng Xu, Shaohu Zhang

2025-12-10

red teaming

2512.10104v2

LLM-PEA: Leveraging Large Language Models Against Phishing Email Attacks

Najmul Hassan, Prashanth BusiReddyGari, Haitao Zhao, Yihao Ren, Jinsheng Xu, Shaohu Zhang

2025-12-10

red teaming

2512.10104v1

CNFinBench: A Benchmark for Safety and Compliance of Large Language Models in Finance

Jinru Ding, Chao Ding, Wenrao Pang, Boyi Xiao, Zhiqiang Liu, Pengcheng Chen, Jiayuan Chen, Tiantian Yuan, Junming Guan, Yidong Jiang, Dawei Cheng, Jie Xu

2025-12-10

red teaming

2512.09506v1

Black-Box Behavioral Distillation Breaks Safety Alignment in Medical LLMs

Sohely Jahan, Ruimin Sun

2025-12-10

safety

2512.09403v1

ObliInjection: Order-Oblivious Prompt Injection Attack to LLM Agents with Multi-source Data

Reachal Wang, Yuqi Jia, Neil Zhenqiang Gong

2025-12-10

red teaming

2512.09321v3

ObliInjection: Order-Oblivious Prompt Injection Attack to LLM Agents with Multi-source Data

Ruiqi Wang, Yuqi Jia, Neil Zhenqiang Gong

2025-12-10

red teaming

2512.09321v1

Insured Agents: A Decentralized Trust Insurance Mechanism for Agentic Economy

Botao 'Amber' Hu, Bangdao Chen

2025-12-09

2512.08737v1

Attention is All You Need to Defend Against Indirect Prompt Injection Attacks in LLMs

Yinan Zhong, Qianhao Miao, Yanjiao Chen, Jiangyi Deng, Yushi Cheng, Wenyuan Xu

2025-12-09

red teaming

2512.08417v2

Attention is All You Need to Defend Against Indirect Prompt Injection Attacks in LLMs

Yinan Zhong, Qianhao Miao, Yanjiao Chen, Jiangyi Deng, Yushi Cheng, Wenyuan Xu

2025-12-09

red teaming

2512.08417v1

Systematization of Knowledge: Security and Safety in the Model Context Protocol Ecosystem

Shiva Gaire, Srijan Gyawali, Saroj Mishra, Suman Niroula, Dilip Thakur, Umesh Yadav

2025-12-09

red teaming

2512.08290v2

Systematization of Knowledge: Security and Safety in the Model Context Protocol Ecosystem

Shiva Gaire, Srijan Gyawali, Saroj Mishra, Suman Niroula, Dilip Thakur, Umesh Yadav

2025-12-09

red teaming

2512.08290v1

A Practical Framework for Evaluating Medical AI Security: Reproducible Assessment of Jailbreaking and Privacy Vulnerabilities Across Clinical Specialties

Jinghao Wang, Ping Zhang, Carter Yagemann

2025-12-09

red teaming

2512.08185v1

RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models

Xiqiao Xiong, Ouxiang Li, Zhuo Liu, Moxin Li, Wentao Shi, Fuli Feng, Xiangnan He

2025-12-08

red teaming

2512.07761v1

Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models

Fenghua Weng, Chaochao Lu, Xia Hu, Wenqi Shao, Wenjie Wang

2025-12-08

2512.07141v1

December 08 - December 14, 2025

ceLLMate: Sandboxing Browser AI Agents

Detecting Prompt Injection Attacks Against Application Using Classifiers

Challenges of Evaluating LLM Safety for User Welfare

When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection

When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection

How to Trick Your AI TA: A Systematic Study of Academic Jailbreaking in LLM Code Evaluation

Phishing Email Detection Using Large Language Models

LLM-PEA: Leveraging Large Language Models Against Phishing Email Attacks

CNFinBench: A Benchmark for Safety and Compliance of Large Language Models in Finance

Black-Box Behavioral Distillation Breaks Safety Alignment in Medical LLMs

ObliInjection: Order-Oblivious Prompt Injection Attack to LLM Agents with Multi-source Data

ObliInjection: Order-Oblivious Prompt Injection Attack to LLM Agents with Multi-source Data

Insured Agents: A Decentralized Trust Insurance Mechanism for Agentic Economy

Attention is All You Need to Defend Against Indirect Prompt Injection Attacks in LLMs

Attention is All You Need to Defend Against Indirect Prompt Injection Attacks in LLMs

Systematization of Knowledge: Security and Safety in the Model Context Protocol Ecosystem

Systematization of Knowledge: Security and Safety in the Model Context Protocol Ecosystem

A Practical Framework for Evaluating Medical AI Security: Reproducible Assessment of Jailbreaking and Privacy Vulnerabilities Across Clinical Specialties

RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models

Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models

December 01 - December 07, 2025

SoK: Trust-Authorization Mismatch in LLM Agent Interactions

Cognitive Control Architecture (CCA): A Lifecycle Supervision Framework for Robustly Aligned AI Agents

Cognitive Control Architecture (CCA): A Lifecycle Supervision Framework for Robustly Aligned AI Agents

RunawayEvil: Jailbreaking the Image-to-Video Generative Models