Paper Library

EASE: Practical and Efficient Safety Alignment for Small Language Models

Haonan Shi, Guoli Wang, Tu Ouyang, An Wang

2025-11-09

red teaming

2511.06512v1

KG-DF: A Black-box Defense Framework against Jailbreak Attacks Based on Knowledge Graphs

Shuyuan Liu, Jiawei Chen, Xiao Yang, Hang Su, Zhaoxia Yin

2025-11-09

red teaming

2511.07480v1

Efficient LLM Safety Evaluation through Multi-Agent Debate

Dachuan Lin, Guobin Shen, Zihao Yang, Tianrong Liu, Dongcheng Zhao, Yi Zeng

2025-11-09

red teaming safety

2511.06396v1

RelightMaster: Precise Video Relighting with Multi-plane Light Images

Weikang Bian, Xiaoyu Shi, Zhaoyang Huang, Jianhong Bai, Qinghe Wang, Xintao Wang, Pengfei Wan, Kun Gai, Hongsheng Li

2025-11-09

2511.06271v1

RAG-targeted Adversarial Attack on LLM-based Threat Detection and Mitigation Framework

Seif Ikbarieh, Kshitiz Aryal, Maanak Gupta

2025-11-09

red teaming

2511.06212v1

Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs

Alina Fastowski, Bardh Prenkaj, Yuxiao Li, Gjergji Kasneci

2025-11-08

red teaming

2511.05919v2

Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs

Alina Fastowski, Bardh Prenkaj, Yuxiao Li, Gjergji Kasneci

2025-11-08

red teaming

2511.05919v1

MCP-RiskCue: Can LLM Infer Risk Information From MCP Server System Logs?

Jiayi Fu, Qiyao Sun

2025-11-08

2511.05867v2

When AI Meets the Web: Prompt Injection Risks in Third-Party AI Chatbot Plugins

Yigitcan Kaya, Anton Landerer, Stijn Pletinckx, Michelle Zimmermann, Christopher Kruegel, Giovanni Vigna

2025-11-08

red teaming

2511.05797v1

Reflective Personalization Optimization: A Post-hoc Rewriting Framework for Black-Box Large Language Models

Teqi Hao, Xioayu Tan, Shaojie Shi, Yinghui Xu, Xihe Qiu

2025-11-07

2511.05286v1

Large Language Models for Cyber Security

Raunak Somani, Aswani Kumar Cherukuri

2025-11-06

2511.04508v1

AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research

Tim Beyer, Jonas Dornbusch, Jakob Steimle, Moritz Ladenburger, Leo Schwinn, Stephan Günnemann

2025-11-06

red teaming

2511.04316v1

Secure Code Generation at Scale with Reflexion

Arup Datta, Ahmed Aljohani, Hyunsook Do

2025-11-05

2511.03898v1

Whisper Leak: a side-channel attack on Large Language Models

Geoff McDonald, Jonathan Bar Or

2025-11-05

2511.03675v1

Inter-Agent Trust Models: A Comparative Study of Brief, Claim, Proof, Stake, Reputation and Constraint in Agentic Web Protocol Design-A2A, AP2, ERC-8004, and Beyond

Botao 'Amber' Hu, Helena Rong

2025-11-05

red teaming

2511.03434v1

Let the Bees Find the Weak Spots: A Path Planning Perspective on Multi-Turn Jailbreak Attacks against LLMs

Yize Liu, Yunyun Hou, Aina Sui

2025-11-05

red teaming

2511.03271v1

November 10 - November 16, 2025

Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs

Alignment-Aware Quantization for LLM Safety

JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework

EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment

Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment

SAFENLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces

SAFENLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces

November 03 - November 09, 2025

EASE: Practical and Efficient Safety Alignment for Small Language Models

KG-DF: A Black-box Defense Framework against Jailbreak Attacks Based on Knowledge Graphs

Efficient LLM Safety Evaluation through Multi-Agent Debate

RelightMaster: Precise Video Relighting with Multi-plane Light Images

RAG-targeted Adversarial Attack on LLM-based Threat Detection and Mitigation Framework

Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs

Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs

MCP-RiskCue: Can LLM Infer Risk Information From MCP Server System Logs?

When AI Meets the Web: Prompt Injection Risks in Third-Party AI Chatbot Plugins

Reflective Personalization Optimization: A Post-hoc Rewriting Framework for Black-Box Large Language Models

Large Language Models for Cyber Security

AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research

Secure Code Generation at Scale with Reflexion

Whisper Leak: a side-channel attack on Large Language Models

Inter-Agent Trust Models: A Comparative Study of Brief, Claim, Proof, Stake, Reputation and Constraint in Agentic Web Protocol Design-A2A, AP2, ERC-8004, and Beyond

Let the Bees Find the Weak Spots: A Path Planning Perspective on Multi-Turn Jailbreak Attacks against LLMs