Paper Library

GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering

Jehyeok Yeon, Federico Cinus, Yifan Wu, Luca Luceri

2025-12-07

red teaming safety

2512.06655v1

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Xiaojun Jia, Jie Liao, Qi Guo, Teng Ma, Simeng Qin, Ranjie Duan, Tianlin Li, Yihao Huang, Zhitao Zeng, Dongxian Wu, Yiming Li, Wenqi Ren, Xiaochun Cao, Yang Liu

2025-12-06

red teaming

2512.06589v1

Securing the Model Context Protocol: Defending LLMs Against Tool Poisoning and Adversarial Attacks

Saeid Jamshidi, Kawser Wazed Nafi, Arghavan Moradi Dakhel, Negar Shahabi, Foutse Khomh, Naser Ezzati-Jivan

2025-12-06

red teaming

2512.06556v1

Metaphor-based Jailbreaking Attacks on Text-to-Image Models

Chenyu Zhang, Yiwen Ma, Lanjun Wang, Wenhui Li, Yi Tu, An-An Liu

2025-12-06

2512.10766v1

Beyond Model Jailbreak: Systematic Dissection of the "Ten DeadlySins" in Embodied Intelligence

Yuhang Huang, Junchao Li, Boyang Ma, Xuelong Dai, Minghui Xu, Kaidi Xu, Yue Zhang, Jianping Wang, Xiuzhen Cheng

2025-12-06

2512.06387v1

Automated Data Enrichment using Confidence-Aware Fine-Grained Debate among Open-Source LLMs for Mental Health and Online Safety

Junyu Mao, Anthony Hills, Talia Tseriotou, Maria Liakata, Aya Shamir, Dan Sayda, Dana Atzil-Slonim, Natalie Djohari, Arpan Mandal, Silke Roth, Pamela Ugwudike, Mahesan Niranjan, Stuart E. Middleton

2025-12-06

safety

2512.06227v1

VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack

Shiji Zhao, Shukun Xiong, Yao Huang, Yan Jin, Zhenyu Wu, Jiyang Guan, Ranjie Duan, Jialing Tao, Hui Xue, Xingxing Wei

2025-12-05

red teaming

2512.05853v2

VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack

Shiji Zhao, Shukun Xiong, Yao Huang, Yan Jin, Zhenyu Wu, Jiyang Guan, Ranjie Duan, Jialing Tao, Hui Xue, Xingxing Wei

2025-12-05

red teaming

2512.05853v1

ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

Weikai Lu, Ziqian Zeng, Kehua Zhang, Haoran Li, Huiping Zhuang, Ruidong Wang, Cen Chen, Hao Peng

2025-12-05

2512.05745v1

TeleAI-Safety: A comprehensive LLM jailbreaking benchmark towards attacks, defenses, and evaluations

Xiuyuan Chen, Jian Zhao, Yuxiang He, Yuan Xun, Xinwei Liu, Yanshu Li, Huilin Zhou, Wei Cai, Ziyan Shi, Yuchen Yuan, Tianle Zhang, Chi Zhang, Xuelong Li

2025-12-05

red teaming safety

2512.05485v2

TeleAI-Safety: A comprehensive LLM jailbreaking benchmark towards attacks, defenses, and evaluations

Xiuyuan Chen, Jian Zhao, Yuxiang He, Yuan Xun, Xinwei Liu, Yanshu Li, Huilin Zhou, Wei Cai, Ziyan Shi, Yuchen Yuan, Tianle Zhang, Chi Zhang, Xuelong Li

2025-12-05

red teaming safety

2512.05485v1

Chameleon: Adaptive Adversarial Agents for Scaling-Based Visual Prompt Injection in Multimodal AI Systems

M Zeeshan, Saud Satti

2025-12-04

red teaming

2512.04895v1

STELLA: Guiding Large Language Models for Time Series Forecasting with Semantic Abstractions

Junjie Fan, Hongye Zhao, Linduo Wei, Jiayu Rao, Guijia Li, Jiaxin Yuan, Wenqi Xu, Yong Qi

2025-12-04

2512.04871v1

SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security

Wei Zhao, Zhe Li, Jun Sun

2025-12-04

red teaming

2512.04841v1

ASTRIDE: A Security Threat Modeling Platform for Agentic-AI Applications

Eranga Bandara, Amin Hass, Ross Gore, Sachin Shetty, Ravi Mukkamala, Safdar H. Bouk, Xueping Liang, Ng Wee Keong, Kasun De Zoysa, Aruna Withanage, Nilaan Loganathan

2025-12-04

red teaming

2512.04785v1

Boundary-Aware Test-Time Adaptation for Zero-Shot Medical Image Segmentation

Chenlin Xu, Lei Zhang, Lituan Wang, Xinyu Pu, Pengfei Ma, Guangwu Qian, Zizhou Wang, Yan Wang

2025-12-04

2512.04520v1

DataGovBench: Benchmarking LLM Agents for Real-World Data Governance Workflows

Zhou Liu, Zhaoyang Han, Guochen Yan, Hao Liang, Bohan Zeng, Xing Chen, Yuanfeng Song, Wentao Zhang

2025-12-04

governance

2512.04416v2

GovBench: Benchmarking LLM Agents for Real-World Data Governance Workflows

Zhou Liu, Zhaoyang Han, Guochen Yan, Hao Liang, Bohan Zeng, Xing Chen, Yuanfeng Song, Wentao Zhang

2025-12-04

governance

2512.04416v1

Executable Governance for AI: Translating Policies into Rules Using LLMs

Gautam Varma Datla, Anudeep Vurity, Tejaswani Dash, Tazeem Ahmad, Mohd Adnan, Saima Rafi

2025-12-04

governance

2512.04408v1

Multi-Scale Visual Prompting for Lightweight Small-Image Classification

Salim Khazem

2025-12-03

2512.03663v1

Immunity memory-based jailbreak detection: multi-agent adaptive guard for large language models

Jun Leng, Litian Zhang, Xi Zhang

2025-12-03

red teaming

2512.03356v1

MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu, Huchuan Lu, Xintao Wang, Pengfei Wan, Kun Gai, Xu Jia

2025-12-02

2512.03041v1

Invasive Context Engineering to Control Large Language Models

Thomas Rivasseau

2025-12-02

2512.03001v1

Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

Yuan Xiong, Ziqi Miao, Lijun Li, Chen Qian, Jie Li, Jing Shao

2025-12-02

red teaming

2512.02973v1

December 01 - December 07, 2025

GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Securing the Model Context Protocol: Defending LLMs Against Tool Poisoning and Adversarial Attacks

Metaphor-based Jailbreaking Attacks on Text-to-Image Models

Beyond Model Jailbreak: Systematic Dissection of the "Ten DeadlySins" in Embodied Intelligence

Automated Data Enrichment using Confidence-Aware Fine-Grained Debate among Open-Source LLMs for Mental Health and Online Safety

VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack

VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack

ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

TeleAI-Safety: A comprehensive LLM jailbreaking benchmark towards attacks, defenses, and evaluations

TeleAI-Safety: A comprehensive LLM jailbreaking benchmark towards attacks, defenses, and evaluations

Chameleon: Adaptive Adversarial Agents for Scaling-Based Visual Prompt Injection in Multimodal AI Systems

STELLA: Guiding Large Language Models for Time Series Forecasting with Semantic Abstractions

SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security

ASTRIDE: A Security Threat Modeling Platform for Agentic-AI Applications

Boundary-Aware Test-Time Adaptation for Zero-Shot Medical Image Segmentation

DataGovBench: Benchmarking LLM Agents for Real-World Data Governance Workflows

GovBench: Benchmarking LLM Agents for Real-World Data Governance Workflows

Executable Governance for AI: Translating Policies into Rules Using LLMs

Multi-Scale Visual Prompting for Lightweight Small-Image Classification

Immunity memory-based jailbreak detection: multi-agent adaptive guard for large language models

MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

Invasive Context Engineering to Control Large Language Models

Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities