Paper Library

SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection

Maithili Joshi, Palash Nandi, Tanmoy Chakraborty

2025-09-19

red teaming

2509.16060v1

EmoQ: Speech Emotion Recognition via Speech-Aware Q-Former and Large Language Model

Yiqing Yang, Man-Wai Mak

2025-09-19

2509.15775v1

Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction

Yuanbo Xie, Yingjie Zhang, Tianyun Liu, Duohe Ma, Tingwen Liu

2025-09-18

red teaming safety

2509.15202v1

Sentinel Agents for Secure and Trustworthy Agentic AI in Multi-Agent Systems

Diego Gosmar, Deborah A. Dahl

2025-09-18

red teaming

2509.14956v1

Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages

Yujia Hu, Ming Shan Hee, Preslav Nakov, Roy Ka-Wei Lee

2025-09-18

safety

2509.15260v2

Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages

Yujia Hu, Ming Shan Hee, Preslav Nakov, Roy Ka-Wei Lee

2025-09-18

safety

2509.15260v1

MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models

Siyu Yan, Long Zeng, Xuecheng Wu, Chengcheng Han, Kongcheng Zhang, Chong Peng, Xuezhi Cao, Xunliang Cai, Chenjuan Guo

2025-09-18

red teaming

2509.14651v1

LLM Jailbreak Detection for (Almost) Free!

Guorui Chen, Yifan Xia, Xiaojun Jia, Zhijiang Li, Philip Torr, Jindong Gu

2025-09-18

red teaming

2509.14558v1

A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li, Ruifeng Xu

2025-09-17

red teaming

2509.14297v1

Agentic JWT: A Secure Delegation Protocol for Autonomous AI Agents

Abhishek Goswami

2025-09-16

2509.13597v1

A Multi-Agent LLM Defense Pipeline Against Prompt Injection Attacks

S M Asif Hossain, Ruksat Khan Shayoni, Mohd Ruhul Ameen, Akif Islam, M. F. Mridha, Jungpil Shin

2025-09-16

2509.14285v2

A Multi-Agent LLM Defense Pipeline Against Prompt Injection Attacks

S M Asif Hossain, Ruksat Khan Shayoni, Mohd Ruhul Ameen, Akif Islam, M. F. Mridha, Jungpil Shin

2025-09-16

safety

2509.14285v1

Jailbreaking Large Language Models Through Content Concretization

Johan Wahréus, Ahmed Hussain, Panos Papadimitratos

2025-09-16

red teaming

2509.12937v1

Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models

Yunhan Zhao, Xiang Zheng, Xingjun Ma

2025-09-16

red teaming

2509.12724v1

Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models

Gustavo Sandoval, Denys Fenchenko, Junyao Chen

2025-09-15

red teaming

2509.14271v1

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check

Chentao Cao, Xiaojun Xu, Bo Han, Hang Li

2025-09-15

2509.11629v1

September 15 - September 21, 2025

SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection

EmoQ: Speech Emotion Recognition via Speech-Aware Q-Former and Large Language Model

Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction

Sentinel Agents for Secure and Trustworthy Agentic AI in Multi-Agent Systems

Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages

Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages

MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models

LLM Jailbreak Detection for (Almost) Free!

A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

Agentic JWT: A Secure Delegation Protocol for Autonomous AI Agents

A Multi-Agent LLM Defense Pipeline Against Prompt Injection Attacks

A Multi-Agent LLM Defense Pipeline Against Prompt Injection Attacks

Jailbreaking Large Language Models Through Content Concretization

Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models

Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check

September 08 - September 14, 2025

Securing AI Agents: Implementing Role-Based Access Control for Industrial Applications

When Smiley Turns Hostile: Interpreting How Emojis Trigger LLMs' Toxicity

ENJ: Optimizing Noise with Genetic Algorithms to Jailbreak LSMs

Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding

Prompt Injection Attacks on LLM Generated Reviews of Scientific Publications

Realism Control One-step Diffusion for Real-World Image Super-Resolution

When Your Reviewer is an LLM: Biases, Divergence, and Prompt Injection Risks in Peer Review

Steering MoE LLMs via Expert (De)Activation