Paper Library

Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models

Ragib Amin Nihal, Rui Wen, Kazuhiro Nakadai, Jun Sakuma

2025-10-09

red teaming

2510.08859v1

CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

Debeshee Das, Luca Beurer-Kellner, Marc Fischer, Maximilian Baader

2025-10-09

2510.08829v1

VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands

Aofan Liu, Lulu Tang

2025-10-09

red teaming

2510.09699v1

An Adaptive Multi Agent Bitcoin Trading System

Aadi Singhi

2025-10-09

2510.08068v2

An Adaptive Multi Agent Bitcoin Trading System

Aadi Singhi

2025-10-09

2510.08068v1

Fewer Weights, More Problems: A Practical Attack on LLM Pruning

Kazuki Egashira, Robin Staab, Thibaud Gloaguen, Mark Vero, Martin Vechev

2025-10-09

red teaming

2510.07985v2

Fewer Weights, More Problems: A Practical Attack on LLM Pruning

Kazuki Egashira, Robin Staab, Thibaud Gloaguen, Mark Vero, Martin Vechev

2025-10-09

red teaming

2510.07985v1

From Defender to Devil? Unintended Risk Interactions Induced by LLM Defenses

Xiangtao Meng, Tianshuo Cong, Li Wang, Wenyu Chen, Zheng Li, Shanqing Guo, Xiaoyun Wang

2025-10-09

safety

2510.07968v1

MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation

Weisen Jiang, Sinno Jialin Pan

2025-10-09

2510.07835v1

Energy-Driven Steering: Reducing False Refusals in Large Language Models

Eric Hanchen Jiang, Weixuan Ou, Run Liu, Shengyuan Pang, Guancheng Wan, Ranjie Duan, Wei Dong, Kai-Wei Chang, XiaoFeng Wang, Ying Nian Wu, Xinfeng Li

2025-10-09

2510.08646v1

Practical and Stealthy Touch-Guided Jailbreak Attacks on Deployed Mobile Vision-Language Agents

Renhua Ding, Xiao Yang, Zhengwei Fang, Jun Luo, Kun He, Jun Zhu

2025-10-09

red teaming

2510.07809v2

Effective and Stealthy One-Shot Jailbreaks on Deployed Mobile Vision-Language Agents

Renhua Ding, Xiao Yang, Zhengwei Fang, Jun Luo, Kun He, Jun Zhu

2025-10-09

red teaming

2510.07809v1

The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs

Omar Mahmoud, Ali Khalil, Buddhika Laknath Semage, Thommen George Karimpanal, Santu Rana

2025-10-09

safety

2510.07775v1

RedTWIZ: Diverse LLM Red Teaming via Adaptive Attack Planning

Artur Horal, Daniel Pina, Henrique Paz, Iago Paulo, João Soares, Rafael Ferreira, Diogo Tavares, Diogo Glória-Silva, João Magalhães, David Semedo

2025-10-08

red teaming

2510.06994v1

Exposing Citation Vulnerabilities in Generative Engines

Riku Mochizuki, Shusuke Komatsu, Souta Noguchi, Kazuto Ataka

2025-10-08

2510.06823v1

Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness

Tavish McDonald, Bo Lei, Stanislav Fort, Bhavya Kailkhura, Brian Bartoldson

2025-10-08

red teaming

2510.06790v1

XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation

Udbhav Bamba, Minghao Fang, Yifan Yu, Haizhong Zheng, Fan Lai

2025-10-08

2510.06672v2

Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection?

Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis

2025-10-08

red teaming

2510.06594v2

Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection?

Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis

2025-10-08

red teaming

2510.06594v1

Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation

Jiawei Mao, Yuhan Wang, Lifeng Chen, Can Zhao, Yucheng Tang, Dong Yang, Liangqiong Qu, Daguang Xu, Yuyin Zhou

2025-10-07

2510.06131v1

LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback

Raffaele Mura, Giorgio Piras, Kamilė Lukošiūtė, Maura Pintor, Amin Karbasi, Battista Biggio

2025-10-07

red teaming

2510.08604v1

Towards Reliable and Practical LLM Security Evaluations via Bayesian Modelling

Mary Llewellyn, Annie Gray, Josh Collyer, Michael Harries

2025-10-07

red teaming

2510.05709v1

Evaluating LLM Safety Across Child Development Stages: A Simulated Agent Approach

Abhejay Murali, Saleh Afroogh, Kevin Chen, David Atkinson, Amit Dhurandhar, Junfeng Jiao

2025-10-07

safety

2510.05484v1

Adversarial Reinforcement Learning for Large Language Model Agent Safety

Zizhao Wang, Dingcheng Li, Vaishakh Keshava, Phillip Wallis, Ananth Balashankar, Peter Stone, Lukas Rutishauser

2025-10-06

red teaming

2510.05442v1

October 06 - October 12, 2025

Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models

CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands

An Adaptive Multi Agent Bitcoin Trading System

An Adaptive Multi Agent Bitcoin Trading System

Fewer Weights, More Problems: A Practical Attack on LLM Pruning

Fewer Weights, More Problems: A Practical Attack on LLM Pruning

From Defender to Devil? Unintended Risk Interactions Induced by LLM Defenses

MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation

Energy-Driven Steering: Reducing False Refusals in Large Language Models

Practical and Stealthy Touch-Guided Jailbreak Attacks on Deployed Mobile Vision-Language Agents

Effective and Stealthy One-Shot Jailbreaks on Deployed Mobile Vision-Language Agents

The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs

RedTWIZ: Diverse LLM Red Teaming via Adaptive Attack Planning

Exposing Citation Vulnerabilities in Generative Engines

Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness

XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation

Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection?

Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection?

Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation

LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback

Towards Reliable and Practical LLM Security Evaluations via Bayesian Modelling

Evaluating LLM Safety Across Child Development Stages: A Simulated Agent Approach

Adversarial Reinforcement Learning for Large Language Model Agent Safety