Paper Library

Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models

Ruofan Wang, Xin Wang, Yang Yao, Xuan Tong, Xingjun Ma

2025-08-03

red teaming

2508.01741v1

PUZZLED: Jailbreaking LLMs through Word-Based Puzzles

Yelim Ahn, Jaejin Lee

2025-08-02

red teaming

2508.01306v1

LeakSealer: A Semisupervised Defense for LLMs Against Prompt Injection and Leakage Attacks

Francesco Panebianco, Stefano Bonfanti, Francesco Trovò, Michele Carminati

2025-08-01

red teaming safety

2508.00602v1

Activation-Guided Local Editing for Jailbreaking Attacks

Jiecong Wang, Haoran Li, Hao Peng, Ziqian Zeng, Zihao Wang, Haohua Du, Zhengtao Yu

2025-08-01

red teaming

2508.00555v1

Pro2Guard: Proactive Runtime Enforcement of LLM Agent Safety via Probabilistic Model Checking

Haoyu Wang, Chris M. Poskitt, Jun Sun, Jiali Wei

2025-08-01

safety

2508.00500v1

A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

Shirui Wang, Zhihui Tang, Huaxia Yang, Qiuhong Gong, Tiantian Gu, Hongyang Ma, Yongxin Wang, Wubin Sun, Zeliang Lian, Kehang Mao, Yinan Jiang, Zhicheng Huang, Lingyun Ma, Wenjie Shen, Yajie Ji, Yunhui Tan, Chunbo Wang, Yunlu Gao, Qianling Ye, Rui Lin, Mingyu Chen, Lijuan Niu, Zhihao Wang, Peng Yu, Mengran Lang, Yue Liu, Huimin Zhang, Haitao Shen, Long Chen, Qiguang Zhao, Si-Xuan Liu, Lina Zhou, Hua Gao, Dongqiang Ye, Lingmin Meng, Youtao Yu, Naixin Liang, Jianxiong Wu

2025-07-31

safety

2507.23486v1

Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

Saeed Almheiri, Yerulan Kongrat, Adrian Santosh, Ruslan Tasmukhanov, Josemaria Vera, Muhammad Dehan Al Kautsar, Fajri Koto

2025-07-31

red teaming

2507.23465v1

Measuring Harmfulness of Computer-Using Agents

Aaron Xuxiang Tian, Ruofan Zhang, Janet Tang, Jiaxin Wen

2025-07-31

red teaming

2508.00935v1

Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs

Xikang Yang, Biyu Zhou, Xuehai Tang, Jizhong Han, Songlin Hu

2025-07-30

red teaming safety

2507.22564v1

Promoting Online Safety by Simulating Unsafe Conversations with LLMs

Owen Hoffman, Kangze Peng, Zehua You, Sajid Kamal, Sukrit Venkatagiri

2025-07-29

safety

2507.22267v1

Strategic Deflection: Defending LLMs from Logit Manipulation

Yassine Rachidy, Jihad Rbaiti, Youssef Hmamouche, Faissal Sehbaoui, Amal El Fallah Seghrouchni

2025-07-29

red teaming

2507.22160v1

Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security

Muzhi Dai, Shixuan Liu, Zhiyuan Zhao, Junyu Gao, Hao Sun, Xuelong Li

2025-07-29

red teaming

2507.22037v1

Anyone Can Jailbreak: Prompt-Based Attacks on LLMs and T2Is

Ahmed B Mustafa, Zihan Ye, Yang Lu, Michael P Pound, Shreyank N Gowda

2025-07-29

red teaming

2507.21820v1

August 04 - August 10, 2025

ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models

Risk Analysis Techniques for Governed LLM-based Multi-Agent Systems

Evo-MARL: Co-Evolutionary Multi-Agent Reinforcement Learning for Internalized Safety

When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning

CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors

CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors

Highlight & Summarize: RAG without the jailbreaks

Large Reasoning Models Are Autonomous Jailbreak Agents

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

July 28 - August 03, 2025

Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models

PUZZLED: Jailbreaking LLMs through Word-Based Puzzles

LeakSealer: A Semisupervised Defense for LLMs Against Prompt Injection and Leakage Attacks

Activation-Guided Local Editing for Jailbreaking Attacks

Pro2Guard: Proactive Runtime Enforcement of LLM Agent Safety via Probabilistic Model Checking

A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

Measuring Harmfulness of Computer-Using Agents

Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs

Promoting Online Safety by Simulating Unsafe Conversations with LLMs

Strategic Deflection: Defending LLMs from Logit Manipulation

Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security

Anyone Can Jailbreak: Prompt-Based Attacks on LLMs and T2Is