Paper Library

The Resurgence of GCG Adversarial Attacks on Large Language Models

Yuting Tan, Xuying Li, Zhuo Li, Huizhen Shu, Peikang Hu

2025-08-30

red teaming

2509.00391v1

Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models

Sihao Wu, Gaojie Jin, Wei Huang, Jianhong Wang, Xiaowei Huang

2025-08-30

2509.00373v1

Cybersecurity AI: Hacking the AI Hackers via Prompt Injection

Víctor Mayoral-Vilches, Per Mannermaa Rynning

2025-08-29

red teaming

2508.21669v1

PromptSleuth: Detecting Prompt Injection via Semantic Intent Invariance

Mengxiao Wang, Yuxuan Zhang, Guofei Gu

2025-08-28

red teaming

2508.20890v2

JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring

Junjie Chu, Mingjie Li, Ziqing Yang, Ye Leng, Chenhao Lin, Chao Shen, Michael Backes, Yun Shen, Yang Zhang

2025-08-28

red teaming

2508.20848v1

From Law to Gherkin: A Human-Centred Quasi-Experiment on the Quality of LLM-Generated Behavioural Specifications from Food-Safety Regulations

Shabnam Hassani, Mehrdad Sabetzadeh, Daniel Amyot

2025-08-28

safety

2508.20744v1

Towards Mechanistic Defenses Against Typographic Attacks in CLIP

Lorenz Hufe, Constantin Venhoff, Maximilian Dreyer, Sebastian Lapuschkin, Wojciech Samek

2025-08-28

2508.20570v1

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Yang Zhang, Haohan Wang

2025-08-28

red teaming

2508.20325v1

IntentionReasoner: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement

Yuanzhe Shen, Zisu Huang, Zhengkang Guo, Yide Liu, Guanxu Chen, Ruicheng Yin, Xiaoqing Zheng, Xuanjing Huang

2025-08-27

red teaming

2508.20151v1

Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks

Sheng Liu, Qiang Sheng, Danding Wang, Yang Li, Guang Yang, Juan Cao

2025-08-27

red teaming safety

2508.20038v2

Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks

Sheng Liu, Qiang Sheng, Danding Wang, Yang Li, Guang Yang, Juan Cao

2025-08-27

red teaming safety

2508.20038v1

Pruning Strategies for Backdoor Defense in LLMs

Santosh Chapagain, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi

2025-08-27

safety

2508.20032v1

Evaluating Language Model Reasoning about Confidential Information

Dylan Sam, Alexander Robey, Andy Zou, Matt Fredrikson, J. Zico Kolter

2025-08-27

red teaming

2508.19980v1

AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema

Ting-Chun Liu, Ching-Yu Hsu, Kuan-Yi Lee, Chi-An Fu, Hung-yi Lee

2025-08-27

red teaming

2509.00088v2

Safety Alignment Should Be Made More Than Just A Few Attention Heads

Chao Huang, Zefeng Zhang, Juewei Yue, Quangang Li, Chuang Zhang, Tingwen Liu

2025-08-27

red teaming

2508.19697v1

On Surjectivity of Neural Networks: Can you elicit any behavior from your model?

Haozhe Jiang, Nika Haghtalab

2025-08-26

red teaming

2508.19445v1

Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience

Xi Wang, Songlei Jian, Shasha Li, Xiaopeng Li, Bin Ji, Jun Ma, Xiaodong Liu, Jing Wang, Feilong Bao, Jianfeng Zhang, Baosheng Wang, Jie Yu

2025-08-25

red teaming

2508.19292v1

Speculative Safety-Aware Decoding

Xuekang Wang, Shengyu Zhu, Xueqi Cheng

2025-08-25

red teaming

2508.17739v1

September 01 - September 07, 2025

LLM-empowered Agents Simulation Framework for Scenario Generation in Service Ecosystem Governance

Web Fraud Attacks Against LLM-Driven Multi-Agent Systems

August 25 - August 31, 2025

The Resurgence of GCG Adversarial Attacks on Large Language Models

Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models

Cybersecurity AI: Hacking the AI Hackers via Prompt Injection

PromptSleuth: Detecting Prompt Injection via Semantic Intent Invariance

JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring

From Law to Gherkin: A Human-Centred Quasi-Experiment on the Quality of LLM-Generated Behavioural Specifications from Food-Safety Regulations

Towards Mechanistic Defenses Against Typographic Attacks in CLIP

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

IntentionReasoner: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement

Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks

Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks

Pruning Strategies for Backdoor Defense in LLMs

Evaluating Language Model Reasoning about Confidential Information

AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema

Safety Alignment Should Be Made More Than Just A Few Attention Heads

On Surjectivity of Neural Networks: Can you elicit any behavior from your model?

Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience

Speculative Safety-Aware Decoding

August 18 - August 24, 2025

Activation Transport Operators

Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD

Unveiling the Latent Directions of Reflection in Large Language Models

ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks