Paper Library

A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection

Ivan Zhang

2025-08-10

red teaming

2508.07139v1

The Cost of Thinking: Increased Jailbreak Risk in Large Language Models

Fan Yang

2025-08-09

red teaming

2508.10032v1

Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs

Jinhwa Kim, Ian G. Harris

2025-08-09

2508.10031v1

Many-Turn Jailbreaking

Xianjun Yang, Liqiang Xiao, Shiyang Li, Faisal Ladhak, Hyokun Yun, Linda Ruth Petzold, Yi Xu, William Yang Wang

2025-08-09

red teaming

2508.06755v1

Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs

Wenpeng Xing, Mohan Li, Chunqiang Hu, Haitao XuNingyu Zhang, Bo Lin, Meng Han

2025-08-08

red teaming

2508.10029v1

Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models: A Unified and Accurate Approach

Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue, Xiting Wang

2025-08-08

red teaming

2508.09201v1

LLM Robustness Leaderboard v1 --Technical report

Pierre Peigné - Lefebvre, Quentin Feuillade-Montixi, Tom David, Nicolas Miailhe

2025-08-08

red teaming

2508.06296v2

LLM Robustness Leaderboard v1 --Technical report

Pierre Peigné - Lefebvre, Quentin Feuillade-Montixi, Tom David, Nicolas Miailhe

2025-08-08

red teaming

2508.06296v1

Beyond Uniform Criteria: Scenario-Adaptive Multi-Dimensional Jailbreak Evaluation

Lai Jiang, Yuekang Li, Xiaohan Zhang, Youtao Ding, Li Pan

2025-08-08

red teaming

2508.06194v1

Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks

Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, Yi Zeng

2025-08-08

safety

2508.09190v1

Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM

Chi Zhang, Changjia Zhu, Junjie Xiong, Xiaoran Xu, Lingyao Li, Yao Liu, Zhuo Lu

2025-08-07

red teaming

2508.05775v2

Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation

Chi Zhang, Changjia Zhu, Junjie Xiong, Xiaoran Xu, Lingyao Li, Yao Liu, Zhuo Lu

2025-08-07

red teaming

2508.05775v1

AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety

Adi Levi, Or Levi, Sardhendu Mishra, Jonathan Morra

2025-08-07

safety

2508.05527v1

JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

Renmiao Chen, Shiyao Cui, Xuancheng Huang, Chengwei Pan, Victor Shea-Jay Huang, QingLin Zhang, Xuan Ouyang, Zhexin Zhang, Hongning Wang, Minlie Huang

2025-08-07

red teaming

2508.05087v1

Adversarial Attacks and Defenses on Graph-aware Large Language Models (LLMs)

Iyiola E. Olatunji, Franziska Boenisch, Jing Xu, Adam Dziedzic

2025-08-06

safety

2508.04894v1

August 11 - August 17, 2025

NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs

Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs

Can We Trust AI to Govern AI? Benchmarking LLM Performance on Privacy and AI Governance Exams

Attacks and Defenses Against LLM Fingerprinting

Securing Educational LLMs: A Generalised Taxonomy of Attacks on LLMs and DREAD Risk Assessment

When the Domain Expert Has No Time and the LLM Developer Has No Clinical Expertise: Real-World Lessons from LLM Co-Design in a Safety-Net Hospital

Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity

Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge

Multi-Turn Jailbreaks Are Simpler Than They Seem

August 04 - August 10, 2025

A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection

The Cost of Thinking: Increased Jailbreak Risk in Large Language Models

Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs

Many-Turn Jailbreaking

Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs

Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models: A Unified and Accurate Approach

LLM Robustness Leaderboard v1 --Technical report

LLM Robustness Leaderboard v1 --Technical report

Beyond Uniform Criteria: Scenario-Adaptive Multi-Dimensional Jailbreak Evaluation

Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks

Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM

Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation

AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety

JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

Adversarial Attacks and Defenses on Graph-aware Large Language Models (LLMs)