← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 235 papers total
August 04 - August 10, 2025
21 papers
Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models: A Unified and Accurate Approach
Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue, Xiting Wang
2025-08-08
red teaming
2508.09201v1
LLM Robustness Leaderboard v1 --Technical report
Pierre Peigné - Lefebvre, Quentin Feuillade-Montixi, Tom David, Nicolas Miailhe
2025-08-08
red teaming
2508.06296v2
LLM Robustness Leaderboard v1 --Technical report
Pierre Peigné - Lefebvre, Quentin Feuillade-Montixi, Tom David, Nicolas Miailhe
2025-08-08
red teaming
2508.06296v1
Beyond Uniform Criteria: Scenario-Adaptive Multi-Dimensional Jailbreak Evaluation
Lai Jiang, Yuekang Li, Xiaohan Zhang, Youtao Ding, Li Pan
2025-08-08
red teaming
2508.06194v1
Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks
Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, Yi Zeng
2025-08-08
safety
2508.09190v1
Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM
Chi Zhang, Changjia Zhu, Junjie Xiong, Xiaoran Xu, Lingyao Li, Yao Liu, Zhuo Lu
2025-08-07
red teaming
2508.05775v2
Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation
Chi Zhang, Changjia Zhu, Junjie Xiong, Xiaoran Xu, Lingyao Li, Yao Liu, Zhuo Lu
2025-08-07
red teaming
2508.05775v1
AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety
Adi Levi, Or Levi, Sardhendu Mishra, Jonathan Morra
2025-08-07
safety
2508.05527v1
JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering
Renmiao Chen, Shiyao Cui, Xuancheng Huang, Chengwei Pan, Victor Shea-Jay Huang, QingLin Zhang, Xuan Ouyang, Zhexin Zhang, Hongning Wang, Minlie Huang
2025-08-07
red teaming
2508.05087v1
Adversarial Attacks and Defenses on Graph-aware Large Language Models (LLMs)
Iyiola E. Olatunji, Franziska Boenisch, Jing Xu, Adam Dziedzic
2025-08-06
safety
2508.04894v1
ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments
Yuquan Wang, Mi Zhang, Yining Wang, Geng Hong, Xiaoyu You, Min Yang
2025-08-06
red teaming
2508.04204v1
Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models
Siddhant Panpatil, Hiskias Dingeto, Haon Park
2025-08-06
red teaming
2508.04196v1
Risk Analysis Techniques for Governed LLM-based Multi-Agent Systems
Alistair Reid, Simon O'Callaghan, Liam Carroll, Tiberio Caetano
2025-08-06
governance
2508.05687v1
Evo-MARL: Co-Evolutionary Multi-Agent Reinforcement Learning for Internalized Safety
Zhenyu Pan, Yiting Zhang, Yutong Zhang, Jianshu Zhang, Haozheng Luo, Yuwei Han, Dennis Wu, Hong-Yu Chen, Philip S. Yu, Manling Li, Han Liu
2025-08-05
red teaming
2508.03864v1
When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs
Bodam Kim, Hiskias Dingeto, Taeyoun Kwon, Dasol Choi, DongGeon Lee, Haon Park, JaeHoon Lee, Jongho Shin
2025-08-05
red teaming
2508.03365v1
Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning
Rui Pu, Chaozhuo Li, Rui Ha, Litian Zhang, Lirong Qiu, Xi Zhang
2025-08-05
2508.03054v1
CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors
Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
2025-08-05
red teaming
2508.02997v2
CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors
Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
2025-08-05
red teaming
2508.02997v1
Highlight & Summarize: RAG without the jailbreaks
Giovanni Cherubin, Andrew Paverd
2025-08-04
2508.02872v1
Large Reasoning Models Are Autonomous Jailbreak Agents
Thilo Hagendorff, Erik Derner, Nuria Oliver
2025-08-04
red teaming
2508.04039v1
TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs
Amitava Das, Vinija Jain, Aman Chadha
2025-08-04
2508.02063v1
July 28 - August 03, 2025
3 papers
Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models
Ruofan Wang, Xin Wang, Yang Yao, Xuan Tong, Xingjun Ma
2025-08-03
red teaming
2508.01741v1
PUZZLED: Jailbreaking LLMs through Word-Based Puzzles
Yelim Ahn, Jaejin Lee
2025-08-02
red teaming
2508.01306v1
LeakSealer: A Semisupervised Defense for LLMs Against Prompt Injection and Leakage Attacks
Francesco Panebianco, Stefano Bonfanti, Francesco Trovò, Michele Carminati
2025-08-01
red teaming
safety
2508.00602v1
‹
1
2
3
4
...
8
9
10
›