← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 235 papers total
May 26 - June 01, 2025
18 papers
CoP: Agentic Red-teaming for Large Language Models using Composition of Principles
Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho
2025-06-01
red teaming
2506.00781v1
SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs
Shaona Ghosh, Amrita Bhattacharjee, Yftah Ziser, Christopher Parisien
2025-06-01
safety
2506.04250v1
SafeTuneBed: A Toolkit for Benchmarking LLM Safety Alignment in Fine-Tuning
Saad Hossain, Samanvay Vajpayee, Sirisha Rambhatla
2025-05-31
safety
2506.00676v1
AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents
Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, Hanan Salam
2025-05-31
safety
2506.00641v1
Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities
Jiahui Geng, Thy Thy Tran, Preslav Nakov, Iryna Gurevych
2025-05-31
red teaming
2506.00548v1
Wide Reflective Equilibrium in LLM Alignment: Bridging Moral Epistemology and AI Safety
Matthew Brophy
2025-05-31
safety
2506.00415v1
TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis
Xiaorui Wu, Xiaofeng Mao, Fei Li, Xin Zhang, Xuanhong Li, Chong Teng, Donghong Ji, Zhuang Li
2025-05-30
red teaming
2505.24672v1
Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization
Utsav Maskey, Chencheng Zhu, Usman Naseem
2025-05-30
red teaming
2505.24621v1
AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders
Yuqi Zhang, Yuchun Miao, Zuchao Li, Liang Ding
2025-05-30
2505.24519v1
Model Unlearning via Sparse Autoencoder Subspace Guided Projections
Xu Wang, Zihao Li, Benyou Wang, Yan Hu, Difan Zou
2025-05-30
2505.24428v1
From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models
Haibo Jin, Peiyan Zhang, Peiran Wang, Man Luo, Haohan Wang
2025-05-30
red teaming
2505.24232v1
Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap
Wenhan Yang, Spencer Stice, Ali Payani, Baharan Mirzasoleiman
2025-05-30
safety
2505.24208v1
The State of Multilingual LLM Safety Research: From Measuring the Language Gap to Mitigating It
Zheng-Xin Yong, Beyza Ermis, Marzieh Fadaee, Stephen H. Bach, Julia Kreutzer
2025-05-30
safety
2505.24119v1
Understanding Refusal in Language Models with Sparse Autoencoders
Wei Jie Yeo, Nirmalendu Prakash, Clement Neo, Roy Ka-Wei Lee, Erik Cambria, Ranjan Satapathy
2025-05-29
red teaming
2505.23556v1
Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models
Mingyu Yu, Wei Wang, Yanjie Wei, Sujuan Qin
2025-05-29
red teaming
2505.23404v1
Operationalizing CaMeL: Strengthening LLM Defenses for Enterprise Deployment
Krti Tallam, Emma Miller
2025-05-28
safety
2505.22852v1
GeneBreaker: Jailbreak Attacks against DNA Language Models with Pathogenicity Guidance
Zaixi Zhang, Zhenghong Zhou, Ruofan Jin, Le Cong, Mengdi Wang
2025-05-28
2505.23839v1
JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models
Jiaxin Song, Yixu Wang, Jie Li, Rui Yu, Yan Teng, Xingjun Ma, Yingchun Wang
2025-05-26
red teaming
2505.19610v2
May 19 - May 25, 2025
1 paper
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary
Licheng Pan, Yongqi Tong, Xin Zhang, Xiaolu Zhang, Jun Zhou, Zhixuan Chu
2025-05-23
safety
2505.18325v2
‹
1
2
3
...
8
9
10
›