← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1331 papers total
November 10 - November 16, 2025
15 papers
Toward Honest Language Models for Deductive Reasoning
Jiarui Liu, Kaustubh Dhole, Yingheng Wang, Haoyang Wen, Sarah Zhang, Haitao Mao, Gaotang Li, Neeraj Varshney, Jingguo Liu, Xiaoman Pan
2025-11-12
2511.09222v3
Toward Honest Language Models for Deductive Reasoning
Jiarui Liu, Kaustubh Dhole, Yingheng Wang, Haoyang Wen, Sarah Zhang, Haitao Mao, Gaotang Li, Neeraj Varshney, Jingguo Liu, Xiaoman Pan
2025-11-12
2511.09222v2
StyleBreak: Revealing Alignment Vulnerabilities in Large Audio-Language Models via Style-Aware Audio Jailbreak
Hongyi Li, Chengxuan Zhou, Chu Wang, Sicheng Liang, Yanting Chen, Qinlin Xie, Jiawei Ye, Jie Wu
2025-11-12
red teaming
2511.10692v1
iSeal: Encrypted Fingerprinting for Reliable LLM Ownership Verification
Zixun Xiong, Gaoyi Wu, Qingyang Yu, Mingyu Derek Ma, Lingfeng Yao, Miao Pan, Xiaojiang Du, Hao Wang
2025-11-12
2511.08905v2
iSeal: Encrypted Fingerprinting for Reliable LLM Ownership Verification
Zixun Xiong, Gaoyi Wu, Qingyang Yu, Mingyu Derek Ma, Lingfeng Yao, Miao Pan, Xiaojiang Du, Hao Wang
2025-11-12
2511.08905v1
Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models
Huzaifa Arif, Keerthiram Murugesan, Ching-Yun Ko, Pin-Yu Chen, Payel Das, Alex Gittens
2025-11-11
safety
2511.08484v1
SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models
Giorgio Piras, Raffaele Mura, Fabio Brau, Luca Oneto, Fabio Roli, Battista Biggio
2025-11-11
2511.08379v2
Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs
Yuxuan Zhou, Yuzhao Peng, Yang Bai, Kuofeng Gao, Yihao Zhang, Yechao Zhang, Xun Chen, Tao Yu, Tao Dai, Shu-Tao Xia
2025-11-11
red teaming
2511.08367v1
Alignment-Aware Quantization for LLM Safety
Sunghyun Wee, Suyoung Kim, Hyeonjin Kim, Kyomin Hwang, Nojun Kwak
2025-11-11
safety
2511.07842v1
JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework
Yuxuan Zhou, Yang Bai, Kuofeng Gao, Tao Dai, Shu-Tao Xia
2025-11-10
red teaming
2511.07315v1
EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers
Yilin Jiang, Mingzi Zhang, Xuanyu Yin, Sheng Jin, Suyu Lu, Zuocan Ying, Zengyi Yu, Xiangjie Kong
2025-11-10
safety
2511.06890v1
Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment
Peng Zhang, Peijie Sun
2025-11-10
safety
2511.06852v2
Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment
Peng Zhang, peijie sun
2025-11-10
red teaming
safety
2511.06852v1
SAFENLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces
Ruiheng Liu, XiaoBing Chen, Jinyu Zhang, Qiongwen Zhang, Yu Zhang, Bailong Yang
2025-11-10
safety
2511.06778v2
SAFENLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces
Ruiheng Liu, XiaoBing Chen, Jinyu Zhang, Qiongwen Zhang, Yu Zhang, Bailong Yang
2025-11-10
safety
2511.06778v1
November 03 - November 09, 2025
9 papers
EASE: Practical and Efficient Safety Alignment for Small Language Models
Haonan Shi, Guoli Wang, Tu Ouyang, An Wang
2025-11-09
red teaming
2511.06512v1
KG-DF: A Black-box Defense Framework against Jailbreak Attacks Based on Knowledge Graphs
Shuyuan Liu, Jiawei Chen, Xiao Yang, Hang Su, Zhaoxia Yin
2025-11-09
red teaming
2511.07480v1
Efficient LLM Safety Evaluation through Multi-Agent Debate
Dachuan Lin, Guobin Shen, Zihao Yang, Tianrong Liu, Dongcheng Zhao, Yi Zeng
2025-11-09
red teaming
safety
2511.06396v1
RelightMaster: Precise Video Relighting with Multi-plane Light Images
Weikang Bian, Xiaoyu Shi, Zhaoyang Huang, Jianhong Bai, Qinghe Wang, Xintao Wang, Pengfei Wan, Kun Gai, Hongsheng Li
2025-11-09
2511.06271v1
RAG-targeted Adversarial Attack on LLM-based Threat Detection and Mitigation Framework
Seif Ikbarieh, Kshitiz Aryal, Maanak Gupta
2025-11-09
red teaming
2511.06212v1
Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs
Alina Fastowski, Bardh Prenkaj, Yuxiao Li, Gjergji Kasneci
2025-11-08
red teaming
2511.05919v2
Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs
Alina Fastowski, Bardh Prenkaj, Yuxiao Li, Gjergji Kasneci
2025-11-08
red teaming
2511.05919v1
Can LLM Infer Risk Information From MCP Server System Logs?
Jiayi Fu, Yuansen Zhang, Yinggui Wang
2025-11-08
2511.05867v3
MCP-RiskCue: Can LLM Infer Risk Information From MCP Server System Logs?
Jiayi Fu, Qiyao Sun
2025-11-08
2511.05867v2
‹
1
2
3
...
27
28
29
...
54
55
56
›