← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1172 papers total
November 10 - November 16, 2025
24 papers
ICX360: In-Context eXplainability 360 Toolkit
Dennis Wei, Ronny Luss, Xiaomeng Hu, Lucas Monteiro Paes, Pin-Yu Chen, Karthikeyan Natesan Ramamurthy, Erik Miehling, Inge Vejsbjerg, Hendrik Strobelt
2025-11-14
red teaming
2511.10879v1
Can AI Models be Jailbroken to Phish Elderly Victims? An End-to-End Evaluation
Fred Heiding, Simon Lermen
2025-11-13
red teaming
2511.11759v1
PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization
Runpeng Geng, Yanting Wang, Chenlong Yin, Minhao Cheng, Ying Chen, Jinyuan Jia
2025-11-13
2511.10720v1
Say It Differently: Linguistic Styles as Jailbreak Vectors
Srikant Panda, Avinash Rai
2025-11-13
red teaming
2511.10519v1
EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models
Jialin Wu, Kecen Li, Zhicong Huang, Xinfeng Li, Xiaofeng Wang, Cheng Hong
2025-11-13
2511.09880v1
A precessing magnetic jet as the engine of GRB 250702B
Tao An
2025-11-13
2511.09850v1
Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO
Nikolay Blagoev, Oğuzhan Ersoy, Lydia Yiyu Chen
2025-11-12
red teaming
2511.09780v1
Rebellion: Noise-Robust Reasoning Training for Audio Reasoning Models
Tiansheng Huang, Virat Shejwalkar, Oscar Chang, Milad Nasr, Ling Liu
2025-11-12
red teaming
2511.09682v1
Toward Honest Language Models for Deductive Reasoning
Jiarui Liu, Kaustubh Dhole, Yingheng Wang, Haoyang Wen, Sarah Zhang, Haitao Mao, Gaotang Li, Neeraj Varshney, Jingguo Liu, Xiaoman Pan
2025-11-12
2511.09222v4
Toward Honest Language Models for Deductive Reasoning
Jiarui Liu, Kaustubh Dhole, Yingheng Wang, Haoyang Wen, Sarah Zhang, Haitao Mao, Gaotang Li, Neeraj Varshney, Jingguo Liu, Xiaoman Pan
2025-11-12
2511.09222v3
Toward Honest Language Models for Deductive Reasoning
Jiarui Liu, Kaustubh Dhole, Yingheng Wang, Haoyang Wen, Sarah Zhang, Haitao Mao, Gaotang Li, Neeraj Varshney, Jingguo Liu, Xiaoman Pan
2025-11-12
2511.09222v2
StyleBreak: Revealing Alignment Vulnerabilities in Large Audio-Language Models via Style-Aware Audio Jailbreak
Hongyi Li, Chengxuan Zhou, Chu Wang, Sicheng Liang, Yanting Chen, Qinlin Xie, Jiawei Ye, Jie Wu
2025-11-12
red teaming
2511.10692v1
iSeal: Encrypted Fingerprinting for Reliable LLM Ownership Verification
Zixun Xiong, Gaoyi Wu, Qingyang Yu, Mingyu Derek Ma, Lingfeng Yao, Miao Pan, Xiaojiang Du, Hao Wang
2025-11-12
2511.08905v2
iSeal: Encrypted Fingerprinting for Reliable LLM Ownership Verification
Zixun Xiong, Gaoyi Wu, Qingyang Yu, Mingyu Derek Ma, Lingfeng Yao, Miao Pan, Xiaojiang Du, Hao Wang
2025-11-12
2511.08905v1
Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models
Huzaifa Arif, Keerthiram Murugesan, Ching-Yun Ko, Pin-Yu Chen, Payel Das, Alex Gittens
2025-11-11
safety
2511.08484v1
SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models
Giorgio Piras, Raffaele Mura, Fabio Brau, Luca Oneto, Fabio Roli, Battista Biggio
2025-11-11
2511.08379v2
Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs
Yuxuan Zhou, Yuzhao Peng, Yang Bai, Kuofeng Gao, Yihao Zhang, Yechao Zhang, Xun Chen, Tao Yu, Tao Dai, Shu-Tao Xia
2025-11-11
red teaming
2511.08367v1
Alignment-Aware Quantization for LLM Safety
Sunghyun Wee, Suyoung Kim, Hyeonjin Kim, Kyomin Hwang, Nojun Kwak
2025-11-11
safety
2511.07842v1
JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework
Yuxuan Zhou, Yang Bai, Kuofeng Gao, Tao Dai, Shu-Tao Xia
2025-11-10
red teaming
2511.07315v1
EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers
Yilin Jiang, Mingzi Zhang, Xuanyu Yin, Sheng Jin, Suyu Lu, Zuocan Ying, Zengyi Yu, Xiangjie Kong
2025-11-10
safety
2511.06890v1
Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment
Peng Zhang, Peijie Sun
2025-11-10
safety
2511.06852v2
Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment
Peng Zhang, peijie sun
2025-11-10
red teaming
safety
2511.06852v1
SAFENLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces
Ruiheng Liu, XiaoBing Chen, Jinyu Zhang, Qiongwen Zhang, Yu Zhang, Bailong Yang
2025-11-10
safety
2511.06778v2
SAFENLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces
Ruiheng Liu, XiaoBing Chen, Jinyu Zhang, Qiongwen Zhang, Yu Zhang, Bailong Yang
2025-11-10
safety
2511.06778v1
‹
1
2
3
...
20
21
22
...
47
48
49
›