← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1331 papers total
October 06 - October 12, 2025
24 papers
Energy-Driven Steering: Reducing False Refusals in Large Language Models
Eric Hanchen Jiang, Weixuan Ou, Run Liu, Shengyuan Pang, Guancheng Wan, Ranjie Duan, Wei Dong, Kai-Wei Chang, XiaoFeng Wang, Ying Nian Wu, Xinfeng Li
2025-10-09
2510.08646v1
Practical and Stealthy Touch-Guided Jailbreak Attacks on Deployed Mobile Vision-Language Agents
Renhua Ding, Xiao Yang, Zhengwei Fang, Jun Luo, Kun He, Jun Zhu
2025-10-09
red teaming
2510.07809v2
Effective and Stealthy One-Shot Jailbreaks on Deployed Mobile Vision-Language Agents
Renhua Ding, Xiao Yang, Zhengwei Fang, Jun Luo, Kun He, Jun Zhu
2025-10-09
red teaming
2510.07809v1
The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs
Omar Mahmoud, Ali Khalil, Buddhika Laknath Semage, Thommen George Karimpanal, Santu Rana
2025-10-09
safety
2510.07775v1
RedTWIZ: Diverse LLM Red Teaming via Adaptive Attack Planning
Artur Horal, Daniel Pina, Henrique Paz, Iago Paulo, João Soares, Rafael Ferreira, Diogo Tavares, Diogo Glória-Silva, João Magalhães, David Semedo
2025-10-08
red teaming
2510.06994v1
Exposing Citation Vulnerabilities in Generative Engines
Riku Mochizuki, Shusuke Komatsu, Souta Noguchi, Kazuto Ataka
2025-10-08
2510.06823v1
Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness
Tavish McDonald, Bo Lei, Stanislav Fort, Bhavya Kailkhura, Brian Bartoldson
2025-10-08
red teaming
2510.06790v1
XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation
Udbhav Bamba, Minghao Fang, Yifan Yu, Haizhong Zheng, Fan Lai
2025-10-08
2510.06672v2
Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection?
Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
2025-10-08
red teaming
2510.06594v2
Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection?
Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
2025-10-08
red teaming
2510.06594v1
Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation
Jiawei Mao, Yuhan Wang, Lifeng Chen, Can Zhao, Yucheng Tang, Dong Yang, Liangqiong Qu, Daguang Xu, Yuyin Zhou
2025-10-07
2510.06131v1
LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback
Raffaele Mura, Giorgio Piras, Kamilė Lukošiūtė, Maura Pintor, Amin Karbasi, Battista Biggio
2025-10-07
red teaming
2510.08604v1
Towards Reliable and Practical LLM Security Evaluations via Bayesian Modelling
Mary Llewellyn, Annie Gray, Josh Collyer, Michael Harries
2025-10-07
red teaming
2510.05709v1
Evaluating LLM Safety Across Child Development Stages: A Simulated Agent Approach
Abhejay Murali, Saleh Afroogh, Kevin Chen, David Atkinson, Amit Dhurandhar, Junfeng Jiao
2025-10-07
safety
2510.05484v1
Adversarial Reinforcement Learning for Large Language Model Agent Safety
Zizhao Wang, Dingcheng Li, Vaishakh Keshava, Phillip Wallis, Ananth Balashankar, Peter Stone, Lukas Rutishauser
2025-10-06
red teaming
2510.05442v1
AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling
Xiaogeng Liu, Chaowei Xiao
2025-10-06
red teaming
2510.05379v2
AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling
Xiaogeng Liu, Chaowei Xiao
2025-10-06
red teaming
2510.05379v1
WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection
Xi Xuan, Xuechen Liu, Wenxin Zhang, Yi-Cheng Lin, Xiaojian Lin, Tomi Kinnunen
2025-10-06
2510.05305v1
Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?
Rishika Bhagwatkar, Kevin Kasa, Abhay Puri, Gabriel Huang, Irina Rish, Graham W. Taylor, Krishnamurthy Dj Dvijotham, Alexandre Lacoste
2025-10-06
red teaming
2510.05244v1
Proactive defense against LLM Jailbreak
Weiliang Zhao, Jinjun Peng, Daniel Ben-Levi, Zhou Yu, Junfeng Yang
2025-10-06
red teaming
safety
2510.05052v1
Imperceptible Jailbreaking against Large Language Models
Kuofeng Gao, Yiming Li, Chao Du, Xin Wang, Xingjun Ma, Shu-Tao Xia, Tianyu Pang
2025-10-06
red teaming
2510.05025v1
RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection
Yuxin Wen, Arman Zharmagambetov, Ivan Evtimov, Narine Kokhlikyan, Tom Goldstein, Kamalika Chaudhuri, Chuan Guo
2025-10-06
red teaming
2510.04885v1
Unified Threat Detection and Mitigation Framework (UTDMF): Combating Prompt Injection, Deception, and Bias in Enterprise-Scale Transformers
Santhosh KumarRavindran
2025-10-06
red teaming
2510.04528v1
P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs
Shuai Zhao, Xinyi Wu, Shiqian Zhao, Xiaobao Wu, Zhongliang Guo, Yanhao Jia, Anh Tuan Luu
2025-10-06
safety
2510.04503v2
‹
1
2
3
...
35
36
37
...
54
55
56
›