← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1331 papers total
August 18 - August 24, 2025
17 papers
HAMSA: Hijacking Aligned Compact Models via Stealthy Automation
Alexey Krylov, Iskander Vagizov, Dmitrii Korzh, Maryam Douiba, Azidine Guezzaz, Vladimir Kokh, Sergey D. Erokhin, Elena V. Tutubalina, Oleg Y. Rogov
2025-08-22
red teaming
2508.16484v1
Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models
Guangyu Yang, Jinghong Chen, Jingbiao Mei, Weizhe Lin, Bill Byrne
2025-08-22
red teaming
2508.16406v1
Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs
Yu Yan, Sheng Sun, Zhe Wang, Yijun Lin, Zenghao Duan, zhifei zheng, Min Liu, Zhiyi yin, Jianping Zhang
2025-08-22
red teaming
2508.16347v1
LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts
Darpan Aswal, Céline Hudelot
2025-08-22
2508.16325v1
MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine
Kaiyuan Ji, Yijin Guo, Zicheng Zhang, Xiangyang Zhu, Yuan Tian, Ning Liu, Guangtao Zhai
2025-08-22
safety
2508.16213v1
SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models
Peng Ding, Wen Sun, Dailin Li, Wei Zou, Jiaming Wang, Jiajun Chen, Shujian Huang
2025-08-21
red teaming
2508.15648v2
SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models
Peng Ding, Wen Sun, Dailin Li, Wei Zou, Jiaming Wang, Jiajun Chen, Shujian Huang
2025-08-21
2508.15648v1
SafetyFlow: An Agent-Flow System for Automated LLM Safety Benchmarking
Xiangyang Zhu, Yuan Tian, Chunyi Li, Kaiwei Zhang, Wei Sun, Guangtao Zhai
2025-08-21
safety
2508.15526v1
IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents
Hengyu An, Jinghuai Zhang, Tianyu Du, Chunyi Zhou, Qingming Li, Tao Lin, Shouling Ji
2025-08-21
safety
2508.15310v1
Retrieval-Augmented Review Generation for Poisoning Recommender Systems
Shiyi Yang, Xinshu Li, Guanglin Zhou, Chen Wang, Xiwei Xu, Liming Zhu, Lina Yao
2025-08-21
2508.15252v1
SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks
Xiangman Li, Xiaodong Wu, Qi Li, Jianbing Ni, Rongxing Lu
2025-08-21
red teaming
2508.15182v1
Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent
Sajib Biswas, Mao Nishino, Samuel Jacob Chacko, Xiuwen Liu
2025-08-20
red teaming
2508.14853v1
CCFC: Core & Core-Full-Core Dual-Track Defense for LLM Jailbreak Protection
Jiaming Hu, Haoyu Wang, Debarghya Mukherjee, Ioannis Ch. Paschalidis
2025-08-19
red teaming
safety
2508.14128v1
MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies
Weiwei Qi, Shuo Shao, Wei Gu, Tianhang Zheng, Puning Zhao, Zhan Qin, Kui Ren
2025-08-18
red teaming
2508.13048v1
FuSaR: A Fuzzification-Based Method for LRM Safety-Reasoning Balance
Jianhao Chen, Mayi Xu, Xiaohu Li, Yongqi Li, Xiangyu Zhang, Jianjie Huang, Tieyun Qian
2025-08-18
red teaming
2508.12897v1
Involuntary Jailbreak
Yangyang Guo, Yangyan Li, Mohan Kankanhalli
2025-08-18
red teaming
2508.13246v1
CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection
Seonglae Cho, Zekun Wu, Adriano Koshiyama
2025-08-18
red teaming
safety
2508.12535v1
August 11 - August 17, 2025
7 papers
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective
Minseon Kim, Jin Myung Kwak, Lama Alssum, Bernard Ghanem, Philip Torr, David Krueger, Fazl Barez, Adel Bibi
2025-08-17
safety
2508.12531v1
Too Easily Fooled? Prompt Injection Breaks LLMs on Frustratingly Simple Multiple-Choice Questions
Xuyang Guo, Zekai Huang, Zhao Song, Jiahao Zhang
2025-08-16
red teaming
2508.13214v1
Mitigating Jailbreaks with Intent-Aware LLMs
Wei Jie Yeo, Ranjan Satapathy, Erik Cambria
2025-08-16
red teaming
2508.12072v1
SafeConstellations: Steering LLM Safety to Reduce Over-Refusals Through Task-Specific Trajectory
Utsav Maskey, Sumit Yadav, Mark Dras, Usman Naseem
2025-08-15
safety
2508.11290v1
ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal
Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, Jiashui Wang, Xinlei Ying, Long Liu, Wenhai Wang
2025-08-15
safety
2508.11222v1
Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation
Huizhen Shu, Xuying Li, Qirui Wang, Yuji Kosuga, Mengqiu Tian, Zhuo Li
2025-08-14
red teaming
2508.10404v1
Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
Chiyu Zhang, Lu Zhou, Xiaogang Xu, Jiafei Wu, Liming Fang, Zhe Liu
2025-08-14
red teaming
2508.10390v1
‹
1
2
3
...
43
44
45
...
54
55
56
›