← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1169 papers total
August 25 - August 31, 2025
2 papers
Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience
Xi Wang, Songlei Jian, Shasha Li, Xiaopeng Li, Bin Ji, Jun Ma, Xiaodong Liu, Jing Wang, Feilong Bao, Jianfeng Zhang, Baosheng Wang, Jie Yu
2025-08-25
red teaming
2508.19292v1
Speculative Safety-Aware Decoding
Xuekang Wang, Shengyu Zhu, Xueqi Cheng
2025-08-25
red teaming
2508.17739v1
August 18 - August 24, 2025
21 papers
Activation Transport Operators
Andrzej Szablewski, Marek Masiak
2025-08-24
2508.17540v1
Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD
Bryan Chen Zhengyu Tan, Daniel Wai Kit Chin, Zhengyuan Liu, Nancy F. Chen, Roy Ka-Wei Lee
2025-08-24
safety
2508.17450v1
Unveiling the Latent Directions of Reflection in Large Language Models
Fu-Chieh Chang, Yu-Ting Lee, Pei-Yuan Wu
2025-08-23
red teaming
2508.16989v1
ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks
Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park
2025-08-23
red teaming
2508.16889v1
HAMSA: Hijacking Aligned Compact Models via Stealthy Automation
Alexey Krylov, Iskander Vagizov, Dmitrii Korzh, Maryam Douiba, Azidine Guezzaz, Vladimir Kokh, Sergey D. Erokhin, Elena V. Tutubalina, Oleg Y. Rogov
2025-08-22
red teaming
2508.16484v1
Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models
Guangyu Yang, Jinghong Chen, Jingbiao Mei, Weizhe Lin, Bill Byrne
2025-08-22
red teaming
2508.16406v1
Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs
Yu Yan, Sheng Sun, Zhe Wang, Yijun Lin, Zenghao Duan, zhifei zheng, Min Liu, Zhiyi yin, Jianping Zhang
2025-08-22
red teaming
2508.16347v1
LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts
Darpan Aswal, Céline Hudelot
2025-08-22
2508.16325v1
MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine
Kaiyuan Ji, Yijin Guo, Zicheng Zhang, Xiangyang Zhu, Yuan Tian, Ning Liu, Guangtao Zhai
2025-08-22
safety
2508.16213v1
SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models
Peng Ding, Wen Sun, Dailin Li, Wei Zou, Jiaming Wang, Jiajun Chen, Shujian Huang
2025-08-21
red teaming
2508.15648v2
SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models
Peng Ding, Wen Sun, Dailin Li, Wei Zou, Jiaming Wang, Jiajun Chen, Shujian Huang
2025-08-21
2508.15648v1
SafetyFlow: An Agent-Flow System for Automated LLM Safety Benchmarking
Xiangyang Zhu, Yuan Tian, Chunyi Li, Kaiwei Zhang, Wei Sun, Guangtao Zhai
2025-08-21
safety
2508.15526v1
IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents
Hengyu An, Jinghuai Zhang, Tianyu Du, Chunyi Zhou, Qingming Li, Tao Lin, Shouling Ji
2025-08-21
safety
2508.15310v1
Retrieval-Augmented Review Generation for Poisoning Recommender Systems
Shiyi Yang, Xinshu Li, Guanglin Zhou, Chen Wang, Xiwei Xu, Liming Zhu, Lina Yao
2025-08-21
2508.15252v1
SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks
Xiangman Li, Xiaodong Wu, Qi Li, Jianbing Ni, Rongxing Lu
2025-08-21
red teaming
2508.15182v1
Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent
Sajib Biswas, Mao Nishino, Samuel Jacob Chacko, Xiuwen Liu
2025-08-20
red teaming
2508.14853v1
CCFC: Core & Core-Full-Core Dual-Track Defense for LLM Jailbreak Protection
Jiaming Hu, Haoyu Wang, Debarghya Mukherjee, Ioannis Ch. Paschalidis
2025-08-19
red teaming
safety
2508.14128v1
MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies
Weiwei Qi, Shuo Shao, Wei Gu, Tianhang Zheng, Puning Zhao, Zhan Qin, Kui Ren
2025-08-18
red teaming
2508.13048v1
FuSaR: A Fuzzification-Based Method for LRM Safety-Reasoning Balance
Jianhao Chen, Mayi Xu, Xiaohu Li, Yongqi Li, Xiangyu Zhang, Jianjie Huang, Tieyun Qian
2025-08-18
red teaming
2508.12897v1
Involuntary Jailbreak
Yangyang Guo, Yangyan Li, Mohan Kankanhalli
2025-08-18
red teaming
2508.13246v1
CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection
Seonglae Cho, Zekun Wu, Adriano Koshiyama
2025-08-18
red teaming
safety
2508.12535v1
August 11 - August 17, 2025
1 paper
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective
Minseon Kim, Jin Myung Kwak, Lama Alssum, Bernard Ghanem, Philip Torr, David Krueger, Fazl Barez, Adel Bibi
2025-08-17
safety
2508.12531v1
‹
1
2
3
...
36
37
38
...
47
48
49
›