← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 235 papers total
August 25 - August 31, 2025
7 papers
Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks
Sheng Liu, Qiang Sheng, Danding Wang, Yang Li, Guang Yang, Juan Cao
2025-08-27
red teaming
safety
2508.20038v1
Pruning Strategies for Backdoor Defense in LLMs
Santosh Chapagain, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi
2025-08-27
safety
2508.20032v1
Evaluating Language Model Reasoning about Confidential Information
Dylan Sam, Alexander Robey, Andy Zou, Matt Fredrikson, J. Zico Kolter
2025-08-27
red teaming
2508.19980v1
Safety Alignment Should Be Made More Than Just A Few Attention Heads
Chao Huang, Zefeng Zhang, Juewei Yue, Quangang Li, Chuang Zhang, Tingwen Liu
2025-08-27
red teaming
2508.19697v1
On Surjectivity of Neural Networks: Can you elicit any behavior from your model?
Haozhe Jiang, Nika Haghtalab
2025-08-26
red teaming
2508.19445v1
Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience
Xi Wang, Songlei Jian, Shasha Li, Xiaopeng Li, Bin Ji, Jun Ma, Xiaodong Liu, Jing Wang, Feilong Bao, Jianfeng Zhang, Baosheng Wang, Jie Yu
2025-08-25
red teaming
2508.19292v1
Speculative Safety-Aware Decoding
Xuekang Wang, Shengyu Zhu, Xueqi Cheng
2025-08-25
red teaming
2508.17739v1
August 18 - August 24, 2025
17 papers
Activation Transport Operators
Andrzej Szablewski, Marek Masiak
2025-08-24
2508.17540v1
Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD
Bryan Chen Zhengyu Tan, Daniel Wai Kit Chin, Zhengyuan Liu, Nancy F. Chen, Roy Ka-Wei Lee
2025-08-24
safety
2508.17450v1
Unveiling the Latent Directions of Reflection in Large Language Models
Fu-Chieh Chang, Yu-Ting Lee, Pei-Yuan Wu
2025-08-23
red teaming
2508.16989v1
ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks
Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park
2025-08-23
red teaming
2508.16889v1
HAMSA: Hijacking Aligned Compact Models via Stealthy Automation
Alexey Krylov, Iskander Vagizov, Dmitrii Korzh, Maryam Douiba, Azidine Guezzaz, Vladimir Kokh, Sergey D. Erokhin, Elena V. Tutubalina, Oleg Y. Rogov
2025-08-22
red teaming
2508.16484v1
Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models
Guangyu Yang, Jinghong Chen, Jingbiao Mei, Weizhe Lin, Bill Byrne
2025-08-22
red teaming
2508.16406v1
Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs
Yu Yan, Sheng Sun, Zhe Wang, Yijun Lin, Zenghao Duan, zhifei zheng, Min Liu, Zhiyi yin, Jianping Zhang
2025-08-22
red teaming
2508.16347v1
LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts
Darpan Aswal, Céline Hudelot
2025-08-22
2508.16325v1
MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine
Kaiyuan Ji, Yijin Guo, Zicheng Zhang, Xiangyang Zhu, Yuan Tian, Ning Liu, Guangtao Zhai
2025-08-22
safety
2508.16213v1
SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models
Peng Ding, Wen Sun, Dailin Li, Wei Zou, Jiaming Wang, Jiajun Chen, Shujian Huang
2025-08-21
red teaming
2508.15648v2
SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models
Peng Ding, Wen Sun, Dailin Li, Wei Zou, Jiaming Wang, Jiajun Chen, Shujian Huang
2025-08-21
2508.15648v1
SafetyFlow: An Agent-Flow System for Automated LLM Safety Benchmarking
Xiangyang Zhu, Yuan Tian, Chunyi Li, Kaiwei Zhang, Wei Sun, Guangtao Zhai
2025-08-21
safety
2508.15526v1
IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents
Hengyu An, Jinghuai Zhang, Tianyu Du, Chunyi Zhou, Qingming Li, Tao Lin, Shouling Ji
2025-08-21
safety
2508.15310v1
Retrieval-Augmented Review Generation for Poisoning Recommender Systems
Shiyi Yang, Xinshu Li, Guanglin Zhou, Chen Wang, Xiwei Xu, Liming Zhu, Lina Yao
2025-08-21
2508.15252v1
SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks
Xiangman Li, Xiaodong Wu, Qi Li, Jianbing Ni, Rongxing Lu
2025-08-21
red teaming
2508.15182v1
Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent
Sajib Biswas, Mao Nishino, Samuel Jacob Chacko, Xiuwen Liu
2025-08-20
red teaming
2508.14853v1
CCFC: Core & Core-Full-Core Dual-Track Defense for LLM Jailbreak Protection
Jiaming Hu, Haoyu Wang, Debarghya Mukherjee, Ioannis Ch. Paschalidis
2025-08-19
red teaming
safety
2508.14128v1
‹
1
2
3
...
8
9
10
›