Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 770 papers total

August 25 - August 31, 2025

11 papers

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Yang Zhang, Haohan Wang
2025-08-28
red teaming
2508.20325v1

IntentionReasoner: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement

Yuanzhe Shen, Zisu Huang, Zhengkang Guo, Yide Liu, Guanxu Chen, Ruicheng Yin, Xiaoqing Zheng, Xuanjing Huang
2025-08-27
red teaming
2508.20151v1

Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks

Sheng Liu, Qiang Sheng, Danding Wang, Yang Li, Guang Yang, Juan Cao
2025-08-27
red teaming safety
2508.20038v2

Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks

Sheng Liu, Qiang Sheng, Danding Wang, Yang Li, Guang Yang, Juan Cao
2025-08-27
red teaming safety
2508.20038v1

Pruning Strategies for Backdoor Defense in LLMs

Santosh Chapagain, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi
2025-08-27
safety
2508.20032v1

Evaluating Language Model Reasoning about Confidential Information

Dylan Sam, Alexander Robey, Andy Zou, Matt Fredrikson, J. Zico Kolter
2025-08-27
red teaming
2508.19980v1

AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema

Ting-Chun Liu, Ching-Yu Hsu, Kuan-Yi Lee, Chi-An Fu, Hung-yi Lee
2025-08-27
red teaming
2509.00088v2

Safety Alignment Should Be Made More Than Just A Few Attention Heads

Chao Huang, Zefeng Zhang, Juewei Yue, Quangang Li, Chuang Zhang, Tingwen Liu
2025-08-27
red teaming
2508.19697v1

On Surjectivity of Neural Networks: Can you elicit any behavior from your model?

Haozhe Jiang, Nika Haghtalab
2025-08-26
red teaming
2508.19445v1

Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience

Xi Wang, Songlei Jian, Shasha Li, Xiaopeng Li, Bin Ji, Jun Ma, Xiaodong Liu, Jing Wang, Feilong Bao, Jianfeng Zhang, Baosheng Wang, Jie Yu
2025-08-25
red teaming
2508.19292v1

Speculative Safety-Aware Decoding

Xuekang Wang, Shengyu Zhu, Xueqi Cheng
2025-08-25
red teaming
2508.17739v1

August 18 - August 24, 2025

13 papers

Activation Transport Operators

Andrzej Szablewski, Marek Masiak
2025-08-24
2508.17540v1

Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD

Bryan Chen Zhengyu Tan, Daniel Wai Kit Chin, Zhengyuan Liu, Nancy F. Chen, Roy Ka-Wei Lee
2025-08-24
safety
2508.17450v1

Unveiling the Latent Directions of Reflection in Large Language Models

Fu-Chieh Chang, Yu-Ting Lee, Pei-Yuan Wu
2025-08-23
red teaming
2508.16989v1

ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks

Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park
2025-08-23
red teaming
2508.16889v1

HAMSA: Hijacking Aligned Compact Models via Stealthy Automation

Alexey Krylov, Iskander Vagizov, Dmitrii Korzh, Maryam Douiba, Azidine Guezzaz, Vladimir Kokh, Sergey D. Erokhin, Elena V. Tutubalina, Oleg Y. Rogov
2025-08-22
red teaming
2508.16484v1

Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models

Guangyu Yang, Jinghong Chen, Jingbiao Mei, Weizhe Lin, Bill Byrne
2025-08-22
red teaming
2508.16406v1

Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs

Yu Yan, Sheng Sun, Zhe Wang, Yijun Lin, Zenghao Duan, zhifei zheng, Min Liu, Zhiyi yin, Jianping Zhang
2025-08-22
red teaming
2508.16347v1

LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts

Darpan Aswal, Céline Hudelot
2025-08-22
2508.16325v1

MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine

Kaiyuan Ji, Yijin Guo, Zicheng Zhang, Xiangyang Zhu, Yuan Tian, Ning Liu, Guangtao Zhai
2025-08-22
safety
2508.16213v1

SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Peng Ding, Wen Sun, Dailin Li, Wei Zou, Jiaming Wang, Jiajun Chen, Shujian Huang
2025-08-21
red teaming
2508.15648v2

SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Peng Ding, Wen Sun, Dailin Li, Wei Zou, Jiaming Wang, Jiajun Chen, Shujian Huang
2025-08-21
2508.15648v1

SafetyFlow: An Agent-Flow System for Automated LLM Safety Benchmarking

Xiangyang Zhu, Yuan Tian, Chunyi Li, Kaiwei Zhang, Wei Sun, Guangtao Zhai
2025-08-21
safety
2508.15526v1

IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents

Hengyu An, Jinghuai Zhang, Tianyu Du, Chunyi Zhou, Qingming Li, Tao Lin, Shouling Ji
2025-08-21
safety
2508.15310v1