Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 235 papers total

August 25 - August 31, 2025

7 papers

August 18 - August 24, 2025

17 papers

Activation Transport Operators

Andrzej Szablewski, Marek Masiak
2025-08-24
2508.17540v1

Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD

Bryan Chen Zhengyu Tan, Daniel Wai Kit Chin, Zhengyuan Liu, Nancy F. Chen, Roy Ka-Wei Lee
2025-08-24
safety
2508.17450v1

Unveiling the Latent Directions of Reflection in Large Language Models

Fu-Chieh Chang, Yu-Ting Lee, Pei-Yuan Wu
2025-08-23
red teaming
2508.16989v1

ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks

Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park
2025-08-23
red teaming
2508.16889v1

HAMSA: Hijacking Aligned Compact Models via Stealthy Automation

Alexey Krylov, Iskander Vagizov, Dmitrii Korzh, Maryam Douiba, Azidine Guezzaz, Vladimir Kokh, Sergey D. Erokhin, Elena V. Tutubalina, Oleg Y. Rogov
2025-08-22
red teaming
2508.16484v1

Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models

Guangyu Yang, Jinghong Chen, Jingbiao Mei, Weizhe Lin, Bill Byrne
2025-08-22
red teaming
2508.16406v1

Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs

Yu Yan, Sheng Sun, Zhe Wang, Yijun Lin, Zenghao Duan, zhifei zheng, Min Liu, Zhiyi yin, Jianping Zhang
2025-08-22
red teaming
2508.16347v1

LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts

Darpan Aswal, Céline Hudelot
2025-08-22
2508.16325v1

MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine

Kaiyuan Ji, Yijin Guo, Zicheng Zhang, Xiangyang Zhu, Yuan Tian, Ning Liu, Guangtao Zhai
2025-08-22
safety
2508.16213v1

SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Peng Ding, Wen Sun, Dailin Li, Wei Zou, Jiaming Wang, Jiajun Chen, Shujian Huang
2025-08-21
red teaming
2508.15648v2

SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Peng Ding, Wen Sun, Dailin Li, Wei Zou, Jiaming Wang, Jiajun Chen, Shujian Huang
2025-08-21
2508.15648v1

SafetyFlow: An Agent-Flow System for Automated LLM Safety Benchmarking

Xiangyang Zhu, Yuan Tian, Chunyi Li, Kaiwei Zhang, Wei Sun, Guangtao Zhai
2025-08-21
safety
2508.15526v1

IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents

Hengyu An, Jinghuai Zhang, Tianyu Du, Chunyi Zhou, Qingming Li, Tao Lin, Shouling Ji
2025-08-21
safety
2508.15310v1

Retrieval-Augmented Review Generation for Poisoning Recommender Systems

Shiyi Yang, Xinshu Li, Guanglin Zhou, Chen Wang, Xiwei Xu, Liming Zhu, Lina Yao
2025-08-21
2508.15252v1

SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks

Xiangman Li, Xiaodong Wu, Qi Li, Jianbing Ni, Rongxing Lu
2025-08-21
red teaming
2508.15182v1

Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent

Sajib Biswas, Mao Nishino, Samuel Jacob Chacko, Xiuwen Liu
2025-08-20
red teaming
2508.14853v1

CCFC: Core & Core-Full-Core Dual-Track Defense for LLM Jailbreak Protection

Jiaming Hu, Haoyu Wang, Debarghya Mukherjee, Ioannis Ch. Paschalidis
2025-08-19
red teaming safety
2508.14128v1