Philipp Zimmermann
← Back to Newsletter

Paper Library

Collection of AI Security research papers

Showing 235 papers total

June 09 - June 15, 2025

21 papers

Exploring the Secondary Risks of Large Language Models

Jiawei Chen, Zhengwei Fang, Xiao Yang, Chao Yu, Zhaoxia Yin, Hang Su
2025-06-14
2506.12382v1

QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety

Taegyeong Lee, Jeonghwa Yoo, Hyoungseo Cho, Soo Yong Kim, Yunho Maeng
2025-06-14
safety
2506.12299v1

InfoFlood: Jailbreaking Large Language Models with Information Overload

Advait Yadav, Haibo Jin, Man Luo, Jun Zhuang, Haohan Wang
2025-06-13
red teaming
2506.12274v1

Investigating Vulnerabilities and Defenses Against Audio-Visual Attacks: A Comprehensive Survey Emphasizing Multimodal Models

Jinming Wen, Xinyi Wu, Shuai Zhao, Yanhao Jia, Yuwen Li
2025-06-13
red teaming
2506.11521v1

DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents

Hao Li, Xiaogeng Liu, Hung-Chun Chiu, Dianqi Li, Ning Zhang, Chaowei Xiao
2025-06-13
safety
2506.12104v1

How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?

Sohee Yang, Sang-Woo Lee, Nora Kassner, Daniela Gottesman, Sebastian Riedel, Mor Geva
2025-06-12
red teaming
2506.10979v1

SoK: Evaluating Jailbreak Guardrails for Large Language Models

Xunguang Wang, Zhenlan Ji, Wenxuan Wang, Zongjie Li, Daoyuan Wu, Shuai Wang
2025-06-12
red teaming
2506.10597v1

Effective Red-Teaming of Policy-Adherent Agents

Itay Nakash, George Kour, Koren Lazar, Matan Vetzler, Guy Uziel, Ateret Anaby-Tavor
2025-06-11
red teaming
2506.09600v1

Risks & Benefits of LLMs & GenAI for Platform Integrity, Healthcare Diagnostics, Cybersecurity, Privacy & AI Safety: A Comprehensive Survey, Roadmap & Implementation Blueprint

Kiarash Ahi
2025-06-10
safety
2506.12088v1

AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)

Danush Khanna, Krishna Kumar, Basab Ghosh, Vinija Jain, Vasu Sharma, Aman Chadha, Amitava Das
2025-06-10
red teaming safety
2506.08885v2

AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)

Danush Khanna, Krishna Kumar, Basab Ghosh, Vinija Jain, Vasu Sharma, Aman Chadha, Amitava Das
2025-06-10
red teaming safety
2506.08885v1

Evaluation empirique de la sécurisation et de l'alignement de ChatGPT et Gemini: analyse comparative des vulnérabilités par expérimentations de jailbreaks

Rafaël Nouailles
2025-06-10
red teaming
2506.10029v1

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Shuo Yang, Qihui Zhang, Yuyang Liu, Yue Huang, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, Li Yuan
2025-06-10
safety
2506.08473v2

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Shuo Yang, Qihui Zhang, Yuyang Liu, Yue Huang, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, Li Yuan
2025-06-10
safety
2506.08473v1

RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards

Jingnan Zheng, Xiangtian Ji, Yijun Lu, Chenhang Cui, Weixiang Zhao, Gelei Deng, Zhenkai Liang, An Zhang, Tat-Seng Chua
2025-06-09
red teaming
2506.07736v2

RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards

Jingnan Zheng, Xiangtian Ji, Yijun Lu, Chenhang Cui, Weixiang Zhao, Gelei Deng, Zhenkai Liang, An Zhang, Tat-Seng Chua
2025-06-09
red teaming
2506.07736v1

LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges

Haoyang Li, Huan Gao, Zhiyuan Zhao, Zhiyu Lin, Junyu Gao, Xuelong Li
2025-06-09
red teaming
2506.10022v1

Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models

Maciej Chrabąszcz, Katarzyna Lorenc, Karolina Seweryn
2025-06-09
red teaming
2506.07645v1

TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts

Torsten Krauß, Hamid Dashtbani, Alexandra Dmitrienko
2025-06-09
red teaming
2506.07596v1

When Style Breaks Safety: Defending Language Models Against Superficial Style Alignment

Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi
2025-06-09
red teaming
2506.07452v1

Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures

Yukai Zhou, Sibei Yang, Wenjie Wang
2025-06-09
red teaming
2506.07402v1

June 02 - June 08, 2025

3 papers