← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 947 papers total
January 05 - January 11, 2026
15 papers
Jailbreaking LLMs & VLMs: Mechanisms, Evaluation, and Unified Defense
Zejian Chen, Chaozhuo Li, Chao Li, Xi Zhang, Litian Zhang, Yiming He
2026-01-07
red teaming
2601.03594v1
STAR-S: Improving Safety Alignment through Self-Taught Reasoning on Safety Rules
Di Wu, Yanyan Zhao, Xin Lu, Mingzhe Li, Bing Qin
2026-01-07
2601.03537v1
Jailbreaking LLMs Without Gradients or Priors: Effective and Transferable Attacks
Zhakshylyk Nurlanov, Frank R. Schmidt, Florian Bernard
2026-01-06
red teaming
2601.03420v1
GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models
Xiangdong Hu, Yangyang Jiang, Qin Hu, Xiaojun Jia
2026-01-06
red teaming
2601.03416v1
JPU: Bridging Jailbreak Defense and Unlearning via On-Policy Path Rectification
Xi Wang, Songlei Jian, Shasha Li, Xiaopeng Li, Zhaoye Li, Bin Ji, Baosheng Wang, Jie Yu
2026-01-06
red teaming
2601.03005v1
Adversarial Contrastive Learning for LLM Quantization Attacks
Dinghong Song, Zhiwei Xu, Hai Wan, Xibin Zhao, Pengfei Su, Dong Li
2026-01-06
red teaming
2601.02680v1
TRYLOCK: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering
Scott Thornton
2026-01-06
red teaming
2601.03300v1
Extracting books from production language models
Ahmed Ahmed, A. Feder Cooper, Sanmi Koyejo, Percy Liang
2026-01-06
red teaming
2601.02671v1
Multi-Turn Jailbreaking of Aligned LLMs via Lexical Anchor Tree Search
Devang Kulshreshtha, Hang Su, Chinmay Hegde, Haohan Wang
2026-01-06
red teaming
2601.02670v1
LLM-Empowered Functional Safety and Security by Design in Automotive Systems
Nenad Petrovic, Vahid Zolfaghari, Fengjunjie Pan, Alois Knoll
2026-01-05
safety
2601.02215v1
Exploring Approaches for Detecting Memorization of Recommender System Data in Large Language Models
Antonio Colacicco, Vito Guida, Dario Di Palma, Fedelucio Narducci, Tommaso Di Noia
2026-01-05
2601.02002v1
Hidden State Poisoning Attacks against Mamba-based Language Models
Alexandre Le Mercier, Chris Develder, Thomas Demeester
2026-01-05
red teaming
2601.01972v2
Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance
Jiawen Zhang, Lipeng He, Kejia Chen, Jian Lou, Jian Liu, Xiaohu Yang, Ruoxi Jia
2026-01-05
safety
2601.01887v2
Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization
Jiwei Guan, Haibo Jin, Haohan Wang
2026-01-05
red teaming
2601.01747v2
Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization
Jiwei Guan, Haibo Jin, Haohan Wang
2026-01-05
red teaming
2601.01747v1
December 29 - January 04, 2026
9 papers
JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models
Junyu Liu, Zirui Li, Qian Niu, Zequn Zhang, Yue Xun, Wenlong Hou, Shujun Wang, Yusuke Iwasawa, Yutaka Matsuo, Kan Hatakeyama-Sato
2026-01-04
red teaming
2601.01627v1
OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs
Xin Wang, Yunhao Chen, Juncheng Li, Yixu Wang, Yang Yao, Tianle Gu, Jie Li, Yan Teng, Xingjun Ma, Yingchun Wang, Xia Hu
2026-01-04
red teaming
2601.01592v1
How Real is Your Jailbreak? Fine-grained Jailbreak Evaluation with Anchored Reference
Songyang Liu, Chaozhuo Li, Rui Pu, Litian Zhang, Chenxu Wang, Zejian Chen, Yuting Zhang, Yiming Hei
2026-01-04
red teaming
2601.03288v1
MCP-SandboxScan: WASM-based Secure Execution and Runtime Analysis for MCP Tools
Zhuoran Tan, Run Hao, Jeremy Singer, Yutian Tang, Christos Anagnostopoulos
2026-01-03
2601.01241v1
Emoji-Based Jailbreaking of Large Language Models
M P V S Gopinadh, S Mahaboob Hussain
2026-01-02
red teaming
2601.00936v1
CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns
Zhenhong Zhou, Shilinlu Yan, Chuanpu Liu, Qiankun Li, Kun Wang, Zhigang Zeng
2026-01-02
safety
2601.00588v2
Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations
Hyunjun Kim
2026-01-01
2601.00454v1
$α^3$-Bench: A Unified Benchmark of Safety, Robustness, and Efficiency for LLM-Based UAV Agents over 6G Networks
Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah
2026-01-01
safety
2601.03281v1
Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak
Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang, Yaochu Jin
2026-01-01
red teaming
safety
2601.00213v1
‹
1
2
3
4
5
...
38
39
40
›