← Back to Newsletter
Paper Library
Collection of AI Security research papers
Search papers:
Filter by topic:
All Topics
Red Teaming
Safety
Risk & Governance
🔍 Search
Showing 1331 papers total
April 06 - April 12, 2026
3 papers
SALLIE: Safeguarding Against Latent Language & Image Exploits
Guy Azov, Ofer Rivlin, Guy Shtar
2026-04-06
2604.06247v1
Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception
Seamus Brady
2026-04-06
safety
2604.04660v1
ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems
Zhuowen Yuan, Zhaorun Chen, Zhen Xiang, Nathaniel D. Bastian, Seyyed Hadi Hashemi, Chaowei Xiao, Wenbo Guo, Bo Li
2026-04-06
2604.04426v1
March 30 - April 05, 2026
21 papers
LLM-Enabled Open-Source Systems in the Wild: An Empirical Study of Vulnerabilities in GitHub Security Advisories
Fariha Tanjim Shifat, Hariswar Baburaj, Ce Zhou, Jaydeb Sarker, Mia Mohammad Imran
2026-04-05
2604.04288v1
FreakOut-LLM: The Effect of Emotional Stimuli on Safety Alignment
Daniel Kuznetsov, Ofir Cohen, Karin Shistik, Rami Puzis, Asaf Shabtai
2026-04-05
red teaming
2604.04992v1
Predict, Don't React: Value-Based Safety Forecasting for LLM Streaming
Pride Kavumba, Koki Wataoka, Huy H. Nguyen, Jiaxuan Li, Masaya Ohagi
2026-04-05
safety
2604.03962v1
Automating Cloud Security and Forensics Through a Secure-by-Design Generative AI Framework
Dalal Alharthi, Ivan Roberto Kawaminami Garcia
2026-04-05
red teaming
2604.03912v1
Your Agent is More Brittle Than You Think: Uncovering Indirect Injection Vulnerabilities in Agentic LLMs
Wenhui Zhu, Xuanzhao Dong, Xiwen Chen, Rui Cai, Peijie Qiu, Zhipeng Wang, Oana Frunza, Shao Tang, Jindong Gu, Yalin Wang
2026-04-04
red teaming
2604.03870v1
AttackEval: A Systematic Empirical Study of Prompt Injection Attack Effectiveness Against Large Language Models
Jackson Wang
2026-04-04
red teaming
2604.03598v1
Automated Analysis of Global AI Safety Initiatives: A Taxonomy-Driven LLM Approach
Takayuki Semitsu, Naoto Kiribuchi, Kengo Zenitani
2026-04-04
safety
2604.03533v1
Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study
Zhihao Chen, Ying Zhang, Yi Liu, Gelei Deng, Yuekang Li, Yanjun Zhang, Jianting Ning, Leo Yu Zhang, Lei Ma, Zhiqiang Li
2026-04-03
red teaming
2604.03070v1
LogicPoison: Logical Attacks on Graph Retrieval-Augmented Generation
Yilin Xiao, Jin Chen, Qinggang Zhang, Yujing Zhang, Chuang Zhou, Longhao Yang, Lingfei Ren, Xin Yang, Xiao Huang
2026-04-03
red teaming
2604.02954v1
Generalization Limits of Reinforcement Learning Alignment
Haruhi Shida, Koo Imai, Keigo Kansa
2026-04-03
red teaming
2604.02652v1
Understanding the Effects of Safety Unalignment on Large Language Models
John T. Halloran
2026-04-02
red teaming
2604.02574v1
Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters
Ahmed B Mustafa, Zihan Ye, Yang Lu, Michael P Pound, Shreyank N Gowda
2026-04-02
red teaming
2604.01888v1
CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders
Su-Hyeon Kim, Hyundong Jin, Yejin Lee, Yo-Sub Han
2026-04-02
red teaming
2604.01604v1
SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits
Zikai Zhang, Rui Hu, Olivera Kotevska, Jiahao Xu
2026-04-01
red teaming
2604.01473v1
Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models
Weidi Luo, Xiaofei Wen, Tenghao Huang, Hongyi Wang, Zhen Xiang, Chaowei Xiao, Kristina Gligorić, Muhao Chen
2026-04-01
red teaming
2604.01444v2
Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models
Weidi Luo, Xiaofei Wen, Tenghao Huang, Hongyi Wang, Zhen Xiang, Chaowei Xiao, Kristina Gligorić, Muhao Chen
2026-04-01
red teaming
2604.01444v1
ClawSafety: "Safe" LLMs, Unsafe Agents
Bowen Wei, Yunbei Zhang, Jinhao Pan, Kai Mei, Xiao Wang, Jihun Hamm, Ziwei Zhu, Yingqiang Ge
2026-04-01
red teaming
2604.01438v2
ClawSafety: "Safe" LLMs, Unsafe Agents
Bowen Wei, Yunbei Zhang, Jinhao Pan, Kai Mei, Xiao Wang, Jihun Hamm, Ziwei Zhu, Yingqiang Ge
2026-04-01
red teaming
2604.01438v1
AgentWatcher: A Rule-based Prompt Injection Monitor
Yanting Wang, Wei Zou, Runpeng Geng, Jinyuan Jia
2026-04-01
2604.01194v1
Multi-Agent LLM Governance for Safe Two-Timescale Reinforcement Learning in SDN-IoT Defense
Saeid Jamshidi, Negar Shahabi, Foutse Khomh, Carol Fung, Mohammad Hamdaqa
2026-04-01
safety
governance
2604.01127v1
The Persistent Vulnerability of Aligned AI Systems
Aengus Lynch
2026-03-31
red teaming
2604.00324v1
‹
1
2
3
...
54
55
56
›