← Back to Library

Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models

Authors: Huzaifa Arif, Keerthiram Murugesan, Ching-Yun Ko, Pin-Yu Chen, Payel Das, Alex Gittens

Published: 2025-11-11

arXiv ID: 2511.08484v1

Added to Library: 2025-11-12 04:00 UTC

Safety

📄 Abstract

We propose patching for large language models (LLMs) like software versions, a lightweight and modular approach for addressing safety vulnerabilities. While vendors release improved LLM versions, major releases are costly, infrequent, and difficult to tailor to customer needs, leaving released models with known safety gaps. Unlike full-model fine-tuning or major version updates, our method enables rapid remediation by prepending a compact, learnable prefix to an existing model. This "patch" introduces only 0.003% additional parameters, yet reliably steers model behavior toward that of a safer reference model. Across three critical domains (toxicity mitigation, bias reduction, and harmfulness refusal) policy patches achieve safety improvements comparable to next-generation safety-aligned models while preserving fluency. Our results demonstrate that LLMs can be "patched" much like software, offering vendors and practitioners a practical mechanism for distributing scalable, efficient, and composable safety updates between major model releases.

🔍 Key Points

  • Introduction of a lightweight patching method for large language models (LLMs) that allows for rapid safety remediation without full model updates.
  • Demonstration of effective mitigation across three critical safety issues: toxicity, bias, and harmfulness, comparable to next-generation safety-aligned models.
  • Utilization of a two-stage training process (Supervised Fine-Tuning followed by Direct Preference Optimization) that stabilizes fluency while improving safety.
  • The patching approach introduces only 0.003% additional parameters, making it efficient in terms of resource usage and easy to deploy as a software update.
  • Extensive experimental validation showcasing the policy patch's ability to generalize across different models and datasets.

💡 Why This Paper Matters

This paper presents a novel approach to enhance the safety of large language models through an efficient and modular patching method, addressing significant issues like toxicity and bias. With its lightweight architecture requiring minimal additional parameters, this solution allows for rapid updates and deployments, bridging the gap between high-stakes safety needs and the logistical challenges faced by LLM providers. The practical implications of this method are substantial, providing a viable way to enhance user safety while upholding model fluency.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly pertinent as it addresses critical vulnerabilities in large language models that can lead to harmful outputs. The proposed patching method offers a practical solution to mitigate risks associated with toxic language generation and bias, thus contributing to the field of AI safety and ethics. Furthermore, the framework for modular updates enhances the agility of AI systems in responding to emerging threats, making it a valuable reference for ongoing security research.

📚 Read the Full Paper