Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts

Authors: Kartik Sharma, Yiqiao Jin, Vineeth Rakesh, Yingtong Dou, Menghai Pan, Mahashweta Das, Srijan Kumar

Published: 2025-06-18

arXiv ID: 2506.15751v1

Added to Library: 2025-06-23 04:03 UTC

Red Teaming

📄 Abstract

As large language models (LLMs) are deployed in safety-critical settings, it is essential to ensure that their responses comply with safety standards. Prior research has revealed that LLMs often fail to grasp the notion of safe behaviors, resulting in either unjustified refusals to harmless prompts or the generation of harmful content. While substantial efforts have been made to improve their robustness, existing defenses often rely on costly fine-tuning of model parameters or employ suboptimal heuristic techniques. In this work, we take a novel approach to safeguard LLMs by learning to adapt the system prompts in instruction-tuned LLMs. While LLMs are typically pre-trained to follow a fixed system prompt, we investigate the impact of tailoring the system prompt to each specific user input on the safety of the responses. To this end, we propose $\textbf{Sysformer}$, a trans$\textbf{former}$ model that updates an initial $\textbf{sys}$tem prompt to a more robust system prompt in the LLM input embedding space while attending to the user prompt. While keeping the LLM parameters frozen, the Sysformer is trained to refuse to respond to a set of harmful prompts while responding ideally to a set of safe ones. Through extensive experiments on $5$ LLMs from different families and $2$ recent benchmarks, we demonstrate that Sysformer can significantly enhance the robustness of LLMs, leading to upto $80\%$ gain in the refusal rate on harmful prompts while enhancing the compliance with the safe prompts by upto $90\%$. Results also generalize well to sophisticated jailbreaking attacks, making LLMs upto $100\%$ more robust against different attack strategies. We hope our findings lead to cheaper safeguarding of LLMs and motivate future investigations into designing variable system prompts.

🔍 Key Points

Introduction of Sysformer, an adaptive transformer model that modifies the system prompt based on user input without retraining the LLM parameters, significantly improving safety compliance.
Achieved up to 80% increase in refusal rates to harmful prompts while maintaining compliance with safe prompts by 90%, demonstrating effectiveness across various LLMs.
Innovative defense against sophisticated jailbreaking attacks, enhancing existing models' robustness in safety-critical applications through prompt adaptation.
Demonstrated generalized capabilities of Sysformer across different-sized models and benchmark datasets, ensuring broader applicability in AI deployments.

💡 Why This Paper Matters

This paper presents significant advancements in safeguarding large language models through the introduction of Sysformer, a novel mechanism that dynamically adapts the system prompt to ensure safety compliance in responses. Its findings highlight the potential to enhance AI robustness in the face of adversarial attacks while maintaining performance on benign inputs, making it a valuable contribution to the field of AI safety.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper particularly relevant as it addresses the critical issue of ensuring safety in AI deployments, especially with the rising concerns over the misuse of large language models. By proposing a method that effectively mitigates the risks of harmful content generation without necessitating costly fine-tuning or retraining, the findings of this work provide practical solutions to enhance AI security frameworks against adversarial prompts and jailbreaking attempts.

Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper