Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance

📄 Abstract

Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods' susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has significantly advanced LLM reasoning capabilities but remains unexplored in the context of safety assurance. Addressing this gap, our work pioneers inference scaling for robust and effective LLM safety against emerging threats. We reveal that conventional inference scaling techniques, despite their success in reasoning tasks, perform poorly in safety contexts, even falling short of basic approaches like Best-of-N Sampling. We attribute this inefficiency to a newly identified challenge, the exploration--efficiency dilemma, arising from the high computational overhead associated with frequent process reward model (PRM) evaluations. To overcome this dilemma, we propose SAFFRON, a novel inference scaling paradigm tailored explicitly for safety assurance. Central to our approach is the introduction of a multifurcation reward model (MRM) that significantly reduces the required number of reward model evaluations. To operationalize this paradigm, we further propose: (i) a partial supervision training objective for MRM, (ii) a conservative exploration constraint to prevent out-of-distribution explorations, and (iii) a Trie-based key--value caching strategy that facilitates cache sharing across sequences during tree search. Extensive experiments validate the effectiveness of our method. Additionally, we publicly release our trained multifurcation reward model (Saffron-1) and the accompanying token-level safety reward dataset (Safety4M) to accelerate future research in LLM safety. Our code, model, and data are publicly available at https://github.com/q-rz/saffron , and our project homepage is at https://q-rz.github.io/p/saffron .

🔍 Key Points

The paper identifies the inadequacies of existing inference scaling methods in ensuring the safety of large language models (LLMs) against jailbreak attacks, specifically highlighting the 'exploration-efficiency dilemma'.
The authors propose a novel inference scaling paradigm named Saffron, which introduces the multifurcation reward model (MRM) to allow simultaneous reward evaluations for all candidate tokens, thus reducing the number of reward model calls required during inference.
Saffron includes additional innovations such as a partial supervision training objective for the MRM, a conservative exploration constraint to avoid out-of-distribution explorations, and a Trie-based caching strategy to optimize key-value sharing among sequences, enhancing computational efficiency.
Extensive experiments demonstrate that Saffron significantly outperforms traditional methods in terms of safety assurance and scaling efficiency, making it a promising solution for LLM safety against emerging threats.
The authors publicly release their trained MRM, Saffron-1, and a token-level safety reward dataset named Safety4M, to facilitate further research in the field.

💡 Why This Paper Matters

This paper is crucial for advancing the safety of LLMs, as it addresses the pressing vulnerabilities of current safety assurance methods and provides a scalable, effective solution that enhances the resilience of LLMs against exploitation. The introduction of Saffron marks a significant step toward integrating safety mechanisms into inference scaling, potentially improving the deployment of AI systems in sensitive areas.

🎯 Why It's Interesting for AI Security Researchers

This paper is of great interest to AI security researchers as it tackles the intersection of inference scalability and model safety—a critical area of concern given the increasing deployment of LLMs in real-world applications. The method proposed not only enhances the safety mechanisms of LLMs but also provides a framework that could be adapted and extended for various safety-critical AI applications, fostering the development of robust AI systems that can withstand adversarial attacks.

Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper