← Back to Library

Improving LLM Safety and Helpfulness using SFT and DPO: A Study on OPT-350M

Authors: Piyush Pant

Published: 2025-09-10

arXiv ID: 2509.09055v1

Added to Library: 2025-09-12 04:01 UTC

Safety

📄 Abstract

This research investigates the effectiveness of alignment techniques, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and a combined SFT+DPO approach on improving the safety and helpfulness of the OPT-350M language model. Utilizing the Anthropic Helpful-Harmless RLHF dataset, we train and evaluate four models: the base OPT350M, an SFT model, a DPO model, and a model trained with both SFT and DPO. We introduce three key evaluation metrics: Harmlessness Rate (HmR), Helpfulness Rate (HpR), and a Combined Alignment Score (CAS), all derived from reward model outputs. The results show that while SFT outperforms DPO, The combined SFT+DPO model outperforms all others across all metrics, demonstrating the complementary nature of these techniques. Our findings also highlight challenges posed by noisy data, limited GPU resources, and training constraints. This study offers a comprehensive view of how fine-tuning strategies affect model alignment and provides a foundation for more robust alignment pipelines in future work.

🔍 Key Points

  • The study evaluates the effectiveness of alignment techniques Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and their combination (SFT+DPO) on the OPT-350M language model, demonstrating that the combined approach achieves the highest performance across key evaluation metrics.
  • The paper introduces three new evaluation metrics: Harmlessness Rate (HmR), Helpfulness Rate (HpR), and a Combined Alignment Score (CAS), providing a structured way to assess model performance in terms of safety and helpfulness.
  • The results reveal challenges in using DPO alone, emphasizing the need for high-quality data and sufficient training epochs, which affects the model's ability to optimize alignment effectively.
  • The findings indicate that while SFT alone improves safety and helpfulness notably, the combined SFT+DPO approach harnesses the strengths of both methods, suggesting a complementary strategy can enhance model alignment.
  • The work highlights implications for smaller LLMs, indicating that effective alignment is feasible even with limited computational resources, making advancements accessible for various users such as startups and academic groups.

💡 Why This Paper Matters

This paper presents significant advancements in aligning smaller language models with human preferences for safety and helpfulness, showcasing the effectiveness of combining SFT and DPO techniques. By addressing the computational and resource constraints faced by many practitioners, it offers a scalable and practical approach to improving the deployment of language models in real-world applications, thus enhancing AI safety.

🎯 Why It's Interesting for AI Security Researchers

The findings of this research are particularly pertinent to AI security researchers, as they tackle the alignment issues associated with large language models that can produce harmful or biased outputs. Understanding the effectiveness of various training techniques to enhance model alignment directly contributes to the development of safer AI systems, which is critical for mitigating risks associated with the deployment of these technologies in sensitive environments. Furthermore, the study's insights into dataset quality and the challenges of training smaller LLMs are essential for researchers seeking to improve AI safety measures.

📚 Read the Full Paper