JT-Safe: Intrinsically Enhancing the Safety and Trustworthiness of LLMs

Authors: Junlan Feng, Fanyu Meng, Chong Long, Pengyu Cong, Duqing Wang, Yan Zheng, Yuyao Zhang, Xuanchang Gao, Ye Yuan, Yunfei Ma, Zhijie Ren, Fan Yang, Na Wu, Di Jin, Chao Deng

Published: 2025-10-20

arXiv ID: 2510.17918v1

Added to Library: 2025-10-22 04:01 UTC

Safety

📄 Abstract

The hallucination and credibility concerns of large language models (LLMs) are global challenges that the industry is collectively addressing. Recently, a significant amount of advances have been made on post-training and inference techniques to mitigate these challenges. However, it is widely agreed that unsafe and hallucinations of LLMs intrinsically originate from pre-training, involving pre-training data and the next-token prediction learning mechanism. In this paper, we focus on enhancing pre-training data to improve the trustworthiness and safety of LLMs. Since the data is vast, it's almost impossible to entirely purge the data of factual errors, logical inconsistencies, or distributional biases. Moreover, the pre-training data lack grounding in real-world knowledge. Each piece of data is treated as a sequence of tokens rather than as a representation of a part of the world. To overcome these issues, we propose approaches to enhancing our pre-training data with its context in the world and increasing a substantial amount of data reflecting industrial scenarios. We argue that most source data are created by the authors for specific purposes in a certain spatial-temporal context. They have played a role in the real world. By incorporating related world context information, we aim to better anchor pre-training data within real-world scenarios, thereby reducing uncertainty in model training and enhancing the model's safety and trustworthiness. We refer to our Data with World Context as DWC. We continue pre-training an earlier checkpoint of JT-35B-Base with 1.5 trillion of DWC tokens. We introduce our post-training procedures to activate the potentials of DWC. Compared with the Qwen model of a similar scale, JT-Safe-35B achieves an average performance improvement of 1.79% on the Safety and Trustworthy evaluation benchmarks, while being pretrained with only 6.2 trillion tokens.

🔍 Key Points

Introduction of the Data with World Context (DWC): The paper proposes enhancing pre-training data for large language models (LLMs) by integrating contextual information that provides a more structured and reliable understanding of the world, thereby mitigating issues of hallucination and credibility.
Implementation of a multi-stage pre-training method: The authors describe a comprehensive pre-training pipeline that includes a general high-quality data stage, a DWC safety-enhancement stage, and a long-context annealing stage to systematically improve model performance in safety and trustworthiness.
Superior Results on Evaluation Benchmarks: The model JT-Safe-35B shows improved performance (1.79% higher) on safety and trustworthiness benchmarks compared to similar models while being trained with fewer tokens (6.2 trillion) and achieves notable advancements across various general and industry-specific capabilities.
Post-training Alignment Techniques: The paper elaborates on advanced reinforcement learning and supervised fine-tuning methods aimed at grounding the model in human values and safety expectations, enhancing its overall reliability and user trust.
Extensive Experimental Validation: The findings are supported by comprehensive experiments that analyze the performance of the model across multiple dimensions, indicating the successful synergy between DWC data and post-training techniques.

💡 Why This Paper Matters

This paper presents crucial advancements in enhancing the safety and trustworthiness of LLMs through innovative data processing and training methodologies. By focusing on integrating world context into pre-training and employing systematic post-training techniques, the authors not only highlight the fundamental issues with existing models but also propose effective solutions that show measurable improvements in model performance. These contributions are vital for paving the way towards safer and more reliable AI applications, which are increasingly important in today's AI-driven world.

🎯 Why It's Interesting for AI Security Researchers

This paper is of particular interest to AI security researchers as it directly addresses core challenges in LLMs related to hallucinations and credibility, which can lead to harmful outputs. The novel methods proposed here for enhancing pre-training data and implementing robust alignment techniques are critical for developing safer AI systems. Researchers focusing on AI security can explore the implications of these advancements in improving the resilience of models against adversarial inputs and ensuring that models operate within ethical and safety-bound frameworks.

JT-Safe: Intrinsically Enhancing the Safety and Trustworthiness of LLMs

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper