โ† Back to Library

CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns

Authors: Zhenhong Zhou, Shilinlu Yan, Chuanpu Liu, Qiankun Li, Kun Wang, Zhigang Zeng

Published: 2026-01-02

arXiv ID: 2601.00588v2

Added to Library: 2026-01-07 10:13 UTC

Safety

๐Ÿ“„ Abstract

Large language models (LLMs) are increasingly deployed in cost-sensitive and on-device scenarios, and safety guardrails have advanced mainly in English. However, real-world Chinese malicious queries typically conceal intent via homophones, pinyin, symbol-based splitting, and other Chinese-specific patterns. These Chinese-specific adversarial patterns create the safety evaluation gap that is not well captured by existing benchmarks focused on English. This gap is particularly concerning for lightweight models, which may be more vulnerable to such specific adversarial perturbations. To bridge this gap, we introduce the Chinese-Specific Safety Benchmark (CSSBench) that emphasizes these adversarial patterns and evaluates the safety of lightweight LLMs in Chinese. Our benchmark covers six domains that are common in real Chinese scenarios, including illegal activities and compliance, privacy leakage, health and medical misinformation, fraud and hate, adult content, and public and political safety, and organizes queries into multiple task types. We evaluate a set of popular lightweight LLMs and measure over-refusal behavior to assess safety-induced performance degradation. Our results show that the Chinese-specific adversarial pattern is a critical challenge for lightweight LLMs. This benchmark offers a comprehensive evaluation of LLM safety in Chinese, assisting robust deployments in practice.

๐Ÿ” Key Points

  • Introduction of CSSBench, a Chinese-Specific Safety Benchmark for evaluating lightweight LLMs against unique Chinese adversarial patterns.
  • Identification of six safety domains commonly found in Chinese contexts, enhancing the understanding of LLM vulnerabilities in real-world scenarios.
  • Demonstration of the effectiveness of Chinese-specific adversarial patterns such as Pinyin mixes, homophones, and zero-width character insertions in evading safety measures.
  • Comprehensive analysis of safety performance across various lightweight LLMs, highlighting the trade-offs between safety and helpfulness through metrics like ASR and ORR.
  • Identification of significant over-refusal behaviors in lightweight models, showing that many exhibit conservative refusals at the expense of helpfulness.

๐Ÿ’ก Why This Paper Matters

The findings and methodologies detailed in this paper highlight critical gaps in existing safety evaluations of lightweight language models, particularly in the context of Chinese language processing. By targeting specific adversarial strategies and introducing a benchmark tailored to these challenges, the study underscores the complexities of ensuring safety in AI models deployed in non-English settings. This work establishes a foundation for future research aimed at enhancing model robustness against diverse adversarial tactics, making it a significant contribution to the field of AI safety.

๐ŸŽฏ Why It's Interesting for AI Security Researchers

This paper addresses the critical issue of AI safety in a specific linguistic context, opening up avenues for understanding how language-specific nuances affect model vulnerability. For AI security researchers, the novel benchmark and evaluation methods presented offer a framework to investigate the robustness of language models against tailored adversarial attacksโ€”insights that are increasingly essential as AI systems are deployed globally with varied linguistic considerations.

๐Ÿ“š Read the Full Paper