← Back to Library

CNFinBench: A Benchmark for Safety and Compliance of Large Language Models in Finance

Authors: Jinru Ding, Chao Ding, Wenrao Pang, Boyi Xiao, Zhiqiang Liu, Pengcheng Chen, Jiayuan Chen, Tiantian Yuan, Junming Guan, Yidong Jiang, Dawei Cheng, Jie Xu

Published: 2025-12-10

arXiv ID: 2512.09506v1

Added to Library: 2025-12-11 03:00 UTC

Red Teaming

📄 Abstract

Large language models are increasingly deployed across the financial sector for tasks such as research, compliance, risk analysis, and customer service, which makes rigorous safety evaluation essential. However, existing financial benchmarks primarily focus on textbook-style question answering and numerical problem solving, but fail to evaluate models' real-world safety behaviors. They weakly assess regulatory compliance and investor-protection norms, rarely stress-test multi-turn adversarial tactics such as jailbreaks or prompt injection, inconsistently ground answers in long filings, ignore tool- or RAG-induced over-reach risks, and rely on opaque or non-auditable evaluation protocols. To close these gaps, we introduce CNFinBench, a benchmark that employs finance-tailored red-team dialogues and is structured around a Capability-Compliance-Safety triad, including evidence-grounded reasoning over long reports and jurisdiction-aware rule/tax compliance tasks. For systematic safety quantification, we introduce the Harmful Instruction Compliance Score (HICS) to measure how consistently models resist harmful prompts across multi-turn adversarial dialogues. To ensure auditability, CNFinBench enforces strict output formats with dynamic option perturbation for objective tasks and employs a hybrid LLM-ensemble plus human-calibrated judge for open-ended evaluations. Experiments on 21 models across 15 subtasks confirm a persistent capability-compliance gap: models achieve an average score of 61.0 on capability tasks but fall to 34.18 on compliance and risk-control evaluations. Under multi-turn adversarial dialogue tests, most systems reach only partial resistance (HICS 60-79), demonstrating that refusal alone is not a reliable proxy for safety without cited and verifiable reasoning.

🔍 Key Points

  • CNFinBench introduces a novel benchmark for evaluating the safety and compliance of large language models (LLMs) specifically in financial contexts.
  • The benchmark incorporates a Capability-Compliance-Safety triad which emphasizes rigorous assessment of LLM behavior in real-world financial scenarios, unlike previous benchmarks that focus largely on numerical problem solving and factual correctness.
  • CNFinBench employs multi-turn adversarial consultations and dynamic output formats to test and quantify model resistance to harmful instructions, introducing the Harmful Instruction Compliance Score (HICS) for systematic evaluation.
  • The benchmark has been extensively evaluated on 21 models across 15 subtasks, revealing a significant discrepancy between capability (average score of 61.0) and compliance/risk-control performance (average score of 34.18).
  • The incorporation of an ensemble of LLMs with human-verified evaluations enhances the robustness and auditability of the results, providing clear insights into the reliable operation of LLMs in financial applications.

💡 Why This Paper Matters

The CNFinBench benchmark represents a critical advancement in the evaluation of LLMs' safety and compliance within the financial sector, highlighting the shortcomings of existing models and underscoring the need for comprehensive, domain-specific assessments. Its findings reinforce the necessity for models to not only provide accurate responses but also to comply with complex regulatory frameworks, thereby safeguarding against potential risks in high-stakes environments.

🎯 Why It's Interesting for AI Security Researchers

This paper will be of great interest to AI security researchers as it addresses the pressing need for improved safety and compliance measures in AI systems, particularly in industries as sensitive as finance. The introduction of methods like HICS and the focus on dynamic adversarial testing provide valuable insights into how LLMs can be rigorously scrutinized, contributing to robust security frameworks and enhancing the overall integrity of AI deployments.

📚 Read the Full Paper