← Back to Library

GovBench: Benchmarking LLM Agents for Real-World Data Governance Workflows

Authors: Zhou Liu, Zhaoyang Han, Guochen Yan, Hao Liang, Bohan Zeng, Xing Chen, Yuanfeng Song, Wentao Zhang

Published: 2025-12-04

arXiv ID: 2512.04416v1

Added to Library: 2025-12-05 03:05 UTC

Risk & Governance

📄 Abstract

Data governance ensures data quality, security, and compliance through policies and standards, a critical foundation for scaling modern AI development. Recently, large language models (LLMs) have emerged as a promising solution for automating data governance by translating user intent into executable transformation code. However, existing benchmarks for automated data science often emphasize snippet-level coding or high-level analytics, failing to capture the unique challenge of data governance: ensuring the correctness and quality of the data itself. To bridge this gap, we introduce GovBench, a benchmark featuring 150 diverse tasks grounded in real-world scenarios, built on data from actual cases. GovBench employs a novel "reversed-objective" methodology to synthesize realistic noise and utilizes rigorous metrics to assess end-to-end pipeline reliability. Our analysis on GovBench reveals that current models struggle with complex, multi-step workflows and lack robust error-correction mechanisms. Consequently, we propose DataGovAgent, a framework utilizing a Planner-Executor-Evaluator architecture that integrates constraint-based planning, retrieval-augmented generation, and sandboxed feedback-driven debugging. Experimental results show that DataGovAgent significantly boosts the Average Task Score (ATS) on complex tasks from 39.7 to 54.9 and reduces debugging iterations by over 77.9 percent compared to general-purpose baselines.

🔍 Key Points

  • Introduces GovBench, a benchmark with 150 tasks for evaluating large language models (LLMs) in data governance, emphasizing both single-step and complex multi-step workflows.
  • Presents a novel 'reversed-objective' methodology for realistic noise synthesis, enhancing the assessment of LLMs in handling real-world data governance scenarios.
  • Develops DataGovAgent, an integrated framework that uses a Planner-Executor-Evaluator architecture to convert natural language into executable data governance workflows, resulting in significant performance improvements over existing models.
  • Experimental results show DataGovAgent's effectiveness, increasing Average Task Score (ATS) and decreasing debugging iterations, demonstrating superior efficiency and correctness in outputs.
  • The paper identifies current limitations of LLMs in data governance and provides a structured approach to address them, paving the way for advancements in automated data governance solutions.

💡 Why This Paper Matters

This paper is highly relevant as it addresses critical gaps in current benchmarks and modeling approaches for data governance, an area essential for ensuring the reliability and integrity of data in AI applications. By introducing GovBench and DataGovAgent, it not only highlights the challenges faced by existing models but also provides a practical framework to tackle these issues, demonstrating measurable improvements in task success rates and operational efficiency. The research contributes to the growing body of work aiming to automate and enhance data governance processes through advanced AI techniques, making it a significant resource for practitioners in the field.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper of interest due to its focus on data governance, which is integral to maintaining data integrity, privacy, and compliance in AI systems. The methodologies proposed, such as the 'reversed-objective' noise synthesis and the Planner-Executor-Evaluator framework, offer innovative ways to identify and mitigate potential data-related vulnerabilities. Furthermore, the performance metrics and findings can help in assessing the reliability of LLMs when integrated into systems that require robust security measures, making it crucial for developing secure AI applications.

📚 Read the Full Paper