← Back to Library

Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness

Authors: Haotian Deng, Chris Farber, Jiyoon Lee, David Tang

Published: 2025-12-21

arXiv ID: 2601.08843v1

Added to Library: 2026-01-15 03:01 UTC

Red Teaming

📄 Abstract

Automated short-answer grading (ASAG) remains a challenging task due to the linguistic variability of student responses and the need for nuanced, rubric-aligned partial credit. While Large Language Models (LLMs) offer a promising solution, their reliability as automated judges in rubric-based settings requires rigorous assessment. In this paper, we systematically evaluate the performance of LLM-judges for rubric-based short-answer grading. We investigate three key aspects: the alignment of LLM grading with expert judgment across varying rubric complexities, the trade-off between uncertainty and accuracy facilitated by a consensus-based deferral mechanism, and the model's robustness under random input perturbations and adversarial attacks. Using the SciEntsBank benchmark and Qwen 2.5-72B, we find that alignment is strong for binary tasks but degrades with increased rubric granularity. Our "Trust Curve" analysis demonstrates a clear trade-off where filtering low-confidence predictions improves accuracy on the remaining subset. Additionally, robustness experiments reveal that while the model is resilient to prompt injection, it is sensitive to synonym substitutions. Our work provides critical insights into the capabilities and limitations of rubric-conditioned LLM judges, highlighting the importance of uncertainty estimation and robustness testing for reliable deployment.

🔍 Key Points

  • The paper systematically evaluates the performance of LLMs in rubric-based short-answer grading, revealing that alignment with expert judgment drops significantly as rubric complexity increases.
  • A novel consensus-based deferral mechanism is proposed to manage uncertainty in grading, demonstrating that selectively deferring low-confidence predictions enhances overall accuracy.
  • Robustness testing indicates that while the model is resilient to prompt injection attacks, it is sensitive to semantic perturbations, such as synonym substitutions, highlighting weaknesses in understanding nuanced language variations.
  • The work introduces a 'Trust Curve' methodology to analyze the trade-off between grading coverage and accuracy, enabling more reliable automated grading by adjusting confidence thresholds.
  • The study provides critical insights into the limitations of LLM grading, emphasizing the need for uncertainty estimation and robustness analysis in the deployment of automated grading systems.

💡 Why This Paper Matters

This paper is relevant and important as it addresses key challenges in automated grading with LLMs, offering a detailed evaluation of their alignment, uncertainty management, and robustness to adversarial attacks. The findings underscore the complexity of rubric-based grading and the necessity for integrating human oversight, particularly in higher-stakes educational contexts. Additionally, the proposed methods for uncertainty filtering and robustness testing pave the way for more reliable and transparent automated assessment tools.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of interest to AI security researchers as it explores robustness against adversarial attacks, a critical aspect of deploying AI systems in real-world scenarios. By highlighting specific vulnerabilities, such as sensitivity to perturbations and potential for hallucinations, the research contributes valuable knowledge toward developing more secure AI applications, particularly in educational environments where integrity in evaluation is paramount.

📚 Read the Full Paper