← Back to Library

How to Trick Your AI TA: A Systematic Study of Academic Jailbreaking in LLM Code Evaluation

Authors: Devanshu Sahoo, Vasudev Majhi, Arjun Neekhra, Yash Sinha, Murari Mandal, Dhruv Kumar

Published: 2025-12-11

arXiv ID: 2512.10415v1

Added to Library: 2025-12-12 03:01 UTC

Red Teaming

📄 Abstract

The use of Large Language Models (LLMs) as automatic judges for code evaluation is becoming increasingly prevalent in academic environments. But their reliability can be compromised by students who may employ adversarial prompting strategies in order to induce misgrading and secure undeserved academic advantages. In this paper, we present the first large-scale study of jailbreaking LLM-based automated code evaluators in academic context. Our contributions are: (i) We systematically adapt 20+ jailbreaking strategies for jailbreaking AI code evaluators in the academic context, defining a new class of attacks termed academic jailbreaking. (ii) We release a poisoned dataset of 25K adversarial student submissions, specifically designed for the academic code-evaluation setting, sourced from diverse real-world coursework and paired with rubrics and human-graded references, and (iii) In order to capture the multidimensional impact of academic jailbreaking, we systematically adapt and define three jailbreaking metrics (Jailbreak Success Rate, Score Inflation, and Harmfulness). (iv) We comprehensively evalulate the academic jailbreaking attacks using six LLMs. We find that these models exhibit significant vulnerability, particularly to persuasive and role-play-based attacks (up to 97% JSR). Our adversarial dataset and benchmark suite lay the groundwork for next-generation robust LLM-based evaluators in academic code assessment.

🔍 Key Points

  • Introduction of 'academic jailbreaking' as a distinct class of attacks targeting LLM-based code evaluators, highlighting vulnerabilities of LLMs in educational settings.
  • Release of a dataset comprising 25,000 adversarial student submissions, serving as a foundation for future evaluations of LLM robustness against academic jailbreaking.
  • Development of three jailbreaking metrics (Jailbreak Success Rate, Score Inflation, Harmfulness) to quantitatively measure model robustness and grading reliability under adversarial conditions.
  • Comprehensive evaluation of six LLMs, revealing high susceptibility to specific attack strategies, particularly 'Role Play' and 'Disguised Intent'.
  • Proposals for defense strategies and protocols to improve the robustness of LLM-based evaluators, promoting enhanced reliability in academic contexts.

💡 Why This Paper Matters

This paper provides critical insights into the vulnerabilities of LLM-based grading systems in academic settings, establishing a new research area devoted to understanding and mitigating the risks of adversarial manipulation. Its findings and contributions are essential for enhancing trust and reliability in educational AI applications, ensuring fair and accurate evaluations of student work.

🎯 Why It's Interesting for AI Security Researchers

The paper is of significant interest to AI security researchers as it outlines a novel category of adversarial attacks specific to AI-assisted educational tools, calling attention to the safety implications of deploying such systems in sensitive environments. It offers a structured methodology for assessing vulnerabilities, which could inform future research focused on improving AI robustness and developing effective countermeasures against adversarial threats.

📚 Read the Full Paper