← Back to Library

Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation

Authors: Geert Trooskens, Aaron Karlsberg, Anmol Sharma, Lamara De Brouwer, Max Van Puyvelde, Matthew Young, John Thickstun, Gil Alterovitz, Walter A. De Brouwer

Published: 2026-04-06

arXiv ID: 2604.05150v1

Added to Library: 2026-04-08 02:00 UTC

📄 Abstract

We study compiled AI, a paradigm in which large language models generate executable code artifacts during a compilation phase, after which workflows execute deterministically without further model invocation. This paradigm has antecedents in prior work on declarative pipeline optimization (DSPy) and hybrid neural-symbolic planning (LLM+P); our contribution is a systems-oriented study of its application to high-stakes enterprise workflows, with particular emphasis on healthcare settings where reliability and auditability are critical. By constraining generation to narrow business-logic functions embedded in validated templates, compiled AI trades runtime flexibility for predictability, auditability, cost efficiency, and reduced security exposure. We introduce (i) a system architecture for constrained LLM-based code generation, (ii) a four-stage generation-and-validation pipeline that converts probabilistic model output into production-ready code artifacts, and (iii) an evaluation framework measuring operational metrics including token amortization, determinism, reliability, security, and cost. We evaluate on two task types: function-calling (BFCL, n=400) and document intelligence (DocILE, n=5,680 invoices). On function-calling, compiled AI achieves 96% task completion with zero execution tokens, breaking even with runtime inference at approximately 17 transactions and reducing token consumption by 57x at 1,000 transactions. On document intelligence, our Code Factory variant matches Direct LLM on key field extraction (KILE: 80.0%) while achieving the highest line item recognition accuracy (LIR: 80.4%). Security evaluation across 135 test cases demonstrates 96.7% accuracy on prompt injection detection and 87.5% on static code safety analysis with zero false positives.

🔍 Key Points

  • Introduction of Gradient-Controlled Decoding (GCD) as a dual-anchor safety mechanism for LLMs, addressing limitations of previous single-anchor systems.
  • GCD significantly reduces false positive rates by 52% compared to prior methods while maintaining similar recall rates, ensuring safer user interactions.
  • The system provides guaranteed first-token safety through preset refusal tokens irrespective of the sampling strategy, making it resilient to various prompt-injection attacks.
  • Performance evaluated across multiple benchmarks (ToxicChat, XSTest-v2, AdvBench) demonstrates robust generalizability, lower attack success rates, and manageable latency increases (15-20ms).
  • GCD integrates seamlessly across different models (LLaMA-2, Mixtral, Qwen), requiring only 20 demonstration templates for implementation.

💡 Why This Paper Matters

This paper is crucial as it presents a new, effective approach to enhancing the safety of large language models against adversarial attacks while minimizing negative impacts on user experience. By reducing false positives, it ensures a more reliable interaction framework, making LLMs more beneficial for sensitive applications.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this work invaluable as it addresses current vulnerabilities in LLMs, particularly in relation to jailbreak and prompt injection attacks. The proposed GCD mechanism represents a novel approach to safety that balances robustness and user experience, which is critical in the field of AI security, where effective defenses against adversarial manipulations are paramount.

📚 Read the Full Paper