Policy Compiler for Secure Agentic Systems

📄 Abstract

LLM-based agents are increasingly being deployed in contexts requiring complex authorization policies: customer service protocols, approval workflows, data access restrictions, and regulatory compliance. Embedding these policies in prompts provides no enforcement guarantees. We present PCAS, a Policy Compiler for Agentic Systems that provides deterministic policy enforcement. Enforcing such policies requires tracking information flow across agents, which linear message histories cannot capture. Instead, PCAS models the agentic system state as a dependency graph capturing causal relationships among events such as tool calls, tool results, and messages. Policies are expressed in a Datalog-derived language, as declarative rules that account for transitive information flow and cross-agent provenance. A reference monitor intercepts all actions and blocks violations before execution, providing deterministic enforcement independent of model reasoning. PCAS takes an existing agent implementation and a policy specification, and compiles them into an instrumented system that is policy-compliant by construction, with no security-specific restructuring required. We evaluate PCAS on three case studies: information flow policies for prompt injection defense, approval workflows in a multi-agent pharmacovigilance system, and organizational policies for customer service. On customer service tasks, PCAS improves policy compliance from 48% to 93% across frontier models, with zero policy violations in instrumented runs.

🔍 Key Points

Introduces the Indic Jailbreak Robustness (IJR) benchmark, a judge-free evaluation of adversarial safety for 12 South Asian languages, expanding the scope of safety assessments to multilingual contexts.
Demonstrates that contract-bound evaluations overestimate safety, revealing a 'contract gap' where models behave differently under forced compliance versus natural language settings.
Highlights the phenomenon of cross-lingual transfer vulnerability, showing that English adversarial prompts can effectively jailbreak models in Indic languages, with stronger transfer from format wrappers compared to instruction wrappers.
Identifies the impact of orthographic variation (native vs. romanized forms) on jailbreak success rates, indicating that mixed or romanized inputs can lower model efficacy and increase vulnerabilities.
Validates the findings through extensive human audits and reproducibility tests, confirming the robustness and reliability of the proposed benchmark methodology.

💡 Why This Paper Matters

The paper contributes significantly to the field of AI safety by addressing the previously overlooked vulnerabilities in South Asian languages and emphasizing the need for multilingual evaluation benchmarks. The findings challenge traditional English-centric assessments, making IJR an essential resource for understanding and mitigating risks in diverse language contexts.

🎯 Why It's Interesting for AI Security Researchers

This research is crucial for AI security researchers as it provides new insights into the vulnerabilities of language models when faced with multilingual and culturally specific adversarial attacks. By identifying weaknesses in existing models through a comprehensive and reproducible framework, it lays the groundwork for improved safety mechanisms and highlights the need for inclusive evaluation strategies that account for linguistic diversity.

Policy Compiler for Secure Agentic Systems

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper