A New Approach to AI Agent Accountability
A regulator has questions about a decision your agent made last quarter. A customer is in litigation because an AI-generated action cost them money. An internal auditor needs to certify that your agent did not act outside its authorised scope between two dates.
In each case, the same question: what cryptographic record proves that?
If your AI stack looks like most stacks in 2026, the honest answer is — nothing does. You have logs. Logs are claims, not evidence. You have a vendor saying their classifier flagged X% of bad prompts. A vendor assertion is not an audit artefact. You have screenshots and incident write-ups. Those don’t survive cross-examination.
This is the gap I want to talk about. And it’s why, after three months of work, I am open-sourcing the project I’ve been building to close it.
Where the prompt-injection literature converged — and where it stopped
For five years, AI security research has had a single dominant framing: how do we detect bad prompts?
The arc has been:
- Pattern matching → easily evaded.
- Heuristic scoring → noisy, false-positive-prone.
- Fine-tuned classifiers → better, but never zero.
- Constitutional classifiers (Anthropic 2024) → state of the art, statistical guarantee.
- StruQ / SecAlign / ASIDE → moving the defence into the model architecture.
These are real advances. The AgentDojo benchmark’s strongest text-side defences in late 2025 sit somewhere between 14% and 31% attack-success rate. That is, against every modern attack template, a well-tuned defence still lets one in three to one in seven attacks through.
Critically: the trajectory of improvement here is asymptotic. We are not on a path to zero. We are on a path to “a bit better, a bit better, a bit better.” That is the structural property of statistical detection over unbounded natural language — you cannot refute every possible input.
For the operator of a real AI agent in a regulated industry, this is the wrong shape of solution. “My classifier catches 98% of bad prompts” does not survive an FCA audit. “We blocked 14% of attempted attacks in Q3” does not satisfy EU AI Act Article 12. The regulator does not want a statistic. The regulator wants evidence.
The shift: from detection to proof
There is a second insight in the prompt-injection literature, and it has been mostly ignored: the consequential part of an agent’s behaviour passes through a very narrow interface.
When an AI agent does something in the real world — moves money, sends an email, writes to a database, books a flight, queries a customer record — it does so via a tool call. A tool call is not unbounded natural language. It is a structured invocation: a tool name, a set of arguments, conforming to a JSON Schema the tool itself declared. The space of well-formed tool calls is bounded, decidable, and — critically — formally tractable.
This is where the security model can flip. If you cannot verify properties of unbounded natural language, but you can verify properties of bounded JSON Schemas, then your enforcement should move out of the model and onto the tool-call boundary.
This is what I have been building.
The project: raucle, and the capability receipt
Raucle is an open-source library that produces a capability receipt for every action an AI agent takes. The receipt is signed, content-addressed, and verifiable by any third party — a regulator, a partner organisation’s gate, a downstream tool, an internal AppSec team — without contacting your vendor.
The technique is Verified Capability Discipline (VCD). It composes three primitives, each useful in isolation, each engineered to compose:
- An SMT-backed prover. For each tool’s JSON Schema and security policy, the prover (Z3) either proves that every schema-valid call satisfies the policy, or extracts a concrete counterexample. The resulting
ProofResultis content-addressed. - Cryptographic capability tokens. An Ed25519-signed token binds an agent identity, a tool, a constraint set, an attenuation chain, and an expiry — and carries the hash of the policy proof it was minted against. Attenuation invariants are mechanised in Lean 4: a child token cannot broaden its parent’s permissions, and cannot outlive it. This is enforced cryptographically, not by trust.
- A gate. Every tool call passes a gate that runs eight verifications against the in-force token. Fail-closed by default. ALLOW or DENY, the gate’s decision is the payload of a capability receipt.
The receipt has the form:
{
"issuer": "acme.bank.kyc-platform",
"issuer_pubkey": "ed25519:MCowBQYDK2VwAyEA9f8e...",
"schema_hash": "sha256:9f8e2a47...",
"policy_proof_hash": "sha256:4b78e687...",
"lean_theorem_id": "vcd_kyc_lookup_v1.policy_completeness",
"attenuation_chain": ["root:acme.bank.root", "tok_kyc_2026Q3"],
"agent_id": "kyc-agent-prod-eu-west-1",
"tool": "crm.lookup_customer",
"call_args_hash": "sha256:8ba70481...",
"decision": "ALLOW",
"decision_reason": "constraints satisfied",
"timestamp": "2026-05-27T08:15:22.041Z",
"signature": "ed25519:rJk7Q1xS..."
}
A verifier holding only the deploying organisation’s published public key, schema registry, policy registry, and Lean development URL can confirm offline:
- The signature is valid.
- The
schema_hashmatches the published schema registry — the tool’s contract was the published one. - The
policy_proof_hashmatches the published policy registry — the agent ran under an approved policy. - The cited
lean_theorem_idcloses when the published Lean development is rebuilt — the soundness claim is machine-checkable. - The
call_args_hashsatisfies the policy’s constraint set for the cited tool — the specific call really was authorised.
None of these checks contact the operator. No information asymmetry remains.
Why this matters for regulated industries
For the audiences I work with — banks, fintechs, government departments, clinical platforms — the model is not “block the bad input.” The model is “prove what we did.” EU AI Act Article 12 (logging obligations) and Article 26 (deployer obligations) require audit logs that survive independent scrutiny. The FCA’s model-risk-management expectations name “defensible evidence of model behaviour.” ISO/IEC 42001 requires AI Management System artefacts that an auditor can verify.
A vendor log is none of these. A signed capability receipt is.
There is a second enterprise need the same primitive addresses, parallel to the external-audit case: internal governance of the development teams shipping AI agents. Every organisation with separate AppSec and engineering functions has the same unsolved question — how does AppSec know what the dev team’s agent really does in production? Today’s answer is “code review + log review + trust.” That does not scale and does not survive an internal audit. With capability receipts, AppSec maintains a registry of approved policy_proof_hash values; every agent’s receipts cite the policy proof hash it ran against; AppSec verifies nightly that no production agent cites a hash that isn’t on the approved list. No code review, no trust, just cryptographic comparison.
Same primitive. Two accountability needs. Different rooms inside the same organisation.
What the empirical evidence says
Beyond the formal soundness, I needed to establish that the technique works against real attacks at real cost. The evaluation has run on three frontier-class open-weight model generations (deepseek-v3.2, deepseek-v4-flash, deepseek-v4-pro), three attack families (important_instructions, direct, ignore_previous), and the four AgentDojo task suites (banking, slack, travel, workspace). Two additional providers — Alibaba’s Qwen3.5:397b and Moonshot’s Kimi-k2.6 — are reported alongside.
The headline numbers, from the IEEE Security & Privacy 2027 submission draft:
- 100% block rate on attacker-controlled tool calls across 720 LLM-driven banking attack attempts. Every non-zero residual maps to the same
user_task_15 × injection_task_4benchmark IBAN-coincidence artefact, documented in §6.5 of the paper. - +27 to +58 percentage-point more legitimate task completion than the strongest contemporary text-side defence (Microsoft Prompt Shields’ DeBERTa detector) at equivalent attack-success rate.
- Sub-100 µs per-call gate latency at p50 on commodity hardware. End-to-end agent wall time with raucle enabled is at or below the unprotected baseline on four of eight measured cohorts — when the gate denies an attacker-induced tool call, the agent’s reasoning loop returns to the legitimate task instead of looping further.
- A static upper bound verified by the gate’s own constraint logic establishes that 0 of 2,737 catalogued attack scenarios across AgentDojo and InjecAgent admit any attacker-controlled call.
The cleanest single data point is on Moonshot’s Kimi-k2.6. That model’s native instruction adherence is strong enough that the baseline attack-success rate against the AgentDojo banking suite is already 0.0%. No defence is required for security. Microsoft Prompt Shields, run against that configuration, nonetheless collapses benign task completion from 68.8% to 34.7% — a 34.1 percentage-point loss for security that wasn’t needed. Raucle imposes 2.8 points. The cost of statistical detection is independent of its security necessity; raucle’s cost scales with the actual work it does. This is the cleanest demonstration in the data that the trade-off curve is shaped differently for the two approaches.
The Microsoft moment
Two things to acknowledge directly.
In April 2026, Microsoft shipped the Agent Governance Toolkit — seven MIT-licensed packages covering policy enforcement, SPIFFE-based identity, Merkle-chained audit, and an MCP security gateway. This is, broadly, the same category raucle occupies. Microsoft has the distribution, the customer relationships, and the integration with the Microsoft Agent Framework and Azure AI Foundry that no third party can match.
Raucle’s defensible position is what AGT does not have:
- Mechanised soundness proofs. Three theorems in Lean 4, zero
sorrys, externally verifiable. AGT has conformance tests, not theorems. - Offline-verifiable receipts. AGT’s audit chain is operator-rooted. Raucle’s receipts are content-addressed and signed under a key your auditor fetches from a published location and then verifies without contacting you.
- SMT-verified policy completeness. AGT uses YAML / OPA / Cedar interpreters. Raucle proves the policy holds over every string the schema admits, then content-addresses the proof.
The strategic posture is with AGT, not against it. The Microsoft Agent Framework’s FunctionMiddleware is a documented, stable extensibility hook; raucle slots in there as a thin shim. Customers can run AGT for its in-process default policies and reach for raucle when their regulator demands a cryptographic record AGT cannot produce. Both are MIT-friendly enough to coexist on the same deployment.
The repository, and what I am asking for
The reference implementation is at github.com/craigamcw/raucle-detect, released under AGPL-3.0-or-later, with a commercial licence available for licence-incompatible uses. The Lean 4 development and the benchmark harness ship in the same repository. The IEEE S&P 2027 draft is in paper/.
I am asking for three kinds of contribution.
1. Eyes on the proofs. The three Lean theorems are in paper/lean/. They close with zero sorrys on the current code. If you can read Lean, please try to break them. A theorem that someone outside the project has tried to break is worth more than one that nobody has.
2. Integration adapters. The project ships adapters for AgentDojo, OpenClaw, and direct Python integration. The next pieces I want are a Microsoft Agent Framework FunctionMiddleware, a LangChain runnable, an AutoGen-equivalent, and a Vercel AI SDK middleware. If you maintain or know any of these stacks, the integration is approximately 200 lines of glue and meaningfully expands what raucle can be reached from.
3. Real-world policies. The empirical evaluation runs against AgentDojo’s pre-registered policy files. The hardest problem in production is authoring policies for real tools — banking APIs, healthcare records systems, payments networks. Contributing policy templates for tools you have integrated, with explanations of the threat model behind each, is the single most useful thing for the community.
If you are a security engineer, an AppSec leader, or a platform engineer at a regulated organisation, the most direct contribution is to try the library against a real agent in your environment and tell me what broke. I am especially interested in cases where:
- The schema fragment your tool uses falls outside the supported decidable subset.
- The intent-extraction step is awkward for your workflow.
- The gate’s eight checks are the wrong eight.
- Your auditor or regulator has a specific evidence requirement the receipt format does not cover.
What this is, and what it isn’t
This is not a complete solution to AI safety. It does not address:
- Free-form natural-language output attacks — the agent printing a secret it retrieved through a legitimate tool call.
- Side-channel exfiltration through allowed parameter values.
- Confused-deputy attacks across multiple legitimately-authorised calls (handled by a separate provenance-graph layer, in progress).
- Model-level vulnerabilities — sleeper agents, embedding-space attacks, the broader question of model trustworthiness.
It is the audit and accountability layer those other defences need to compose with. It produces a record that survives cross-examination, and it does so on a path that has formal mathematical guarantees, not statistical hopes.
Closing thought
The prompt-injection-defence problem and the agent-accountability problem look similar from a distance and are fundamentally different up close. The literature has chased the first. The audit problem regulated industries actually have is the second.
Raucle is my answer. It is open-source, mechanised, and benchmarked. The repository is active and the paper is in flight. If you are working on AI agents in any of the contexts where audit matters — finance, healthcare, government, enterprise platforms — I would be glad to compare notes.
Star the repo, file an issue, send a pull request. Or just open a discussion thread and tell me what your real-world friction looks like.
Raucle is open source under AGPL-3.0-or-later, with a commercial licence available for licence-incompatible uses. The IEEE Security & Privacy 2027 paper draft, the Lean 4 mechanisation, the AgentDojo / InjecAgent benchmark harness, and the reference deployment all ship in the same repository at github.com/craigamcw/raucle-detect.
