Expand ↗
Page list (1268)

Defeating Prompt Injections by Design

Reference: Debenedetti, Shumailov, Fan, Hayes, Carlini, Fabian, Kern, Shi, Terzis & Tramèr (2025). Defeating Prompt Injections by Design (CaMeL). arXiv:2503.18813 (Google DeepMind / ETH Zürich). URL. Code: https://github.com/google-research/camel-prompt-injection.

Summary

CaMeL (“CApabilities for MachinE Learning”) is a robust, by-design defence against Prompt Injection attacks on tool-using LLM Agents. Rather than trying to make the model itself injection-resistant — an approach that decade-long experience with content filters suggests will fail — CaMeL wraps an arbitrary LLM in a protective system layer that performs explicit control- and data-flow separation between the trusted user query and the untrusted data the agent retrieves from tools, websites, or shared memory.

The trusted query is first compiled into a structured plan: a small program whose control flow is fixed at parse time and whose data flow between steps is statically determined. Untrusted strings returned by tools are treated as inert data — they can populate variables but cannot rewrite the program, redirect tool calls, or change which downstream tools are invoked. To prevent exfiltration over authorised channels (the harder half of the problem, since some tools must be allowed to write outwards), CaMeL attaches Capabilities to each data value tracking its provenance and policy class; tool invocations are gated by Information Flow Control policies that check capabilities against an explicit security label lattice.

Evaluated on the AgentDojo benchmark, CaMeL solves 77 % of tasks with provable security guarantees, against 84 % for an undefended baseline — a small utility cost for a structural defence that does not depend on the LLM noticing the attack. The paper positions CaMeL as a successor to ad-hoc prompt-level mitigations and as a concrete instance of end-to-end security thinking applied to agentic AI.

Key Ideas

  • Threat model: prompt injection from any untrusted data source the agent reads — tools, web pages, files, memory, other agents.
  • Control-flow extraction: parse the trusted user query into a fixed control-flow plan; downstream model calls see only data, never code.
  • Data-flow tracking: every variable carries a provenance label; tools that consume “untrusted” labels cannot influence which subsequent tools are called.
  • Capabilities for tool calls: classic capability-based access control transplanted to LLM tool use; security policies enforced at the tool boundary.
  • Provable security: when a task is completed under CaMeL, the trace itself certifies that no untrusted data influenced control flow — a property auditable post hoc.
  • Empirical cost: 77 % vs 84 % task success — graceful degradation rather than catastrophic refusal.
  • Open source: reference implementation released; integrates with existing agent frameworks via tool-call interception.

Connections

Conceptual Contribution

Tags

#agent-security #prompt-injection #llm-agents #capabilities #information-flow-control #tool-use

Backlinks