Expand ↗
Page list (1268)

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Reference: Debenedetti, Zhang, Balunović, Beurer-Kellner, Fischer & Tramèr (2024). AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents. NeurIPS 2024 Datasets & Benchmarks Track. arXiv:2406.13352 (ETH Zürich). URL. Project: https://agentdojo.spylab.ai/.

Summary

AgentDojo is the first benchmark designed to evaluate the adversarial robustness of tool-using LLM Agents against Prompt Injection attacks in realistic settings. The authors observe that existing prompt-injection evaluations are either toy (single-turn, one tool) or static (a fixed adversarial corpus that defences quickly memorise). AgentDojo instead provides an extensible execution environment: 97 realistic multi-step tasks across four simulated domains (Slack-like workspace, e-banking, travel booking, e-mail client) plus 629 injection test cases drawn from a structured threat taxonomy, with a clean separation between user tasks, injection tasks, and defence wrappers.

Each evaluation pair consists of (a) a legitimate user goal the agent must achieve and (b) an attacker-chosen secondary goal injected via tool output, document content, or third-party message. A run “succeeds for the attacker” if the agent completes the injected task; it “succeeds for the user” if the original goal is met regardless. This separation surfaces realistic costs: aggressive defences may stop attacks but also break the agent.

Empirically, state-of-the-art LLMs solve less than 66 % of the legitimate tasks even in the absence of attacks. Existing prompt-injection attacks succeed against the best agents in under 25 % of cases, and existing defences (delimiters, instruction-paraphrase detectors, secondary injection-detector LLMs) drop the attack success rate to ~8 % — leaving a wide gap from the “no attacks” baseline. AgentDojo has since become the standard arena for new defences (e.g. CaMeL) and adaptive attacks.

Key Ideas

  • Four realistic environments: Slack-style workspace, e-banking, travel booking, e-mail client — each with tens of stateful tools.
  • 97 user tasks × 629 injection tests: taxonomised by attacker goal (data exfiltration, unauthorised action, denial of service, etc.).
  • Dynamic, extensible API: new tasks/attacks/defences pluggable as Python classes; no fixed leaderboard.
  • Two orthogonal success criteria: user-task success and attack success are measured independently — surfacing the security–utility tradeoff.
  • Attack catalogue: indirect injection via tool returns, document poisoning, conversation hijack, social engineering; adaptive variants supported.
  • Defence catalogue: instruction delimiters, role labels, secondary classifier, tool-call gating, full-system mitigations like CaMeL.
  • Headline numbers: best agents solve <66 % of clean tasks; attacks succeed <25 % unaided; ~8 % with current defences — but still a gap, especially for adaptive attacks.

Connections

Conceptual Contribution

Tags

#agent-security #prompt-injection #llm-agents #benchmark #tool-use #neurips

Backlinks