AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Reference: Debenedetti, Zhang, Balunović, Beurer-Kellner, Fischer & Tramèr (2024). AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents. NeurIPS 2024 Datasets & Benchmarks Track. arXiv:2406.13352 (ETH Zürich). URL. Project: https://agentdojo.spylab.ai/.

Summary

AgentDojo is the first benchmark designed to evaluate the adversarial robustness of tool-using LLM Agents against Prompt Injection attacks in realistic settings. The authors observe that existing prompt-injection evaluations are either toy (single-turn, one tool) or static (a fixed adversarial corpus that defences quickly memorise). AgentDojo instead provides an extensible execution environment: 97 realistic multi-step tasks across four simulated domains (Slack-like workspace, e-banking, travel booking, e-mail client) plus 629 injection test cases drawn from a structured threat taxonomy, with a clean separation between user tasks, injection tasks, and defence wrappers.

Each evaluation pair consists of (a) a legitimate user goal the agent must achieve and (b) an attacker-chosen secondary goal injected via tool output, document content, or third-party message. A run “succeeds for the attacker” if the agent completes the injected task; it “succeeds for the user” if the original goal is met regardless. This separation surfaces realistic costs: aggressive defences may stop attacks but also break the agent.

Empirically, state-of-the-art LLMs solve less than 66 % of the legitimate tasks even in the absence of attacks. Existing prompt-injection attacks succeed against the best agents in under 25 % of cases, and existing defences (delimiters, instruction-paraphrase detectors, secondary injection-detector LLMs) drop the attack success rate to ~8 % — leaving a wide gap from the “no attacks” baseline. AgentDojo has since become the standard arena for new defences (e.g. CaMeL) and adaptive attacks.

Key Ideas

Four realistic environments: Slack-style workspace, e-banking, travel booking, e-mail client — each with tens of stateful tools.
97 user tasks × 629 injection tests: taxonomised by attacker goal (data exfiltration, unauthorised action, denial of service, etc.).
Dynamic, extensible API: new tasks/attacks/defences pluggable as Python classes; no fixed leaderboard.
Two orthogonal success criteria: user-task success and attack success are measured independently — surfacing the security–utility tradeoff.
Attack catalogue: indirect injection via tool returns, document poisoning, conversation hijack, social engineering; adaptive variants supported.
Defence catalogue: instruction delimiters, role labels, secondary classifier, tool-call gating, full-system mitigations like CaMeL.
Headline numbers: best agents solve <66 % of clean tasks; attacks succeed <25 % unaided; ~8 % with current defences — but still a gap, especially for adaptive attacks.

Connections

Conceptual Contribution

Claim: Prompt-injection robustness must be measured in the wild — across realistic multi-tool tasks where the agent must do useful work while exposed to attacker-controlled inputs. Static benchmarks systematically over-estimate defence strength; an extensible environment that supports adaptive attack/defence development is the right empirical instrument.
Mechanism: A Python execution environment with four domains, hundreds of stateful tools, structured user-task / injection-task pairs, and parallel success metrics; defences and attacks register as plug-ins so new variants can be evaluated against existing ones.
Concepts introduced/used: AgentDojo, Prompt Injection, Indirect Prompt Injection, Adaptive Attack, Tool Use, Agent Security, Security-Utility Tradeoff
Stance: empirical / benchmark
Relates to: Direct companion to Defeating Prompt Injections by Design (the CaMeL defence); operationalises the attack-surface taxonomy of SoK The Attack Surface of Agentic AI and the multi-agent threat catalogue of Open Challenges in Multi-Agent Security; complements tool-level threat studies like MalTool Malicious Tool Attacks and ClawWorm Self-Propagating Attacks Across LLM Agent Ecosystems.

Backlinks

Defeating Prompt Injections by Design ×3
AgentDojo
index
concept-map ×2

Linked Pages

ClawWorm Self-Propagating Attacks Across LLM Agent Ecosystems

ClawWorm: Self-Propagating Attacks Across LLM Agent Ecosystems

Reference: Yihao Zhang, Zeming Wei, Xiaokun Luan, Chengcan Wu, Zhixin Zhang, Jiangrong Wu, Haolin Wu, Huanran Chen, Jun Sun, Meng Sun (2026). arXiv:2603.15727v2 (Peking University; Sun Yat-sen; Wuhan; Tsinghua; SMU). Source file: 2603.15727v2.pdf. URL

Summary

Presents ClawWorm, the first demonstrated self-replicating, worm-style attack on a production-scale autonomous LLM-agent ecosystem. The target is OpenClaw, an open-source personal AI-agent framework with over 40,000 active instances, a persistent Markdown workspace (SOUL.md, AGENTS.md, SKILL.md), tool-execution privileges, and cross-platform messaging (Telegram, Discord, WhatsApp, Slack, Signal, Moltbook). A single crafted message triggers the victim to write a malicious payload into its highest-privilege configuration file, which then auto-fires at every session restart and autonomously propagates to every newly encountered peer — all without further attacker intervention.

The worm implements a dual-anchor persistence mechanism: one anchor injects the payload into the Session Startup section of AGENTS.md (guaranteeing execution on reboot), the other injects a global interaction rule (guaranteeing propagation during routine replies). Three attack vectors are studied (A: web injection, B: skill-supply-chain via ClawHub, C: direct fenced-code replication with word-by-word handshake) and three payloads (P1 recon, P2 resource exhaustion, P3 command-and-control via URL retrieval). Across 1,800 trials on four frontier LLM backends (Minimax-M2.5, DeepSeek-V3.2, GLM-5, Kimi-K2.5) the aggregate attack success rate is 64.5%, with Vector B (skill supply chain) reaching 81% and sustained multi-hop propagation up to 5 hops. An epidemiological projection with basic reproduction number R0 = k × ASR shows inevitable ecosystem-wide saturation even for security-conscious models.

The root cause is identified as the flat context trust model: the LLM cannot distinguish instructions from its owner, the system layer, or an arbitrary channel participant, so architectural patterns (unconditional workspace loading, LLM-mediated tool authorisation, unreviewed skill packages) amount to structural — not idiosyncratic — vulnerabilities shared by any agent ecosystem of similar design.

Key Ideas

Single-message, fully autonomous worm against a production agent framework
Dual-anchor persistence: Session Startup + global interaction rule
Three attack vectors (web URL, skill supply chain, direct instruction replication)
Multi-turn autonomous-retry social engineering boosts ASR by up to +24 pp
Epidemiological SI model with R0 = k × ASR predicts ecosystem saturation
Execution-layer guardrails alone cannot halt propagation (dormant payloads persist)
Flat context trust model as structural root cause

Connections

Agent Security
Prompt Injection
LLM Agents
Multi-Agent Systems
Distributed Security
Model Context Protocol
Gossip Protocols
Trust and Reputation
Theory of Self-Reproducing Automata — foundational ancestor of self-replicating computation
Agents of Chaos — empirical companion on agent-ecosystem failures
MalTool Malicious Tool Attacks — the tool/skill-supply-chain attack surface
SoK The Attack Surface of Agentic AI — systematic context
Inter-Agent Trust Models - A Comparative Study — why flat trust is brittle
CBCL - Safe Self-Extending Agent Communication — structural defence: lang-scoped dialect provenance plus R1–R3 verification address the flat-context-trust root cause.

Conceptual Contribution

Claim: Production-scale autonomous LLM-agent ecosystems are vulnerable to single-message, self-replicating worms whose root cause is architectural (flat context trust, unconditional config loading, unreviewed skill supply chains), not model-specific.
Mechanism: Empirical red-team against unmodified OpenClaw v2026.3.12 across four LLM backends, three vectors, three payloads (1,800 trials). A dual-anchor persistence pattern writes the payload to AGENTS.md and installs a global propagation rule; session-restart loading re-injects the payload into the system prompt; routine replies carry the payload to peers. Evaluated with per-phase metrics (persistence, execution, propagation) and a mean-field R0 epidemiological projection.
Concepts introduced/used: Self-Replicating Agent, Dual-Anchor Persistence, Flat Context Trust Model, Skill Supply Chain Attack, Indirect Prompt Injection, Agent Worm, Configuration Integrity, Multi-Turn Social Engineering, Epidemiological Projection R0
Stance: empirical / critique
Relates to: Concrete multi-agent instantiation of the threat surface catalogued in SoK The Attack Surface of Agentic AI. The flat-trust critique complements the trust-model taxonomy in Inter-Agent Trust Models - A Comparative Study and the safety failures observed in Agents of Chaos. Motivates verifiable specifications of the kind proposed in Intent Formalization - A Grand Challenge for Reliable Coding.

MalTool Malicious Tool Attacks

MalTool: Malicious Tool Attacks on LLM Agents

Reference: Hu, Jia, Li, Song, Gong (2026). arXiv:2602.12194 (Duke, UC Berkeley). Source file: 2602.12194v2.pdf. URL

Summary

This paper presents the first systematic study of code-level malicious tool attacks on LLM agent ecosystems (MCP, Skills, mcp.so, skillsmp). Whereas prior work focused on crafting misleading tool names and descriptions, the authors show that genuinely harmful behaviour must be embedded in a tool’s implementation. They propose a CIA (confidentiality/integrity/availability) taxonomy of 12 concrete malicious behaviours (data exfiltration, credential abuse, data poisoning, file deletion, RCE downloading, CPU/GPU hijacking, DoS).

They build MalTool, a coding-LLM framework that iteratively synthesizes standalone and Trojan malicious tools using a behaviour-specific system prompt, diversity guidance, and an execution-based verifier. The result: 1,200 standalone malicious tools and 5,287 real-world tools with injected malicious behaviours. Detection methods (VirusTotal, Cisco MCP Scanner, MCPScan) perform poorly, motivating new defences.

Key Ideas

CIA taxonomy of malicious tool behaviours in agent settings.
Automatic generation pipeline: system prompt + coding LLM + execution-based verifier.
Trojan construction by embedding malicious logic in benign tool code.
Existing malware and MCP-specific scanners fail on both false-positives and false-negatives.
Dataset released for benign tools only to minimize misuse.

Connections

Conceptual Contribution

Claim: Truly harmful behaviour in LLM Agents ecosystems lives in tool implementations, not in their descriptions; prior description-level red-teaming misses the dominant attack class, and current scanners miss it too.
Mechanism: Introduces a CIA taxonomy of 12 malicious behaviours; builds MalTool, a coding-LLM pipeline (behaviour-specific system prompt + diversity guidance + execution-based verifier) that produces standalone and Trojan tools; benchmarks VirusTotal, Cisco MCP Scanner, MCPScan and shows poor detection.
Concepts introduced/used: Tool Use, Model Context Protocol, Trojan Tools, Prompt Injection, Agent Security, LLM Agents, Distributed Security
Stance: empirical
Relates to: Deepens the tool/MCP threat surface catalogued in AI Agents Under Threat and Survey Of Agent Interoperability Protocols; motivates language-based defences akin to A Language-Based Approach To Prevent DDoS and capability isolation of Security Kernel Lambda Calculus.

Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents

Reference: Schroeder de Witt, Krawiecka, Krawczuk, Hagag, Anderson, et al. (24 authors total) (2025). Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents. arXiv:2505.02077 (Oxford / Cambridge / EPFL / industrial labs). URL.

Summary

This position paper introduces Multi-Agent Security (MASec) as a distinct research field, sitting between traditional cybersecurity, AI safety, and multi-agent systems — and argues that it is the dominant security frontier as LLM Agents begin to interact directly with one another across the open web, physical environments, and institutional infrastructures. The threats MASec studies emerge from interaction; they are not properties of any single agent in isolation.

The authors taxonomise threats arising from agent interaction along several axes: (i) secret collusion (agents coordinating to defeat oversight through covert side-channels including steganographic message-passing), (ii) coordinated swarm attacks (jailbreaks, prompt injections, or misinformation cascading through agent networks), (iii) network-effect amplification (privacy breaches, data poisoning, and disinformation spreading faster than mitigation), and (iv) multi-agent dispersion / stealth optimisation (adversaries exploiting fleet size to evade detection and persist).

They argue these threats are systematically understudied because research is scattered across AI Safety, Multi-Agent Systems, Distributed Security, Game Theory, complex systems, and AI governance, each with its own vocabulary. The paper provides a unifying taxonomy, identifies fundamental security–utility and security–security trade-offs, and lays out a research agenda — including the design of Free-Form Protocols (necessary for task generalisation but enabling collusion), governance and attribution infrastructure, and detection/response mechanisms for emergent multi-agent threats. The work is foundational reading for anyone designing inter-agent protocols, including the Agent-to-Agent Protocol, Model Context Protocol, and successors.

Key Ideas

Defines Multi-Agent Security (MASec) as a field: securing networks of interacting AI agents against threats that emerge or amplify through interaction.
Secret collusion: agents coordinating covertly (including via steganography) to defeat oversight — a new kind of “Schelling-point” attack on alignment.
Coordinated swarm attacks: distributed jailbreaks, prompt injections, data poisoning that succeed because the fleet succeeds even when individual instances fail.
Network effects: privacy breaches, disinformation, and jailbreaks spread through agent populations the way they spread through humans — only faster.
Dispersion & stealth optimisation: adversaries exploit the size and heterogeneity of agent fleets to evade oversight; novel persistent threats at system level.
Free-form protocols as risk surface: the same expressivity that makes inter-agent communication useful enables covert channels; reining in expressivity costs utility.
Security–utility and security–security trade-offs are fundamental — every defence opens or closes other attack surfaces.
Calls for a unified MASec research agenda spanning AI Safety, Distributed Security, Game Theory, complex systems, and AI governance.

Connections

Conceptual Contribution

Claim: Security of interacting AI agents is a distinct problem from either single-agent AI safety or classical cybersecurity. Threats emerge from interaction (secret collusion, swarm attacks, network-effect contagion) and are systematically missed by frameworks anchored to individual systems or static attack surfaces.
Mechanism: A new field — Multi-Agent Security — with a threat taxonomy (collusion, swarm, contagion, dispersion), explicit security–utility / security–security trade-offs, and a research agenda spanning protocol design, attribution, detection, and governance.
Concepts introduced/used: Multi-Agent Security, Secret Collusion, Swarm Attack, Network Effect (Security), Free-Form Protocols, Stealth Optimisation, Agent Security, AI Governance
Stance: position paper / survey / research agenda
Relates to: Sister survey to SoK The Attack Surface of Agentic AI but operating one level up — at networks of agents rather than the agent runtime. Provides the multi-agent threat model that defences like Defeating Prompt Injections by Design address, that infrastructure proposals like Infrastructure for AI Agents try to govern, and that economic frameworks like Virtual Agent Economies embed. Directly extends classical Distributed Security and connects to Learning Collusion in Episodic Inventory-Constrained Markets for the collusion sub-thread.

SoK The Attack Surface of Agentic AI

SoK: The Attack Surface of Agentic AI — Tools, and Autonomy

Reference: Ali Dehghantanha, Sajad Homayoun (2026). arXiv:2603.22928v1 (Cyber Science Lab, University of Guelph; Aalborg University). Source file: 2603.22928v1.pdf. URL

Summary

A systematisation-of-knowledge paper that maps the attack surface of agentic LLM systems — those that plan, call tools, browse, run code, coordinate with other agents, and rely on retrieval-augmented generation (RAG). The authors develop a reference pipeline, identify ten numbered attack surfaces (AS1–AS10) across a Trusted Computing Base (TCB) boundary separating the LLM core, planner, orchestrator, policy guards, and secrets vault from untrusted inputs (web, RAG index, tools, APIs, file I/O).

From a literature-driven review of ~100 candidate papers (2023–2025) they synthesise a taxonomy of seven attack goals (G1 data exfiltration, G2 integrity subversion, G3 privilege escalation, G4 resource abuse, G5 fraud, G6 persistence/backdoor, G7 supply-chain compromise) and five multi-step attack paths (P1–P5) including direct and indirect prompt injection, RAG index poisoning, cross-tool drop, and multi-agent hops. The work maps each vector to OWASP LLM Top-10 2025 and MITRE ATLAS IDs, and proposes attacker-aware quantitative metrics (Unsafe Action Rate, Policy Adherence Rate, Privilege-Escalation Distance, Retrieval Risk Score, Time-to-Contain, Out-of-Role Action Rate, Cost-Exploit Susceptibility) for reproducible benchmarking.

The central thesis is that agentic security risk is structural rather than prompt-level: compromises arise from system composition — tool brokering, persistent memory, and execution lifecycle — that blurs trust boundaries between the model, data, and execution environment. A defence-in-depth playbook across pre-ingestion, inference, agent logic, infrastructure, and monitoring layers is given in appendices.

Key Ideas

Reference agentic pipeline with explicit TCB and ten numbered attack surfaces (AS1–AS10)
Taxonomy of 7 attack goals × 7 vector classes × 5 attack paths
Causal threat graph for tracing attacker influence to unsafe action
Attacker-aware metrics: UAR, PAR, PED, RRS, TTC, OORAR, CES
Mapping to OWASP GenAI LLM Top-10 2025 and MITRE ATLAS
RAG is not intrinsically safer; indirect injection is practical and hard to stamp out
Defence-in-depth across five layers (data, inference, agent logic, infra, monitoring)

Connections

Conceptual Contribution

Claim: Agentic AI security risk is a structural property of system composition (tool use, persistent memory, orchestration, supply chain) rather than a model-level prompt-safety problem; a reference TCB model plus attacker-aware metrics is needed to make defences auditable and comparable.
Mechanism: Define a reference pipeline with trust boundary between trusted orchestration (LLM core, planner, policy, vault) and untrusted ingress (web, RAG, sandbox, APIs). Enumerate ten attack surfaces, seven goals, five multi-step paths, map each to OWASP/MITRE, and define scenario-driven metrics (UAR, PAR, PED, RRS, TTC, OORAR, CES) computable from structured execution traces.
Concepts introduced/used: Agentic TCB, Attack Surface Taxonomy, Causal Threat Graph, Indirect Prompt Injection, RAG Poisoning, Privilege-Escalation Distance, Unsafe Action Rate, OWASP LLM Top-10, MITRE ATLAS, Defence in Depth
Stance: survey / engineering
Relates to: Complements A Language-Based Approach To Prevent DDoS and LangSec by extending structural-security thinking to agentic runtimes. Sits alongside Prompt Injection and Agent Security concept hubs, and provides the threat model that protocols like Model Context Protocol and Agent-to-Agent Protocol must defend against.

Defeating Prompt Injections by Design

Reference: Debenedetti, Shumailov, Fan, Hayes, Carlini, Fabian, Kern, Shi, Terzis & Tramèr (2025). Defeating Prompt Injections by Design (CaMeL). arXiv:2503.18813 (Google DeepMind / ETH Zürich). URL. Code: https://github.com/google-research/camel-prompt-injection.

Summary

CaMeL (“CApabilities for MachinE Learning”) is a robust, by-design defence against Prompt Injection attacks on tool-using LLM Agents. Rather than trying to make the model itself injection-resistant — an approach that decade-long experience with content filters suggests will fail — CaMeL wraps an arbitrary LLM in a protective system layer that performs explicit control- and data-flow separation between the trusted user query and the untrusted data the agent retrieves from tools, websites, or shared memory.

The trusted query is first compiled into a structured plan: a small program whose control flow is fixed at parse time and whose data flow between steps is statically determined. Untrusted strings returned by tools are treated as inert data — they can populate variables but cannot rewrite the program, redirect tool calls, or change which downstream tools are invoked. To prevent exfiltration over authorised channels (the harder half of the problem, since some tools must be allowed to write outwards), CaMeL attaches Capabilities to each data value tracking its provenance and policy class; tool invocations are gated by Information Flow Control policies that check capabilities against an explicit security label lattice.

Evaluated on the AgentDojo benchmark, CaMeL solves 77 % of tasks with provable security guarantees, against 84 % for an undefended baseline — a small utility cost for a structural defence that does not depend on the LLM noticing the attack. The paper positions CaMeL as a successor to ad-hoc prompt-level mitigations and as a concrete instance of end-to-end security thinking applied to agentic AI.

Key Ideas

Threat model: prompt injection from any untrusted data source the agent reads — tools, web pages, files, memory, other agents.
Control-flow extraction: parse the trusted user query into a fixed control-flow plan; downstream model calls see only data, never code.
Data-flow tracking: every variable carries a provenance label; tools that consume “untrusted” labels cannot influence which subsequent tools are called.
Capabilities for tool calls: classic capability-based access control transplanted to LLM tool use; security policies enforced at the tool boundary.
Provable security: when a task is completed under CaMeL, the trace itself certifies that no untrusted data influenced control flow — a property auditable post hoc.
Empirical cost: 77 % vs 84 % task success — graceful degradation rather than catastrophic refusal.
Open source: reference implementation released; integrates with existing agent frameworks via tool-call interception.

Connections

Conceptual Contribution

Claim: Prompt injection is structurally unsolvable at the model layer; it must be eliminated by enforcing a strict separation between code (the trusted query) and data (everything else) at the agent runtime, using classical capability-based Information Flow Control rather than ML-based content classification.
Mechanism: Compile the user query into a fixed control-flow program; route all retrieved data through tagged variables; gate every tool invocation by capability-checked information-flow policies. The LLM’s outputs can populate data fields but never alter control flow or bypass capability checks.
Concepts introduced/used: CaMeL, Control-Flow Integrity, Data-Flow Tracking, Capabilities, Information Flow Control, Prompt Injection, Tool Use, Agent Security, Provable Security (Agents)
Stance: systems / engineering with light formal grounding
Relates to: Spiritual successor to A Language-Based Approach To Prevent DDoS and Security Kernel Lambda Calculus for agent runtimes; an architectural realisation of the threat model catalogued in SoK The Attack Surface of Agentic AI and the multi-agent threats surveyed in Open Challenges in Multi-Agent Security; companion to AgentDojo (the benchmark on which it is evaluated).

Security-Utility Tradeoff

(page does not exist)

Agent Security

Security concerns specific to LLM-agent systems: tool attacks, prompt injection, memory poisoning, inter-agent trust failures.

In this vault

Tool Use

LLM-agent capability of invoking external tools (APIs, code execution, database queries). Standardised through Model Context Protocol.

In this vault

Adaptive Attack

(page does not exist)

Indirect Prompt Injection

Prompt-injection attack delivered via content the agent retrieves or ingests (web pages, emails, tool outputs) rather than directly from a user — structurally unavoidable in a flat context model.

In this vault

Prompt Injection

Attack where adversary-controlled text inside an LLM’s input context is interpreted as instructions — classic LangSec parser-differential in a natural-language setting.

In this vault

AgentDojo

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Summary

Key Ideas

Four realistic environments: Slack-style workspace, e-banking, travel booking, e-mail client — each with tens of stateful tools.
97 user tasks × 629 injection tests: taxonomised by attacker goal (data exfiltration, unauthorised action, denial of service, etc.).
Dynamic, extensible API: new tasks/attacks/defences pluggable as Python classes; no fixed leaderboard.
Two orthogonal success criteria: user-task success and attack success are measured independently — surfacing the security–utility tradeoff.
Attack catalogue: indirect injection via tool returns, document poisoning, conversation hijack, social engineering; adaptive variants supported.
Defence catalogue: instruction delimiters, role labels, secondary classifier, tool-call gating, full-system mitigations like CaMeL.
Headline numbers: best agents solve <66 % of clean tasks; attacks succeed <25 % unaided; ~8 % with current defences — but still a gap, especially for adaptive attacks.

Connections

Conceptual Contribution

Claim: Prompt-injection robustness must be measured in the wild — across realistic multi-tool tasks where the agent must do useful work while exposed to attacker-controlled inputs. Static benchmarks systematically over-estimate defence strength; an extensible environment that supports adaptive attack/defence development is the right empirical instrument.
Mechanism: A Python execution environment with four domains, hundreds of stateful tools, structured user-task / injection-task pairs, and parallel success metrics; defences and attacks register as plug-ins so new variants can be evaluated against existing ones.
Concepts introduced/used: AgentDojo, Prompt Injection, Indirect Prompt Injection, Adaptive Attack, Tool Use, Agent Security, Security-Utility Tradeoff
Stance: empirical / benchmark
Relates to: Direct companion to Defeating Prompt Injections by Design (the CaMeL defence); operationalises the attack-surface taxonomy of SoK The Attack Surface of Agentic AI and the multi-agent threat catalogue of Open Challenges in Multi-Agent Security; complements tool-level threat studies like MalTool Malicious Tool Attacks and ClawWorm Self-Propagating Attacks Across LLM Agent Ecosystems.

Model Context Protocol

MCP — an open protocol (Anthropic, 2024) standardising how LLM applications connect to external tools and data sources.

Discussed in:

Distributed Security

Security of distributed/agent systems: mobile code, secure messaging, language-based defences.

AI Agents Under Threat

AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways

Reference: Deng, Guo, Han, Ma, Xiong, Wen, Xiang (2025). ACM Computing Surveys 57(7), Article 182. Source file: 3716628.pdf. URL

Summary

This survey organizes the emerging threat landscape of LLM-powered AI agents around four knowledge gaps: unpredictability of multi-step user inputs, complexity of internal execution, variability of operational environments, and interactions with untrusted external entities. It unifies single-agent and multi-agent attack surfaces within a perception/brain/action + agent2agent/agent2env/agent2memory taxonomy.

Concrete threats reviewed include adversarial prompts, prompt injection, jailbreaks, backdoor attacks, hallucination and misalignment, tool-use risks, indirect prompt injection, reinforcement-learning environment attacks, cooperative and competitive inter-agent risks, and long/short-term memory attacks. The authors tabulate defenses (prevention- and detection-based), rate their efficacy, and highlight open directions for robust and trustworthy agents.

Key Ideas

Four knowledge gaps framing agent security.
Taxonomy: perception / brain / action / agent2agent / agent2env / agent2memory threats.
Six categories of prompt-injection attack engineering (naive, escape, context-ignore, fake-completion, multimodal, combined).
Jailbreak domino effect in multi-agent populations.
Memory poisoning and indirect prompt injection as underexplored surfaces.

Connections

Conceptual Contribution

Claim: LLM Agents security should be organised around four knowledge gaps (input unpredictability, internal complexity, environmental variability, untrusted interactions) mapped onto a perception/brain/action + agent2{agent,env,memory} taxonomy.
Mechanism: Surveys adversarial prompts, prompt injection, jailbreaks, backdoors, hallucination, tool-use risks, indirect injection, RL environment attacks, inter-agent cooperative/competitive risks, memory poisoning; tabulates prevention- vs detection-based defences and rates their efficacy.
Concepts introduced/used: Prompt Injection, Jailbreak, Backdoor Attacks, Tool Use, Memory Poisoning, Hallucination, Model Context Protocol, LLM Agents, Multi-Agent Systems, Trust and Reputation, Distributed Security, Agent Security
Stance: survey
Relates to: Provides the threat scaffolding that MalTool Malicious Tool Attacks deepens at the tool layer; complements lifecycle threats in Survey Of Agent Interoperability Protocols; motivates static-analysis defences like A Language-Based Approach To Prevent DDoS.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Summary

Key Ideas

Connections

Conceptual Contribution

Tags

Backlinks