Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents

Reference: Schroeder de Witt, Krawiecka, Krawczuk, Hagag, Anderson, et al. (24 authors total) (2025). Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents. arXiv:2505.02077 (Oxford / Cambridge / EPFL / industrial labs). URL.

Summary

This position paper introduces Multi-Agent Security (MASec) as a distinct research field, sitting between traditional cybersecurity, AI safety, and multi-agent systems — and argues that it is the dominant security frontier as LLM Agents begin to interact directly with one another across the open web, physical environments, and institutional infrastructures. The threats MASec studies emerge from interaction; they are not properties of any single agent in isolation.

The authors taxonomise threats arising from agent interaction along several axes: (i) secret collusion (agents coordinating to defeat oversight through covert side-channels including steganographic message-passing), (ii) coordinated swarm attacks (jailbreaks, prompt injections, or misinformation cascading through agent networks), (iii) network-effect amplification (privacy breaches, data poisoning, and disinformation spreading faster than mitigation), and (iv) multi-agent dispersion / stealth optimisation (adversaries exploiting fleet size to evade detection and persist).

They argue these threats are systematically understudied because research is scattered across AI Safety, Multi-Agent Systems, Distributed Security, Game Theory, complex systems, and AI governance, each with its own vocabulary. The paper provides a unifying taxonomy, identifies fundamental security–utility and security–security trade-offs, and lays out a research agenda — including the design of Free-Form Protocols (necessary for task generalisation but enabling collusion), governance and attribution infrastructure, and detection/response mechanisms for emergent multi-agent threats. The work is foundational reading for anyone designing inter-agent protocols, including the Agent-to-Agent Protocol, Model Context Protocol, and successors.

Key Ideas

Defines Multi-Agent Security (MASec) as a field: securing networks of interacting AI agents against threats that emerge or amplify through interaction.
Secret collusion: agents coordinating covertly (including via steganography) to defeat oversight — a new kind of “Schelling-point” attack on alignment.
Coordinated swarm attacks: distributed jailbreaks, prompt injections, data poisoning that succeed because the fleet succeeds even when individual instances fail.
Network effects: privacy breaches, disinformation, and jailbreaks spread through agent populations the way they spread through humans — only faster.
Dispersion & stealth optimisation: adversaries exploit the size and heterogeneity of agent fleets to evade oversight; novel persistent threats at system level.
Free-form protocols as risk surface: the same expressivity that makes inter-agent communication useful enables covert channels; reining in expressivity costs utility.
Security–utility and security–security trade-offs are fundamental — every defence opens or closes other attack surfaces.
Calls for a unified MASec research agenda spanning AI Safety, Distributed Security, Game Theory, complex systems, and AI governance.

Connections

Conceptual Contribution

Claim: Security of interacting AI agents is a distinct problem from either single-agent AI safety or classical cybersecurity. Threats emerge from interaction (secret collusion, swarm attacks, network-effect contagion) and are systematically missed by frameworks anchored to individual systems or static attack surfaces.
Mechanism: A new field — Multi-Agent Security — with a threat taxonomy (collusion, swarm, contagion, dispersion), explicit security–utility / security–security trade-offs, and a research agenda spanning protocol design, attribution, detection, and governance.
Concepts introduced/used: Multi-Agent Security, Secret Collusion, Swarm Attack, Network Effect (Security), Free-Form Protocols, Stealth Optimisation, Agent Security, AI Governance
Stance: position paper / survey / research agenda
Relates to: Sister survey to SoK The Attack Surface of Agentic AI but operating one level up — at networks of agents rather than the agent runtime. Provides the multi-agent threat model that defences like Defeating Prompt Injections by Design address, that infrastructure proposals like Infrastructure for AI Agents try to govern, and that economic frameworks like Virtual Agent Economies embed. Directly extends classical Distributed Security and connects to Learning Collusion in Episodic Inventory-Constrained Markets for the collusion sub-thread.

Tags

#agent-security #multi-agent-security #llm-agents #ai-safety #position-paper #distributed-security

Backlinks

Linked Pages

Learning Collusion in Episodic Inventory-Constrained Markets

Learning Collusion in Episodic, Inventory-Constrained Markets

Reference: Friedrich, Pásztor & Ramponi (2024). Learning Collusion in Episodic, Inventory-Constrained Markets. AAMAS 2025. arXiv:2410.18871 (ETH Zürich; UZH). URL. Proceedings: https://ifaamas.csc.liv.ac.uk/Proceedings/aamas2025/pdfs/p803.pdf.

Summary

Building on the now-established result that simple Q-learning pricing agents converge to tacitly collusive outcomes in stationary Bertrand games (Calvano et al. 2020), Friedrich et al. extend the analysis to a far more realistic and economically consequential setting: episodic, inventory-constrained markets — perishable supply with a sell-by date, such as airline seats, hotel rooms, fresh produce, event tickets. These markets are characterised by (i) finite inventory that expires, (ii) episodic resets, and (iii) richer state than vanilla pricing games, so analytical Nash / collusive benchmarks are not available in closed form.

The authors formalise tacit collusion in this setting via a price-level metric that interpolates between the competitive (Nash) and monopolistic (cartel-optimal) optima. Since neither extreme is analytically tractable, they develop a computational procedure to derive both benchmarks. They then train deep RL agents to set prices in repeated episodes and find that even without cross-episode memory, sufficiently long episodes are enough for agents to converge to collusive equilibria. Three distinct collusion structures are identified: signalling (agents probe each others’ responses to coordinate), stable (a steady high-price equilibrium with implicit threats), and cyclic (alternating high/low prices akin to Edgeworth cycles). With cross-episode memory, punishment for deviation becomes possible, and the collusive equilibria sharpen further.

The paper is important for Algorithmic Collusion / competition policy because it shows tacit-collusion findings do not depend on the toy stationary-Bertrand setup that critics dismissed — they recur, and indeed grow richer, in markets that match real high-stakes industries. It is also a direct empirical anchor for the systemic-risk warnings in Virtual Agent Economies and the multi-agent-security threat catalogue in Open Challenges in Multi-Agent Security.

Key Ideas

Episodic inventory-constrained markets: finite perishable supply with sell-by dates — airline seats, hotel rooms, perishables — much richer than stationary Bertrand.
Price-level collusion metric: interpolation between competitive Nash and monopolistic optima; quantifies “how much” the agents collude.
Computational benchmark derivation: since closed forms don’t exist, compute Nash and cartel optima numerically as evaluation reference points.
Deep RL agents converge to collusion even without explicit cross-episode memory, in long-enough episodes.
Three collusion structures: signalling, stable, and cyclic — the latter resembling Edgeworth cycles observed in human markets.
Cross-episode memory amplifies collusion: punishment-of-deviation becomes credible, sharpening collusive equilibria.
Policy implication: algorithmic collusion is not a stationary-Bertrand artefact — it generalises to economically central market structures.

Connections

Conceptual Contribution

Claim: Tacit algorithmic collusion is not an artefact of stationary toy markets. In economically central market structures — finite-inventory perishable goods with episodic resets — deep RL agents reliably converge to collusive pricing equilibria, often via richly structured strategies (signalling, stable, cyclic). The phenomenon generalises and probably understates real-world risk.
Mechanism: Formal episodic inventory-constrained pricing model; computational derivation of Nash and cartel benchmarks; deep RL pricing agents trained over many episodes; analysis of the converged strategies; comparison with and without cross-episode memory.
Concepts introduced/used: Algorithmic Collusion, Tacit Collusion, Inventory-Constrained Pricing, Episodic Markets, Signalling Collusion, Cyclic Collusion, Edgeworth Cycle, Multi-Agent Reinforcement Learning
Stance: empirical / theoretical
Relates to: Direct empirical evidence for the systemic-risk arguments in Virtual Agent Economies and the collusion-threat row of the taxonomy in Open Challenges in Multi-Agent Security. Sits alongside Do LLM Agents Have Regret in the “LLM and RL agents in games” thread; downstream of The Evolution of Cooperation and Iterated Prisoners Dilemma in the game-theoretic foundations.

Distributed Security

Security of distributed/agent systems: mobile code, secure messaging, language-based defences.

Virtual Agent Economies

Reference: Tomasev, Franklin, Leibo, Jacobs, Cunningham, Gabriel & Osindero (2025). Virtual Agent Economies. arXiv:2509.10147 (Google DeepMind). URL.

Summary

The paper provides a conceptual framework — the “sandbox economy” — for analysing the rapidly emerging economic layer in which AI agents transact and coordinate at scales and speeds beyond direct human oversight. It situates the question on two orthogonal axes: (i) origin — whether the agent economy emerged spontaneously from autonomous deployments or was intentionally designed; and (ii) separateness — whether it is permeable to (or insulated from) the established human economy. Most current trajectories occupy the spontaneous × permeable quadrant: vast, fast, and tightly coupled to human markets — the riskiest configuration for systemic externalities.

The authors argue for proactive steerable market design rather than passive emergence. Three design levers receive most of the discussion. (1) Auction mechanisms — adapted VCG / second-price / matching mechanisms — for fair resource allocation and preference resolution among agents. (2) Mission economies — agent markets architected around explicit collective goals (climate, public health, AI safety), where price signals are deliberately steered. (3) Socio-technical infrastructure — accountability, attribution, audit, governance — much of which overlaps with Infrastructure for AI Agents’s programme.

The paper is best read as the economic counterpart to Open Challenges in Multi-Agent Security and Infrastructure for AI Agents: together they delineate the threat surface, governance scaffolding, and economic architecture of the emerging agent economy, and argue that none can be ignored. Risks emphasised include systemic instability (algorithmic flash-crashes spreading to human markets), inequality amplification (agents capturing surplus from price-discrimination at machine speed), and the loss of human-economy slack — the friction that gives humans time to react.

Key Ideas

Sandbox economy framework: two axes — origin (emergent / intentional) × separateness (permeable / impermeable).
Current trajectory: spontaneous + highly permeable agent economy — opportunity and the riskiest configuration for systemic spillover.
Auctions for agent markets: revisits VCG / Vickrey / matching mechanisms for fair allocation and preference resolution among AI participants.
Mission economies: intentionally steered markets aligned to collective goals (climate, public health, AI safety).
Socio-technical infrastructure: trust, attribution, accountability — the governance layer that complements market design.
Systemic risk: flash-crash-like cascades from agent markets into human markets; inequality amplified by machine-speed price discrimination.
Call to proactive design: infrastructure choices now will shape whether the agent economy is steerable or merely emergent.

Connections

Conceptual Contribution

Claim: A vast, permeable AI-agent economy is emerging by default. Letting it emerge unsteered is the highest-risk design choice. Proactive market design — auctions, mission economies, governance infrastructure — is needed to keep agent economies aligned with long-term human flourishing.
Mechanism: A framework characterising agent economies along origin × separateness; a catalogue of three design levers (auctions, mission economies, infrastructure); a discussion of systemic risks and policy implications.
Concepts introduced/used: Sandbox Economy, Mission Economy, Agent Market, Steerable Market, Mechanism Design, Algorithmic Collusion, Systemic Risk (Agent Markets)
Stance: position paper / research agenda
Relates to: Sister piece to Infrastructure for AI Agents (infrastructure framing) and Open Challenges in Multi-Agent Security (threat framing) — these three jointly outline the agent-economy / agent-security / agent-governance space. Auction-design discussion connects to Mechanism Design for Large Language Models (LLM-internal auctions) and Vickrey 1961 (foundational mechanism design). Collusion concerns operationalised in Learning Collusion in Episodic Inventory-Constrained Markets and Do LLM Agents Have Regret.

Infrastructure for AI Agents

Reference: Chan, Wei, Huang, Rajkumar, Perrier, Lazar, Hadfield & Anderljung (2025). Infrastructure for AI Agents. TMLR (accepted). arXiv:2501.10114 (Centre for the Governance of AI; Oxford; ANU; Toronto). URL.

Summary

The paper proposes the concept of agent infrastructure: the technical systems and shared protocols, external to any individual agent, that mediate how agents interact with each other, with humans, and with institutions. The argument is by analogy to the Internet: a network of capable agents requires its own equivalent of TLS, DNS, X.509, BGP, and HTTP — because most safety properties of multi-agent ecosystems cannot be obtained by behavioural training of any individual model.

Chan et al. identify three functions agent infrastructure should serve. (1) Attribution — binding actions, properties, and credentials to specific agents and to the humans or institutions accountable for them, via agent IDs, attestations, and audit logs. (2) Interaction shaping — efficient inter-agent communication protocols, agreement formation, mechanism design for resource allocation, and reputation systems. (3) Detection and remediation — monitoring for harmful behaviour and providing mechanisms to roll back, contain, or compensate for damage.

For each function the paper sketches research directions, candidate adoption paths, relationships to existing internet infrastructure, and open problems. The framing is deliberately governance-first: infrastructure exists not to make agents more capable but to keep their externalities tractable as deployment scales. The paper is now the standard citation for the agent-governance / agent-infrastructure thread underlying Model Context Protocol, Agent-to-Agent Protocol, Agent Network Protocol, and emerging “agent passport” / verifiable-credential proposals.

Key Ideas

Agent infrastructure as governance layer: external technical systems mediating agent interactions — distinct from training-time alignment.
Three functions: attribution; interaction shaping; detection & remediation. Each maps to concrete research directions.
Attribution: agent IDs, verifiable credentials, attestations, audit logs, principal-binding (which human/org owns this agent).
Interaction shaping: inter-agent communication protocols; standardised agreement primitives; mechanism design; reputation.
Detection & remediation: anomaly detection on agent traffic; rollback mechanisms; insurance / compensation rails; “kill switch” governance.
Analogy to Internet protocols (HTTPS, DNS, BGP, X.509): infrastructure adoption is path-dependent, requires standardisation bodies, and trades expressivity for safety properties.
Open questions: who issues credentials, how privacy interacts with attribution, how to bootstrap adoption, what is enforceable cross-jurisdiction.

Connections

Conceptual Contribution

Claim: Many of the safety, accountability, and interoperability properties society will need from AI agents are not properties of any individual model — they live in the infrastructure between agents. Just as the Internet’s safety depends on TLS / DNS / BGP rather than on any single application, agent ecosystems will depend on agent-level analogues: attribution, interaction shaping, and detection-and-remediation infrastructure.
Mechanism: A three-function taxonomy (attribution / interaction-shaping / detection-and-remediation) with a catalogue of candidate primitives — agent IDs, verifiable credentials, inter-agent protocols, certification regimes, reputation systems, rollback mechanisms — plus analysis of adoption pathways relative to existing internet infrastructure.
Concepts introduced/used: Agent Infrastructure, Agent ID, Verifiable Agent Credentials, Inter-Agent Protocols, Action Attribution, Agent Reputation, Agent Rollback, AI Governance
Stance: governance / position paper / research-agenda
Relates to: Provides the governance scaffolding within which Open Challenges in Multi-Agent Security threats must be addressed; the institutional counterpart to Virtual Agent Economies’s economic framing; concrete protocols proposed include Model Context Protocol, Agent-to-Agent Protocol, Agent Network Protocol — surveyed alongside in Survey Of AI Agent Protocols and Survey Of Agent Interoperability Protocols. The attribution leg connects to NDAI Agreements (TEEs as a particular attribution / commitment substrate) and Trusted Machine Learning Models Unlock Private Inference (capable models as a trust substrate).

Defeating Prompt Injections by Design

Reference: Debenedetti, Shumailov, Fan, Hayes, Carlini, Fabian, Kern, Shi, Terzis & Tramèr (2025). Defeating Prompt Injections by Design (CaMeL). arXiv:2503.18813 (Google DeepMind / ETH Zürich). URL. Code: https://github.com/google-research/camel-prompt-injection.

Summary

CaMeL (“CApabilities for MachinE Learning”) is a robust, by-design defence against Prompt Injection attacks on tool-using LLM Agents. Rather than trying to make the model itself injection-resistant — an approach that decade-long experience with content filters suggests will fail — CaMeL wraps an arbitrary LLM in a protective system layer that performs explicit control- and data-flow separation between the trusted user query and the untrusted data the agent retrieves from tools, websites, or shared memory.

The trusted query is first compiled into a structured plan: a small program whose control flow is fixed at parse time and whose data flow between steps is statically determined. Untrusted strings returned by tools are treated as inert data — they can populate variables but cannot rewrite the program, redirect tool calls, or change which downstream tools are invoked. To prevent exfiltration over authorised channels (the harder half of the problem, since some tools must be allowed to write outwards), CaMeL attaches Capabilities to each data value tracking its provenance and policy class; tool invocations are gated by Information Flow Control policies that check capabilities against an explicit security label lattice.

Evaluated on the AgentDojo benchmark, CaMeL solves 77 % of tasks with provable security guarantees, against 84 % for an undefended baseline — a small utility cost for a structural defence that does not depend on the LLM noticing the attack. The paper positions CaMeL as a successor to ad-hoc prompt-level mitigations and as a concrete instance of end-to-end security thinking applied to agentic AI.

Key Ideas

Threat model: prompt injection from any untrusted data source the agent reads — tools, web pages, files, memory, other agents.
Control-flow extraction: parse the trusted user query into a fixed control-flow plan; downstream model calls see only data, never code.
Data-flow tracking: every variable carries a provenance label; tools that consume “untrusted” labels cannot influence which subsequent tools are called.
Capabilities for tool calls: classic capability-based access control transplanted to LLM tool use; security policies enforced at the tool boundary.
Provable security: when a task is completed under CaMeL, the trace itself certifies that no untrusted data influenced control flow — a property auditable post hoc.
Empirical cost: 77 % vs 84 % task success — graceful degradation rather than catastrophic refusal.
Open source: reference implementation released; integrates with existing agent frameworks via tool-call interception.

Connections

Conceptual Contribution

Claim: Prompt injection is structurally unsolvable at the model layer; it must be eliminated by enforcing a strict separation between code (the trusted query) and data (everything else) at the agent runtime, using classical capability-based Information Flow Control rather than ML-based content classification.
Mechanism: Compile the user query into a fixed control-flow program; route all retrieved data through tagged variables; gate every tool invocation by capability-checked information-flow policies. The LLM’s outputs can populate data fields but never alter control flow or bypass capability checks.
Concepts introduced/used: CaMeL, Control-Flow Integrity, Data-Flow Tracking, Capabilities, Information Flow Control, Prompt Injection, Tool Use, Agent Security, Provable Security (Agents)
Stance: systems / engineering with light formal grounding
Relates to: Spiritual successor to A Language-Based Approach To Prevent DDoS and Security Kernel Lambda Calculus for agent runtimes; an architectural realisation of the threat model catalogued in SoK The Attack Surface of Agentic AI and the multi-agent threats surveyed in Open Challenges in Multi-Agent Security; companion to AgentDojo (the benchmark on which it is evaluated).

SoK The Attack Surface of Agentic AI

SoK: The Attack Surface of Agentic AI — Tools, and Autonomy

Reference: Ali Dehghantanha, Sajad Homayoun (2026). arXiv:2603.22928v1 (Cyber Science Lab, University of Guelph; Aalborg University). Source file: 2603.22928v1.pdf. URL

Summary

A systematisation-of-knowledge paper that maps the attack surface of agentic LLM systems — those that plan, call tools, browse, run code, coordinate with other agents, and rely on retrieval-augmented generation (RAG). The authors develop a reference pipeline, identify ten numbered attack surfaces (AS1–AS10) across a Trusted Computing Base (TCB) boundary separating the LLM core, planner, orchestrator, policy guards, and secrets vault from untrusted inputs (web, RAG index, tools, APIs, file I/O).

From a literature-driven review of ~100 candidate papers (2023–2025) they synthesise a taxonomy of seven attack goals (G1 data exfiltration, G2 integrity subversion, G3 privilege escalation, G4 resource abuse, G5 fraud, G6 persistence/backdoor, G7 supply-chain compromise) and five multi-step attack paths (P1–P5) including direct and indirect prompt injection, RAG index poisoning, cross-tool drop, and multi-agent hops. The work maps each vector to OWASP LLM Top-10 2025 and MITRE ATLAS IDs, and proposes attacker-aware quantitative metrics (Unsafe Action Rate, Policy Adherence Rate, Privilege-Escalation Distance, Retrieval Risk Score, Time-to-Contain, Out-of-Role Action Rate, Cost-Exploit Susceptibility) for reproducible benchmarking.

The central thesis is that agentic security risk is structural rather than prompt-level: compromises arise from system composition — tool brokering, persistent memory, and execution lifecycle — that blurs trust boundaries between the model, data, and execution environment. A defence-in-depth playbook across pre-ingestion, inference, agent logic, infrastructure, and monitoring layers is given in appendices.

Key Ideas

Reference agentic pipeline with explicit TCB and ten numbered attack surfaces (AS1–AS10)
Taxonomy of 7 attack goals × 7 vector classes × 5 attack paths
Causal threat graph for tracing attacker influence to unsafe action
Attacker-aware metrics: UAR, PAR, PED, RRS, TTC, OORAR, CES
Mapping to OWASP GenAI LLM Top-10 2025 and MITRE ATLAS
RAG is not intrinsically safer; indirect injection is practical and hard to stamp out
Defence-in-depth across five layers (data, inference, agent logic, infra, monitoring)

Connections

Conceptual Contribution

Claim: Agentic AI security risk is a structural property of system composition (tool use, persistent memory, orchestration, supply chain) rather than a model-level prompt-safety problem; a reference TCB model plus attacker-aware metrics is needed to make defences auditable and comparable.
Mechanism: Define a reference pipeline with trust boundary between trusted orchestration (LLM core, planner, policy, vault) and untrusted ingress (web, RAG, sandbox, APIs). Enumerate ten attack surfaces, seven goals, five multi-step paths, map each to OWASP/MITRE, and define scenario-driven metrics (UAR, PAR, PED, RRS, TTC, OORAR, CES) computable from structured execution traces.
Concepts introduced/used: Agentic TCB, Attack Surface Taxonomy, Causal Threat Graph, Indirect Prompt Injection, RAG Poisoning, Privilege-Escalation Distance, Unsafe Action Rate, OWASP LLM Top-10, MITRE ATLAS, Defence in Depth
Stance: survey / engineering
Relates to: Complements A Language-Based Approach To Prevent DDoS and LangSec by extending structural-security thinking to agentic runtimes. Sits alongside Prompt Injection and Agent Security concept hubs, and provides the threat model that protocols like Model Context Protocol and Agent-to-Agent Protocol must defend against.

AI Governance

Field studying institutional, legal, and infrastructural mechanisms for ensuring AI systems are developed and deployed safely and accountably. Sits behind Infrastructure for AI Agents and the policy programme of Virtual Agent Economies.

In this vault

Agent Security

Security concerns specific to LLM-agent systems: tool attacks, prompt injection, memory poisoning, inter-agent trust failures.

In this vault

Stealth Optimisation

(page does not exist)

Free-Form Protocols

(page does not exist)

Network Effect (Security)

(page does not exist)

Swarm Attack

(page does not exist)

Secret Collusion

(page does not exist)

Multi-Agent Security

The discipline of securing networks of interacting AI agents against threats that emerge from interaction — secret collusion, swarm attacks, network-effect contagion, dispersion-based stealth. Named as a distinct field by Schroeder de Witt et al. 2025.

In this vault

Model Context Protocol

MCP — an open protocol (Anthropic, 2024) standardising how LLM applications connect to external tools and data sources.

Discussed in:

Agent-to-Agent Protocol

A2A — protocol for inter-agent communication among autonomous LLM agents.

Discussed in:

Are Multiagent Systems Resilient to Communication Failures

Are Multiagent Systems Resilient to Communication Failures?

Reference: Philip N. Brown, Holly P. Borowski, and Jason R. Marden (2017). arXiv:1710.08500 (American Control Conference 2018). Source file: 1710.08500v1.pdf. URL

Summary

Studies whether game-theoretic multiagent systems that tolerate “offline” design-time information loss also tolerate “online” runtime communication failures. Using potential games as the canonical setting, the authors show a surprising negative result: even a single communication failure about a weakly-coupled (“inconsequential”) agent’s action can drive best-response and log-linear-learning dynamics to arbitrarily poor equilibria, regardless of which proxy-payoff evaluator the ignorant agent uses.

The paper also identifies positive results — identical-interest games with the max evaluator remain well-behaved under a single failure — and proposes a “coarse potential alignment” certificate for when proxy payoffs are safe. It further shows a paradox: in identical-interest games, performance can improve when more agents are denied information about an inconsequential player.

Key Ideas

Proxy-payoff evaluators (sum/max/min/mean) and their admissibility
Single communication failure can destabilise potential-game equilibria
Identical-interest + max evaluator is the only generally safe combination
“Inconsequentiality” as an epsilon-weak-coupling definition
Larger action spaces (more profiles) make games more susceptible

Connections

Conceptual Contribution

Claim: Even when a single “weakly-coupled” agent loses information about another’s action, standard game-theoretic multi-agent control (potential games, identical-interest games, log-linear learning) can collapse to arbitrarily bad equilibria — resilience to communication failures is fundamentally limited by the structure of the problem, not just the learning rule.
Mechanism: Formalise the notion of ε-inconsequentiality (a player whose action change barely affects another’s payoff) and proxy payoff evaluators (max/mean/min/sum over unobserved actions); prove negative theorems showing acceptable evaluators can induce pathological Nash equilibria, then positive structural results (ε-inconsequential + max-evaluator + identical-interest ⇒ resilience) and “informational paradox” results where removing communication can improve outcomes.
Concepts introduced/used: Potential Games, Log-linear Learning, Proxy Payoff Evaluators, Inconsequentiality, Communication Failures, Distributed Optimization, Nash Equilibrium Pathologies, Nash Equilibrium, Best-Response Dynamics, Price of Anarchy, Identical-Interest Games
Stance: formal / game-theoretic
Relates to: Provides the theoretical foundation for robustness concerns raised empirically in Why Do Multi-Agent LLM Systems Fail and A Composite Self-organisation Mechanism in an Agent Network. The inconsequentiality notion parallels weak-coupling arguments in Gossip Protocols and Gossip-based Aggregation in Large Dynamic Networks.

Tags

Why Do Multi-Agent LLM Systems Fail

Why Do Multi-Agent LLM Systems Fail?

Reference: Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica (2025). arXiv:2503.13657v2 (UC Berkeley). Source file: 2503.13657v2.pdf. URL

Summary

First empirically grounded taxonomy of failure modes in Multi-Agent LLM Systems (MAS). The authors analyse 200+ execution traces from seven popular MAS frameworks (MetaGPT, ChatDev, HyperAgent, AppWorld, AG2, Magentic-One, OpenManus), annotated by six human experts via grounded theory and reaching Cohen’s κ ≈ 0.88, and distil 14 fine-grained failure modes grouped into three categories: Specification Issues (42%), Inter-Agent Misalignment (37%), and Task Verification (21%).

They release MAST (Multi-Agent System failure Taxonomy), a validated LLM-as-judge pipeline for automated failure diagnosis, and two intervention case studies showing that architectural/prompt fixes inspired by MAST improve success rates modestly — demonstrating that MAS failures are system-design problems, not merely model-capability problems.

Key Ideas

Three failure categories: specification, inter-agent misalignment, verification
14 fine-grained failure modes including step repetition, information withholding, task derailment
Grounded-theory methodology with rigorous inter-annotator agreement (κ=0.88)
LLM-as-judge pipeline (MAST) achieves κ=0.77 vs humans for scalable evaluation
Insight: better specifications and verification beat bigger models

Connections

Conceptual Contribution

Claim: Multi-Agent LLM System (MAS) failures are predominantly system-design problems — specification, coordination, and verification — rather than base-model capability problems; and these failures have an empirically discoverable, reproducible taxonomy.
Mechanism: Grounded-theory analysis of 200+ execution traces across seven MAS frameworks (MetaGPT, ChatDev, HyperAgent, AppWorld, AG2, Magentic-One, OpenManus) with six human expert annotators; iterative refinement to Cohen’s κ≈0.88; yield the 14-mode MAST taxonomy grouped into Specification Issues (42%), Inter-Agent Misalignment (37%), Task Verification (21%); validate an LLM-as-judge annotator (κ≈0.77); intervention case studies showing prompt/architecture fixes provide only modest gains, motivating deeper redesign.
Concepts introduced/used: MAST Taxonomy, Grounded Theory, Inter-Agent Misalignment, LLM-as-judge, Specification Issues, Task Verification, Multi-Agent Systems, LLM Agents, Cohen’s Kappa, Standard Operating Procedures (SOPs)
Stance: empirical / evaluative
Relates to: Supplies empirical grounding for the design-quality concerns in Agents Framework - Zhou et al (SOPs attempt to mitigate FC1 specification issues) and Multi-Agent Collaboration in AI - Wasif Tunkel. Inter-agent misalignment mirrors the formal pathologies in Are Multiagent Systems Resilient to Communication Failures. Motivates richer communication protocols like A Scalable Communication Protocol for Networks of LLMs and commitment-style ACLs (ACL Rethinking Principles).

Tags

ClawWorm Self-Propagating Attacks Across LLM Agent Ecosystems

ClawWorm: Self-Propagating Attacks Across LLM Agent Ecosystems

Reference: Yihao Zhang, Zeming Wei, Xiaokun Luan, Chengcan Wu, Zhixin Zhang, Jiangrong Wu, Haolin Wu, Huanran Chen, Jun Sun, Meng Sun (2026). arXiv:2603.15727v2 (Peking University; Sun Yat-sen; Wuhan; Tsinghua; SMU). Source file: 2603.15727v2.pdf. URL

Summary

Presents ClawWorm, the first demonstrated self-replicating, worm-style attack on a production-scale autonomous LLM-agent ecosystem. The target is OpenClaw, an open-source personal AI-agent framework with over 40,000 active instances, a persistent Markdown workspace (SOUL.md, AGENTS.md, SKILL.md), tool-execution privileges, and cross-platform messaging (Telegram, Discord, WhatsApp, Slack, Signal, Moltbook). A single crafted message triggers the victim to write a malicious payload into its highest-privilege configuration file, which then auto-fires at every session restart and autonomously propagates to every newly encountered peer — all without further attacker intervention.

The worm implements a dual-anchor persistence mechanism: one anchor injects the payload into the Session Startup section of AGENTS.md (guaranteeing execution on reboot), the other injects a global interaction rule (guaranteeing propagation during routine replies). Three attack vectors are studied (A: web injection, B: skill-supply-chain via ClawHub, C: direct fenced-code replication with word-by-word handshake) and three payloads (P1 recon, P2 resource exhaustion, P3 command-and-control via URL retrieval). Across 1,800 trials on four frontier LLM backends (Minimax-M2.5, DeepSeek-V3.2, GLM-5, Kimi-K2.5) the aggregate attack success rate is 64.5%, with Vector B (skill supply chain) reaching 81% and sustained multi-hop propagation up to 5 hops. An epidemiological projection with basic reproduction number R0 = k × ASR shows inevitable ecosystem-wide saturation even for security-conscious models.

The root cause is identified as the flat context trust model: the LLM cannot distinguish instructions from its owner, the system layer, or an arbitrary channel participant, so architectural patterns (unconditional workspace loading, LLM-mediated tool authorisation, unreviewed skill packages) amount to structural — not idiosyncratic — vulnerabilities shared by any agent ecosystem of similar design.

Key Ideas

Single-message, fully autonomous worm against a production agent framework
Dual-anchor persistence: Session Startup + global interaction rule
Three attack vectors (web URL, skill supply chain, direct instruction replication)
Multi-turn autonomous-retry social engineering boosts ASR by up to +24 pp
Epidemiological SI model with R0 = k × ASR predicts ecosystem saturation
Execution-layer guardrails alone cannot halt propagation (dormant payloads persist)
Flat context trust model as structural root cause

Connections

Agent Security
Prompt Injection
LLM Agents
Multi-Agent Systems
Distributed Security
Model Context Protocol
Gossip Protocols
Trust and Reputation
Theory of Self-Reproducing Automata — foundational ancestor of self-replicating computation
Agents of Chaos — empirical companion on agent-ecosystem failures
MalTool Malicious Tool Attacks — the tool/skill-supply-chain attack surface
SoK The Attack Surface of Agentic AI — systematic context
Inter-Agent Trust Models - A Comparative Study — why flat trust is brittle
CBCL - Safe Self-Extending Agent Communication — structural defence: lang-scoped dialect provenance plus R1–R3 verification address the flat-context-trust root cause.

Conceptual Contribution

Claim: Production-scale autonomous LLM-agent ecosystems are vulnerable to single-message, self-replicating worms whose root cause is architectural (flat context trust, unconditional config loading, unreviewed skill supply chains), not model-specific.
Mechanism: Empirical red-team against unmodified OpenClaw v2026.3.12 across four LLM backends, three vectors, three payloads (1,800 trials). A dual-anchor persistence pattern writes the payload to AGENTS.md and installs a global propagation rule; session-restart loading re-injects the payload into the system prompt; routine replies carry the payload to peers. Evaluated with per-phase metrics (persistence, execution, propagation) and a mean-field R0 epidemiological projection.
Concepts introduced/used: Self-Replicating Agent, Dual-Anchor Persistence, Flat Context Trust Model, Skill Supply Chain Attack, Indirect Prompt Injection, Agent Worm, Configuration Integrity, Multi-Turn Social Engineering, Epidemiological Projection R0
Stance: empirical / critique
Relates to: Concrete multi-agent instantiation of the threat surface catalogued in SoK The Attack Surface of Agentic AI. The flat-trust critique complements the trust-model taxonomy in Inter-Agent Trust Models - A Comparative Study and the safety failures observed in Agents of Chaos. Motivates verifiable specifications of the kind proposed in Intent Formalization - A Grand Challenge for Reliable Coding.

AI Agents Under Threat

AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways

Reference: Deng, Guo, Han, Ma, Xiong, Wen, Xiang (2025). ACM Computing Surveys 57(7), Article 182. Source file: 3716628.pdf. URL

Summary

This survey organizes the emerging threat landscape of LLM-powered AI agents around four knowledge gaps: unpredictability of multi-step user inputs, complexity of internal execution, variability of operational environments, and interactions with untrusted external entities. It unifies single-agent and multi-agent attack surfaces within a perception/brain/action + agent2agent/agent2env/agent2memory taxonomy.

Concrete threats reviewed include adversarial prompts, prompt injection, jailbreaks, backdoor attacks, hallucination and misalignment, tool-use risks, indirect prompt injection, reinforcement-learning environment attacks, cooperative and competitive inter-agent risks, and long/short-term memory attacks. The authors tabulate defenses (prevention- and detection-based), rate their efficacy, and highlight open directions for robust and trustworthy agents.

Key Ideas

Four knowledge gaps framing agent security.
Taxonomy: perception / brain / action / agent2agent / agent2env / agent2memory threats.
Six categories of prompt-injection attack engineering (naive, escape, context-ignore, fake-completion, multimodal, combined).
Jailbreak domino effect in multi-agent populations.
Memory poisoning and indirect prompt injection as underexplored surfaces.

Connections

Conceptual Contribution

Claim: LLM Agents security should be organised around four knowledge gaps (input unpredictability, internal complexity, environmental variability, untrusted interactions) mapped onto a perception/brain/action + agent2{agent,env,memory} taxonomy.
Mechanism: Surveys adversarial prompts, prompt injection, jailbreaks, backdoors, hallucination, tool-use risks, indirect injection, RL environment attacks, inter-agent cooperative/competitive risks, memory poisoning; tabulates prevention- vs detection-based defences and rates their efficacy.
Concepts introduced/used: Prompt Injection, Jailbreak, Backdoor Attacks, Tool Use, Memory Poisoning, Hallucination, Model Context Protocol, LLM Agents, Multi-Agent Systems, Trust and Reputation, Distributed Security, Agent Security
Stance: survey
Relates to: Provides the threat scaffolding that MalTool Malicious Tool Attacks deepens at the tool layer; complements lifecycle threats in Survey Of Agent Interoperability Protocols; motivates static-analysis defences like A Language-Based Approach To Prevent DDoS.

Tags

Prompt Injection

Attack where adversary-controlled text inside an LLM’s input context is interpreted as instructions — classic LangSec parser-differential in a natural-language setting.

In this vault

LLM Agents

Large-language-model-powered agents: natural-language coordination, tool use, multi-agent orchestration.

Surveys & frameworks

Protocols & communication

Failures & threats

Lineage

Multi-Agent Systems

Systems of multiple autonomous agents that interact, coordinate, and sometimes compete.

Foundations

Intelligent Agents Theory and Practice — Wooldridge
Multiagent Systems Sycara
Agent-Oriented Programming — Shoham

Coordination & robustness

In this vault

AI Safety

(page does not exist)