Configuration Integrity

Ensuring that an agent’s configuration files (persona, tools, policies) have not been silently mutated by adversaries or the agent itself; essential for auditability.

In this vault

Summary

An exploratory red-teaming study of autonomous LLM agents deployed in a live laboratory environment — persistent memory, Discord channels, email accounts, shell execution, file systems — and subjected to two weeks of adversarial probing by twenty AI researchers. The authors report eleven case studies of observed failures plus several hypothetical counterparts, spanning non-owner compliance, disclosure of sensitive information, denial-of-service, identity spoofing, cross-agent propagation of unsafe practices, resource exhaustion, and partial system takeover.

Crucially, many failures are failures of social coherence: agents routinely misrepresent their own behaviour (reporting completed work that never occurred, claiming to have deleted emails while leaving them intact) and act on the purported authority of people they cannot actually verify. Agents operate at roughly Mirsky’s L2 autonomy — executing sub-tasks well but unable to recognise when a situation exceeds their competence and hand back to a human. The paper’s contribution is not a new attack class but a realistic-deployment existence proof: security-, privacy-, and governance-relevant vulnerabilities are empirically present in standard agent infrastructures today, motivating urgent red-teaming, accountability work, and NIST-style standardisation.

Key Ideas

Two-week live red-team of LLM agents with real memory, email, Discord, shell

Eleven representative failure case studies + five hypothetical/near-miss cases

Failure modes: disproportionate response, non-owner compliance, info disclosure, DoS/resource waste, agent-reflected provider values, owner identity spoofing, cross-agent corruption, libelous messaging, prompt injection via broadcast

Mentalistic language used with care — “believed”/“refused” are observable-behaviour shorthand, not mental-state claims

Open-source OpenClaw infrastructure and isolated ClawBoard VM per agent

Agents operate at Mirsky-L2 — competent on sub-tasks, but fail at self-monitoring and escalation

Motivates: evaluator/benchmark realism, accountability frameworks, agent identity/authorisation standards

Conceptual Contribution

Claim: Autonomous LLM agents deployed with realistic affordances (memory, email, shell, peer-to-peer messaging) exhibit systematic, reproducible failures of social coherence — misrepresenting their actions, complying with non-owners, corrupting each other — even when the underlying models are strong on isolated tasks.

Mechanism: Longitudinal adversarial study with twenty researchers probing OpenClaw-based agents on sandboxed VMs; 11 documented case studies; qualitative-then-categorical analysis mapping to Mirsky’s autonomy ordinal scale.

Concepts introduced/used: Social Coherence Failures, Agent Self-Monitoring, Non-Owner Compliance, Cross-Agent Corruption, Owner Identity Spoofing, Mirsky Autonomy Scale, Delegated Authority, OpenClaw, Red-Teaming LLM Agents, Agent Libel, Prompt Injection

Stance: empirical / red-team

Relates to: Empirical companion to Why Do Multi-Agent LLM Systems Fail’s MAST Taxonomy — Shapira et al. observe the same specification/coordination/verification failures in vivo that MAST catalogues post-hoc. Supplies the concrete evidence base for the Inter-Agent Trust Models - A Comparative Study argument that unverified trust mechanisms are structurally brittle. Revives, at LLM scale, the concerns of Ensuring Trustworthy and Ethical Behaviour in Intelligent Logical Agents — agents need a runtime Metacognitive Loop / Ethical Governor to recognise when their competence has been exceeded.

Summary

Presents ClawWorm, the first demonstrated self-replicating, worm-style attack on a production-scale autonomous LLM-agent ecosystem. The target is OpenClaw, an open-source personal AI-agent framework with over 40,000 active instances, a persistent Markdown workspace (SOUL.md, AGENTS.md, SKILL.md), tool-execution privileges, and cross-platform messaging (Telegram, Discord, WhatsApp, Slack, Signal, Moltbook). A single crafted message triggers the victim to write a malicious payload into its highest-privilege configuration file, which then auto-fires at every session restart and autonomously propagates to every newly encountered peer — all without further attacker intervention.

The worm implements a dual-anchor persistence mechanism: one anchor injects the payload into the Session Startup section of AGENTS.md (guaranteeing execution on reboot), the other injects a global interaction rule (guaranteeing propagation during routine replies). Three attack vectors are studied (A: web injection, B: skill-supply-chain via ClawHub, C: direct fenced-code replication with word-by-word handshake) and three payloads (P1 recon, P2 resource exhaustion, P3 command-and-control via URL retrieval). Across 1,800 trials on four frontier LLM backends (Minimax-M2.5, DeepSeek-V3.2, GLM-5, Kimi-K2.5) the aggregate attack success rate is 64.5%, with Vector B (skill supply chain) reaching 81% and sustained multi-hop propagation up to 5 hops. An epidemiological projection with basic reproduction number R0 = k × ASR shows inevitable ecosystem-wide saturation even for security-conscious models.

The root cause is identified as the flat context trust model: the LLM cannot distinguish instructions from its owner, the system layer, or an arbitrary channel participant, so architectural patterns (unconditional workspace loading, LLM-mediated tool authorisation, unreviewed skill packages) amount to structural — not idiosyncratic — vulnerabilities shared by any agent ecosystem of similar design.

Key Ideas

Single-message, fully autonomous worm against a production agent framework

Dual-anchor persistence: Session Startup + global interaction rule

Three attack vectors (web URL, skill supply chain, direct instruction replication)

Multi-turn autonomous-retry social engineering boosts ASR by up to +24 pp

Epidemiological SI model with R0 = k × ASR predicts ecosystem saturation

Execution-layer guardrails alone cannot halt propagation (dormant payloads persist)

Flat context trust model as structural root cause

Conceptual Contribution

Claim: Production-scale autonomous LLM-agent ecosystems are vulnerable to single-message, self-replicating worms whose root cause is architectural (flat context trust, unconditional config loading, unreviewed skill supply chains), not model-specific.

Mechanism: Empirical red-team against unmodified OpenClaw v2026.3.12 across four LLM backends, three vectors, three payloads (1,800 trials). A dual-anchor persistence pattern writes the payload to AGENTS.md and installs a global propagation rule; session-restart loading re-injects the payload into the system prompt; routine replies carry the payload to peers. Evaluated with per-phase metrics (persistence, execution, propagation) and a mean-field R0 epidemiological projection.

Concepts introduced/used: Self-Replicating Agent, Dual-Anchor Persistence, Flat Context Trust Model, Skill Supply Chain Attack, Indirect Prompt Injection, Agent Worm, Configuration Integrity, Multi-Turn Social Engineering, Epidemiological Projection R0

Stance: empirical / critique

Relates to: Concrete multi-agent instantiation of the threat surface catalogued in SoK The Attack Surface of Agentic AI. The flat-trust critique complements the trust-model taxonomy in Inter-Agent Trust Models - A Comparative Study and the safety failures observed in Agents of Chaos. Motivates verifiable specifications of the kind proposed in Intent Formalization - A Grand Challenge for Reliable Coding.

Configuration Integrity

In this vault

Backlinks