Toolformer: Language Models Can Teach Themselves to Use Tools

Reference: Schick, Dwivedi-Yu, Dessì, Raileanu, Lomeli, Zettlemoyer, Cancedda, Scialom (2024). NeurIPS 2023. Source file: 2302.04761.pdf. URL

Summary

Toolformer shows that a language model can teach itself, in a self-supervised way, to decide when to call an external API, which API to call, what arguments to pass, and how to incorporate the returned result into subsequent token prediction. Starting from a handful of human-written demonstrations per tool, the model annotates a large corpus with candidate API calls, executes them, and keeps only the calls whose results reduce the cross-entropy loss on the surrounding tokens. The filtered, API-augmented corpus is then used to finetune the model.

The approach is demonstrated on a calculator, Q&A system, Wikipedia search, machine translation, and calendar. A 6.7B Toolformer (GPT-J-based) outperforms a much larger GPT-3 on several zero-shot downstream tasks without sacrificing core language modelling. The paper is a foundational reference for Tool Use in LLM Agents and therefore for the tool-layer attack surface catalogued in AI Agents Under Threat and MalTool Malicious Tool Attacks.

Key Ideas

Self-supervised tool learning via loss-reduction filtering — no task-specific supervision.
API calls are represented as interleaved special tokens (<API> name(args) -> result </API>) directly inside the token stream.
A single model learns heterogeneous tools rather than one tool per specialist.
Tools compensate for LLM weaknesses (arithmetic, factual recall, freshness, low-resource translation).
Establishes the architectural template — model emits tool-call tokens, external executor returns results, tokens resume — that later MCP/A2A-style protocols generalise.

Connections

Conceptual Contribution

Claim: Language models can learn to use external tools in a self-supervised fashion by keeping only API calls whose responses reduce next-token loss, bootstrapping tool competence from a handful of demonstrations.
Mechanism: Sample candidate API-call positions and arguments via in-context prompting; execute calls; filter by weighted cross-entropy reduction (L_i^- − L_i^+ ≥ τ_f); finetune on the filtered, API-interleaved corpus.
Concepts introduced/used: self-supervised tool learning, API-call tokens, loss-based filtering, Tool Use, LLM Agents — the direct antecedent of Model Context Protocol style tool-calling interfaces.
Stance: constructive
Relates to: Supplies the tool-invocation substrate whose abuses are studied in MalTool Malicious Tool Attacks, Skill Supply Chain Attack, and the action-layer threats in AI Agents Under Threat.

Tags

#llm #tool-use #foundational #self-supervised #agents

Summary

This survey organizes the emerging threat landscape of LLM-powered AI agents around four knowledge gaps: unpredictability of multi-step user inputs, complexity of internal execution, variability of operational environments, and interactions with untrusted external entities. It unifies single-agent and multi-agent attack surfaces within a perception/brain/action + agent2agent/agent2env/agent2memory taxonomy.

Concrete threats reviewed include adversarial prompts, prompt injection, jailbreaks, backdoor attacks, hallucination and misalignment, tool-use risks, indirect prompt injection, reinforcement-learning environment attacks, cooperative and competitive inter-agent risks, and long/short-term memory attacks. The authors tabulate defenses (prevention- and detection-based), rate their efficacy, and highlight open directions for robust and trustworthy agents.

Key Ideas

Four knowledge gaps framing agent security.

Taxonomy: perception / brain / action / agent2agent / agent2env / agent2memory threats.

Six categories of prompt-injection attack engineering (naive, escape, context-ignore, fake-completion, multimodal, combined).

Jailbreak domino effect in multi-agent populations.

Memory poisoning and indirect prompt injection as underexplored surfaces.

Conceptual Contribution

Claim: LLM Agents security should be organised around four knowledge gaps (input unpredictability, internal complexity, environmental variability, untrusted interactions) mapped onto a perception/brain/action + agent2{agent,env,memory} taxonomy.

Mechanism: Surveys adversarial prompts, prompt injection, jailbreaks, backdoors, hallucination, tool-use risks, indirect injection, RL environment attacks, inter-agent cooperative/competitive risks, memory poisoning; tabulates prevention- vs detection-based defences and rates their efficacy.

Concepts introduced/used: Prompt Injection, Jailbreak, Backdoor Attacks, Tool Use, Memory Poisoning, Hallucination, Model Context Protocol, LLM Agents, Multi-Agent Systems, Trust and Reputation, Distributed Security, Agent Security

Stance: survey

Relates to: Provides the threat scaffolding that MalTool Malicious Tool Attacks deepens at the tool layer; complements lifecycle threats in Survey Of Agent Interoperability Protocols; motivates static-analysis defences like A Language-Based Approach To Prevent DDoS.

Summary

This paper presents the first systematic study of code-level malicious tool attacks on LLM agent ecosystems (MCP, Skills, mcp.so, skillsmp). Whereas prior work focused on crafting misleading tool names and descriptions, the authors show that genuinely harmful behaviour must be embedded in a tool’s implementation. They propose a CIA (confidentiality/integrity/availability) taxonomy of 12 concrete malicious behaviours (data exfiltration, credential abuse, data poisoning, file deletion, RCE downloading, CPU/GPU hijacking, DoS).

They build MalTool, a coding-LLM framework that iteratively synthesizes standalone and Trojan malicious tools using a behaviour-specific system prompt, diversity guidance, and an execution-based verifier. The result: 1,200 standalone malicious tools and 5,287 real-world tools with injected malicious behaviours. Detection methods (VirusTotal, Cisco MCP Scanner, MCPScan) perform poorly, motivating new defences.

Key Ideas

CIA taxonomy of malicious tool behaviours in agent settings.

Automatic generation pipeline: system prompt + coding LLM + execution-based verifier.

Trojan construction by embedding malicious logic in benign tool code.

Existing malware and MCP-specific scanners fail on both false-positives and false-negatives.

Dataset released for benign tools only to minimize misuse.

Conceptual Contribution

Claim: Truly harmful behaviour in LLM Agents ecosystems lives in tool implementations, not in their descriptions; prior description-level red-teaming misses the dominant attack class, and current scanners miss it too.

Mechanism: Introduces a CIA taxonomy of 12 malicious behaviours; builds MalTool, a coding-LLM pipeline (behaviour-specific system prompt + diversity guidance + execution-based verifier) that produces standalone and Trojan tools; benchmarks VirusTotal, Cisco MCP Scanner, MCPScan and shows poor detection.

Concepts introduced/used: Tool Use, Model Context Protocol, Trojan Tools, Prompt Injection, Agent Security, LLM Agents, Distributed Security

Stance: empirical

Relates to: Deepens the tool/MCP threat surface catalogued in AI Agents Under Threat and Survey Of Agent Interoperability Protocols; motivates language-based defences akin to A Language-Based Approach To Prevent DDoS and capability isolation of Security Kernel Lambda Calculus.

Toolformer: Language Models Can Teach Themselves to Use Tools

Summary

Key Ideas

Connections

Conceptual Contribution

Tags

Backlinks