Do LLM Agents Have Regret? A Case Study in Online Learning and Games
Reference: Park, Liu, Ozdaglar & Zhang (2024). Do LLM Agents Have Regret? A Case Study in Online Learning and Games. arXiv:2403.16843 (MIT; UMD). URL. OpenReview: https://openreview.net/forum?id=OhZ4u164cN.
Summary
Park et al. ask a sharp question about LLM agents in interactive settings: do they have regret? — i.e. do they exhibit the no-regret behaviour that classical online-learning and game-theoretic algorithms guarantee, and that is necessary for converging to coarse-correlated equilibria in repeated games?
The paper proceeds in three steps. Empirically, they evaluate GPT-3.5 / GPT-4 / Claude / Llama on canonical online-learning benchmarks (prediction-with-expert-advice; bandit-like sequential decision problems) and on repeated games (matrix games, Cournot, Bertrand, public-goods). Frontier LLMs are often no-regret across these settings and often converge to coarse-correlated or Nash equilibria when playing each other. Theoretically, they offer a partial explanation: under stylised assumptions on supervised pre-training and human rationality, the LLM’s next-action distribution approximates a softmax over historical payoffs — which itself implements a no-regret algorithm. But they identify clean failure cases: there exist simple non-stationary or adversarial online-learning instances where even GPT-4 demonstrably accumulates linear regret.
The paper’s constructive contribution is a new regret-loss training objective. Unlike supervised pretraining loss, regret-loss does not require labels of optimal actions — only the historical sequence of plays and payoffs. The authors prove a statistical generalisation bound for regret-loss minimisation and an optimisation guarantee that minimising it can recover known no-regret learning algorithms (e.g. FTRL). Empirically, regret-loss-finetuned models close the gap on the failure cases. The paper is a foundational reference for any analysis of LLM agents in markets, auctions, or interactive coordination — a category that includes Virtual Agent Economies, Mechanism Design for Large Language Models, Learning Collusion in Episodic Inventory-Constrained Markets, and Language Models Can Reduce Asymmetry in Information Markets.
Key Ideas
- Regret as a diagnostic for LLM agents in interactive settings: do they no-regret-learn against arbitrary opponents?
- Empirical screen: frontier LLMs (GPT-3.5/4, Claude, Llama) on canonical online-learning + repeated-game benchmarks.
- Often no-regret in benign settings, often converging to coarse-correlated / Nash equilibria when playing each other.
- Theoretical bridge: under stylised pretraining + human-rationality assumptions, the LLM’s next-action distribution resembles a softmax over payoffs — itself a no-regret algorithm.
- Identified failure cases: simple non-stationary / adversarial online-learning instances where GPT-4 has linear regret.
- Regret-loss objective: label-free training loss that explicitly incentivises no-regret behaviour; statistical and optimisation guarantees.
- Recovery of classical algorithms: minimising regret-loss can converge to algorithms like FTRL.
Connections
Conceptual Contribution