Mutual Information

I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) — the reduction in uncertainty about X given knowledge of Y (and symmetrically). One of Shannon’s foundational quantities (A Mathematical Theory of Communication); the operational basis of Channel Capacity (C = max_{p(x)} I(X;Y)). Used in MAS as a measure of causal communication influence in emergent-language metrics (On the Pitfalls of Measuring Emergent Communication).

In this vault

Backlinks

Linked Pages

Causal Influence of Communication

Causal-intervention metric quantifying how much a message changed the recipient’s action distribution — a robust alternative to correlational measures.

In this vault

On the Pitfalls of Measuring Emergent Communication

Channel Capacity

Shannon’s C = max_{p(x)} I(X;Y) — the supremum of mutual information achievable over a memoryless channel with input distribution p(x). The noisy-channel coding theorem (A Mathematical Theory of Communication) establishes its operational meaning: any rate R < C is achievable with arbitrarily low error probability via sufficiently long block codes; any rate R > C is not. For the additive-white-Gaussian-noise channel of bandwidth B and signal-to-noise ratio S/N, C = B log₂(1 + S/N). Bounds the throughput of any communication channel — including LLM-mediated agent communication.

In this vault

Shannon Entropy

The expected self-information of a random variable, H(X) = -Σ p(x) log p(x), measuring its average uncertainty in bits. It bounds the best achievable lossless compression and, via its relationship to Kolmogorov complexity, links statistical and algorithmic notions of information.

In this vault

A Mathematical Theory of Communication

Reference: Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), pp. 379–423 and 27(4), pp. 623–656. (Republished 1949 with additional commentary by Warren Weaver as The Mathematical Theory of Communication, University of Illinois Press.) DOI · Open access PDF (Harvard) · Internet Archive (BSTJ scan)

Summary

Shannon’s two-part 1948 paper founds information theory as a discipline. The setting is the engineering problem of communication: a source produces messages, an encoder transforms them into signals over a noisy channel, a decoder attempts to reconstruct the original. Shannon’s first move is to argue that the meaning of messages is irrelevant to the engineering problem — only their statistical structure matters. He then develops the foundational notions: entropy H = -Σ p_i log p_i as the average information content per symbol of a source; mutual information I(X;Y) = H(X) - H(X|Y) as the information one variable carries about another; channel capacity C = max I(X;Y) as the supremum of mutual information over all input distributions. The technical heart consists of two coding theorems. Source coding (noiseless coding): any source with entropy H can be losslessly compressed at rate arbitrarily close to H bits per symbol, but no lower. Channel coding (noisy-channel coding): any source with entropy below the channel capacity C can be transmitted with arbitrarily low error probability using sufficient block length, but transmission above C necessarily incurs error. Together these establish the operational meanings of entropy and capacity and bound what any communication system can achieve. The companion Mathematical Theory of Communication (1949) adds Weaver’s expository introduction, popularising the framework and inaugurating the broader engagement of philosophy and the social sciences with information theory. Shannon’s framework supplies the technical foundation for every communication system, the conceptual foundation for algorithmic information theory and the MDL principle, and a recurring background reference in agent-communication design — most explicitly in Why AI Agents Communicate In Human Language, which frames the case against natural-language inter-agent communication in Shannon-theoretic terms (lossy channel, low capacity per token, ambiguous code).

Key Ideas

Engineering decoupling from meaning: the engineering problem of communication is independent of semantic content; only the statistical structure of the source matters.
Entropy as average information: H(X) = -Σ p_i log_2 p_i measures the average uncertainty of a random variable in bits; for an i.i.d. source emitting symbols with probabilities p_i, H is the lower bound on bits-per-symbol for lossless compression.
Mutual information: I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) symmetrically measures how much knowing Y reduces uncertainty about X (and vice versa).
Channel capacity: C = max_{p(x)} I(X;Y) — the maximum mutual information achievable over a memoryless channel with input distribution p(x). Capacity is the operational supremum on reliable transmission rate.
Source coding theorem: lossless compression can achieve rate arbitrarily close to H (entropy) but no lower; the practical realisation is via Huffman coding, arithmetic coding, and variants.
Noisy channel coding theorem: for any rate R < C, there exists a coding scheme achieving error probability arbitrarily close to zero with sufficient block length; for R > C, error probability is bounded away from zero. The result is non-constructive — Shannon’s proof uses random coding — but the existence guarantee is what made coding theory a field.
Continuous channels and AWGN capacity: the second part extends the discrete results to continuous channels, deriving the famous C = B log(1 + S/N) for additive-white-Gaussian-noise channels of bandwidth B and signal-to-noise ratio S/N.
Ergodic processes and Markov sources: Shannon’s analysis carefully extends from i.i.d. to ergodic and Markov sources, motivating the asymptotic equipartition property (AEP) and laying the groundwork for source coding of non-i.i.d. data.

Connections

Conceptual Contribution

Claim: The engineering problem of communication is mathematically separable from semantics; the right primitives are entropy (uncertainty per symbol of a source), mutual information (information one variable carries about another), and channel capacity (supremum of reliable transmission rate over a noisy channel). Two coding theorems establish the operational meaning of these quantities — entropy as the lossless-compression lower bound, capacity as the reliable-transmission upper bound.
Mechanism: Probabilistic model of source / channel / receiver; definition of entropy, conditional entropy, joint entropy, and mutual information; noiseless source-coding theorem (lossless compression to within ε of H); noisy channel-coding theorem (reliable transmission below C, error-bounded above C); extension to continuous channels and Gaussian noise.
Concepts introduced/used: Shannon Entropy, Shannon Information, Mutual Information, Channel Capacity, Source Coding Theorem, Noisy-Channel Coding Theorem, Asymptotic Equipartition Property.
Stance: founding paper of an entire discipline.
Relates to: Direct technical predecessor of Kolmogorov complexity / Algorithmic Information Theory (Solomonoff, Kolmogorov, Chaitin 1960s) — where Shannon measures the average information of an ensemble, AIT measures the absolute information of an individual object as the length of its shortest description; both are operationally founded on the same coding-theorem intuition. Shannon’s deliberate decoupling of the engineering problem from semantics inaugurates the running tension in agent communication between Shannon information (statistical, channel-bounded) and meaningful information (semantic, conventional, illocutionary). The agent-communication-language tradition exists precisely to address what Shannon’s framework deliberately set aside; nevertheless, the engineering bounds Shannon establishes constrain any agent communication channel, including LLM-mediated natural-language exchange. Why AI Agents Communicate In Human Language makes this explicit: natural language as inter-agent code has low channel capacity per token, high error probability under lossy LLM compression, and ambiguous decoding — Shannon-theoretic objections that motivate the case for structured ACLs. Conceptually adjacent to Chomsky 1956, which addresses the generative-grammatical structure of language as a parallel layer to Shannon’s statistical structure; together they delimit the design space of any communication system.

Summary

This paper critically examines metrics used to detect and measure emergent communication in multi-agent reinforcement learning. The authors show that commonly used indicators — such as speaker consistency (SC), context independence (CI), mutual information between messages and actions, and message-entropy — can be misleading: agents trained with a communication channel that does not influence their behavior may still exhibit high values on these metrics, producing the illusion of communication.

To disentangle the phenomenon, they propose decomposing communication into positive signaling (messages carry information about a speaker’s observations) and positive listening (messages influence a listener’s subsequent actions). They introduce causal influence of communication (CIC), a causal-intervention-based metric measuring how an agent’s message changes another agent’s action distribution, and demonstrate its properties on matrix communication games (MCGs). They offer concrete recommendations for when each metric should be trusted.

Key Ideas

Speaker consistency can be positive even when no communication happens.
Separate positive signaling from positive listening.
Causal influence of communication (CIC) via interventions on messages.
Matrix Communication Games as a minimal testbed.
Entropy-based metrics are shape-dependent and deceptive.

Connections

Conceptual Contribution

Claim: Popular metrics for emergent communication in MARL cannot distinguish real communication from spurious correlations; communication must be analysed causally, decomposed into signalling and listening.
Mechanism: The authors construct Matrix Communication Games where a policy with a non-influential communication channel still scores high on speaker consistency, context independence, mutual information, and entropy-based measures. They then define positive signalling and positive listening as orthogonal properties, and introduce Causal Influence of Communication (CIC) as a do-calculus intervention that measures how replacing the sent message would change the listener’s action distribution.
Concepts introduced/used: Positive Signalling, Positive Listening, Causal Influence of Communication, Speaker Consistency, Context Independence, Matrix Communication Games, Emergent Communication
Stance: critique
Relates to: Sharpens the empirical agenda of Emergence of Grounded Compositional Language in Multi-Agent Populations and Emergent Communication; its causal framing complements the decision-theoretic message-value account in Towards Automating the Evolution of Linguistic Competence and offers a verifiability foothold missing from the mentalistic semantics critiqued by Agent Communication Languages - Rethinking the Principles.