Deriving Entropy from Self-Information

In the last post Aliens and Self-Information we derived the formula for self-information

The self-information I(x)I(x) of an event xx with probability p(x)p(x) is defined as:

I(x)=logp(x)I(x) = -\log p(x)

It quantifies how surprising or informative an event is. Less probable events carry more information.

Entropy as Expected Value of Self-Information

Entropy H(X)H(X) measures the average uncertainty or information content in a probability distribution p(x)p(x) over all possible outcomes of a random variable XX. It is defined as the expected value of self-information:

H(X)=E[I(x)]H(X) = \mathbb{E}[I(x)]

Substituting I(x)=logp(x)I(x) = -\log p(x):

H(X)=xXp(x)I(x)H(X) = \sum_{x \in \mathcal{X}} p(x) \cdot I(x) H(X)=xXp(x)logp(x)H(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x)

Interpretation

The units depend on the base of the logarithm:

H(X)H(X) is maximized when all outcomes are equally likely (p(x)=1Xp(x) = \frac{1}{|\mathcal{X}|}).

Intuition

Examples:

  1. If p(x)=1p(x) = 1 (certainty), I(x)=0I(x) = 0, and H(X)=0H(X) = 0.
  2. If p(x)p(x) is uniform, H(X)H(X) is maximized because uncertainty is highest.

This derivation links the concept of individual surprise (self-information) to the broader idea of total uncertainty in a system (entropy).