Deriving Entropy from Self-Information

In the last post Aliens and Self-Information we derived the formula for self-information

The self-information $I(x)$ of an event $x$ with probability $p(x)$ is defined as:

I(x) = -\log p(x)

It quantifies how surprising or informative an event is. Less probable events carry more information.

Entropy as Expected Value of Self-Information

Entropy $H(X)$ measures the average uncertainty or information content in a probability distribution $p(x)$ over all possible outcomes of a random variable $X$ . It is defined as the expected value of self-information:

H(X) = \mathbb{E}[I(x)]

Substituting $I(x) = -\log p(x)$ :

H(X) = \sum_{x \in \mathcal{X}} p(x) \cdot I(x)

H(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x)

Interpretation

The units depend on the base of the logarithm:

Base 2: Bits (common in information theory).
Base $e$ : Nats (used in natural sciences).

$H(X)$ is maximized when all outcomes are equally likely ( $p(x) = \frac{1}{|\mathcal{X}|}$ ).

Intuition

Self-information measures the surprise of an individual event.
Entropy aggregates this measure, weighted by the probability of each event. Events with higher probabilities contribute less to entropy because they are less surprising.

Examples:

If $p(x) = 1$ (certainty), $I(x) = 0$ , and $H(X) = 0$ .
If $p(x)$ is uniform, $H(X)$ is maximized because uncertainty is highest.

This derivation links the concept of individual surprise (self-information) to the broader idea of total uncertainty in a system (entropy).