Information theory of multiple random variables

In the previous articles [1] [2] we defined the information of an observation as the logarithm of its probability – and its entropy of a random variable as the expected amount of information gained from observing it (and thus is a measure of how little information the prior carries). Naturally in the multivariate case it is trivial to define a "joint information" that is the logarithm of the joint distribution. and we have the joint entropy:

H(X) =  − ∑xp(x)log p(x)

Because the full joint distribution is at least as informative as the marginal distributions taken separately, we have the property:

H(X,Y) ≤ H(X) + H(Y)

We often informally explain notions of independence and relationships between random variables in terms of information – with information theory, we can now formalize these descriptions. The Venn diagram below shows the additive relationships between various entropies of multiple random variables:

#+ATTRHTML: :width 400

|

(source)

H(X) and H(Y) are represented by the respective circles, H(X,Y) is represented by the combined area of the circles and the mutual entropy; conditional entropies are as indicated. The mutual entropy (more commonly mutual information) is the expectation of the mutual information (more commonly pointwise mutual information):

$$\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}$$

Even though the pointwise mutual information may be positive or negative (the probability distribution of y may become more or less uncertain depending on the observed x), its expectation is always positive in a way analogous to conservation of expected evidence. These ideas can be generalized to beyond two variables:

#+ATTRHTML: :width 400

|

(source)

The mutual entropy represents the reduction in the number of bits necessary to encode two correlated variables together, as opposed to separately. This is a special example of the /entropy gain/ (or "Kullback-Leibler divergence") of two probability distributions p and q: it is the expected number of extra bits used when expressing a p(x)-distributed random variable with a q(x)-entropy encoding. The mutual entropy is the entropy gain from fX, Y(x,y) of fX(x)fY(y).

$$\begin{align*}KL(p(X) | q(X)) &= \sum-p(x) \log {q(x)} -  \sum -p(x) \log {p(x)} \\&= \sum p(x) \log \frac{p(x)}{q(x)}\end{align*}$$

The first term of this expression (the number of bits required to express the random variable in the incorrect encoding) is also known as the cross-entropy.