Mutual information
From Academic Kids

In probability theory and information theory, the mutual information of two random variables is a quantity that measures the independence of the two variables. The unit of measurement of mutual information is the bit.
Informally, mutual information measures the information of X that is shared by Y. If X and Y are independent, then X contains no information about Y and vice versa, so their mutual information is zero. If X and Y are identical then all information conveyed by X is shared with Y: knowing X reveals nothing new about Y and vice versa, therefore the mutual information is the same as the information conveyed by X (or Y) alone, namely the entropy of X. In a specific sense (see below), mutual information quantifies the distance between the joint distribution of X and Y and the product of their marginal distributions.
Formally, in the discrete case, if the joint probability density function (with respect to counting measure, since this is the discrete case) of X and Y is p with p(x, y) = Pr(X=x, Y=y), the probability density function of X alone is f with f(x) = Pr(X=x), and the probability density function of Y alone is g with g(y) = Pr(Y=y), then the mutual information of X and Y is given by I(X, Y), defined as follows for the discrete case:
 <math> I(X,Y) = \sum_{x,y} p(x,y) \times \log_2 \frac{p(x,y)}{f(x)\,g(y)}. \!<math>
In the continuous case, replace summation by a definite double integral:
 <math> I(X,Y) = \int_{(\infty,\infty) \times (\infty,\infty)} p(x,y) \times \log_2 \frac{p(x,y)}{f(x)\,g(y)} \; d(x,y). \!<math>
Mutual information is a measure of independence in the following sense: I(X, Y) = 0 iff X and Y are independent random variables. This is easy to see in one direction: if X and Y are independent, then p(x,y) = f(x) × g(y), and therefore:
 <math> \log \frac{p(x,y)}{f(x)\,g(y)} = \log 1 = 0. \!<math>
Moreover, mutual information is nonnegative (i.e. I(X,Y) ≥ 0; see below) and symmetric (i.e. I(X,Y) = I(Y,X)).
Several generalizations of mutual information to more than two random variables have been proposed, but a widely agreed on definition has not yet emerged.
Relation to other quantities
Mutual information can be equivalently expressed as
 <math> I(X,Y) = H(X)  H(XY) = H(Y)  H(YX) = H(X) + H(Y)  H(X,Y) <math>
where H(X) and H(Y) are entropies, H(XY) and H(YX) are conditional entropies, and H(Y,X) is the joint entropy between X and Y.
Since H(X) > H(XY), this characterization is consistent with the nonnegativity property stated above.
Mutual information can also be expressed in terms of the KullbackLeibler divergence between the joint distribution of two random variables X and Y and the product of their marginal distributions. Let q(x, y) = f(x) × g(y); then
 <math> I(X,Y) = \mathit{KL}(p, q). <math>
Furthermore, let h_{y}(x) = p(x, y) / g(y). Then
 <math> I(X,Y) = \sum_y g(y) \sum_x h_y(x) \times \log_2 \frac{h_y(x)}{f(x)} \!<math>
 <math> = \sum_y g(y) \; \mathit{KL}(h_y,f) \!<math>
 <math> = \mathrm{E}_Y[\mathit{KL}(h_y,f)]. \!<math>
Thus mutual information can also be understood as the expectation of the KullbackLeibler divergence between the conditional distribution h of X given Y and the univariate distribution f of X: the more different the distributions f and h, the greater the information gain.
Applications of mutual information
In many applications, one wants to maximize mutual information (thus increasing dependencies), which is often equivalent to minimizing conditional entropy. Examples include:
 Discriminative training procedures for hidden Markov models have been proposed based on the maximum mutual information (MMI) criterion.
 Mutual information has been used as a criterion for feature selection and feature transformations in machine learning.
 Mutual information is often used as a significance function for the computation of collocations in corpus linguistics.
 Mutual information is used in medical imaging for image registration. Given a reference image (for example, a brain scan), and a second image which needs to be put the same coordinate system as the reference image, this image is deformed until the mutual information between it and the reference image is maximized.
References
 Athanasios Papoulis. Probability, Random Variables, and Stochastic Processes, second edition. New York: McGrawHill, 1984. (See Chapter 15.)
 Kenneth Ward Church and Patrick Hanks. Word association norms, mutual information, and lexicography, Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, 1989. http://www.aclweb.org/anthology/P891010de:Transinformation