Local KL Divergence
February 2, 2013 Leave a comment
The KL divergence is an important tool for studying the distance between two probability distributions. Formally, given two distributions and , the KL divergence is defined as
Note that . Intuitively, a small KL(p || q) means that there are few points that p assigns high probability to but that q does not. We can also think of KL(p || q) as the number of bits of information needed to update from the distribution q to the distribution p.
Suppose that p and q are both mixtures of other distributions: and . Can we bound in terms of the ? In some sense this is asking to upper bound the KL divergence in terms of some more local KL divergence. It turns out this can be done:
Theorem: If and and are all probability distributions, then
Proof: If we expand the definition, then we are trying to prove that
We will in fact show that this is true for every value of , so that it is certainly true for the integral. Using , re-write the condition for a given value of as
(Note that the sign of the inequality flipped because we replaced the two expressions with their negatives.) Now, this follows by using Jensen’s inequality on the function:
This proves the inequality and therefore the theorem.
Remark: Intuitively, if we want to describe in terms of , it is enough to first locate the th term in the sum and then to describe in terms of . The theorem is a formalization of this intuition. In the case that , it also says that the KL divergence between two different mixtures of the same set of distributions is at most the KL divergence between the mixture weights.