Local KL Divergence

The KL divergence is an important tool for studying the distance between two probability distributions. Formally, given two distributions p and q, the KL divergence is defined as

KL(p || q) := \int p(x) \log(p(x)/q(x)) dx

Note that KL(p || q) \neq KL(q || p). Intuitively, a small KL(p || q) means that there are few points that p assigns high probability to but that q does not. We can also think of KL(p || q) as the number of bits of information needed to update from the distribution q to the distribution p.

Suppose that p and q are both mixtures of other distributions: p(x) = \sum_i \alpha_i F_i(x) and q(x) = \sum_i \beta_i G_i(x). Can we bound KL(p || q) in terms of the KL(F_i || G_i)? In some sense this is asking to upper bound the KL divergence in terms of some more local KL divergence. It turns out this can be done:

Theorem: If \sum_i \alpha_i = \sum_i \beta_i = 1 and F_i and G_i are all probability distributions, then

KL\left(\sum_i \alpha_i F_i || \sum_i \beta_i G_i\right) \leq \sum_i \alpha_i \left(\log(\alpha_i/\beta_i) + KL(F_i || G_i)\right).

Proof: If we expand the definition, then we are trying to prove that

\int \left(\sum \alpha_i F_i(x)\right) \log\left(\frac{\sum \alpha_i F_i(x)}{\sum \beta_i G_i(x)}\right) dx \leq \int \left(\sum_i \alpha_iF_i(x) \log\left(\frac{\alpha_i F_i(x)}{\beta_i G_i(x)}\right)\right) dx

We will in fact show that this is true for every value of x, so that it is certainly true for the integral. Using \log(x/y) = -\log(y/x), re-write the condition for a given value of x as

\left(\sum \alpha_i F_i(x)\right) \log\left(\frac{\sum \beta_i G_i(x)}{\sum \alpha_i F_i(x)}\right) \geq \sum_i \alpha_iF_i(x) \log\left(\frac{\beta_i G_i(x)}{\alpha_i F_i(x)}\right)

(Note that the sign of the inequality flipped because we replaced the two expressions with their negatives.) Now, this follows by using Jensen’s inequality on the \log function:

\sum_i \alpha_iF_i(x) \log\left(\frac{\beta_i G_i(x)}{\alpha_i F_i(x)}\right) \leq \left(\sum_i \alpha_iF_i(x)\right) \log\left(\frac{\sum_i \frac{\beta_i G_i(x)}{\alpha_i F_i(x)} \alpha_i F_i(x)}{\sum \alpha_i F_i(x)}\right) = \left(\sum_i \alpha_i F_i(x)\right) \log\left(\frac{\sum_i \beta_i G_i(x)}{\sum_i \alpha_i F_i(x)}\right)

This proves the inequality and therefore the theorem. \square

Remark: Intuitively, if we want to describe \sum \alpha_i F_i in terms of \sum \beta_i G_i, it is enough to first locate the ith term in the sum and then to describe F_i in terms of G_i. The theorem is a formalization of this intuition. In the case that F_i = G_i, it also says that the KL divergence between two different mixtures of the same set of distributions is at most the KL divergence between the mixture weights.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: