A Fun Optimization Problem

I spent the last several hours trying to come up with an efficient algorithm to the following problem:

Problem: Suppose that we have a sequence of l pairs of non-negative numbers (a_1,b_1),\ldots,(a_l,b_l) such that \sum_{i=1}^l a_i \leq A and \sum_{i=1}^l b_i \leq B. Devise an efficient algorithm to find the k pairs (a_{i_1},b_{i_1}),\ldots,(a_{i_k},b_{i_k}) that maximize

\left[\sum_{r=1}^k a_{i_r}\log(a_{i_r}/b_{i_r})\right] + \left[A-\sum_{r=1}^k a_{i_r}\right]\log\left(\frac{A-\sum_{r=1}^k a_{i_r}}{B-\sum_{r=1}^k b_{i_r}}\right).

Commentary: I don’t have a fully satisfactory solution to this yet, although I do think I can find an algorithm that runs in O\left(\frac{l \log(l)}{\epsilon}\right) time and finds 2k pairs that do at least 1-\epsilon as well as the best set of k pairs. It’s possible I need to assume something like \sum_{i=1}^l a_i \leq A/2 instead of just A (and similarly for the b_i), although I’m happy to make that assumption.

While attempting to solve this problem, I’ve managed to utilize a pretty large subset of my bag of tricks for optimization problems, so I think working on it is pretty worthwhile intellectually. It also happens to be important to my research, so if anyone comes up with a good algorithm I’d be interested to know.

Eigenvalue Bounds

While grading homeworks today, I came across the following bound:

Theorem 1: If A and B are symmetric n\times n matrices with eigenvalues \lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_n and \mu_1 \geq \mu_2 \geq \ldots \geq \mu_n respectively, then Trace(A^TB) \leq \sum_{i=1}^n \lambda_i \mu_i.

For such a natural-looking statement, this was surprisingly hard to prove. However, I finally came up with a proof, and it was cool enough that I felt the need to share. To prove this, we actually need two ingredients. The first is the Cauchy Interlacing Theorem:

Theorem 2: If A is an n\times n symmetric matrix and B is an (n-k) \times (n-k) principle submatrix of A, then \lambda_{i-k}(A) \leq \lambda_i(B) \leq \lambda_i(A), where \lambda_i(X) is the ith largest eigenvalue of X.

As a corollary we have:

Corollary 1: For any symmetric matrix X, \sum_{i=1}^k X_{ii} \leq \sum_{i=1}^k \lambda_i(X).

Proof: The left-hand-side is just the trace of the upper-left k\times k principle submatrix of X, whose eigenvalues are by Theorem 2 bounded by the k largest eigenvalues of X. \square

The final ingredient we will need is a sort of “majorization” inequality based on Abel summation:

Theorem 3: If x_1,\ldots,x_n and y_1,\ldots,y_n are such that \sum_{i=1}^k x_i \leq \sum_{i=1}^k y_i for all k (with equality when k=n), and c_1 \geq c_2 \geq \ldots \geq c_n, then \sum_{i=1}^n c_ix_i \leq \sum_{i=1}^n c_iy_i.

Proof: We have:

\sum_{i=1}^n c_ix_i = c_n(x_1+\cdots+x_n) + \sum_{i=1}^{n-1} (c_i-c_{i+1})(x_1+\cdots+x_i) \leq c_n(y_1+\cdots+y_n) + \sum_{i=1}^{n-1} (c_i-c_{i+1})(y_1+\cdots+y_i) = \sum_{i=1}^n c_iy_i

where the equalities come from the Abel summation method. \square

Now, we are finally ready to prove the original theorem:

Proof of Theorem 1: First note that since the trace is invariant under similarity transforms, we can without loss of generality assume that A is diagonal, in which case we want to prove that \sum_{i=1}^n \lambda_i B_{ii} \leq \sum_{i=1}^n \lambda_i \mu_i. But by Corollary 1, we also know that \sum_{i=1}^k B_{ii} \leq \sum_{i=1}^k \mu_i for all k. Since by assumption the \lambda_i are a decreasing sequence, Theorem 3 then implies that \sum_{i=1}^n \lambda_i B_{ii} \leq \sum_{i=1}^n \lambda_i \mu_i, which is what we wanted to show. \square

Local KL Divergence

The KL divergence is an important tool for studying the distance between two probability distributions. Formally, given two distributions p and q, the KL divergence is defined as

KL(p || q) := \int p(x) \log(p(x)/q(x)) dx

Note that KL(p || q) \neq KL(q || p). Intuitively, a small KL(p || q) means that there are few points that p assigns high probability to but that q does not. We can also think of KL(p || q) as the number of bits of information needed to update from the distribution q to the distribution p.

Suppose that p and q are both mixtures of other distributions: p(x) = \sum_i \alpha_i F_i(x) and q(x) = \sum_i \beta_i G_i(x). Can we bound KL(p || q) in terms of the KL(F_i || G_i)? In some sense this is asking to upper bound the KL divergence in terms of some more local KL divergence. It turns out this can be done:

Theorem: If \sum_i \alpha_i = \sum_i \beta_i = 1 and F_i and G_i are all probability distributions, then

KL\left(\sum_i \alpha_i F_i || \sum_i \beta_i G_i\right) \leq \sum_i \alpha_i \left(\log(\alpha_i/\beta_i) + KL(F_i || G_i)\right).

Proof: If we expand the definition, then we are trying to prove that

\int \left(\sum \alpha_i F_i(x)\right) \log\left(\frac{\sum \alpha_i F_i(x)}{\sum \beta_i G_i(x)}\right) dx \leq \int \left(\sum_i \alpha_iF_i(x) \log\left(\frac{\alpha_i F_i(x)}{\beta_i G_i(x)}\right)\right) dx

We will in fact show that this is true for every value of x, so that it is certainly true for the integral. Using \log(x/y) = -\log(y/x), re-write the condition for a given value of x as

\left(\sum \alpha_i F_i(x)\right) \log\left(\frac{\sum \beta_i G_i(x)}{\sum \alpha_i F_i(x)}\right) \geq \sum_i \alpha_iF_i(x) \log\left(\frac{\beta_i G_i(x)}{\alpha_i F_i(x)}\right)

(Note that the sign of the inequality flipped because we replaced the two expressions with their negatives.) Now, this follows by using Jensen’s inequality on the \log function:

\sum_i \alpha_iF_i(x) \log\left(\frac{\beta_i G_i(x)}{\alpha_i F_i(x)}\right) \leq \left(\sum_i \alpha_iF_i(x)\right) \log\left(\frac{\sum_i \frac{\beta_i G_i(x)}{\alpha_i F_i(x)} \alpha_i F_i(x)}{\sum \alpha_i F_i(x)}\right) = \left(\sum_i \alpha_i F_i(x)\right) \log\left(\frac{\sum_i \beta_i G_i(x)}{\sum_i \alpha_i F_i(x)}\right)

This proves the inequality and therefore the theorem. \square

Remark: Intuitively, if we want to describe \sum \alpha_i F_i in terms of \sum \beta_i G_i, it is enough to first locate the ith term in the sum and then to describe F_i in terms of G_i. The theorem is a formalization of this intuition. In the case that F_i = G_i, it also says that the KL divergence between two different mixtures of the same set of distributions is at most the KL divergence between the mixture weights.