## A Fun Optimization Problem

I spent the last several hours trying to come up with an efficient algorithm to the following problem:

Problem: Suppose that we have a sequence of $l$ pairs of non-negative numbers $(a_1,b_1),\ldots,(a_l,b_l)$ such that $\sum_{i=1}^l a_i \leq A$ and $\sum_{i=1}^l b_i \leq B$. Devise an efficient algorithm to find the $k$ pairs $(a_{i_1},b_{i_1}),\ldots,(a_{i_k},b_{i_k})$ that maximize $\left[\sum_{r=1}^k a_{i_r}\log(a_{i_r}/b_{i_r})\right] + \left[A-\sum_{r=1}^k a_{i_r}\right]\log\left(\frac{A-\sum_{r=1}^k a_{i_r}}{B-\sum_{r=1}^k b_{i_r}}\right).$

Commentary: I don’t have a fully satisfactory solution to this yet, although I do think I can find an algorithm that runs in $O\left(\frac{l \log(l)}{\epsilon}\right)$ time and finds $2k$ pairs that do at least $1-\epsilon$ as well as the best set of $k$ pairs. It’s possible I need to assume something like $\sum_{i=1}^l a_i \leq A/2$ instead of just $A$ (and similarly for the $b_i$), although I’m happy to make that assumption.

While attempting to solve this problem, I’ve managed to utilize a pretty large subset of my bag of tricks for optimization problems, so I think working on it is pretty worthwhile intellectually. It also happens to be important to my research, so if anyone comes up with a good algorithm I’d be interested to know.

## Eigenvalue Bounds

While grading homeworks today, I came across the following bound:

Theorem 1: If A and B are symmetric $n\times n$ matrices with eigenvalues $\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_n$ and $\mu_1 \geq \mu_2 \geq \ldots \geq \mu_n$ respectively, then $Trace(A^TB) \leq \sum_{i=1}^n \lambda_i \mu_i$.

For such a natural-looking statement, this was surprisingly hard to prove. However, I finally came up with a proof, and it was cool enough that I felt the need to share. To prove this, we actually need two ingredients. The first is the Cauchy Interlacing Theorem:

Theorem 2: If A is an $n\times n$ symmetric matrix and B is an $(n-k) \times (n-k)$ principle submatrix of A, then $\lambda_{i-k}(A) \leq \lambda_i(B) \leq \lambda_i(A)$, where $\lambda_i(X)$ is the ith largest eigenvalue of X.

As a corollary we have:

Corollary 1: For any symmetric matrix X, $\sum_{i=1}^k X_{ii} \leq \sum_{i=1}^k \lambda_i(X)$.

Proof: The left-hand-side is just the trace of the upper-left $k\times k$ principle submatrix of X, whose eigenvalues are by Theorem 2 bounded by the k largest eigenvalues of X. $\square$

The final ingredient we will need is a sort of “majorization” inequality based on Abel summation:

Theorem 3: If $x_1,\ldots,x_n$ and $y_1,\ldots,y_n$ are such that $\sum_{i=1}^k x_i \leq \sum_{i=1}^k y_i$ for all k (with equality when $k=n$), and $c_1 \geq c_2 \geq \ldots \geq c_n$, then $\sum_{i=1}^n c_ix_i \leq \sum_{i=1}^n c_iy_i$.

Proof: We have: $\sum_{i=1}^n c_ix_i = c_n(x_1+\cdots+x_n) + \sum_{i=1}^{n-1} (c_i-c_{i+1})(x_1+\cdots+x_i) \leq c_n(y_1+\cdots+y_n) + \sum_{i=1}^{n-1} (c_i-c_{i+1})(y_1+\cdots+y_i) = \sum_{i=1}^n c_iy_i$

where the equalities come from the Abel summation method. $\square$

Now, we are finally ready to prove the original theorem:

Proof of Theorem 1: First note that since the trace is invariant under similarity transforms, we can without loss of generality assume that A is diagonal, in which case we want to prove that $\sum_{i=1}^n \lambda_i B_{ii} \leq \sum_{i=1}^n \lambda_i \mu_i$. But by Corollary 1, we also know that $\sum_{i=1}^k B_{ii} \leq \sum_{i=1}^k \mu_i$ for all k. Since by assumption the $\lambda_i$ are a decreasing sequence, Theorem 3 then implies that $\sum_{i=1}^n \lambda_i B_{ii} \leq \sum_{i=1}^n \lambda_i \mu_i$, which is what we wanted to show. $\square$

## Local KL Divergence

The KL divergence is an important tool for studying the distance between two probability distributions. Formally, given two distributions $p$ and $q$, the KL divergence is defined as $KL(p || q) := \int p(x) \log(p(x)/q(x)) dx$

Note that $KL(p || q) \neq KL(q || p)$. Intuitively, a small KL(p || q) means that there are few points that p assigns high probability to but that q does not. We can also think of KL(p || q) as the number of bits of information needed to update from the distribution q to the distribution p.

Suppose that p and q are both mixtures of other distributions: $p(x) = \sum_i \alpha_i F_i(x)$ and $q(x) = \sum_i \beta_i G_i(x)$. Can we bound $KL(p || q)$ in terms of the $KL(F_i || G_i)$? In some sense this is asking to upper bound the KL divergence in terms of some more local KL divergence. It turns out this can be done:

Theorem: If $\sum_i \alpha_i = \sum_i \beta_i = 1$ and $F_i$ and $G_i$ are all probability distributions, then $KL\left(\sum_i \alpha_i F_i || \sum_i \beta_i G_i\right) \leq \sum_i \alpha_i \left(\log(\alpha_i/\beta_i) + KL(F_i || G_i)\right)$.

Proof: If we expand the definition, then we are trying to prove that $\int \left(\sum \alpha_i F_i(x)\right) \log\left(\frac{\sum \alpha_i F_i(x)}{\sum \beta_i G_i(x)}\right) dx \leq \int \left(\sum_i \alpha_iF_i(x) \log\left(\frac{\alpha_i F_i(x)}{\beta_i G_i(x)}\right)\right) dx$

We will in fact show that this is true for every value of $x$, so that it is certainly true for the integral. Using $\log(x/y) = -\log(y/x)$, re-write the condition for a given value of $x$ as $\left(\sum \alpha_i F_i(x)\right) \log\left(\frac{\sum \beta_i G_i(x)}{\sum \alpha_i F_i(x)}\right) \geq \sum_i \alpha_iF_i(x) \log\left(\frac{\beta_i G_i(x)}{\alpha_i F_i(x)}\right)$

(Note that the sign of the inequality flipped because we replaced the two expressions with their negatives.) Now, this follows by using Jensen’s inequality on the $\log$ function: $\sum_i \alpha_iF_i(x) \log\left(\frac{\beta_i G_i(x)}{\alpha_i F_i(x)}\right) \leq \left(\sum_i \alpha_iF_i(x)\right) \log\left(\frac{\sum_i \frac{\beta_i G_i(x)}{\alpha_i F_i(x)} \alpha_i F_i(x)}{\sum \alpha_i F_i(x)}\right) = \left(\sum_i \alpha_i F_i(x)\right) \log\left(\frac{\sum_i \beta_i G_i(x)}{\sum_i \alpha_i F_i(x)}\right)$

This proves the inequality and therefore the theorem. $\square$

Remark: Intuitively, if we want to describe $\sum \alpha_i F_i$ in terms of $\sum \beta_i G_i$, it is enough to first locate the $i$th term in the sum and then to describe $F_i$ in terms of $G_i$. The theorem is a formalization of this intuition. In the case that $F_i = G_i$, it also says that the KL divergence between two different mixtures of the same set of distributions is at most the KL divergence between the mixture weights.