Eigenvalue Bounds

While grading homeworks today, I came across the following bound:

Theorem 1: If A and B are symmetric n\times n matrices with eigenvalues \lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_n and \mu_1 \geq \mu_2 \geq \ldots \geq \mu_n respectively, then Trace(A^TB) \leq \sum_{i=1}^n \lambda_i \mu_i.

For such a natural-looking statement, this was surprisingly hard to prove. However, I finally came up with a proof, and it was cool enough that I felt the need to share. To prove this, we actually need two ingredients. The first is the Cauchy Interlacing Theorem:

Theorem 2: If A is an n\times n symmetric matrix and B is an (n-k) \times (n-k) principle submatrix of A, then \lambda_{i-k}(A) \leq \lambda_i(B) \leq \lambda_i(A), where \lambda_i(X) is the ith largest eigenvalue of X.

As a corollary we have:

Corollary 1: For any symmetric matrix X, \sum_{i=1}^k X_{ii} \leq \sum_{i=1}^k \lambda_i(X).

Proof: The left-hand-side is just the trace of the upper-left k\times k principle submatrix of X, whose eigenvalues are by Theorem 2 bounded by the k largest eigenvalues of X. \square

The final ingredient we will need is a sort of “majorization” inequality based on Abel summation:

Theorem 3: If x_1,\ldots,x_n and y_1,\ldots,y_n are such that \sum_{i=1}^k x_i \leq \sum_{i=1}^k y_i for all k (with equality when k=n), and c_1 \geq c_2 \geq \ldots \geq c_n, then \sum_{i=1}^n c_ix_i \leq \sum_{i=1}^n c_iy_i.

Proof: We have:

\sum_{i=1}^n c_ix_i = c_n(x_1+\cdots+x_n) + \sum_{i=1}^{n-1} (c_i-c_{i+1})(x_1+\cdots+x_i) \leq c_n(y_1+\cdots+y_n) + \sum_{i=1}^{n-1} (c_i-c_{i+1})(y_1+\cdots+y_i) = \sum_{i=1}^n c_iy_i

where the equalities come from the Abel summation method. \square

Now, we are finally ready to prove the original theorem:

Proof of Theorem 1: First note that since the trace is invariant under similarity transforms, we can without loss of generality assume that A is diagonal, in which case we want to prove that \sum_{i=1}^n \lambda_i B_{ii} \leq \sum_{i=1}^n \lambda_i \mu_i. But by Corollary 1, we also know that \sum_{i=1}^k B_{ii} \leq \sum_{i=1}^k \mu_i for all k. Since by assumption the \lambda_i are a decreasing sequence, Theorem 3 then implies that \sum_{i=1}^n \lambda_i B_{ii} \leq \sum_{i=1}^n \lambda_i \mu_i, which is what we wanted to show. \square

Quadratically Independent Monomials

Today Arun asked me the following question:

“Under what conditions will a set \{p_1,\ldots,p_n\} of polynomials be quadratically independent, in the sense that \{p_1^2, p_1p_2, p_2^2, p_1p_3,\ldots,p_{n-1}p_n, p_n^2\} is a linearly independent set?”

I wasn’t able to make much progress on this general question, but in the specific setting where the p_i are all polynomials in one variable, and we further restrict to just monomials, (i.e. p_i(x) = x^{d_i} for some d_i), the condition is just that there are no distinct unordered pairs (i_1,j_1),(i_2,j_2) such that d_{i_1} + d_{j_1} = d_{i_2} + d_{j_2}. Arun was interested in the largest such a set could be for a given maximum degree D, so we are left with the following interesting combinatorics problem:

“What is the largest subset S of \{1,\ldots,D\} such that no two distinct pairs of elements of S have the same sum?”

For convenience of notation let n denote the size of S. A simple upper bound is \binom{N+1}{2} \leq 2D-1, since there are \binom{N+1}{2} pairs to take a sum of, and all pairwise sums lie between 2 and 2D. We therefore have n = O(\sqrt{D}).

What about lower bounds on n? If we let S be the powers of 2 less than or equal to D, then we get a lower bound of \log_2(D); we can do slightly better by taking the Fibonacci numbers instead, but this still only gives us logarithmic growth. So the question is, can we find sets that grow polynomially in D?

It turns out the answer is yes, and we can do so by choosing randomly. Let each element of \{1,\ldots,D\} be placed in S with probability p. Now consider any k, 2 \leq k \leq 2D. If k is odd, then there are (k-1)/2 possible pairs that could add up to k: (1,k-1), (2,k-2),…,((k-1)/2,(k+1)/2). The probability of each such pair existing is p^2. Note that each of these events is independent.

S is invalid if and only if there exists some k such that more than one of these pairs is active in S. The probability of any two given pairs being simultaneously active is p^4, and there are \binom{(k-1)/2}{2} \leq \binom{D}{2} such pairs for a given k, hence (D-1)\binom{D}{2} \leq D^3/2 such pairs total (since we were just looking at odd k). Therefore, the probability of an odd value of k invalidating S is at most p^4D^3/2.

For even k we get much the same result except that the probability for a given value of k comes out to the slightly more complicated formula \binom{k/2-1}{2}p^4 + (k/2-1)p^3 + p^2 \leq D^2p^4/2 + Dp^3 + p^2, so that the total probability of an even value of k invalidating S is at most p^4D^3/2 + p^3D^2 + p^2D.

Putting this all together gives us a bound of p^4D^3 + p^3D^2 + p^2D. If we set p to be $\frac{1}{2}D^{-\frac{3}{4}}$ then the probability of S being invalid is then at most \frac{1}{16} + \frac{1}{8} D^{-\frac{1}{4}} + \frac{1}{4}D^{-\frac{1}{2}} \leq \frac{7}{16}, so with probability at least \frac{7}{16} a set S with elements chosen randomly with probability \frac{1}{2}D^{-\frac{3}{4}} will be valid. On the other hand, such a set has D^{1/4} elements in expectation, and asymptotically the probability of having at least this many elements is \frac{1}{2}. Therefore, with probability at least \frac{1}{16} a randomly chosen set will be both valid and have size greater than \frac{1}{2}, which shows that the largest value of n is at least \Omega\left(D^{1/4}\right).

We can actually do better: if all elements are chosen with probability \frac{1}{2}D^{-2/3}, then one can show that the expected number of invalid pairs is at most \frac{1}{8}D^{1/3} + O(1), and hence we can pick randomly with probability p = \frac{1}{2}D^{-2/3}, remove one element of each of the invalid pairs, and still be left with \Omega(D^{1/3}) elements in S.

So, to recap: choosing elements randomly gives us S of size \Omega(D^{1/4}); choosing randomly and then removing any offending pairs gives us S of size \Omega(D^{1/3}); and we have an upper bound of O(D^{1/2}). What is the actual asymptotic answer? I don’t actually know the answer to this, but I thought I’d share what I have so far because I think the techniques involved are pretty cool.

Algebra trick of the day

I’ve decided to start recording algebra tricks as I end up using them. Today I actually have two tricks, but they end up being used together a lot. I don’t know if they have more formal names, but I call them the “trace trick” and the “rank 1 relaxation”.

Suppose that we want to maximize the Rayleigh quotient \frac{x^TAx}{x^Tx} of a matrix A. There are many reasons we might want to do this, for instance of A is symmetric then the maximum corresponds to the largest eigenvalue. There are also many ways to do this, and the one that I’m about to describe is definitely not the most efficient, but it has the advantage of being flexible, in that it easily generalizes to constrained maximizations, etc.

The first observation is that \frac{x^TAx}{x^Tx} is homogeneous, meaning that scaling x doesn’t affect the result. So, we can assume without loss of generality that x^Tx = 1, and we end up with the optimization problem:

maximize x^TAx

subject to x^Tx = 1

This is where the trace trick comes in. Recall that the trace of a matrix is the sum of its diagonal entries. We are going to use two facts: first, the trace of a number is just the number itself. Second, trace(AB) = trace(BA). (Note, however, that trace(ABC) is not in general equal to trace(BAC), although trace(ABC) is equal to trace(CAB).) We use these two properties as follows — first, we re-write the optimization problem as:

maximize Trace(x^TAx)

subject to Trace(x^Tx) = 1

Second, we re-write it again using the invariance of trace under cyclic permutations:

maximize Trace(Axx^T)

subject to Trace(xx^T) = 1

Now we make the substitution X = xx^T:

maximize Trace(AX)

subject to Trace(X) = 1, X = xx^T

Finally, note that a matrix X can be written as xx^T if and only if X is positive semi-definite and has rank 1. Therefore, we can further write this as

maximize Trace(AX)

subject to Trace(X) = 1, Rank(X) = 1, X \succeq 0

Aside from the rank 1 constraint, this would be a semidefinite program, a type of problem that can be solved efficiently. What happens if we drop the rank 1 constraint? Then I claim that the solution to this program would be the same as if I had kept the constraint in! Why is this? Let’s look at the eigendecomposition of X, written as \sum_{i=1}^n \lambda_i x_ix_i^T, with \lambda_i \geq 0 (by positive semidefiniteness) and \sum_{i=1}^n \lambda_i = 1 (by the trace constraint). Let’s also look at Trace(AX), which can be written as \sum_{i=1}^n \lambda_i Trace(Ax_ix_i^T). Since Trace(AX) is just a convex combination of the Trace(Ax_ix_i^T), we might as well have just picked X to be x_ix_i^T, where i is chosen to maximize Trace(Ax_ix_i^T). If we set that \lambda_i to 1 and all the rest to 0, then we maintain all of the constraints while increasing Trace(AX), meaning that we couldn’t have been at the optimum value of X unless n was equal to 1. What we have shown, then, is that the rank of X must be 1, so that the rank 1 constraint was unnecessary.

Technically, X could be a linear combination of rank 1 matrices that all have the same value of Trace(AX), but in that case we could just pick any one of those matrices. So what I have really shown is that at least one optimal point has rank 1, and we can recover such a point from any solution, even if the original solution was not rank 1.

Here is a problem that uses a similar trick. Suppose we want to find x that simultaneously satisfies the equations:

b_i = |a_i^Tx|^2

for each i = 1,\ldots,n (this example was inspired from the recent NIPS paper by Ohlsson, Yang, Dong, and Sastry, although the idea itself goes at least back to Candes, Strohmer, and Voroninski). Note that this is basically equivalent to solving a system of linear equations where we only know each equation up to a sign (or a phase, in the complex case). Therefore, in general, this problem will not have a unique solution. To ensure the solution is unique, let us assume the very strong condition that whenever a_i^TVa_i = 0 for all i = 1,\ldots,n, the matrix V must itself be zero (note: Candes et al. get away with a much weaker condition). Given this, can we phrase the problem as a semidefinite program? I highly recommend trying to solve this problem on your own, or at least reducing it to a rank-constrained SDP, so I’ll include the solution below a fold.

Read more of this post