Sets with Small Intersection

Suppose that we want to construct subsets S_1, \ldots, S_m \subseteq \{1,\ldots,n\} with the following properties:

  1. |S_i| \geq k for all i
  2. |S_i \cap S_j| \leq 1 for all i \neq j

The goal is to construct as large a family of such subsets as possible (i.e., to make m as large as possible). If k \geq 2\sqrt{n}, then up to constants it is not hard to show that the optimal number of sets is \frac{n}{k} (that is, the trivial construction with all sets disjoint is essentially the best we can do).

Here I am interested in the case when k \ll \sqrt{n}. In this case I claim that we can substantially outperform the trivial construction: we can take m = \Omega(n^2 / k^3). The proof is a very nice application of the asymmetric Lovasz Local Lemma. (Readers can refresh their memory here on what the asymmetric LLL says.)

Proof. We will take a randomized construction. For i \in \{1,\ldots,n\}, j \in \{1,\ldots,m\}, let X_{i,a} be the event that i \in S_a. We will take the X_{i,a} to be independent each with probability \frac{2k}{n}. Also define the events

Y_{i,j,a,b} = I[i \in S_a \wedge j \in S_a \wedge i \in S_b \wedge j \in S_b]

Z_{a} = I[|S_a| < k]

It suffices to show that with non-zero probability, all of the Y_{i,j,a,b} and Z_{a} are false. Note that each Y_{i,j,a,b} depends on Y_{i',j,a',b}, Y_{i',j,a,b'}, Y_{i,j',a',b}, Y_{i,j',a,b'}, Z_a, Z_b, and each Z_a depends on Y_{i,j,a,b}. Thus each Y depends on at most 4nm other Y and 2 other Z, and each Z depends on at most n^2m/2 of the Y. Also note that P(Y_{i,j,a,b}) = (k/n)^4 and P(Z_a) \leq \exp(-k/4) (by the Chernoff bound). It thus suffices to find constants y, z such that

(k/n)^4 \leq y(1-y)^{4nm}(1-z)^2

\exp(-k/4) \leq z(1-y)^{n^2m/2}

We will guess y = \frac{k}{n^2m}, z = \frac{1}{2}, in which case the bottom inequality is approximately \exp(-k/4) \leq \frac{1}{2}\exp(-k/2) (which is satisfied for large enough k, and the top inequality is approximately \frac{k^4}{n^4} \leq \frac{k}{4n^2m} \exp(-4k/n), which is satisfied for m \leq \frac{n^2}{4ek^3} (assuming k \leq n/4). Hence in particular we can indeed take m = \Omega(n^2/k^3), as claimed.


Linear algebra fact

Here is interesting linear algebra fact: let A be an n \times n matrix and u be a vector such that u^{\top}A = \lambda u^{\top}. Then for any matrix B, u^{\top}((A-B)(\lambda I - B)^{-1}) = u^{\top}.

The proof is just basic algebra: u^{\top}(A-B)(\lambda I - B)^{-1} = (\lambda u^{\top} - u^{\top}B)(\lambda I - B)^{-1} = u^{\top}(\lambda I - B)(\lambda I - B)^{-1} = u^{\top}.

Why care about this? Let’s imagine that A is a (not necessarily symmetric) stochastic matrix, so 1^{\top}A = 1^{\top}. Let A-B be a low-rank approximation to A (so A-B consists of all the large singular values, and B consists of all the small singular values). Unfortunately since A is not symmetric, this low-rank approximation doesn’t preserve the eigenvalues of A and so we need not have 1^{\top}(A-B) = 1^{\top}. The (I-B)^{-1} can be thought of as a “correction” term such that the resulting matrix is still low-rank, but we’ve preserved one of the eigenvectors of A.

Maximal Maximum-Entropy Sets

Consider a probability distribution {p(y)} on a space {\mathcal{Y}}. Suppose we want to construct a set {\mathcal{P}} of probability distributions on {\mathcal{Y}} such that {p(y)} is the maximum-entropy distribution over {\mathcal{P}}:

\displaystyle H(p) = \max_{q \in \mathcal{P}} H(q),

where {H(p) = \mathbb{E}_{p}[-\log p(y)]} is the entropy. We call such a set a maximum-entropy set for {p}. Furthermore, we would like {\mathcal{P}} to be as large as possible, subject to the constraint that {\mathcal{P}} is convex.

Does such a maximal convex maximum-entropy set {\mathcal{P}} exist? That is, is there some convex set {\mathcal{P}} such that {p} is the maximum-entropy distribution in {\mathcal{P}}, and for any {\mathcal{Q}} satisfying the same property, {\mathcal{Q} \subseteq \mathcal{P}}? It turns out that the answer is yes, and there is even a simple characterization of {\mathcal{P}}:

Proposition 1 For any distribution {p} on {\mathcal{Y}}, the set

\displaystyle \mathcal{P} = \{q \mid \mathbb{E}_{q}[-\log p(y)] \leq H(p)\}

is the maximal convex maximum-entropy set for {p}.

To see why this is, first note that, clearly, {p \in \mathcal{P}}, and for any {q \in \mathcal{P}} we have

\displaystyle \begin{array}{rcl} H(q) &=& \mathbb{E}_{q}[-\log q(y)] \\ &\leq& \mathbb{E}_{q}[-\log p(y)] \\ &\leq& H(p), \end{array}

so {p} is indeed the maximum-entropy distribution in {\mathcal{P}}. On the other hand, let {\mathcal{Q}} be any other convex set whose maximum-entropy distribution is {p}. Then in particular, for any {q \in \mathcal{Q}}, we must have {H((1-\epsilon)p + \epsilon q) \leq H(p)}. Let us suppose for the sake of contradiction that {q \not\in \mathcal{P}}, so that {\mathbb{E}_{q}[-\log p(y)] > H(p)}. Then we have

\displaystyle \begin{array}{rcl} H((1-\epsilon)p + \epsilon q) &=& \mathbb{E}_{(1-\epsilon)p+\epsilon q}[-\log((1-\epsilon)p(y)+\epsilon q(y))] \\ &=& \mathbb{E}_{(1-\epsilon)p+\epsilon q}[-\log(p(y) + \epsilon (q(y)-p(y))] \\ &=& \mathbb{E}_{(1-\epsilon)p+\epsilon q}\left[-\log(p(y)) - \epsilon \frac{q(y)-p(y)}{p(y)} + \mathcal{O}(\epsilon^2)\right] \\ &=& H(p) + \epsilon(\mathbb{E}_{q}[-\log p(y)]-H(p)) - \epsilon \mathbb{E}_{(1-\epsilon)p+\epsilon q}\left[\frac{q(y)-p(y)}{p(y)}\right] + \mathcal{O}(\epsilon^2) \\ &=& H(p) + \epsilon(\mathbb{E}_{q}[-\log p(y)]-H(p)) - \epsilon^2 \mathbb{E}_{q}\left[\frac{q(y)-p(y)}{p(y)}\right] + \mathcal{O}(\epsilon^2) \\ &=& H(p) + \epsilon(\mathbb{E}_{q}[-\log p(y)]-H(p)) + \mathcal{O}(\epsilon^2). \end{array}

Since {\mathbb{E}_{q}[-\log p(y)] - H(p) > 0}, for sufficiently small {\epsilon} this will exceed {H(p)}, which is a contradiction. Therefore we must have {q \in \mathcal{P}} for all {q \in \mathcal{Q}}, and hence {\mathcal{Q} \subseteq \mathcal{P}}, so that {\mathcal{P}} is indeed the maximal convex maximum-entropy set for {p}.

Convex Conditions for Strong Convexity

An important concept in online learning and convex optimization is that of strong convexity: a twice-differentiable function f is said to be strongly convex with respect to a norm \|\cdot\| if

z^T\frac{\partial^2 f}{\partial x^2}z \geq \|z\|^2

for all z (for functions that are not twice-differentiable, there is an analogous criterion in terms of the Bregman divergence). To check strong convexity, then, we basically need to check a condition on the Hessian, namely that z^THz \geq \|z\|^2. So, under what conditions does this hold?

For the l^2 norm, the answer is easy: z^THz \geq \|z\|_2^2 if and only if H \succeq I (i.e., H-I is positive semidefinite). This can be shown in many ways, perhaps the easiest is by noting that z^THz-\|z\|_2^2 = z^T(H-I)z.

For the l^{\infty} norm, the answer is a bit trickier but still not too complicated. Recall that we want necessary and sufficient conditions under which z^THz \geq \|z\|_{\infty}^2. Note that this is equivalent to asking that z^THz \geq (z_i)^2 for each coordinate i of z, which in turn is equivalent to H \succeq e_ie_i^T for each coordinate vector e_i (these are the vectors that are 1 in the ith coordinate and 0 everywhere else).

More generally, for any norm \|\cdot\|, there exists a dual norm \|\cdot\|_* which satisfies, among other properties, the relationship \|z\| = \sup_{\|w\|_* = 1} w^Tz. So, in general, z^THz \geq \|z\|^2 is equivalent to asking that z^THz \geq (w^Tz)^2 for all w with \|w\|_* = 1. But this is in turn equivalent to asking that

H \succeq ww^T for all w such that \|w\|_* = 1.

In fact, it suffices to pick a subset of the w such that the convex hull consists of all w with \|w\|_* \leq 1; this is why we were able to obtain such a clean formulation in the l^{\infty} case: the dual norm to l^{\infty} is l^1, whose unit ball is the simplex, which is a polytope with only 2n vertices (namely, each of the signed unit vectors \pm e_i).

We can also derive a simple (but computationally expensive) criterion for l^1 strong convexity: here the dual norm is l^{\infty}, whose unit ball is the n-dimensional hypercube, with vertices given by all 2^n vectors of the form [ \pm 1 \ \cdots \ \pm 1]. Thus z^THz \geq \|z\|_1^2 if and only if H \succeq ss^T for all 2^n sign vectors s.

Finally, we re-examine the l^2 case; even though the l^2-ball is not a polytope, we were still able to obtain a very simple expression. This was because the condition H \succeq I manages to capture simultaneously all dual vectors such that w^Tw \leq 1. We thus have the general criterion:

Theorem. H \succeq M_jM_j^T for j = 1,\ldots,m if and only if H is strongly convex with respect to the norm \|\cdot\| whose dual unit ball is the convex hull of the transformed unit balls M_j\mathcal{B}_j, j = 1, \ldots, m, where \mathcal{B}_j is the l^2 unit ball whose dimension matches the number of columns of M_j.

Proof. H \succeq M_jM_j^T if and only if z^THz \geq \max_{j=1}^m \|M_j^Tz\|_2^2. Now note that \|M_j^Tz\|_2 = \sup_{w \in \mathcal{B}_j} w^TM_j^Tz = \sup_{w' \in M_j\mathcal{B}_j} (w')^Tz. If we define \|z\| = \max_{j=1}^m \|M_j^Tz\|_2, it is then apparent that the dual norm unit ball is the convex hull of the M_j\mathcal{B}_j.

Convexity counterexample

Here’s a fun counterexample: a function \mathbb{R}^n \to \mathbb{R} that is jointly convex in any n-1 of the variables, but not in all variables at once. The function is

f(x_1,\ldots,x_n) = \frac{1}{2}(n-1.5)\sum_{i=1}^n x_i^2 - \sum_{i < j} x_ix_j

To see why this is, note that the Hessian of f is equal to

\left[ \begin{array}{cccc} n-1.5 & -1 & \cdots & -1 \\ -1 & n-1.5 & \cdots & -1 \\ \vdots & \vdots & \ddots & \vdots \\ -1 & -1 & \cdots & n-1.5 \end{array} \right]

This matrix is equal to (n-0.5)I - J, where I is the identity matrix and J is the all-ones matrix, which is rank 1 and whose single non-zero eigenvalue is n. Therefore, this matrix has n-1 eigenvalues of n-0.5, as well as a single eigenvalue of -0.5, and hence is not positive definite.

On the other hand, any submatrix of size n-1 is of the form (n-0.5)I-J, but where now J is only (n-1) \times (n-1). This matrix now has n-2 eigenvalues of n-0.5, together with a single eigenvalue of 0.5, and hence is positive definite. Therefore, the Hessian is positive definite when restricted to any n-1 variables, and hence f is convex in any n-1 variables, but not in all n variables jointly.

A Fun Optimization Problem

I spent the last several hours trying to come up with an efficient algorithm to the following problem:

Problem: Suppose that we have a sequence of l pairs of non-negative numbers (a_1,b_1),\ldots,(a_l,b_l) such that \sum_{i=1}^l a_i \leq A and \sum_{i=1}^l b_i \leq B. Devise an efficient algorithm to find the k pairs (a_{i_1},b_{i_1}),\ldots,(a_{i_k},b_{i_k}) that maximize

\left[\sum_{r=1}^k a_{i_r}\log(a_{i_r}/b_{i_r})\right] + \left[A-\sum_{r=1}^k a_{i_r}\right]\log\left(\frac{A-\sum_{r=1}^k a_{i_r}}{B-\sum_{r=1}^k b_{i_r}}\right).

Commentary: I don’t have a fully satisfactory solution to this yet, although I do think I can find an algorithm that runs in O\left(\frac{l \log(l)}{\epsilon}\right) time and finds 2k pairs that do at least 1-\epsilon as well as the best set of k pairs. It’s possible I need to assume something like \sum_{i=1}^l a_i \leq A/2 instead of just A (and similarly for the b_i), although I’m happy to make that assumption.

While attempting to solve this problem, I’ve managed to utilize a pretty large subset of my bag of tricks for optimization problems, so I think working on it is pretty worthwhile intellectually. It also happens to be important to my research, so if anyone comes up with a good algorithm I’d be interested to know.

Eigenvalue Bounds

While grading homeworks today, I came across the following bound:

Theorem 1: If A and B are symmetric n\times n matrices with eigenvalues \lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_n and \mu_1 \geq \mu_2 \geq \ldots \geq \mu_n respectively, then Trace(A^TB) \leq \sum_{i=1}^n \lambda_i \mu_i.

For such a natural-looking statement, this was surprisingly hard to prove. However, I finally came up with a proof, and it was cool enough that I felt the need to share. To prove this, we actually need two ingredients. The first is the Cauchy Interlacing Theorem:

Theorem 2: If A is an n\times n symmetric matrix and B is an (n-k) \times (n-k) principle submatrix of A, then \lambda_{i-k}(A) \leq \lambda_i(B) \leq \lambda_i(A), where \lambda_i(X) is the ith largest eigenvalue of X.

As a corollary we have:

Corollary 1: For any symmetric matrix X, \sum_{i=1}^k X_{ii} \leq \sum_{i=1}^k \lambda_i(X).

Proof: The left-hand-side is just the trace of the upper-left k\times k principle submatrix of X, whose eigenvalues are by Theorem 2 bounded by the k largest eigenvalues of X. \square

The final ingredient we will need is a sort of “majorization” inequality based on Abel summation:

Theorem 3: If x_1,\ldots,x_n and y_1,\ldots,y_n are such that \sum_{i=1}^k x_i \leq \sum_{i=1}^k y_i for all k (with equality when k=n), and c_1 \geq c_2 \geq \ldots \geq c_n, then \sum_{i=1}^n c_ix_i \leq \sum_{i=1}^n c_iy_i.

Proof: We have:

\sum_{i=1}^n c_ix_i = c_n(x_1+\cdots+x_n) + \sum_{i=1}^{n-1} (c_i-c_{i+1})(x_1+\cdots+x_i) \leq c_n(y_1+\cdots+y_n) + \sum_{i=1}^{n-1} (c_i-c_{i+1})(y_1+\cdots+y_i) = \sum_{i=1}^n c_iy_i

where the equalities come from the Abel summation method. \square

Now, we are finally ready to prove the original theorem:

Proof of Theorem 1: First note that since the trace is invariant under similarity transforms, we can without loss of generality assume that A is diagonal, in which case we want to prove that \sum_{i=1}^n \lambda_i B_{ii} \leq \sum_{i=1}^n \lambda_i \mu_i. But by Corollary 1, we also know that \sum_{i=1}^k B_{ii} \leq \sum_{i=1}^k \mu_i for all k. Since by assumption the \lambda_i are a decreasing sequence, Theorem 3 then implies that \sum_{i=1}^n \lambda_i B_{ii} \leq \sum_{i=1}^n \lambda_i \mu_i, which is what we wanted to show. \square