## Convexity counterexample

Here’s a fun counterexample: a function $\mathbb{R}^n \to \mathbb{R}$ that is jointly convex in any $n-1$ of the variables, but not in all variables at once. The function is

$f(x_1,\ldots,x_n) = \frac{1}{2}(n-1.5)\sum_{i=1}^n x_i^2 - \sum_{i < j} x_ix_j$

To see why this is, note that the Hessian of $f$ is equal to

$\left[ \begin{array}{cccc} n-1.5 & -1 & \cdots & -1 \\ -1 & n-1.5 & \cdots & -1 \\ \vdots & \vdots & \ddots & \vdots \\ -1 & -1 & \cdots & n-1.5 \end{array} \right]$

This matrix is equal to $(n-0.5)I - J$, where $I$ is the identity matrix and $J$ is the all-ones matrix, which is rank 1 and whose single non-zero eigenvalue is $n$. Therefore, this matrix has $n-1$ eigenvalues of $n-0.5$, as well as a single eigenvalue of $-0.5$, and hence is not positive definite.

On the other hand, any submatrix of size $n-1$ is of the form $(n-0.5)I-J$, but where now $J$ is only $(n-1) \times (n-1)$. This matrix now has $n-2$ eigenvalues of $n-0.5$, together with a single eigenvalue of $0.5$, and hence is positive definite. Therefore, the Hessian is positive definite when restricted to any $n-1$ variables, and hence $f$ is convex in any $n-1$ variables, but not in all $n$ variables jointly.

## Probabilistic Abstractions I

(This post represents research in progress. I may think about these concepts entirely differently a few months from now, but for my own benefit I’m trying to exposit on them in order to force myself to understand them better.)

For many inference tasks, especially ones with either non-linearities or non-convexities, it is common to use particle-based methods such as beam search, particle filters, sequential Monte Carlo, or Markov Chain Monte Carlo. In these methods, we approximate a distribution by a collection of samples from that distribution, then update the samples as new information is added. For instance, in beam search, if we are trying to build up a tree, we might build up a collection of $K$ samples for the left and right subtrees, then look at all $K^2$ ways of combining them into the entire tree, but then downsample again to the $K$ trees with the highest scores. This allows us to search through the exponentially large space of all trees efficiently (albeit at the cost of possibly missing high-scoring trees).

One major problem with such particle-based methods is diversity: the particles will tend to cluster around the highest-scoring mode, rather than exploring multiple local optima if they exist. This can be bad because it makes learning algorithms overly myopic. Another problem, especially in combinatorial domains, is difficulty of partial evaluation: if we have some training data that we are trying to fit to, and we have chosen settings of some, but not all, variables in our model, it can be difficult to know if that setting is on the right track (for instance, it can be difficult to know whether a partially-built tree is a promising candidate or not). For time-series modeling, this isn’t nearly as large of a problem, since we can evaluate against a prefix of the time series to get a good idea (this perhaps explains the success of particle filters in these domains).

I’ve been working on a method that tries to deal with both of these problems, which I call probabilistic abstractions. The idea is to improve the diversity of particle-based methods by creating “fat” particles which cover multiple states at once; the reason that such fat particles help is that they allow us to first optimize for coverage (by placing down relatively large particles that cover the entire space), then later worry about more local details (by placing down many particles near promising-looking local optima).

To be more concrete, if we have a probability distribution over a set of random variables $(X_1,\ldots,X_d)$, then our particles will be sets obtained by specifying the values of some of the $X_i$ and leaving the rest to vary arbitrarily. So, for instance, if $d=4$, then $\{(X_1,X_2,X_3,X_4) \mid X_2 = 1, x_4 = 7\}$ might be a possible “fat” particle.

By choosing some number of fat particles and assigning probabilities to them, we are implicitly specifying a polytope of possible probability distributions; for instance, if our particles are $S_1,\ldots,S_k$, and we assign probability $\pi_i$ to $S_i$, then we have the polytope of distributions $p$ that satisfy the constraints $p(S_1) = \pi_1, p(S_2) = \pi_2$, etc.

Given such a polytope, is there a way to pick a canonical representative from it? One such representative is the maximum entropy distribution in that polytope. This distribution has the property of minimizing the worst-case relative entropy to any other distribution within the polytope (and that worst-case relative entropy is just the entropy of the distribution).

Suppose that we have a polytope for two independent distributions, and we want to compute the polytope for their product. This is easy — just look at the cartesian products of each particle of the first distribution with each particle of the second distribution. If each individual distribution has $k$ particles, then the product distribution has $k^2$ particles — this could be problematic computationally, so we also want a way to narrow down to a subset of the $k$ most informative particles. These will be the $k$ particles such that the corresponding polytope minimizes the maximum entropy of that polytope. Finding this is NP-hard in general, but I’m currently working on good heuristics for computing it.

Next, suppose that we have a distribution on a space $X$ and want to apply a function $f : X \to Y$ to it. If $f$ is a complicated function, it might be difficult to propagate the fat particles (even though it would have been easy to propagate particles composed of single points). To get around this, we need what is called a valid abstraction of $f$: a function $\tilde{f} : 2^X \to 2^Y$ such that $\tilde{f}(S) \supseteq f(S)$ for all $S \in 2^X$. In this case, if we map a particle $S$ to $\tilde{f}(S)$, our equality constraint on the mass assigned to $S$ becomes a lower bound on the mass assigned to $\tilde{f}(S)$ — we thus still have a polytope of possible probability distributions. Depending on the exact structure of the particles (i.e. the exact way in which the different sets overlap), it may be necessary to add additional constraints to the polytope to get good performance — I feel like I have some understanding of this, but it’s something I’ll need to investigate empirically as well. It’s also interesting to note that $\tilde{f}$ (when combined with conditioning on data, which is discussed below) allows us to assign partial credit to promising particles, which was the other property I discussed at the beginning.

Finally, suppose that I want to condition on data. In this case the polytope approach doesn’t work as well, because conditioning on data can blow up the polytope by an arbitrarily large amount. Instead, we just take the maximum-entropy distribution in our polytope and treat that as our “true” distribution, then condition. I haven’t been able to make any formal statements about this procedure, but it seems to work at least somewhat reasonably. It is worth noting that conditioning may not be straightforward, since the likelihood function may not be constant across a given fat particle. To deal with this, we can replace the likelihood function by its average (which I think can be justified in terms of maximum entropy as well, although the details here are a bit hazier).

So, in summary, we have a notion of fat particles, which provide better coverage than point particles, and can combine them, apply functions to them, subsample them, and condition on data. This is essentially all of the operations we want to be able to apply for particle-based methods, so we in theory should now be able to implement versions of these particle-based methods that get better coverage.

## Pairwise Independence vs. Independence

For collections of independent random variables, the Chernoff bound and related bounds give us very sharp concentration inequalities — if $X_1,\ldots,X_n$ are independent, then their sum has a distribution that decays like $e^{-x^2}$. For random variables that are only pairwise independent, the strongest bound we have is Chebyshev’s inequality, which says that their sum decays like $\frac{1}{x^2}$.

The point of this post is to construct an equality case for Chebyshev: a collection of pairwise independent random variables whose sum does not satisfy the concentration bound of Chernoff, and instead decays like $\frac{1}{x^2}$.

The construction is as follows: let $X_1,\ldots,X_d$ be independent binary random variables, and for any $S \subset \{1,\ldots,d\}$, let $Y_S = \sum_{i \in S} X_i$, where the sum is taken mod 2. Then we can easily check that the $Y_S$ are pairwise independent. Now consider  the random variable $Z = \sum_{S} Y_S$. If any of the $X_i$ is equal to 1, then we can pair up the $Y_S$ by either adding or removing $i$ from $S$ to get the other element of the pair. If we do this, we see that $Z = 2^{d-1}$ in this case. On the other hand, if all of the $X_i$ are equal to 0, then $Z = 0$ as well. Thus, with probability $\frac{1}{2^d}$, $Z$ deviates from its mean by $2^{d-1}-\frac{1}{2}$, whereas the variance of $Z$ is $2^{d-2}-\frac{1}{4}$. The bound on this probability form Chebyshev is $\frac{2^{d-2}-1/4}{(2^{d-1}-1/2)^2}$, which is very close to $\frac{1}{2^d}$, so this constitutes something very close to the Chebyshev equality case.

Anyways, I just thought this was a cool example that demonstrates the difference between pairwise and full independence.

## A Fun Optimization Problem

I spent the last several hours trying to come up with an efficient algorithm to the following problem:

Problem: Suppose that we have a sequence of $l$ pairs of non-negative numbers $(a_1,b_1),\ldots,(a_l,b_l)$ such that $\sum_{i=1}^l a_i \leq A$ and $\sum_{i=1}^l b_i \leq B$. Devise an efficient algorithm to find the $k$ pairs $(a_{i_1},b_{i_1}),\ldots,(a_{i_k},b_{i_k})$ that maximize

$\left[\sum_{r=1}^k a_{i_r}\log(a_{i_r}/b_{i_r})\right] + \left[A-\sum_{r=1}^k a_{i_r}\right]\log\left(\frac{A-\sum_{r=1}^k a_{i_r}}{B-\sum_{r=1}^k b_{i_r}}\right).$

Commentary: I don’t have a fully satisfactory solution to this yet, although I do think I can find an algorithm that runs in $O\left(\frac{l \log(l)}{\epsilon}\right)$ time and finds $2k$ pairs that do at least $1-\epsilon$ as well as the best set of $k$ pairs. It’s possible I need to assume something like $\sum_{i=1}^l a_i \leq A/2$ instead of just $A$ (and similarly for the $b_i$), although I’m happy to make that assumption.

While attempting to solve this problem, I’ve managed to utilize a pretty large subset of my bag of tricks for optimization problems, so I think working on it is pretty worthwhile intellectually. It also happens to be important to my research, so if anyone comes up with a good algorithm I’d be interested to know.

## Eigenvalue Bounds

While grading homeworks today, I came across the following bound:

Theorem 1: If A and B are symmetric $n\times n$ matrices with eigenvalues $\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_n$ and $\mu_1 \geq \mu_2 \geq \ldots \geq \mu_n$ respectively, then $Trace(A^TB) \leq \sum_{i=1}^n \lambda_i \mu_i$.

For such a natural-looking statement, this was surprisingly hard to prove. However, I finally came up with a proof, and it was cool enough that I felt the need to share. To prove this, we actually need two ingredients. The first is the Cauchy Interlacing Theorem:

Theorem 2: If A is an $n\times n$ symmetric matrix and B is an $(n-k) \times (n-k)$ principle submatrix of A, then $\lambda_{i-k}(A) \leq \lambda_i(B) \leq \lambda_i(A)$, where $\lambda_i(X)$ is the ith largest eigenvalue of X.

As a corollary we have:

Corollary 1: For any symmetric matrix X, $\sum_{i=1}^k X_{ii} \leq \sum_{i=1}^k \lambda_i(X)$.

Proof: The left-hand-side is just the trace of the upper-left $k\times k$ principle submatrix of X, whose eigenvalues are by Theorem 2 bounded by the k largest eigenvalues of X. $\square$

The final ingredient we will need is a sort of “majorization” inequality based on Abel summation:

Theorem 3: If $x_1,\ldots,x_n$ and $y_1,\ldots,y_n$ are such that $\sum_{i=1}^k x_i \leq \sum_{i=1}^k y_i$ for all k (with equality when $k=n$), and $c_1 \geq c_2 \geq \ldots \geq c_n$, then $\sum_{i=1}^n c_ix_i \leq \sum_{i=1}^n c_iy_i$.

Proof: We have:

$\sum_{i=1}^n c_ix_i = c_n(x_1+\cdots+x_n) + \sum_{i=1}^{n-1} (c_i-c_{i+1})(x_1+\cdots+x_i) \leq c_n(y_1+\cdots+y_n) + \sum_{i=1}^{n-1} (c_i-c_{i+1})(y_1+\cdots+y_i) = \sum_{i=1}^n c_iy_i$

where the equalities come from the Abel summation method. $\square$

Now, we are finally ready to prove the original theorem:

Proof of Theorem 1: First note that since the trace is invariant under similarity transforms, we can without loss of generality assume that A is diagonal, in which case we want to prove that $\sum_{i=1}^n \lambda_i B_{ii} \leq \sum_{i=1}^n \lambda_i \mu_i$. But by Corollary 1, we also know that $\sum_{i=1}^k B_{ii} \leq \sum_{i=1}^k \mu_i$ for all k. Since by assumption the $\lambda_i$ are a decreasing sequence, Theorem 3 then implies that $\sum_{i=1}^n \lambda_i B_{ii} \leq \sum_{i=1}^n \lambda_i \mu_i$, which is what we wanted to show. $\square$

## Local KL Divergence

The KL divergence is an important tool for studying the distance between two probability distributions. Formally, given two distributions $p$ and $q$, the KL divergence is defined as

$KL(p || q) := \int p(x) \log(p(x)/q(x)) dx$

Note that $KL(p || q) \neq KL(q || p)$. Intuitively, a small KL(p || q) means that there are few points that p assigns high probability to but that q does not. We can also think of KL(p || q) as the number of bits of information needed to update from the distribution q to the distribution p.

Suppose that p and q are both mixtures of other distributions: $p(x) = \sum_i \alpha_i F_i(x)$ and $q(x) = \sum_i \beta_i G_i(x)$. Can we bound $KL(p || q)$ in terms of the $KL(F_i || G_i)$? In some sense this is asking to upper bound the KL divergence in terms of some more local KL divergence. It turns out this can be done:

Theorem: If $\sum_i \alpha_i = \sum_i \beta_i = 1$ and $F_i$ and $G_i$ are all probability distributions, then

$KL\left(\sum_i \alpha_i F_i || \sum_i \beta_i G_i\right) \leq \sum_i \alpha_i \left(\log(\alpha_i/\beta_i) + KL(F_i || G_i)\right)$.

Proof: If we expand the definition, then we are trying to prove that

$\int \left(\sum \alpha_i F_i(x)\right) \log\left(\frac{\sum \alpha_i F_i(x)}{\sum \beta_i G_i(x)}\right) dx \leq \int \left(\sum_i \alpha_iF_i(x) \log\left(\frac{\alpha_i F_i(x)}{\beta_i G_i(x)}\right)\right) dx$

We will in fact show that this is true for every value of $x$, so that it is certainly true for the integral. Using $\log(x/y) = -\log(y/x)$, re-write the condition for a given value of $x$ as

$\left(\sum \alpha_i F_i(x)\right) \log\left(\frac{\sum \beta_i G_i(x)}{\sum \alpha_i F_i(x)}\right) \geq \sum_i \alpha_iF_i(x) \log\left(\frac{\beta_i G_i(x)}{\alpha_i F_i(x)}\right)$

(Note that the sign of the inequality flipped because we replaced the two expressions with their negatives.) Now, this follows by using Jensen’s inequality on the $\log$ function:

$\sum_i \alpha_iF_i(x) \log\left(\frac{\beta_i G_i(x)}{\alpha_i F_i(x)}\right) \leq \left(\sum_i \alpha_iF_i(x)\right) \log\left(\frac{\sum_i \frac{\beta_i G_i(x)}{\alpha_i F_i(x)} \alpha_i F_i(x)}{\sum \alpha_i F_i(x)}\right) = \left(\sum_i \alpha_i F_i(x)\right) \log\left(\frac{\sum_i \beta_i G_i(x)}{\sum_i \alpha_i F_i(x)}\right)$

This proves the inequality and therefore the theorem. $\square$

Remark: Intuitively, if we want to describe $\sum \alpha_i F_i$ in terms of $\sum \beta_i G_i$, it is enough to first locate the $i$th term in the sum and then to describe $F_i$ in terms of $G_i$. The theorem is a formalization of this intuition. In the case that $F_i = G_i$, it also says that the KL divergence between two different mixtures of the same set of distributions is at most the KL divergence between the mixture weights.

Today Arun asked me the following question:

“Under what conditions will a set $\{p_1,\ldots,p_n\}$ of polynomials be quadratically independent, in the sense that $\{p_1^2, p_1p_2, p_2^2, p_1p_3,\ldots,p_{n-1}p_n, p_n^2\}$ is a linearly independent set?”

I wasn’t able to make much progress on this general question, but in the specific setting where the $p_i$ are all polynomials in one variable, and we further restrict to just monomials, (i.e. $p_i(x) = x^{d_i}$ for some $d_i$), the condition is just that there are no distinct unordered pairs $(i_1,j_1),(i_2,j_2)$ such that $d_{i_1} + d_{j_1} = d_{i_2} + d_{j_2}$. Arun was interested in the largest such a set could be for a given maximum degree $D$, so we are left with the following interesting combinatorics problem:

“What is the largest subset $S$ of $\{1,\ldots,D\}$ such that no two distinct pairs of elements of $S$ have the same sum?”

For convenience of notation let $n$ denote the size of $S$. A simple upper bound is $\binom{N+1}{2} \leq 2D-1$, since there are $\binom{N+1}{2}$ pairs to take a sum of, and all pairwise sums lie between $2$ and $2D$. We therefore have $n = O(\sqrt{D})$.

What about lower bounds on n? If we let S be the powers of 2 less than or equal to D, then we get a lower bound of $\log_2(D)$; we can do slightly better by taking the Fibonacci numbers instead, but this still only gives us logarithmic growth. So the question is, can we find sets that grow polynomially in D?

It turns out the answer is yes, and we can do so by choosing randomly. Let each element of $\{1,\ldots,D\}$ be placed in S with probability p. Now consider any k, $2 \leq k \leq 2D$. If k is odd, then there are (k-1)/2 possible pairs that could add up to k: (1,k-1), (2,k-2),…,((k-1)/2,(k+1)/2). The probability of each such pair existing is $p^2$. Note that each of these events is independent.

S is invalid if and only if there exists some k such that more than one of these pairs is active in S. The probability of any two given pairs being simultaneously active is $p^4$, and there are $\binom{(k-1)/2}{2} \leq \binom{D}{2}$ such pairs for a given $k$, hence $(D-1)\binom{D}{2} \leq D^3/2$ such pairs total (since we were just looking at odd k). Therefore, the probability of an odd value of k invalidating S is at most $p^4D^3/2$.

For even $k$ we get much the same result except that the probability for a given value of $k$ comes out to the slightly more complicated formula $\binom{k/2-1}{2}p^4 + (k/2-1)p^3 + p^2 \leq D^2p^4/2 + Dp^3 + p^2$, so that the total probability of an even value of k invalidating S is at most $p^4D^3/2 + p^3D^2 + p^2D$.

Putting this all together gives us a bound of $p^4D^3 + p^3D^2 + p^2D$. If we set p to be $\frac{1}{2}D^{-\frac{3}{4}}$ then the probability of S being invalid is then at most $\frac{1}{16} + \frac{1}{8} D^{-\frac{1}{4}} + \frac{1}{4}D^{-\frac{1}{2}} \leq \frac{7}{16}$, so with probability at least $\frac{7}{16}$ a set S with elements chosen randomly with probability $\frac{1}{2}D^{-\frac{3}{4}}$ will be valid. On the other hand, such a set has $D^{1/4}$ elements in expectation, and asymptotically the probability of having at least this many elements is $\frac{1}{2}$. Therefore, with probability at least $\frac{1}{16}$ a randomly chosen set will be both valid and have size greater than $\frac{1}{2}$, which shows that the largest value of $n$ is at least $\Omega\left(D^{1/4}\right)$.

We can actually do better: if all elements are chosen with probability $\frac{1}{2}D^{-2/3}$, then one can show that the expected number of invalid pairs is at most $\frac{1}{8}D^{1/3} + O(1)$, and hence we can pick randomly with probability $p = \frac{1}{2}D^{-2/3}$, remove one element of each of the invalid pairs, and still be left with $\Omega(D^{1/3})$ elements in S.

So, to recap: choosing elements randomly gives us S of size $\Omega(D^{1/4})$; choosing randomly and then removing any offending pairs gives us S of size $\Omega(D^{1/3})$; and we have an upper bound of $O(D^{1/2})$. What is the actual asymptotic answer? I don’t actually know the answer to this, but I thought I’d share what I have so far because I think the techniques involved are pretty cool.