Probabilistic Abstractions I

(This post represents research in progress. I may think about these concepts entirely differently a few months from now, but for my own benefit I’m trying to exposit on them in order to force myself to understand them better.)

For many inference tasks, especially ones with either non-linearities or non-convexities, it is common to use particle-based methods such as beam search, particle filters, sequential Monte Carlo, or Markov Chain Monte Carlo. In these methods, we approximate a distribution by a collection of samples from that distribution, then update the samples as new information is added. For instance, in beam search, if we are trying to build up a tree, we might build up a collection of K samples for the left and right subtrees, then look at all K^2 ways of combining them into the entire tree, but then downsample again to the K trees with the highest scores. This allows us to search through the exponentially large space of all trees efficiently (albeit at the cost of possibly missing high-scoring trees).

One major problem with such particle-based methods is diversity: the particles will tend to cluster around the highest-scoring mode, rather than exploring multiple local optima if they exist. This can be bad because it makes learning algorithms overly myopic. Another problem, especially in combinatorial domains, is difficulty of partial evaluation: if we have some training data that we are trying to fit to, and we have chosen settings of some, but not all, variables in our model, it can be difficult to know if that setting is on the right track (for instance, it can be difficult to know whether a partially-built tree is a promising candidate or not). For time-series modeling, this isn’t nearly as large of a problem, since we can evaluate against a prefix of the time series to get a good idea (this perhaps explains the success of particle filters in these domains).

I’ve been working on a method that tries to deal with both of these problems, which I call probabilistic abstractions. The idea is to improve the diversity of particle-based methods by creating “fat” particles which cover multiple states at once; the reason that such fat particles help is that they allow us to first optimize for coverage (by placing down relatively large particles that cover the entire space), then later worry about more local details (by placing down many particles near promising-looking local optima).

To be more concrete, if we have a probability distribution over a set of random variables (X_1,\ldots,X_d), then our particles will be sets obtained by specifying the values of some of the X_i and leaving the rest to vary arbitrarily. So, for instance, if d=4, then \{(X_1,X_2,X_3,X_4) \mid X_2 = 1, x_4 = 7\} might be a possible “fat” particle.

By choosing some number of fat particles and assigning probabilities to them, we are implicitly specifying a polytope of possible probability distributions; for instance, if our particles are S_1,\ldots,S_k, and we assign probability \pi_i to S_i, then we have the polytope of distributions p that satisfy the constraints p(S_1) = \pi_1, p(S_2) = \pi_2, etc.

Given such a polytope, is there a way to pick a canonical representative from it? One such representative is the maximum entropy distribution in that polytope. This distribution has the property of minimizing the worst-case relative entropy to any other distribution within the polytope (and that worst-case relative entropy is just the entropy of the distribution).

Suppose that we have a polytope for two independent distributions, and we want to compute the polytope for their product. This is easy — just look at the cartesian products of each particle of the first distribution with each particle of the second distribution. If each individual distribution has k particles, then the product distribution has k^2 particles — this could be problematic computationally, so we also want a way to narrow down to a subset of the k most informative particles. These will be the k particles such that the corresponding polytope minimizes the maximum entropy of that polytope. Finding this is NP-hard in general, but I’m currently working on good heuristics for computing it.

Next, suppose that we have a distribution on a space X and want to apply a function f : X \to Y to it. If f is a complicated function, it might be difficult to propagate the fat particles (even though it would have been easy to propagate particles composed of single points). To get around this, we need what is called a valid abstraction of f: a function \tilde{f} : 2^X \to 2^Y such that \tilde{f}(S) \supseteq f(S) for all S \in 2^X. In this case, if we map a particle S to \tilde{f}(S), our equality constraint on the mass assigned to S becomes a lower bound on the mass assigned to \tilde{f}(S) — we thus still have a polytope of possible probability distributions. Depending on the exact structure of the particles (i.e. the exact way in which the different sets overlap), it may be necessary to add additional constraints to the polytope to get good performance — I feel like I have some understanding of this, but it’s something I’ll need to investigate empirically as well. It’s also interesting to note that \tilde{f} (when combined with conditioning on data, which is discussed below) allows us to assign partial credit to promising particles, which was the other property I discussed at the beginning.

Finally, suppose that I want to condition on data. In this case the polytope approach doesn’t work as well, because conditioning on data can blow up the polytope by an arbitrarily large amount. Instead, we just take the maximum-entropy distribution in our polytope and treat that as our “true” distribution, then condition. I haven’t been able to make any formal statements about this procedure, but it seems to work at least somewhat reasonably. It is worth noting that conditioning may not be straightforward, since the likelihood function may not be constant across a given fat particle. To deal with this, we can replace the likelihood function by its average (which I think can be justified in terms of maximum entropy as well, although the details here are a bit hazier).

So, in summary, we have a notion of fat particles, which provide better coverage than point particles, and can combine them, apply functions to them, subsample them, and condition on data. This is essentially all of the operations we want to be able to apply for particle-based methods, so we in theory should now be able to implement versions of these particle-based methods that get better coverage.

Pairwise Independence vs. Independence

For collections of independent random variables, the Chernoff bound and related bounds give us very sharp concentration inequalities — if X_1,\ldots,X_n are independent, then their sum has a distribution that decays like e^{-x^2}. For random variables that are only pairwise independent, the strongest bound we have is Chebyshev’s inequality, which says that their sum decays like \frac{1}{x^2}.

The point of this post is to construct an equality case for Chebyshev: a collection of pairwise independent random variables whose sum does not satisfy the concentration bound of Chernoff, and instead decays like \frac{1}{x^2}.

The construction is as follows: let X_1,\ldots,X_d be independent binary random variables, and for any S \subset \{1,\ldots,d\}, let Y_S = \sum_{i \in S} X_i, where the sum is taken mod 2. Then we can easily check that the Y_S are pairwise independent. Now consider  the random variable Z = \sum_{S} Y_S. If any of the X_i is equal to 1, then we can pair up the Y_S by either adding or removing i from S to get the other element of the pair. If we do this, we see that Z = 2^{d-1} in this case. On the other hand, if all of the X_i are equal to 0, then Z = 0 as well. Thus, with probability \frac{1}{2^d}, Z deviates from its mean by 2^{d-1}-\frac{1}{2}, whereas the variance of Z is 2^{d-2}-\frac{1}{4}. The bound on this probability form Chebyshev is \frac{2^{d-2}-1/4}{(2^{d-1}-1/2)^2}, which is very close to \frac{1}{2^d}, so this constitutes something very close to the Chebyshev equality case.

Anyways, I just thought this was a cool example that demonstrates the difference between pairwise and full independence.