Probabilistic Abstractions I

(This post represents research in progress. I may think about these concepts entirely differently a few months from now, but for my own benefit I’m trying to exposit on them in order to force myself to understand them better.)

For many inference tasks, especially ones with either non-linearities or non-convexities, it is common to use particle-based methods such as beam search, particle filters, sequential Monte Carlo, or Markov Chain Monte Carlo. In these methods, we approximate a distribution by a collection of samples from that distribution, then update the samples as new information is added. For instance, in beam search, if we are trying to build up a tree, we might build up a collection of K samples for the left and right subtrees, then look at all K^2 ways of combining them into the entire tree, but then downsample again to the K trees with the highest scores. This allows us to search through the exponentially large space of all trees efficiently (albeit at the cost of possibly missing high-scoring trees).

One major problem with such particle-based methods is diversity: the particles will tend to cluster around the highest-scoring mode, rather than exploring multiple local optima if they exist. This can be bad because it makes learning algorithms overly myopic. Another problem, especially in combinatorial domains, is difficulty of partial evaluation: if we have some training data that we are trying to fit to, and we have chosen settings of some, but not all, variables in our model, it can be difficult to know if that setting is on the right track (for instance, it can be difficult to know whether a partially-built tree is a promising candidate or not). For time-series modeling, this isn’t nearly as large of a problem, since we can evaluate against a prefix of the time series to get a good idea (this perhaps explains the success of particle filters in these domains).

I’ve been working on a method that tries to deal with both of these problems, which I call probabilistic abstractions. The idea is to improve the diversity of particle-based methods by creating “fat” particles which cover multiple states at once; the reason that such fat particles help is that they allow us to first optimize for coverage (by placing down relatively large particles that cover the entire space), then later worry about more local details (by placing down many particles near promising-looking local optima).

To be more concrete, if we have a probability distribution over a set of random variables (X_1,\ldots,X_d), then our particles will be sets obtained by specifying the values of some of the X_i and leaving the rest to vary arbitrarily. So, for instance, if d=4, then \{(X_1,X_2,X_3,X_4) \mid X_2 = 1, x_4 = 7\} might be a possible “fat” particle.

By choosing some number of fat particles and assigning probabilities to them, we are implicitly specifying a polytope of possible probability distributions; for instance, if our particles are S_1,\ldots,S_k, and we assign probability \pi_i to S_i, then we have the polytope of distributions p that satisfy the constraints p(S_1) = \pi_1, p(S_2) = \pi_2, etc.

Given such a polytope, is there a way to pick a canonical representative from it? One such representative is the maximum entropy distribution in that polytope. This distribution has the property of minimizing the worst-case relative entropy to any other distribution within the polytope (and that worst-case relative entropy is just the entropy of the distribution).

Suppose that we have a polytope for two independent distributions, and we want to compute the polytope for their product. This is easy — just look at the cartesian products of each particle of the first distribution with each particle of the second distribution. If each individual distribution has k particles, then the product distribution has k^2 particles — this could be problematic computationally, so we also want a way to narrow down to a subset of the k most informative particles. These will be the k particles such that the corresponding polytope minimizes the maximum entropy of that polytope. Finding this is NP-hard in general, but I’m currently working on good heuristics for computing it.

Next, suppose that we have a distribution on a space X and want to apply a function f : X \to Y to it. If f is a complicated function, it might be difficult to propagate the fat particles (even though it would have been easy to propagate particles composed of single points). To get around this, we need what is called a valid abstraction of f: a function \tilde{f} : 2^X \to 2^Y such that \tilde{f}(S) \supseteq f(S) for all S \in 2^X. In this case, if we map a particle S to \tilde{f}(S), our equality constraint on the mass assigned to S becomes a lower bound on the mass assigned to \tilde{f}(S) — we thus still have a polytope of possible probability distributions. Depending on the exact structure of the particles (i.e. the exact way in which the different sets overlap), it may be necessary to add additional constraints to the polytope to get good performance — I feel like I have some understanding of this, but it’s something I’ll need to investigate empirically as well. It’s also interesting to note that \tilde{f} (when combined with conditioning on data, which is discussed below) allows us to assign partial credit to promising particles, which was the other property I discussed at the beginning.

Finally, suppose that I want to condition on data. In this case the polytope approach doesn’t work as well, because conditioning on data can blow up the polytope by an arbitrarily large amount. Instead, we just take the maximum-entropy distribution in our polytope and treat that as our “true” distribution, then condition. I haven’t been able to make any formal statements about this procedure, but it seems to work at least somewhat reasonably. It is worth noting that conditioning may not be straightforward, since the likelihood function may not be constant across a given fat particle. To deal with this, we can replace the likelihood function by its average (which I think can be justified in terms of maximum entropy as well, although the details here are a bit hazier).

So, in summary, we have a notion of fat particles, which provide better coverage than point particles, and can combine them, apply functions to them, subsample them, and condition on data. This is essentially all of the operations we want to be able to apply for particle-based methods, so we in theory should now be able to implement versions of these particle-based methods that get better coverage.

Pairwise Independence vs. Independence

For collections of independent random variables, the Chernoff bound and related bounds give us very sharp concentration inequalities — if X_1,\ldots,X_n are independent, then their sum has a distribution that decays like e^{-x^2}. For random variables that are only pairwise independent, the strongest bound we have is Chebyshev’s inequality, which says that their sum decays like \frac{1}{x^2}.

The point of this post is to construct an equality case for Chebyshev: a collection of pairwise independent random variables whose sum does not satisfy the concentration bound of Chernoff, and instead decays like \frac{1}{x^2}.

The construction is as follows: let X_1,\ldots,X_d be independent binary random variables, and for any S \subset \{1,\ldots,d\}, let Y_S = \sum_{i \in S} X_i, where the sum is taken mod 2. Then we can easily check that the Y_S are pairwise independent. Now consider  the random variable Z = \sum_{S} Y_S. If any of the X_i is equal to 1, then we can pair up the Y_S by either adding or removing i from S to get the other element of the pair. If we do this, we see that Z = 2^{d-1} in this case. On the other hand, if all of the X_i are equal to 0, then Z = 0 as well. Thus, with probability \frac{1}{2^d}, Z deviates from its mean by 2^{d-1}-\frac{1}{2}, whereas the variance of Z is 2^{d-2}-\frac{1}{4}. The bound on this probability form Chebyshev is \frac{2^{d-2}-1/4}{(2^{d-1}-1/2)^2}, which is very close to \frac{1}{2^d}, so this constitutes something very close to the Chebyshev equality case.

Anyways, I just thought this was a cool example that demonstrates the difference between pairwise and full independence.

A Fun Optimization Problem

I spent the last several hours trying to come up with an efficient algorithm to the following problem:

Problem: Suppose that we have a sequence of l pairs of non-negative numbers (a_1,b_1),\ldots,(a_l,b_l) such that \sum_{i=1}^l a_i \leq A and \sum_{i=1}^l b_i \leq B. Devise an efficient algorithm to find the k pairs (a_{i_1},b_{i_1}),\ldots,(a_{i_k},b_{i_k}) that maximize

\left[\sum_{r=1}^k a_{i_r}\log(a_{i_r}/b_{i_r})\right] + \left[A-\sum_{r=1}^k a_{i_r}\right]\log\left(\frac{A-\sum_{r=1}^k a_{i_r}}{B-\sum_{r=1}^k b_{i_r}}\right).

Commentary: I don’t have a fully satisfactory solution to this yet, although I do think I can find an algorithm that runs in O\left(\frac{l \log(l)}{\epsilon}\right) time and finds 2k pairs that do at least 1-\epsilon as well as the best set of k pairs. It’s possible I need to assume something like \sum_{i=1}^l a_i \leq A/2 instead of just A (and similarly for the b_i), although I’m happy to make that assumption.

While attempting to solve this problem, I’ve managed to utilize a pretty large subset of my bag of tricks for optimization problems, so I think working on it is pretty worthwhile intellectually. It also happens to be important to my research, so if anyone comes up with a good algorithm I’d be interested to know.

Eigenvalue Bounds

While grading homeworks today, I came across the following bound:

Theorem 1: If A and B are symmetric n\times n matrices with eigenvalues \lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_n and \mu_1 \geq \mu_2 \geq \ldots \geq \mu_n respectively, then Trace(A^TB) \leq \sum_{i=1}^n \lambda_i \mu_i.

For such a natural-looking statement, this was surprisingly hard to prove. However, I finally came up with a proof, and it was cool enough that I felt the need to share. To prove this, we actually need two ingredients. The first is the Cauchy Interlacing Theorem:

Theorem 2: If A is an n\times n symmetric matrix and B is an (n-k) \times (n-k) principle submatrix of A, then \lambda_{i-k}(A) \leq \lambda_i(B) \leq \lambda_i(A), where \lambda_i(X) is the ith largest eigenvalue of X.

As a corollary we have:

Corollary 1: For any symmetric matrix X, \sum_{i=1}^k X_{ii} \leq \sum_{i=1}^k \lambda_i(X).

Proof: The left-hand-side is just the trace of the upper-left k\times k principle submatrix of X, whose eigenvalues are by Theorem 2 bounded by the k largest eigenvalues of X. \square

The final ingredient we will need is a sort of “majorization” inequality based on Abel summation:

Theorem 3: If x_1,\ldots,x_n and y_1,\ldots,y_n are such that \sum_{i=1}^k x_i \leq \sum_{i=1}^k y_i for all k (with equality when k=n), and c_1 \geq c_2 \geq \ldots \geq c_n, then \sum_{i=1}^n c_ix_i \leq \sum_{i=1}^n c_iy_i.

Proof: We have:

\sum_{i=1}^n c_ix_i = c_n(x_1+\cdots+x_n) + \sum_{i=1}^{n-1} (c_i-c_{i+1})(x_1+\cdots+x_i) \leq c_n(y_1+\cdots+y_n) + \sum_{i=1}^{n-1} (c_i-c_{i+1})(y_1+\cdots+y_i) = \sum_{i=1}^n c_iy_i

where the equalities come from the Abel summation method. \square

Now, we are finally ready to prove the original theorem:

Proof of Theorem 1: First note that since the trace is invariant under similarity transforms, we can without loss of generality assume that A is diagonal, in which case we want to prove that \sum_{i=1}^n \lambda_i B_{ii} \leq \sum_{i=1}^n \lambda_i \mu_i. But by Corollary 1, we also know that \sum_{i=1}^k B_{ii} \leq \sum_{i=1}^k \mu_i for all k. Since by assumption the \lambda_i are a decreasing sequence, Theorem 3 then implies that \sum_{i=1}^n \lambda_i B_{ii} \leq \sum_{i=1}^n \lambda_i \mu_i, which is what we wanted to show. \square

Local KL Divergence

The KL divergence is an important tool for studying the distance between two probability distributions. Formally, given two distributions p and q, the KL divergence is defined as

KL(p || q) := \int p(x) \log(p(x)/q(x)) dx

Note that KL(p || q) \neq KL(q || p). Intuitively, a small KL(p || q) means that there are few points that p assigns high probability to but that q does not. We can also think of KL(p || q) as the number of bits of information needed to update from the distribution q to the distribution p.

Suppose that p and q are both mixtures of other distributions: p(x) = \sum_i \alpha_i F_i(x) and q(x) = \sum_i \beta_i G_i(x). Can we bound KL(p || q) in terms of the KL(F_i || G_i)? In some sense this is asking to upper bound the KL divergence in terms of some more local KL divergence. It turns out this can be done:

Theorem: If \sum_i \alpha_i = \sum_i \beta_i = 1 and F_i and G_i are all probability distributions, then

KL\left(\sum_i \alpha_i F_i || \sum_i \beta_i G_i\right) \leq \sum_i \alpha_i \left(\log(\alpha_i/\beta_i) + KL(F_i || G_i)\right).

Proof: If we expand the definition, then we are trying to prove that

\int \left(\sum \alpha_i F_i(x)\right) \log\left(\frac{\sum \alpha_i F_i(x)}{\sum \beta_i G_i(x)}\right) dx \leq \int \left(\sum_i \alpha_iF_i(x) \log\left(\frac{\alpha_i F_i(x)}{\beta_i G_i(x)}\right)\right) dx

We will in fact show that this is true for every value of x, so that it is certainly true for the integral. Using \log(x/y) = -\log(y/x), re-write the condition for a given value of x as

\left(\sum \alpha_i F_i(x)\right) \log\left(\frac{\sum \beta_i G_i(x)}{\sum \alpha_i F_i(x)}\right) \geq \sum_i \alpha_iF_i(x) \log\left(\frac{\beta_i G_i(x)}{\alpha_i F_i(x)}\right)

(Note that the sign of the inequality flipped because we replaced the two expressions with their negatives.) Now, this follows by using Jensen’s inequality on the \log function:

\sum_i \alpha_iF_i(x) \log\left(\frac{\beta_i G_i(x)}{\alpha_i F_i(x)}\right) \leq \left(\sum_i \alpha_iF_i(x)\right) \log\left(\frac{\sum_i \frac{\beta_i G_i(x)}{\alpha_i F_i(x)} \alpha_i F_i(x)}{\sum \alpha_i F_i(x)}\right) = \left(\sum_i \alpha_i F_i(x)\right) \log\left(\frac{\sum_i \beta_i G_i(x)}{\sum_i \alpha_i F_i(x)}\right)

This proves the inequality and therefore the theorem. \square

Remark: Intuitively, if we want to describe \sum \alpha_i F_i in terms of \sum \beta_i G_i, it is enough to first locate the ith term in the sum and then to describe F_i in terms of G_i. The theorem is a formalization of this intuition. In the case that F_i = G_i, it also says that the KL divergence between two different mixtures of the same set of distributions is at most the KL divergence between the mixture weights.

Quadratically Independent Monomials

Today Arun asked me the following question:

“Under what conditions will a set \{p_1,\ldots,p_n\} of polynomials be quadratically independent, in the sense that \{p_1^2, p_1p_2, p_2^2, p_1p_3,\ldots,p_{n-1}p_n, p_n^2\} is a linearly independent set?”

I wasn’t able to make much progress on this general question, but in the specific setting where the p_i are all polynomials in one variable, and we further restrict to just monomials, (i.e. p_i(x) = x^{d_i} for some d_i), the condition is just that there are no distinct unordered pairs (i_1,j_1),(i_2,j_2) such that d_{i_1} + d_{j_1} = d_{i_2} + d_{j_2}. Arun was interested in the largest such a set could be for a given maximum degree D, so we are left with the following interesting combinatorics problem:

“What is the largest subset S of \{1,\ldots,D\} such that no two distinct pairs of elements of S have the same sum?”

For convenience of notation let n denote the size of S. A simple upper bound is \binom{N+1}{2} \leq 2D-1, since there are \binom{N+1}{2} pairs to take a sum of, and all pairwise sums lie between 2 and 2D. We therefore have n = O(\sqrt{D}).

What about lower bounds on n? If we let S be the powers of 2 less than or equal to D, then we get a lower bound of \log_2(D); we can do slightly better by taking the Fibonacci numbers instead, but this still only gives us logarithmic growth. So the question is, can we find sets that grow polynomially in D?

It turns out the answer is yes, and we can do so by choosing randomly. Let each element of \{1,\ldots,D\} be placed in S with probability p. Now consider any k, 2 \leq k \leq 2D. If k is odd, then there are (k-1)/2 possible pairs that could add up to k: (1,k-1), (2,k-2),…,((k-1)/2,(k+1)/2). The probability of each such pair existing is p^2. Note that each of these events is independent.

S is invalid if and only if there exists some k such that more than one of these pairs is active in S. The probability of any two given pairs being simultaneously active is p^4, and there are \binom{(k-1)/2}{2} \leq \binom{D}{2} such pairs for a given k, hence (D-1)\binom{D}{2} \leq D^3/2 such pairs total (since we were just looking at odd k). Therefore, the probability of an odd value of k invalidating S is at most p^4D^3/2.

For even k we get much the same result except that the probability for a given value of k comes out to the slightly more complicated formula \binom{k/2-1}{2}p^4 + (k/2-1)p^3 + p^2 \leq D^2p^4/2 + Dp^3 + p^2, so that the total probability of an even value of k invalidating S is at most p^4D^3/2 + p^3D^2 + p^2D.

Putting this all together gives us a bound of p^4D^3 + p^3D^2 + p^2D. If we set p to be $\frac{1}{2}D^{-\frac{3}{4}}$ then the probability of S being invalid is then at most \frac{1}{16} + \frac{1}{8} D^{-\frac{1}{4}} + \frac{1}{4}D^{-\frac{1}{2}} \leq \frac{7}{16}, so with probability at least \frac{7}{16} a set S with elements chosen randomly with probability \frac{1}{2}D^{-\frac{3}{4}} will be valid. On the other hand, such a set has D^{1/4} elements in expectation, and asymptotically the probability of having at least this many elements is \frac{1}{2}. Therefore, with probability at least \frac{1}{16} a randomly chosen set will be both valid and have size greater than \frac{1}{2}, which shows that the largest value of n is at least \Omega\left(D^{1/4}\right).

We can actually do better: if all elements are chosen with probability \frac{1}{2}D^{-2/3}, then one can show that the expected number of invalid pairs is at most \frac{1}{8}D^{1/3} + O(1), and hence we can pick randomly with probability p = \frac{1}{2}D^{-2/3}, remove one element of each of the invalid pairs, and still be left with \Omega(D^{1/3}) elements in S.

So, to recap: choosing elements randomly gives us S of size \Omega(D^{1/4}); choosing randomly and then removing any offending pairs gives us S of size \Omega(D^{1/3}); and we have an upper bound of O(D^{1/2}). What is the actual asymptotic answer? I don’t actually know the answer to this, but I thought I’d share what I have so far because I think the techniques involved are pretty cool.

Exponential Families

In my last post I discussed log-linear models. In this post I’d like to take another perspective on log-linear models, by thinking of them as members of an exponential family. There are many reasons to take this perspective: exponential families give us efficient representations of log-linear models, which is important for continuous domains; they always have conjugate priors, which provide an analytically tractable regularization method; finally, they can be viewed as maximum-entropy models for a given set of sufficient statistics. Don’t worry if these terms are unfamiliar; I will explain all of them by the end of this post. Also note that most of this material is available on the Wikipedia page on exponential families, which I used quite liberally in preparing the below exposition.

1. Exponential Families

An exponential family is a family of probability distributions, parameterized by {\theta \in \mathbb{R}^n}, of the form

\displaystyle p(x \mid \theta) \propto h(x)\exp(\theta^T\phi(x)). \ \ \ \ \ (1)

Notice the similarity to the definition of a log-linear model, which is

\displaystyle p(x \mid \theta) \propto \exp(\theta^T\phi(x)). \ \ \ \ \ (2)

So, a log-linear model is simply an exponential family model with {h(x) = 1}. Note that we can re-write the right-hand-side of (1) as {\exp(\theta^T\phi(x)+\log h(x))}, so an exponential family is really just a log-linear model with one of the coordinates of \theta constrained to equal {1}. Also note that the normalization constant in (1) is a function of \theta (since \theta fully specifies the distribution over {x}), so we can express (1) more explicitly as

\displaystyle p(x \mid \theta) = h(x)\exp(\theta^T\phi(x)-A(\theta)), \ \ \ \ \ (3)

where

\displaystyle A(\theta) = \log\left(\int h(x)\exp(\theta^T\phi(x)) d(x)\right). \ \ \ \ \ (4)

Exponential families are capable of capturing almost all of the common distributions you are familiar with. There is an extensive table on Wikipedia; I’ve also included some of the most common below:

  1. Gaussian distributions. Let {\phi(x) = \left[ \begin{array}{c} x \\ x^2\end{array} \right]}. Then {p(x \mid \theta) \propto \exp(\theta_1x+\theta_2x^2)}. If we let {\theta = \left[\frac{\mu}{\sigma^2},-\frac{1}{2\sigma^2}\right]}, then {p(x \mid \theta) \propto \exp(\frac{\mu x}{\sigma^2}-\frac{x^2}{2\sigma^2}) \propto \exp(-\frac{1}{2\sigma^2}(x-\mu)^2)}. We therefore see that Gaussian distributions are an exponential family for {\phi(x) = \left[ \begin{array}{c} x \\ x^2 \end{array} \right]}.
  2. Poisson distributions. Let {\phi(x) = [x]} and {h(x) = \left\{\begin{array}{ccc} \frac{1}{x!} & : & x \in \{0,1,2,\ldots\} \\ 0 & : & \mathrm{else} \end{array}\right.}. Then {p(k \mid \theta) \propto \frac{1}{k!}\exp(\theta x)}. If we let {\theta_1 = \log(\lambda)} then we get {p(k \mid \theta) \propto \frac{\lambda^k}{k!}}; we thus see that Poisson distributions are also an exponential family.
  3. Multinomial distributions. Suppose that {X = \{1,2,\ldots,n\}}. Let {\phi(k)} be an {n}-dimensional vector whose {k}th element is {1} and where all other elements are zero. Then {p(k \mid \theta) \propto \exp(\theta_k) \propto \frac{\exp(\theta_k)}{\sum_{k=1}^n \exp(\theta_k)}}. If {\theta_k = \log P(x=k)}, then we obtain an arbitrary multinomial distribution. Therefore, multinomial distributions are also an exponential family.

2. Sufficient Statistics

A statistic of a random variable {X} is any deterministic function of that variable. For instance, if {X = [X_1,\ldots,X_n]^T} is a vector of Gaussian random variables, then the sample mean {\hat{\mu} := (X_1+\ldots+X_n)/n} and sample variance {\hat{\sigma}^2 := (X_1^2+\cdots+X_n^2)/n-(X_1+\cdots+X_n)^2/n^2} are both statistics.

Let {\mathcal{F}} be a family of distributions parameterized by \theta, and let {X} be a random variable with distribution given by some unknown {\theta_0}. Then a vector {T(X)} of statistics are called sufficient statistics for {\theta_0} if they contain all possible information about {\theta_0}, that is, for any function {f}, we have

\displaystyle \mathbb{E}[f(X) \mid T(X) = T_0, \theta = \theta_0] = S(f,T_0), \ \ \ \ \ (5)

for some function {S} that has no dependence on {\theta_0}.

For instance, let {X} be a vector of {n} independent Gaussian random variables {X_1,\ldots,X_n} with unknown mean {\mu} and variance {\sigma}. It turns out that {T(X) := [\hat{\mu},\hat{\sigma}^2]} is a sufficient statistic for {\mu} and {\sigma}. This is not immediately obvious; a very useful tool for determining whether statistics are sufficient is the Fisher-Neyman factorization theorem:

Theorem 1 (Fisher-Neyman) Suppose that {X} has a probability density function {p(X \mid \theta)}. Then the statistics {T(X)} are sufficient for \theta if and only if {p(X \mid \theta)} can be written in the form

\displaystyle p(X \mid \theta) = h(X)g_\theta(T(X)). \ \ \ \ \ (6)

In other words, the probability of {X} can be factored into a part that does not depend on \theta, and a part that depends on \theta only via {T(X)}.

What is going on here, intuitively? If {p(X \mid \theta)} depended only on {T(X)}, then {T(X)} would definitely be a sufficient statistic. But that isn’t the only way for {T(X)} to be a sufficient statistic — {p(X \mid \theta)} could also just not depend on \theta at all, in which case {T(X)} would trivially be a sufficient statistic (as would anything else). The Fisher-Neyman theorem essentially says that the only way in which {T(X)} can be a sufficient statistic is if its density is a product of these two cases.

Proof: If (6) holds, then we can check that (5) is satisfied:

\displaystyle \begin{array}{rcl} \mathbb{E}[f(X) \mid T(X) = T_0, \theta = \theta_0] &=& \frac{\int_{T(X) = T_0} f(X) dp(X \mid \theta=\theta_0)}{\int_{T(X) = T_0} dp(X \mid \theta=\theta_0)}\\ \\ &=& \frac{\int_{T(X)=T_0} f(X)h(X)g_\theta(T_0) dX}{\int_{T(X)=T_0} h(X)g_\theta(T_0) dX}\\ \\ &=& \frac{\int_{T(X)=T_0} f(X)h(X)dX}{\int_{T(X)=T_0} h(X) dX}, \end{array}

where the right-hand-side has no dependence on \theta.

On the other hand, if we compute {\mathbb{E}[f(X) \mid T(X) = T_0, \theta = \theta_0]} for an arbitrary density {p(X)}, we get

\displaystyle \begin{array}{rcl} \mathbb{E}[f(X) \mid T(X) = T_0, \theta = \theta_0] &=& \int_{T(X) = T_0} f(X) \frac{p(X \mid \theta=\theta_0)}{\int_{T(X)=T_0} p(X \mid \theta=\theta_0) dX} dX. \end{array}

If the right-hand-side cannot depend on \theta for any choice of {f}, then the term that we multiply {f} by must not depend on \theta; that is, {\frac{p(X \mid \theta=\theta_0)}{\int_{T(X) = T_0} p(X \mid \theta=\theta_0) dX}} must be some function {h_0(X, T_0)} that depends only on {X} and {T_0} and not on \theta. On the other hand, the denominator {\int_{T(X)=T_0} p(X \mid \theta=\theta_0) dX} depends only on {\theta_0} and {T_0}; call this dependence {g_{\theta_0}(T_0)}. Finally, note that {T_0} is a deterministic function of {X}, so let {h(X) := h_0(X,T(X))}. We then see that {p(X \mid \theta=\theta_0) = h_0(X, T_0)g_{\theta_0}(T_0) = h(X)g_{\theta_0}(T(X))}, which is the same form as (6), thus completing the proof of the theorem. \Box

Now, let us apply the Fisher-Neyman theorem to exponential families. By definition, the density for an exponential family factors as

\displaystyle p(x \mid \theta) = h(x)\exp(\theta^T\phi(x)-A(\theta)).

If we let {T(x) = \phi(x)} and {g_\theta(\phi(x)) = \exp(\theta^T\phi(x)-A(\theta))}, then the Fisher-Neyman condition is met; therefore, {\phi(x)} is a vector of sufficient statistics for the exponential family. In fact, we can go further:

Theorem 2 Let {X_1,\ldots,X_n} be drawn independently from an exponential family distribution with fixed parameter \theta. Then the empirical expectation {\hat{\phi} := \frac{1}{n} \sum_{i=1}^n \phi(X_i)} is a sufficient statistic for \theta.

Proof: The density for {X_1,\ldots,X_n} given \theta is

\displaystyle \begin{array}{rcl} p(X_1,\ldots,X_n \mid \theta) &=& h(X_1)\cdots h(X_n) \exp(\theta^T\sum_{i=1}^n \phi(X_i) - nA(\theta)) \\ &=& h(X_1)\cdots h(X_n)\exp(n [\hat{\phi}-A(\theta)]). \end{array}

Letting {h(X_1,\ldots,X_n) = h(X_1)\cdots h(X_n)} and {g_\theta(\hat{\phi}) = \exp(n[\hat{\phi}-A(\theta)])}, we see that the Fisher-Neyman conditions are satisfied, so that {\hat{\phi}} is indeed a sufficient statistic. \Box

Finally, we note (without proof) the same relationship as in the log-linear case to the gradient and Hessian of {p(X_1,\ldots,X_n \mid \theta)} with respect to the model parameters:

Theorem 3 Again let {X_1,\ldots,X_n} be drawn from an exponential family distribution with parameter \theta. Then the gradient of {p(X_1,\ldots,X_n \mid \theta)} with respect to \theta is

\displaystyle n \times \left(\hat{\phi}-\mathbb{E}[\phi \mid \theta]\right)

and the Hessian is

\displaystyle n \times \left(\mathbb{E}[\phi \mid \theta]\mathbb{E}[\phi \mid \theta]^T - \mathbb{E}[\phi\phi^T \mid \theta]\right).

This theorem provides an efficient algorithm for fitting the parameters of an exponential family distribution (for details on the algorithm, see the part near the end of the log-linear models post on parameter estimation).

3. Moments of an Exponential Family

If {X} is a real-valued random variable, then the {p}th moment of {X} is {\mathbb{E}[X^p]}. In general, if {X = [X_1,\ldots,X_n]^T} is a random variable on {\mathbb{R}^n}, then for every sequence {p_1,\ldots,p_n} of non-negative integers, there is a corresponding moment {M_{p_1,\cdots,p_n} := \mathbb{E}[X_1^{p_1}\cdots X_n^{p_n}]}.

In exponential families there is a very nice relationship between the normalization constant {A(\theta)} and the moments of {X}. Before we establish this relationship, let us define the moment generating function of a random variable {X} as {f(\lambda) = \mathbb{E}[\exp(\lambda^TX)]}.

Lemma 4 The moment generating function for a random variable {X} is equal to

\displaystyle \sum_{p_1,\ldots,p_n=0}^{\infty} M_{p_1,\cdots,p_n} \frac{\lambda_1^{p_1}\cdots \lambda_n^{p_n}}{p_1!\cdots p_n!}.

The proof of Lemma 4 is a straightforward application of Taylor’s theorem, together with linearity of expectation (note that in one dimension, the expression in Lemma 4 would just be {\sum_{p=0}^{\infty} \mathbb{E}[X^p] \frac{\lambda^p}{p!}}).

We now see why {f(\lambda)} is called the moment generating function: it is the exponential generating function for the moments of {X}. The moment generating function for the sufficient statistics of an exponential family is particularly easy to compute:

Lemma 5 If {p(x \mid \theta) = h(x)\exp(\theta^T\phi(x)-A(\theta))}, then {\mathbb{E}[\exp(\lambda^T\phi(x))] = \exp(A(\theta+\lambda)-A(\theta))}.

Proof:

\displaystyle \begin{array}{rcl} \mathbb{E}[\exp(\lambda^Tx)] &=& \int \exp(\lambda^Tx) p(x \mid \theta) dx \\ &=& \int \exp(\lambda^Tx)h(x)\exp(\theta^T\phi(x)-A(\theta)) dx \\ &=& \int h(x)\exp((\theta+\lambda)^T\phi(x)-A(\theta)) dx \\ &=& \int h(x)\exp((\theta+\lambda)^T\phi(x)-A(\theta+\lambda))dx \times \exp(A(\theta+\lambda)-A(\theta)) \\ &=& \int p(x \mid \theta+\lambda) dx \times \exp(A(\theta+\lambda)-A(\theta)) \\ &=& \exp(A(\theta+\lambda)-A(\theta)), \end{array}

where the last step uses the fact that {p(x \mid \theta+\lambda)} is a probability density and hence {\int p(x \mid \theta+\lambda) dx = 1}. \Box

Now, by Lemma 4, {M_{p_1,\cdots,p_n}} is just the {(p_1,\ldots,p_n)} coefficient in the Taylor series for the moment generating function {f(\lambda)}, and hence we can compute {M_{p_1,\cdots,p_n}} as {\frac{\partial^{p_1+\cdots+p_n} f(\lambda)}{\partial^{p_1}\lambda_1\cdots \partial^{p_n}\lambda_n}}. Combining this with Lemma 5 gives us a closed-form expression for {M_{p_1,\cdots,p_n}} in terms of the normalization constant {A(\theta)}:

Lemma 6 The moments of an exponential family can be computed as

\displaystyle M_{p_1,\ldots,p_n} = \frac{\partial^{p_1+\cdots+p_n} \exp(A(\theta+\lambda)-A(\theta))}{\partial^{p_1}\lambda_1\cdots \partial^{p_n}\lambda_n}.

For those who prefer cumulants to moments, I will note that there is a version of Lemma 6 for cumulants with an even simpler formula.

Exercise: Use Lemma 6 to compute {\mathbb{E}[X^6]}, where {X} is a Gaussian with mean {\mu} and variance {\sigma^2}.

4. Conjugate Priors

Given a family of distributions {p(X \mid \theta)}, a conjugate prior family {p(\theta \mid \alpha)} is a family that has the property that

\displaystyle p(\theta \mid X, \alpha) = p(\theta \mid \alpha')

for some {\alpha'} depending on {\alpha} and {X}. In other words, if the prior over \theta lies in the conjugate family, and we observe {X}, then the posterior over \theta also lies in the conjugate family. This is very useful algebraically as it means that we can get our posterior simply by updating the parameters of the prior. The following are examples of conjugate families:

  1. (Gaussian-Gaussian) Let {p(X \mid \mu) \propto \exp((X-\mu)^2/2)}, and let {p(\mu \mid \mu_0, \sigma_0) \propto \exp((\mu-\mu_0)^2/2\sigma_0^2)}. Then, by Bayes’ rule,

\displaystyle \begin{array}{rcl} p(\mu \mid X=x, \mu_0, \sigma_0) &\propto \exp((x-\mu)^2/2)\exp((\mu-\mu_0)^2/2\sigma_0^2) \\ &= &\exp\left(\frac{(\mu-\mu_0)^2+\sigma_0^2(\mu-x)^2}{2\sigma_0^2}\right) \\ &\propto& \exp\left(\frac{(1+\sigma_0)^2\mu^2-2(\mu_0+\sigma_0^2x)\mu}{2\sigma_0^2}\right) \\ &\propto& \exp\left(\frac{\mu^2-2\frac{\mu_0+x\sigma_0^2}{1+\sigma_0^2}\mu}{2\sigma_0^2/(1+\sigma_0^2)}\right) \\ &\propto& \exp\left(\frac{(\mu-(\mu_0+x\sigma_0^2)/(1+\sigma_0^2))^2}{2\sigma_0^2/(1+\sigma_0^2)}\right) \\ &\propto& p\left(\mu \mid \frac{\mu_0+x\sigma_0^2}{1+\sigma_0^2}, \frac{\sigma_0}{\sqrt{1+\sigma_0^2}}\right). \end{array}

Therefore, {\mu_0, \sigma_0} parameterize a family of priors over {\mu} that is conjugate to {X \mid \mu}.

  • (Beta-Bernoulli) Let {X \in \{0,1\}}, {\theta \in [0,1]}, {p(X=1 \mid \theta) = \theta}, and {p(\theta \mid \alpha, \beta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}}. The distribution over {X} given \theta is then called a Bernoulli distribution, and that of \theta given {\alpha} and {\beta} is called a beta distribution. Note that {p(X\mid \theta)} can also be written as {\theta^X(1-\theta)^{1-X}}. From this, we see that the family of beta distributions is a conjugate prior to the family of Bernoulli distributions, since

\displaystyle \begin{array}{rcl} p(\theta \mid X=x, \alpha, \beta) &\propto& \theta^x(1-\theta)^{1-x} \times \theta^{\alpha-1}(1-\theta)^{\beta-1} \\ &=& \theta^{\alpha+x-1}(1-\theta)^{\beta+(1-x)-1} \\ &\propto& p(\theta \mid \alpha+x, \beta+(1-x)). \end{array}

  • (Gamma-Poisson) Let {p(X=k \mid \lambda) = \frac{\lambda^k}{e^{\lambda}k!}} for {k \in \mathbb{Z}_{\geq 0}}. Let {p(\lambda \mid \alpha, \beta) \propto \lambda^{\alpha-1}\exp(-\beta \lambda)}. As noted before, the distribution for {X} given {\lambda} is called a Poisson distribution; the distribution for {\lambda} given {\alpha} and {\beta} is called a gamma distribution. We can check that the family of gamma distributions is conjugate to the family of Poisson distributions.Important note: unlike in the last two examples, the normalization constant for the Poisson distribution actually depends on {\lambda}, and so we need to include it in our calculations:

    \displaystyle \begin{array}{rcl} p(\lambda \mid X=k, \alpha, \beta) &\propto& \frac{\lambda^k}{e^{\lambda}k!} \times \lambda^{\alpha-1}\exp(-\beta\lambda) \\ &\propto& \lambda^{\alpha+k-1}\exp(-(\beta+1)\lambda) \\ &\propto& p(\lambda \mid \alpha+k, \beta+1). \end{array}

    Note that, in general, a family of distributions will always have some conjugate family, as if nothing else the family of all probability distributions over \theta will be a conjugate family. What we really care about is a conjugate family that itself has nice properties, such as tractably computable moments.

    Conjugate priors have a very nice relationship to exponential families, established in the following theorem:

    Theorem 7 Let {p(x \mid \theta) = h(x)\exp(\theta^T\phi(x)-A(\theta))} be an exponential family. Then {p(\theta \mid \eta, \kappa) \propto h_2(\theta)\exp\left(\eta^T\theta-\kappa A(\theta)\right)} is a conjugate prior for {x \mid \theta} for any choice of {h_2}. The update formula is {p(\theta \mid x, \eta, \kappa) = p(\theta \mid \eta+\phi(x), \kappa+1)}. Furthermore, {\theta \mid \phi, \kappa} is itself an exponential family, with sufficient statistics {[\theta; A(\theta)]}.

    Checking the theorem is a matter of straightforward algebra, so I will leave the proof as an exercise to the reader. Note that, as before, there is no guarantee that {p(\theta \mid \eta, \kappa)} will be tractable; however, in many cases the conjugate prior given by Theorem 7 is a well-behaved family. See this Wikipedia page for examples of conjugate priors, many of which correspond to exponential family distributions.

    5. Maximum Entropy and Duality

    The final property of exponential families I would like to establish is a certain duality property. What I mean by this is that exponential families can be thought of as the maximum entropy distributions subject to a constraint on the expected value of their sufficient statistics. For those unfamiliar with the term, the entropy of a distribution over {X} with density {p(X)} is {\mathbb{E}[-\log p(X)] := -\int p(x)\log(p(x)) dx}. Intuitively, higher entropy corresponds to higher uncertainty, so a maximum entropy distribution is one specifying as much uncertainty as possible given a certain set of information (such as the values of various moments). This makes them appealing, at least in theory, from a modeling perspective, since they “encode exactly as much information as is given and no more”. (Caveat: this intuition isn’t entirely valid, and in practice maximum-entropy distributions aren’t always necessarily appropriate.)

    In any case, the duality property is captured in the following theorem:

    Theorem 8 The distribution over {X} with maximum entropy such that {\mathbb{E}[\phi(X)] = T} lies in the exponential family with sufficient statistic {\phi(X)} and {h(X) = 1}.

    Proving this fully rigorously requires the calculus of variations; I will instead give the “physicist’s proof”. Proof: } Let {p(X)} be the density for {X}. Then we can view {p} as the solution to the constrained maximization problem:

    \displaystyle \begin{array}{rcl} \mathrm{maximize} && -\int p(X) \log p(X) dX \\ \mathrm{subject \ to} && \int p(X) dX = 1 \\ && \int p(X) \phi(X) dX = T. \end{array}

    By the method of Lagrange multipliers, there exist {\alpha} and {\lambda} such that

    \displaystyle \frac{d}{dp}\left(-\int p(X)\log p(X) dX - \alpha [\int p(X) dX-1] - \lambda^T[\int \phi(X) p(X) dX-T]\right) = 0.

    This simplifies to:

    \displaystyle -\log p(X) - 1 - \alpha -\lambda^T \phi(X) = 0,

    which implies

    \displaystyle p(X) = \exp(-1-\alpha-\lambda^T\phi(X))

    for some {\alpha} and {\lambda}. In particular, if we let {\lambda = -\theta} and {\alpha = A(\theta)-1}, then we recover the exponential family with {h(X) = 1}, as claimed. \Box

    6. Conclusion

    Hopefully I have by now convinced you that exponential families have many nice properties: they have conjugate priors, simple-to-fit parameters, and easily-computed moments. While exponential families aren’t always appropriate models for a given situation, their tractability makes them the model of choice when no other information is present; and, since they can be obtained as maximum-entropy families, they are actually appropriate models in a wide family of circumstances.

Follow

Get every new post delivered to your Inbox.