## Exponential Families

In my last post I discussed log-linear models. In this post I’d like to take another perspective on log-linear models, by thinking of them as members of an exponential family. There are many reasons to take this perspective: exponential families give us efficient representations of log-linear models, which is important for continuous domains; they always have conjugate priors, which provide an analytically tractable regularization method; finally, they can be viewed as maximum-entropy models for a given set of sufficient statistics. Don’t worry if these terms are unfamiliar; I will explain all of them by the end of this post. Also note that most of this material is available on the Wikipedia page on exponential families, which I used quite liberally in preparing the below exposition.

1. Exponential Families

An exponential family is a family of probability distributions, parameterized by ${\theta \in \mathbb{R}^n}$, of the form

$\displaystyle p(x \mid \theta) \propto h(x)\exp(\theta^T\phi(x)). \ \ \ \ \ (1)$

Notice the similarity to the definition of a log-linear model, which is

$\displaystyle p(x \mid \theta) \propto \exp(\theta^T\phi(x)). \ \ \ \ \ (2)$

So, a log-linear model is simply an exponential family model with ${h(x) = 1}$. Note that we can re-write the right-hand-side of (1) as ${\exp(\theta^T\phi(x)+\log h(x))}$, so an exponential family is really just a log-linear model with one of the coordinates of $\theta$ constrained to equal ${1}$. Also note that the normalization constant in (1) is a function of $\theta$ (since $\theta$ fully specifies the distribution over ${x}$), so we can express (1) more explicitly as

$\displaystyle p(x \mid \theta) = h(x)\exp(\theta^T\phi(x)-A(\theta)), \ \ \ \ \ (3)$

where

$\displaystyle A(\theta) = \log\left(\int h(x)\exp(\theta^T\phi(x)) d(x)\right). \ \ \ \ \ (4)$

Exponential families are capable of capturing almost all of the common distributions you are familiar with. There is an extensive table on Wikipedia; I’ve also included some of the most common below:

1. Gaussian distributions. Let ${\phi(x) = \left[ \begin{array}{c} x \\ x^2\end{array} \right]}$. Then ${p(x \mid \theta) \propto \exp(\theta_1x+\theta_2x^2)}$. If we let ${\theta = \left[\frac{\mu}{\sigma^2},-\frac{1}{2\sigma^2}\right]}$, then ${p(x \mid \theta) \propto \exp(\frac{\mu x}{\sigma^2}-\frac{x^2}{2\sigma^2}) \propto \exp(-\frac{1}{2\sigma^2}(x-\mu)^2)}$. We therefore see that Gaussian distributions are an exponential family for ${\phi(x) = \left[ \begin{array}{c} x \\ x^2 \end{array} \right]}$.
2. Poisson distributions. Let ${\phi(x) = [x]}$ and ${h(x) = \left\{\begin{array}{ccc} \frac{1}{x!} & : & x \in \{0,1,2,\ldots\} \\ 0 & : & \mathrm{else} \end{array}\right.}$. Then ${p(k \mid \theta) \propto \frac{1}{k!}\exp(\theta x)}$. If we let ${\theta_1 = \log(\lambda)}$ then we get ${p(k \mid \theta) \propto \frac{\lambda^k}{k!}}$; we thus see that Poisson distributions are also an exponential family.
3. Multinomial distributions. Suppose that ${X = \{1,2,\ldots,n\}}$. Let ${\phi(k)}$ be an ${n}$-dimensional vector whose ${k}$th element is ${1}$ and where all other elements are zero. Then ${p(k \mid \theta) \propto \exp(\theta_k) \propto \frac{\exp(\theta_k)}{\sum_{k=1}^n \exp(\theta_k)}}$. If ${\theta_k = \log P(x=k)}$, then we obtain an arbitrary multinomial distribution. Therefore, multinomial distributions are also an exponential family.

2. Sufficient Statistics

A statistic of a random variable ${X}$ is any deterministic function of that variable. For instance, if ${X = [X_1,\ldots,X_n]^T}$ is a vector of Gaussian random variables, then the sample mean ${\hat{\mu} := (X_1+\ldots+X_n)/n}$ and sample variance ${\hat{\sigma}^2 := (X_1^2+\cdots+X_n^2)/n-(X_1+\cdots+X_n)^2/n^2}$ are both statistics.

Let ${\mathcal{F}}$ be a family of distributions parameterized by $\theta$, and let ${X}$ be a random variable with distribution given by some unknown ${\theta_0}$. Then a vector ${T(X)}$ of statistics are called sufficient statistics for ${\theta_0}$ if they contain all possible information about ${\theta_0}$, that is, for any function ${f}$, we have

$\displaystyle \mathbb{E}[f(X) \mid T(X) = T_0, \theta = \theta_0] = S(f,T_0), \ \ \ \ \ (5)$

for some function ${S}$ that has no dependence on ${\theta_0}$.

For instance, let ${X}$ be a vector of ${n}$ independent Gaussian random variables ${X_1,\ldots,X_n}$ with unknown mean ${\mu}$ and variance ${\sigma}$. It turns out that ${T(X) := [\hat{\mu},\hat{\sigma}^2]}$ is a sufficient statistic for ${\mu}$ and ${\sigma}$. This is not immediately obvious; a very useful tool for determining whether statistics are sufficient is the Fisher-Neyman factorization theorem:

Theorem 1 (Fisher-Neyman) Suppose that ${X}$ has a probability density function ${p(X \mid \theta)}$. Then the statistics ${T(X)}$ are sufficient for $\theta$ if and only if ${p(X \mid \theta)}$ can be written in the form

$\displaystyle p(X \mid \theta) = h(X)g_\theta(T(X)). \ \ \ \ \ (6)$

In other words, the probability of ${X}$ can be factored into a part that does not depend on $\theta$, and a part that depends on $\theta$ only via ${T(X)}$.

What is going on here, intuitively? If ${p(X \mid \theta)}$ depended only on ${T(X)}$, then ${T(X)}$ would definitely be a sufficient statistic. But that isn’t the only way for ${T(X)}$ to be a sufficient statistic — ${p(X \mid \theta)}$ could also just not depend on $\theta$ at all, in which case ${T(X)}$ would trivially be a sufficient statistic (as would anything else). The Fisher-Neyman theorem essentially says that the only way in which ${T(X)}$ can be a sufficient statistic is if its density is a product of these two cases.

Proof: If (6) holds, then we can check that (5) is satisfied:

$\displaystyle \begin{array}{rcl} \mathbb{E}[f(X) \mid T(X) = T_0, \theta = \theta_0] &=& \frac{\int_{T(X) = T_0} f(X) dp(X \mid \theta=\theta_0)}{\int_{T(X) = T_0} dp(X \mid \theta=\theta_0)}\\ \\ &=& \frac{\int_{T(X)=T_0} f(X)h(X)g_\theta(T_0) dX}{\int_{T(X)=T_0} h(X)g_\theta(T_0) dX}\\ \\ &=& \frac{\int_{T(X)=T_0} f(X)h(X)dX}{\int_{T(X)=T_0} h(X) dX}, \end{array}$

where the right-hand-side has no dependence on $\theta$.

On the other hand, if we compute ${\mathbb{E}[f(X) \mid T(X) = T_0, \theta = \theta_0]}$ for an arbitrary density ${p(X)}$, we get

$\displaystyle \begin{array}{rcl} \mathbb{E}[f(X) \mid T(X) = T_0, \theta = \theta_0] &=& \int_{T(X) = T_0} f(X) \frac{p(X \mid \theta=\theta_0)}{\int_{T(X)=T_0} p(X \mid \theta=\theta_0) dX} dX. \end{array}$

If the right-hand-side cannot depend on $\theta$ for any choice of ${f}$, then the term that we multiply ${f}$ by must not depend on $\theta$; that is, ${\frac{p(X \mid \theta=\theta_0)}{\int_{T(X) = T_0} p(X \mid \theta=\theta_0) dX}}$ must be some function ${h_0(X, T_0)}$ that depends only on ${X}$ and ${T_0}$ and not on $\theta$. On the other hand, the denominator ${\int_{T(X)=T_0} p(X \mid \theta=\theta_0) dX}$ depends only on ${\theta_0}$ and ${T_0}$; call this dependence ${g_{\theta_0}(T_0)}$. Finally, note that ${T_0}$ is a deterministic function of ${X}$, so let ${h(X) := h_0(X,T(X))}$. We then see that ${p(X \mid \theta=\theta_0) = h_0(X, T_0)g_{\theta_0}(T_0) = h(X)g_{\theta_0}(T(X))}$, which is the same form as (6), thus completing the proof of the theorem. $\Box$

Now, let us apply the Fisher-Neyman theorem to exponential families. By definition, the density for an exponential family factors as

$\displaystyle p(x \mid \theta) = h(x)\exp(\theta^T\phi(x)-A(\theta)).$

If we let ${T(x) = \phi(x)}$ and ${g_\theta(\phi(x)) = \exp(\theta^T\phi(x)-A(\theta))}$, then the Fisher-Neyman condition is met; therefore, ${\phi(x)}$ is a vector of sufficient statistics for the exponential family. In fact, we can go further:

Theorem 2 Let ${X_1,\ldots,X_n}$ be drawn independently from an exponential family distribution with fixed parameter $\theta$. Then the empirical expectation ${\hat{\phi} := \frac{1}{n} \sum_{i=1}^n \phi(X_i)}$ is a sufficient statistic for $\theta$.

Proof: The density for ${X_1,\ldots,X_n}$ given $\theta$ is

$\displaystyle \begin{array}{rcl} p(X_1,\ldots,X_n \mid \theta) &=& h(X_1)\cdots h(X_n) \exp(\theta^T\sum_{i=1}^n \phi(X_i) - nA(\theta)) \\ &=& h(X_1)\cdots h(X_n)\exp(n [\hat{\phi}-A(\theta)]). \end{array}$

Letting ${h(X_1,\ldots,X_n) = h(X_1)\cdots h(X_n)}$ and ${g_\theta(\hat{\phi}) = \exp(n[\hat{\phi}-A(\theta)])}$, we see that the Fisher-Neyman conditions are satisfied, so that ${\hat{\phi}}$ is indeed a sufficient statistic. $\Box$

Finally, we note (without proof) the same relationship as in the log-linear case to the gradient and Hessian of ${p(X_1,\ldots,X_n \mid \theta)}$ with respect to the model parameters:

Theorem 3 Again let ${X_1,\ldots,X_n}$ be drawn from an exponential family distribution with parameter $\theta$. Then the gradient of ${p(X_1,\ldots,X_n \mid \theta)}$ with respect to $\theta$ is

$\displaystyle n \times \left(\hat{\phi}-\mathbb{E}[\phi \mid \theta]\right)$

and the Hessian is

$\displaystyle n \times \left(\mathbb{E}[\phi \mid \theta]\mathbb{E}[\phi \mid \theta]^T - \mathbb{E}[\phi\phi^T \mid \theta]\right).$

This theorem provides an efficient algorithm for fitting the parameters of an exponential family distribution (for details on the algorithm, see the part near the end of the log-linear models post on parameter estimation).

3. Moments of an Exponential Family

If ${X}$ is a real-valued random variable, then the ${p}$th moment of ${X}$ is ${\mathbb{E}[X^p]}$. In general, if ${X = [X_1,\ldots,X_n]^T}$ is a random variable on ${\mathbb{R}^n}$, then for every sequence ${p_1,\ldots,p_n}$ of non-negative integers, there is a corresponding moment ${M_{p_1,\cdots,p_n} := \mathbb{E}[X_1^{p_1}\cdots X_n^{p_n}]}$.

In exponential families there is a very nice relationship between the normalization constant ${A(\theta)}$ and the moments of ${X}$. Before we establish this relationship, let us define the moment generating function of a random variable ${X}$ as ${f(\lambda) = \mathbb{E}[\exp(\lambda^TX)]}$.

Lemma 4 The moment generating function for a random variable ${X}$ is equal to

$\displaystyle \sum_{p_1,\ldots,p_n=0}^{\infty} M_{p_1,\cdots,p_n} \frac{\lambda_1^{p_1}\cdots \lambda_n^{p_n}}{p_1!\cdots p_n!}.$

The proof of Lemma 4 is a straightforward application of Taylor’s theorem, together with linearity of expectation (note that in one dimension, the expression in Lemma 4 would just be ${\sum_{p=0}^{\infty} \mathbb{E}[X^p] \frac{\lambda^p}{p!}}$).

We now see why ${f(\lambda)}$ is called the moment generating function: it is the exponential generating function for the moments of ${X}$. The moment generating function for the sufficient statistics of an exponential family is particularly easy to compute:

Lemma 5 If ${p(x \mid \theta) = h(x)\exp(\theta^T\phi(x)-A(\theta))}$, then ${\mathbb{E}[\exp(\lambda^T\phi(x))] = \exp(A(\theta+\lambda)-A(\theta))}$.

Proof:

$\displaystyle \begin{array}{rcl} \mathbb{E}[\exp(\lambda^Tx)] &=& \int \exp(\lambda^Tx) p(x \mid \theta) dx \\ &=& \int \exp(\lambda^Tx)h(x)\exp(\theta^T\phi(x)-A(\theta)) dx \\ &=& \int h(x)\exp((\theta+\lambda)^T\phi(x)-A(\theta)) dx \\ &=& \int h(x)\exp((\theta+\lambda)^T\phi(x)-A(\theta+\lambda))dx \times \exp(A(\theta+\lambda)-A(\theta)) \\ &=& \int p(x \mid \theta+\lambda) dx \times \exp(A(\theta+\lambda)-A(\theta)) \\ &=& \exp(A(\theta+\lambda)-A(\theta)), \end{array}$

where the last step uses the fact that ${p(x \mid \theta+\lambda)}$ is a probability density and hence ${\int p(x \mid \theta+\lambda) dx = 1}$. $\Box$

Now, by Lemma 4, ${M_{p_1,\cdots,p_n}}$ is just the ${(p_1,\ldots,p_n)}$ coefficient in the Taylor series for the moment generating function ${f(\lambda)}$, and hence we can compute ${M_{p_1,\cdots,p_n}}$ as ${\frac{\partial^{p_1+\cdots+p_n} f(\lambda)}{\partial^{p_1}\lambda_1\cdots \partial^{p_n}\lambda_n}}$. Combining this with Lemma 5 gives us a closed-form expression for ${M_{p_1,\cdots,p_n}}$ in terms of the normalization constant ${A(\theta)}$:

Lemma 6 The moments of an exponential family can be computed as

$\displaystyle M_{p_1,\ldots,p_n} = \frac{\partial^{p_1+\cdots+p_n} \exp(A(\theta+\lambda)-A(\theta))}{\partial^{p_1}\lambda_1\cdots \partial^{p_n}\lambda_n}.$

For those who prefer cumulants to moments, I will note that there is a version of Lemma 6 for cumulants with an even simpler formula.

Exercise: Use Lemma 6 to compute ${\mathbb{E}[X^6]}$, where ${X}$ is a Gaussian with mean ${\mu}$ and variance ${\sigma^2}$.

4. Conjugate Priors

Given a family of distributions ${p(X \mid \theta)}$, a conjugate prior family ${p(\theta \mid \alpha)}$ is a family that has the property that

$\displaystyle p(\theta \mid X, \alpha) = p(\theta \mid \alpha')$

for some ${\alpha'}$ depending on ${\alpha}$ and ${X}$. In other words, if the prior over $\theta$ lies in the conjugate family, and we observe ${X}$, then the posterior over $\theta$ also lies in the conjugate family. This is very useful algebraically as it means that we can get our posterior simply by updating the parameters of the prior. The following are examples of conjugate families:

1. (Gaussian-Gaussian) Let ${p(X \mid \mu) \propto \exp((X-\mu)^2/2)}$, and let ${p(\mu \mid \mu_0, \sigma_0) \propto \exp((\mu-\mu_0)^2/2\sigma_0^2)}$. Then, by Bayes’ rule,

$\displaystyle \begin{array}{rcl} p(\mu \mid X=x, \mu_0, \sigma_0) &\propto \exp((x-\mu)^2/2)\exp((\mu-\mu_0)^2/2\sigma_0^2) \\ &= &\exp\left(\frac{(\mu-\mu_0)^2+\sigma_0^2(\mu-x)^2}{2\sigma_0^2}\right) \\ &\propto& \exp\left(\frac{(1+\sigma_0)^2\mu^2-2(\mu_0+\sigma_0^2x)\mu}{2\sigma_0^2}\right) \\ &\propto& \exp\left(\frac{\mu^2-2\frac{\mu_0+x\sigma_0^2}{1+\sigma_0^2}\mu}{2\sigma_0^2/(1+\sigma_0^2)}\right) \\ &\propto& \exp\left(\frac{(\mu-(\mu_0+x\sigma_0^2)/(1+\sigma_0^2))^2}{2\sigma_0^2/(1+\sigma_0^2)}\right) \\ &\propto& p\left(\mu \mid \frac{\mu_0+x\sigma_0^2}{1+\sigma_0^2}, \frac{\sigma_0}{\sqrt{1+\sigma_0^2}}\right). \end{array}$

Therefore, ${\mu_0, \sigma_0}$ parameterize a family of priors over ${\mu}$ that is conjugate to ${X \mid \mu}$.

• (Beta-Bernoulli) Let ${X \in \{0,1\}}$, ${\theta \in [0,1]}$, ${p(X=1 \mid \theta) = \theta}$, and ${p(\theta \mid \alpha, \beta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}}$. The distribution over ${X}$ given $\theta$ is then called a Bernoulli distribution, and that of $\theta$ given ${\alpha}$ and ${\beta}$ is called a beta distribution. Note that ${p(X\mid \theta)}$ can also be written as ${\theta^X(1-\theta)^{1-X}}$. From this, we see that the family of beta distributions is a conjugate prior to the family of Bernoulli distributions, since

$\displaystyle \begin{array}{rcl} p(\theta \mid X=x, \alpha, \beta) &\propto& \theta^x(1-\theta)^{1-x} \times \theta^{\alpha-1}(1-\theta)^{\beta-1} \\ &=& \theta^{\alpha+x-1}(1-\theta)^{\beta+(1-x)-1} \\ &\propto& p(\theta \mid \alpha+x, \beta+(1-x)). \end{array}$

• (Gamma-Poisson) Let ${p(X=k \mid \lambda) = \frac{\lambda^k}{e^{\lambda}k!}}$ for ${k \in \mathbb{Z}_{\geq 0}}$. Let ${p(\lambda \mid \alpha, \beta) \propto \lambda^{\alpha-1}\exp(-\beta \lambda)}$. As noted before, the distribution for ${X}$ given ${\lambda}$ is called a Poisson distribution; the distribution for ${\lambda}$ given ${\alpha}$ and ${\beta}$ is called a gamma distribution. We can check that the family of gamma distributions is conjugate to the family of Poisson distributions.Important note: unlike in the last two examples, the normalization constant for the Poisson distribution actually depends on ${\lambda}$, and so we need to include it in our calculations:

$\displaystyle \begin{array}{rcl} p(\lambda \mid X=k, \alpha, \beta) &\propto& \frac{\lambda^k}{e^{\lambda}k!} \times \lambda^{\alpha-1}\exp(-\beta\lambda) \\ &\propto& \lambda^{\alpha+k-1}\exp(-(\beta+1)\lambda) \\ &\propto& p(\lambda \mid \alpha+k, \beta+1). \end{array}$

Note that, in general, a family of distributions will always have some conjugate family, as if nothing else the family of all probability distributions over $\theta$ will be a conjugate family. What we really care about is a conjugate family that itself has nice properties, such as tractably computable moments.

Conjugate priors have a very nice relationship to exponential families, established in the following theorem:

Theorem 7 Let ${p(x \mid \theta) = h(x)\exp(\theta^T\phi(x)-A(\theta))}$ be an exponential family. Then ${p(\theta \mid \eta, \kappa) \propto h_2(\theta)\exp\left(\eta^T\theta-\kappa A(\theta)\right)}$ is a conjugate prior for ${x \mid \theta}$ for any choice of ${h_2}$. The update formula is ${p(\theta \mid x, \eta, \kappa) = p(\theta \mid \eta+\phi(x), \kappa+1)}$. Furthermore, ${\theta \mid \phi, \kappa}$ is itself an exponential family, with sufficient statistics ${[\theta; A(\theta)]}$.

Checking the theorem is a matter of straightforward algebra, so I will leave the proof as an exercise to the reader. Note that, as before, there is no guarantee that ${p(\theta \mid \eta, \kappa)}$ will be tractable; however, in many cases the conjugate prior given by Theorem 7 is a well-behaved family. See this Wikipedia page for examples of conjugate priors, many of which correspond to exponential family distributions.

5. Maximum Entropy and Duality

The final property of exponential families I would like to establish is a certain duality property. What I mean by this is that exponential families can be thought of as the maximum entropy distributions subject to a constraint on the expected value of their sufficient statistics. For those unfamiliar with the term, the entropy of a distribution over ${X}$ with density ${p(X)}$ is ${\mathbb{E}[-\log p(X)] := -\int p(x)\log(p(x)) dx}$. Intuitively, higher entropy corresponds to higher uncertainty, so a maximum entropy distribution is one specifying as much uncertainty as possible given a certain set of information (such as the values of various moments). This makes them appealing, at least in theory, from a modeling perspective, since they “encode exactly as much information as is given and no more”. (Caveat: this intuition isn’t entirely valid, and in practice maximum-entropy distributions aren’t always necessarily appropriate.)

In any case, the duality property is captured in the following theorem:

Theorem 8 The distribution over ${X}$ with maximum entropy such that ${\mathbb{E}[\phi(X)] = T}$ lies in the exponential family with sufficient statistic ${\phi(X)}$ and ${h(X) = 1}$.

Proving this fully rigorously requires the calculus of variations; I will instead give the “physicist’s proof”. Proof: } Let ${p(X)}$ be the density for ${X}$. Then we can view ${p}$ as the solution to the constrained maximization problem:

$\displaystyle \begin{array}{rcl} \mathrm{maximize} && -\int p(X) \log p(X) dX \\ \mathrm{subject \ to} && \int p(X) dX = 1 \\ && \int p(X) \phi(X) dX = T. \end{array}$

By the method of Lagrange multipliers, there exist ${\alpha}$ and ${\lambda}$ such that

$\displaystyle \frac{d}{dp}\left(-\int p(X)\log p(X) dX - \alpha [\int p(X) dX-1] - \lambda^T[\int \phi(X) p(X) dX-T]\right) = 0.$

This simplifies to:

$\displaystyle -\log p(X) - 1 - \alpha -\lambda^T \phi(X) = 0,$

which implies

$\displaystyle p(X) = \exp(-1-\alpha-\lambda^T\phi(X))$

for some ${\alpha}$ and ${\lambda}$. In particular, if we let ${\lambda = -\theta}$ and ${\alpha = A(\theta)-1}$, then we recover the exponential family with ${h(X) = 1}$, as claimed. $\Box$

6. Conclusion

Hopefully I have by now convinced you that exponential families have many nice properties: they have conjugate priors, simple-to-fit parameters, and easily-computed moments. While exponential families aren’t always appropriate models for a given situation, their tractability makes them the model of choice when no other information is present; and, since they can be obtained as maximum-entropy families, they are actually appropriate models in a wide family of circumstances.

## Algebra trick of the day

I’ve decided to start recording algebra tricks as I end up using them. Today I actually have two tricks, but they end up being used together a lot. I don’t know if they have more formal names, but I call them the “trace trick” and the “rank 1 relaxation”.

Suppose that we want to maximize the Rayleigh quotient $\frac{x^TAx}{x^Tx}$ of a matrix $A$. There are many reasons we might want to do this, for instance of $A$ is symmetric then the maximum corresponds to the largest eigenvalue. There are also many ways to do this, and the one that I’m about to describe is definitely not the most efficient, but it has the advantage of being flexible, in that it easily generalizes to constrained maximizations, etc.

The first observation is that $\frac{x^TAx}{x^Tx}$ is homogeneous, meaning that scaling $x$ doesn’t affect the result. So, we can assume without loss of generality that $x^Tx = 1$, and we end up with the optimization problem:

maximize $x^TAx$

subject to $x^Tx = 1$

This is where the trace trick comes in. Recall that the trace of a matrix is the sum of its diagonal entries. We are going to use two facts: first, the trace of a number is just the number itself. Second, trace(AB) = trace(BA). (Note, however, that trace(ABC) is not in general equal to trace(BAC), although trace(ABC) is equal to trace(CAB).) We use these two properties as follows — first, we re-write the optimization problem as:

maximize $Trace(x^TAx)$

subject to $Trace(x^Tx) = 1$

Second, we re-write it again using the invariance of trace under cyclic permutations:

maximize $Trace(Axx^T)$

subject to $Trace(xx^T) = 1$

Now we make the substitution $X = xx^T$:

maximize $Trace(AX)$

subject to $Trace(X) = 1, X = xx^T$

Finally, note that a matrix $X$ can be written as $xx^T$ if and only if $X$ is positive semi-definite and has rank 1. Therefore, we can further write this as

maximize $Trace(AX)$

subject to $Trace(X) = 1, Rank(X) = 1, X \succeq 0$

Aside from the rank 1 constraint, this would be a semidefinite program, a type of problem that can be solved efficiently. What happens if we drop the rank 1 constraint? Then I claim that the solution to this program would be the same as if I had kept the constraint in! Why is this? Let’s look at the eigendecomposition of $X$, written as $\sum_{i=1}^n \lambda_i x_ix_i^T$, with $\lambda_i \geq 0$ (by positive semidefiniteness) and $\sum_{i=1}^n \lambda_i = 1$ (by the trace constraint). Let’s also look at $Trace(AX)$, which can be written as $\sum_{i=1}^n \lambda_i Trace(Ax_ix_i^T)$. Since $Trace(AX)$ is just a convex combination of the $Trace(Ax_ix_i^T)$, we might as well have just picked $X$ to be $x_ix_i^T$, where $i$ is chosen to maximize $Trace(Ax_ix_i^T)$. If we set that $\lambda_i$ to 1 and all the rest to 0, then we maintain all of the constraints while increasing $Trace(AX)$, meaning that we couldn’t have been at the optimum value of $X$ unless $n$ was equal to 1. What we have shown, then, is that the rank of $X$ must be 1, so that the rank 1 constraint was unnecessary.

Technically, $X$ could be a linear combination of rank 1 matrices that all have the same value of $Trace(AX)$, but in that case we could just pick any one of those matrices. So what I have really shown is that at least one optimal point has rank 1, and we can recover such a point from any solution, even if the original solution was not rank 1.

Here is a problem that uses a similar trick. Suppose we want to find $x$ that simultaneously satisfies the equations:

$b_i = |a_i^Tx|^2$

for each $i = 1,\ldots,n$ (this example was inspired from the recent NIPS paper by Ohlsson, Yang, Dong, and Sastry, although the idea itself goes at least back to Candes, Strohmer, and Voroninski). Note that this is basically equivalent to solving a system of linear equations where we only know each equation up to a sign (or a phase, in the complex case). Therefore, in general, this problem will not have a unique solution. To ensure the solution is unique, let us assume the very strong condition that whenever $a_i^TVa_i = 0$ for all $i = 1,\ldots,n$, the matrix $V$ must itself be zero (note: Candes et al. get away with a much weaker condition). Given this, can we phrase the problem as a semidefinite program? I highly recommend trying to solve this problem on your own, or at least reducing it to a rank-constrained SDP, so I’ll include the solution below a fold.

## Log-Linear Models

I’ve spent most of my research career trying to build big, complex nonparametric models; however, I’ve more recently delved into the realm of natural language processing, where how awesome your model looks on paper is irrelevant compared to how well it models your data. In the spirit of this new work (and to lay the groundwork for a later post on NLP), I’d like to go over a family of models that I think is often overlooked due to not being terribly sexy (or at least, I overlooked it for a good while). This family is the family of log-linear models, which are models of the form:

$\displaystyle p(x \mid \theta) \propto e^{\phi(x)^T\theta},$

where ${\phi}$ maps a data point to a feature vector; they are called log-linear because the log of the probability is a linear function of ${\phi(x)}$. We refer to ${\phi(x)^T\theta}$ as the score of ${x}$.

This model class might look fairly restricted at first, but the real magic comes in with the feature vector ${\phi}$. In fact, every probabilistic model that is absolutely continuous with respect to Lebesgue measure can be represented as a log-linear model for sufficient choices of ${\phi}$ and $\theta$. This is actually trivially true, as we can just take ${\phi : X \rightarrow \mathbb{R}}$ to be ${\log p(x)}$ and $\theta$ to be ${1}$.

You might object to this choice of ${\phi}$, since it maps into ${\mathbb{R}}$ rather than ${\{0,1\}^n}$, and feature vectors are typically discrete. However, we can do just as well by letting ${\phi : X \rightarrow \{0,1\}^{\infty}}$, where the ${i}$th coordinate of ${\phi(x)}$ is the ${i}$th digit in the binary representation of ${\log p(x)}$, then let $\theta$ be the vector ${\left(\frac{1}{2},\frac{1}{4},\frac{1}{8},\ldots\right)}$.

It is important to distinguish between the ability to represent an arbitrary model as log-linear and the ability to represent an arbitrary family of models as a log-linear family (that is, as the set of models we get if we fix a choice of features ${\phi}$ and then vary $\theta$). When we don’t know the correct model in advance and want to learn it, this latter consideration can be crucial. Below, I give two examples of model families and discuss how they fit (or do not fit) into the log-linear framework. Important caveat: in both of the models below, it is typically the case that at least some of the variables involved are unobserved. However, we will ignore this for now, and assume that, at least at training time, all of the variables are fully observed (in other words, we can see ${x_i}$ and ${y_i}$ in the hidden Markov model and we can see the full tree of productions in the probabilistic context free grammar).

Hidden Markov Models. A hidden Markov model, or HMM, is a model with latent (unobserved) variables ${x_1,\ldots,x_n}$ together with observed variables ${y_1,\ldots,y_n}$. The distribution for ${y_i}$ depends only on ${x_i}$, and the distribution for ${x_i}$ depends only on ${x_{i-1}}$ (in the sense that ${x_i}$ is conditionally independent of ${x_1,\ldots,x_{i-2}}$ given ${x_{i-1}}$). We can thus summarize the information in an HMM with the distributions ${p(x_{i} = t \mid x_{i-1} = s)}$ and ${p(y_i = u \mid x_i = s)}$.

We can express a hidden Markov model as a log-linear model by defining two classes of features: (i) features ${\phi_{s,t}}$ that count the number of ${i}$ such that ${x_{i-1} = s}$ and ${x_i = t}$; and (ii) features ${\psi_{s,u}}$ that count the number of ${i}$ such that ${x_i = s}$ and ${y_i = u}$. While this choice of features yields a model family capable of expressing an arbitrary hidden Markov model, it is also capable of learning models that are not hidden Markov models. In particular, we would like to think of ${\theta_{s,t}}$ (the index of $\theta$ corresponding to ${\phi_{s,t}}$) as ${\log p(x_i=t \mid x_{i-1}=s)}$, but there is no constraint that ${\sum_{t} \exp(\theta_{s,t}) = 1}$ for each ${s}$, whereas we do necessarily have ${\sum_{t} p(x_i = t \mid x_{i-1}=s) = 1}$ for each ${s}$. If ${n}$ is fixed, we still do obtain an HMM for any setting of $\theta$, although ${\theta_{s,t}}$ will have no simple relationship with ${\log p(x_i = t \mid x_{i-1} = s)}$. Furthermore, the relationship depends on ${n}$, and will therefore not work if we care about multiple Markov chains with different lengths.

Is the ability to express models that are not HMMs good or bad? It depends. If we know for certain that our data satisfy the HMM assumption, then expanding our model family to include models that violate that assumption can only end up hurting us. If the data do not satisfy the HMM assumption, then increasing the size of the model family may allow us to overcome what would otherwise be a model mis-specification. I personally would prefer to have as much control as possible about what assumptions I make, so I tend to see the over-expressivity of HMMs as a bug rather than a feature.

Probabilistic Context Free Grammars. A probabilistic context free grammar, or PCFG, is simply a context free grammar where we place a probability distribution over the production rules for each non-terminal. For those unfamiliar with context free grammars, a context free grammar is specified by:

1. A set ${\mathcal{S}}$ of non-terminal symbols, including a distinguished initial symbol ${E}$.
2. A set ${\mathcal{T}}$ of terminal symbols.
3. For each ${s \in S}$, one or more production rules of the form ${s \mapsto w_1w_2\cdots w_k}$, where ${k \geq 0}$ and ${w_i \in \mathcal{S} \cup \mathcal{T}}$.

For instance, a context free grammar for arithmetic expressions might have ${\mathcal{S} = \{E\}}$, ${\mathcal{T} = \{+,-,\times,/,(,)\} \cup \mathbb{R}}$, and the following production rules:

• ${E \mapsto x}$ for all ${x \in \mathbb{R}}$
• ${E \mapsto E + E}$
• ${E \mapsto E - E}$
• ${E \mapsto E \times E}$
• ${E \mapsto E / E}$
• ${E \mapsto (E)}$

The language corresponding to a context free grammar is the set of all strings that can be obtained by starting from ${E}$ and applying production rules until we only have terminal symbols. The language corresponding to the above grammar is, in fact, the set of well-formed arithmetic expressions, such as ${5-4-2}$, ${2-3\times (4.3)}$, and ${5/9927.12/(3-3\times 1)}$.

As mentioned above, a probabilistic context free grammar simply places a distribution over the production rules for any given non-terminal symbol. By repeatedly sampling from these distributions until we are left with only terminal symbols, we obtain a probability distribution over the language of the grammar.

We can represent a PCFG as a log-linear model by using a feature ${\phi_r}$ for each production rule ${r}$. For instance, we have a feature that counts the number of times that the rule ${E \mapsto E + E}$ gets applied, and another feature that counts the number of times that ${E \mapsto (E)}$ gets applied. Such features yield a log-linear model family that contains all probabilistic context free grammars for a given (deterministic) context free grammar. However, it also contains additional models that do not correspond to PCFGs; this is because we run into the same problem as for HMMs, which is that the sum of ${\exp(\theta_r)}$ over production rules of a given non-terminal does not necessarily add up to ${1}$. In fact, the problem is even worse here. For instance, suppose that ${\theta_{E \mapsto E + E} = 0.1}$ in the model above. Then the expression ${E+E+E+E+E+E}$ gets a score of ${0.5}$, and longer chains of ${E}$s get even higher scores. In particular, there is an infinite sequence of expressions with increasing scores and therefore the model doesn’t normalize (since the sum of the exponentiated scores of all possible productions is infinite).

So, log-linear models over-represent PCFGs in the same way as they over-represent HMMs, but the problems are even worse than before. Let’s ignore these issues for now, and suppose that we want to learn PCFGs with an unknown underlying CFG. To be a bit more concrete, suppose that we have a large collection of possible production rules for each non-terminal ${s \in \mathcal{S}}$, and we think that a small but unknown subset of those production rules should actually appear in the grammar. Then there is no way to encode this directly within the context of a log-linear model family, although we can encode such “sparsity constraints” using simple extensions to log-linear models (for instance, by adding a penalty for the number of non-zero entries in $\theta$). So, we have found another way in which the log-linear representation is not entirely adequate.

Conclusion. Based on the examples above, we have seen that log-linear models have difficulty placing constraints on latent variables. This showed up in two different ways: first, we are unable to constrain subsets of variables to add up to ${1}$ (what I call “local normalization” constraints); second, we are unable to encode sparsity constraints within the model. In both of these cases, it is possible to extend the log-linear framework to address these sorts of constraints, although that is outside the scope of this post.

Parameter Estimation for Log-Linear Models

I’ve explained what a log-linear model is, and partially characterized its representational power. I will now answer the practical question of how to estimate the parameters of a log-linear model (i.e., how to fit $\theta$ based on observed data). Recall that a log-linear model places a distribution over a space ${X}$ by choosing ${\phi : X \rightarrow \mathbb{R}^n}$ and ${\theta \in \mathbb{R}^n}$ and defining

$\displaystyle p(x \mid \theta) \propto \exp(\phi(x)^T\theta)$

More precisely (assuming ${X}$ is a discrete space), we have

$\displaystyle p(x \mid \theta) = \frac{\exp(\phi(x)^T\theta)}{\sum_{x' \in X} \exp(\phi(x')^T\theta)}$

Given observations ${x_1,\ldots,x_n}$, which we assume to be independent given $\theta$, our goal is to choose $\theta$ maximizing ${p(x_1,\ldots,x_n \mid \theta)}$, or, equivalently, ${\log p(x_1,\ldots,x_n \mid \theta)}$. In equations, we want

$\displaystyle \theta^* = \arg\max\limits_{\theta} \sum_{i=1}^n \left[\phi(x_i)^T\theta - \log\left(\sum_{x \in X} \exp(\phi(x)^T\theta)\right) \right]. \ \ \ \ \ (1)$

We typically use gradient methods (such as gradient descent, stochastic gradient descent, or L-BFGS) to minimize the right-hand side of (1). If we compute the gradient of (1) then we get:

$\displaystyle \sum_{i=1}^n \left(\phi(x_i)-\frac{\sum_{x \in X} \exp(\phi(x)^T\theta)\phi(x)}{\sum_{x \in X} \exp(\phi(x)^T\theta)}\right). \ \ \ \ \ (2)$

We can re-write (2) in the following more compact form:

$\displaystyle \sum_{i=1}^n \left(\phi(x_i) - \mathbb{E}[\phi(x) \mid \theta]\right). \ \ \ \ \ (3)$

In other words, the contribution of each training example ${x_i}$ to the gradient is the extent to which the features values for ${x_i}$ exceed their expected values conditioned on $\theta$.

One important consideration for such gradient-based numerical optimizers is convexity. If the objective function we are trying to minimize is convex (or concave), then gradient methods are guaranteed to converge to the global optimum. If the objective function is non-convex, then a gradient-based approach (or any other type of local search) may converge to a local optimum that is very far from the global optimum. In order to assess convexity, we compute the Hessian (matrix of second derivatives) and check whether it is positive definite. (In this case, we actually care about concavity, so we want the Hessian to be negative definite.) We can compute the Hessian by differentiating (2), which gives us

$\displaystyle n \times \left[\left(\frac{\sum_{x \in X} \exp(\phi(x)^T\theta)\phi(x)}{\sum_{x \in X} \exp(\phi(x)^T\theta)}\right)\left(\frac{\sum_{x \in X} \exp(\phi(x)^T\theta)\phi(x)}{\sum_{x \in X} \exp(\phi(x)^T\theta)}\right)^T-\frac{\sum_{x \in X} \exp(\phi(x)^T\theta)\phi(x)\phi(x)^T}{\sum_{x \in X} \exp(\phi(x)^T \theta)}\right]. \ \ \ \ \ (4)$

Again, we can re-write this more compactly as

$\displaystyle n\times \left(\mathbb{E}[\phi(x) \mid \theta]\mathbb{E}[\phi(x) \mid \theta]^T - \mathbb{E}[\phi(x)\phi(x)^T \mid \theta]\right). \ \ \ \ \ (5)$

The term inside the parentheses of (5) is exactly the negative of the covariance matrix of ${\phi(x)}$ given $\theta$, and is therefore necessarily negative definite, so the objective function we are trying to minimize is indeed concave, which, as noted before, implies that our gradient methods will always reach the global optimum.

Regularization and Concavity

We may in practice wish to encode additional prior knowledge about $\theta$ in our model, especially if the dimensionality of $\theta$ is large relative to the amount of data we have. Can we do this and still maintain concavity? The answer in many cases is yes: since the ${L^p}$-norm is convex for all ${p \geq 1}$, we can add an ${L^p}$ penalty to the objective for any such ${p}$ and still have a concave objective function.

Conclusion

Log-linear models provide a universal representation for individual probability distributions, but not for arbitrary families of probability distributions (for instance, due to the inability to capture local normalization constraints or sparsity constraints). However, for the families they do express, parameter optimization can be performed efficiently due to a likelihood function that is log-concave in its parameters. Log-linear models also have tie-ins to many other beautiful areas of statistics, such as exponential families, which will be the subject of the next post.