Log-Linear Models
December 6, 2012 1 Comment
I’ve spent most of my research career trying to build big, complex nonparametric models; however, I’ve more recently delved into the realm of natural language processing, where how awesome your model looks on paper is irrelevant compared to how well it models your data. In the spirit of this new work (and to lay the groundwork for a later post on NLP), I’d like to go over a family of models that I think is often overlooked due to not being terribly sexy (or at least, I overlooked it for a good while). This family is the family of log-linear models, which are models of the form:
where maps a data point to a feature vector; they are called log-linear because the log of the probability is a linear function of
. We refer to
as the score of
.
This model class might look fairly restricted at first, but the real magic comes in with the feature vector . In fact, every probabilistic model that is absolutely continuous with respect to Lebesgue measure can be represented as a log-linear model for sufficient choices of
and
. This is actually trivially true, as we can just take
to be
and
to be
.
You might object to this choice of , since it maps into
rather than
, and feature vectors are typically discrete. However, we can do just as well by letting
, where the
th coordinate of
is the
th digit in the binary representation of
, then let
be the vector
.
It is important to distinguish between the ability to represent an arbitrary model as log-linear and the ability to represent an arbitrary family of models as a log-linear family (that is, as the set of models we get if we fix a choice of features and then vary
). When we don’t know the correct model in advance and want to learn it, this latter consideration can be crucial. Below, I give two examples of model families and discuss how they fit (or do not fit) into the log-linear framework. Important caveat: in both of the models below, it is typically the case that at least some of the variables involved are unobserved. However, we will ignore this for now, and assume that, at least at training time, all of the variables are fully observed (in other words, we can see
and
in the hidden Markov model and we can see the full tree of productions in the probabilistic context free grammar).
Hidden Markov Models. A hidden Markov model, or HMM, is a model with latent (unobserved) variables together with observed variables
. The distribution for
depends only on
, and the distribution for
depends only on
(in the sense that
is conditionally independent of
given
). We can thus summarize the information in an HMM with the distributions
and
.
We can express a hidden Markov model as a log-linear model by defining two classes of features: (i) features that count the number of
such that
and
; and (ii) features
that count the number of
such that
and
. While this choice of features yields a model family capable of expressing an arbitrary hidden Markov model, it is also capable of learning models that are not hidden Markov models. In particular, we would like to think of
(the index of
corresponding to
) as
, but there is no constraint that
for each
, whereas we do necessarily have
for each
. If
is fixed, we still do obtain an HMM for any setting of
, although
will have no simple relationship with
. Furthermore, the relationship depends on
, and will therefore not work if we care about multiple Markov chains with different lengths.
Is the ability to express models that are not HMMs good or bad? It depends. If we know for certain that our data satisfy the HMM assumption, then expanding our model family to include models that violate that assumption can only end up hurting us. If the data do not satisfy the HMM assumption, then increasing the size of the model family may allow us to overcome what would otherwise be a model mis-specification. I personally would prefer to have as much control as possible about what assumptions I make, so I tend to see the over-expressivity of HMMs as a bug rather than a feature.
Probabilistic Context Free Grammars. A probabilistic context free grammar, or PCFG, is simply a context free grammar where we place a probability distribution over the production rules for each non-terminal. For those unfamiliar with context free grammars, a context free grammar is specified by:
- A set
of non-terminal symbols, including a distinguished initial symbol
.
- A set
of terminal symbols.
- For each
, one or more production rules of the form
, where
and
.
For instance, a context free grammar for arithmetic expressions might have ,
, and the following production rules:
for all
The language corresponding to a context free grammar is the set of all strings that can be obtained by starting from and applying production rules until we only have terminal symbols. The language corresponding to the above grammar is, in fact, the set of well-formed arithmetic expressions, such as
,
, and
.
As mentioned above, a probabilistic context free grammar simply places a distribution over the production rules for any given non-terminal symbol. By repeatedly sampling from these distributions until we are left with only terminal symbols, we obtain a probability distribution over the language of the grammar.
We can represent a PCFG as a log-linear model by using a feature for each production rule
. For instance, we have a feature that counts the number of times that the rule
gets applied, and another feature that counts the number of times that
gets applied. Such features yield a log-linear model family that contains all probabilistic context free grammars for a given (deterministic) context free grammar. However, it also contains additional models that do not correspond to PCFGs; this is because we run into the same problem as for HMMs, which is that the sum of
over production rules of a given non-terminal does not necessarily add up to
. In fact, the problem is even worse here. For instance, suppose that
in the model above. Then the expression
gets a score of
, and longer chains of
s get even higher scores. In particular, there is an infinite sequence of expressions with increasing scores and therefore the model doesn’t normalize (since the sum of the exponentiated scores of all possible productions is infinite).
So, log-linear models over-represent PCFGs in the same way as they over-represent HMMs, but the problems are even worse than before. Let’s ignore these issues for now, and suppose that we want to learn PCFGs with an unknown underlying CFG. To be a bit more concrete, suppose that we have a large collection of possible production rules for each non-terminal , and we think that a small but unknown subset of those production rules should actually appear in the grammar. Then there is no way to encode this directly within the context of a log-linear model family, although we can encode such “sparsity constraints” using simple extensions to log-linear models (for instance, by adding a penalty for the number of non-zero entries in
). So, we have found another way in which the log-linear representation is not entirely adequate.
Conclusion. Based on the examples above, we have seen that log-linear models have difficulty placing constraints on latent variables. This showed up in two different ways: first, we are unable to constrain subsets of variables to add up to (what I call “local normalization” constraints); second, we are unable to encode sparsity constraints within the model. In both of these cases, it is possible to extend the log-linear framework to address these sorts of constraints, although that is outside the scope of this post.
Parameter Estimation for Log-Linear Models
I’ve explained what a log-linear model is, and partially characterized its representational power. I will now answer the practical question of how to estimate the parameters of a log-linear model (i.e., how to fit based on observed data). Recall that a log-linear model places a distribution over a space
by choosing
and
and defining
More precisely (assuming is a discrete space), we have
Given observations , which we assume to be independent given
, our goal is to choose
maximizing
, or, equivalently,
. In equations, we want
We typically use gradient methods (such as gradient descent, stochastic gradient descent, or L-BFGS) to minimize the right-hand side of (1). If we compute the gradient of (1) then we get:
We can re-write (2) in the following more compact form:
In other words, the contribution of each training example to the gradient is the extent to which the features values for
exceed their expected values conditioned on
.
One important consideration for such gradient-based numerical optimizers is convexity. If the objective function we are trying to minimize is convex (or concave), then gradient methods are guaranteed to converge to the global optimum. If the objective function is non-convex, then a gradient-based approach (or any other type of local search) may converge to a local optimum that is very far from the global optimum. In order to assess convexity, we compute the Hessian (matrix of second derivatives) and check whether it is positive definite. (In this case, we actually care about concavity, so we want the Hessian to be negative definite.) We can compute the Hessian by differentiating (2), which gives us
Again, we can re-write this more compactly as
The term inside the parentheses of (5) is exactly the negative of the covariance matrix of given
, and is therefore necessarily negative definite, so the objective function we are trying to minimize is indeed concave, which, as noted before, implies that our gradient methods will always reach the global optimum.
Regularization and Concavity
We may in practice wish to encode additional prior knowledge about in our model, especially if the dimensionality of
is large relative to the amount of data we have. Can we do this and still maintain concavity? The answer in many cases is yes: since the
-norm is convex for all
, we can add an
penalty to the objective for any such
and still have a concave objective function.
Conclusion
Log-linear models provide a universal representation for individual probability distributions, but not for arbitrary families of probability distributions (for instance, due to the inability to capture local normalization constraints or sparsity constraints). However, for the families they do express, parameter optimization can be performed efficiently due to a likelihood function that is log-concave in its parameters. Log-linear models also have tie-ins to many other beautiful areas of statistics, such as exponential families, which will be the subject of the next post.
Pingback: Exponential Families « Academically Interesting