<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Academically Interesting</title>
	<atom:link href="http://jsteinhardt.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://jsteinhardt.wordpress.com</link>
	<description>Where I exposit about whatever interests me</description>
	<lastBuildDate>Wed, 12 Jun 2013 21:04:29 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='jsteinhardt.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Academically Interesting</title>
		<link>http://jsteinhardt.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://jsteinhardt.wordpress.com/osd.xml" title="Academically Interesting" />
	<atom:link rel='hub' href='http://jsteinhardt.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Convexity counterexample</title>
		<link>http://jsteinhardt.wordpress.com/2013/06/12/convexity-counterexample/</link>
		<comments>http://jsteinhardt.wordpress.com/2013/06/12/convexity-counterexample/#comments</comments>
		<pubDate>Wed, 12 Jun 2013 21:04:27 +0000</pubDate>
		<dc:creator>jsteinhardt</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Math]]></category>

		<guid isPermaLink="false">http://jsteinhardt.wordpress.com/?p=495</guid>
		<description><![CDATA[Here&#8217;s a fun counterexample: a function that is jointly convex in any of the variables, but not in all variables at once. The function is To see why this is, note that the Hessian of is equal to This matrix is equal to , where is the identity matrix and is the all-ones matrix, which [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=jsteinhardt.wordpress.com&#038;blog=8824138&#038;post=495&#038;subd=jsteinhardt&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Here&#8217;s a fun counterexample: a function <img src='http://s0.wp.com/latex.php?latex=%5Cmathbb%7BR%7D%5En+%5Cto+%5Cmathbb%7BR%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;mathbb{R}^n &#92;to &#92;mathbb{R}' title='&#92;mathbb{R}^n &#92;to &#92;mathbb{R}' class='latex' /> that is jointly convex in any <img src='http://s0.wp.com/latex.php?latex=n-1&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='n-1' title='n-1' class='latex' /> of the variables, but not in all variables at once. The function is</p>
<p><img src='http://s0.wp.com/latex.php?latex=f%28x_1%2C%5Cldots%2Cx_n%29+%3D+%5Cfrac%7B1%7D%7B2%7D%28n-1.5%29%5Csum_%7Bi%3D1%7D%5En+x_i%5E2+-+%5Csum_%7Bi+%3C+j%7D+x_ix_j&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='f(x_1,&#92;ldots,x_n) = &#92;frac{1}{2}(n-1.5)&#92;sum_{i=1}^n x_i^2 - &#92;sum_{i &lt; j} x_ix_j' title='f(x_1,&#92;ldots,x_n) = &#92;frac{1}{2}(n-1.5)&#92;sum_{i=1}^n x_i^2 - &#92;sum_{i &lt; j} x_ix_j' class='latex' /></p>
<p>To see why this is, note that the Hessian of <img src='http://s0.wp.com/latex.php?latex=f&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='f' title='f' class='latex' /> is equal to</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cleft%5B+%5Cbegin%7Barray%7D%7Bcccc%7D+n-1.5+%26+-1+%26+%5Ccdots+%26+-1+%5C%5C+-1+%26+n-1.5+%26+%5Ccdots+%26+-1+%5C%5C+%5Cvdots+%26+%5Cvdots+%26+%5Cddots+%26+%5Cvdots+%5C%5C+-1+%26+-1+%26+%5Ccdots+%26+n-1.5+%5Cend%7Barray%7D+%5Cright%5D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;left[ &#92;begin{array}{cccc} n-1.5 &amp; -1 &amp; &#92;cdots &amp; -1 &#92;&#92; -1 &amp; n-1.5 &amp; &#92;cdots &amp; -1 &#92;&#92; &#92;vdots &amp; &#92;vdots &amp; &#92;ddots &amp; &#92;vdots &#92;&#92; -1 &amp; -1 &amp; &#92;cdots &amp; n-1.5 &#92;end{array} &#92;right]' title='&#92;left[ &#92;begin{array}{cccc} n-1.5 &amp; -1 &amp; &#92;cdots &amp; -1 &#92;&#92; -1 &amp; n-1.5 &amp; &#92;cdots &amp; -1 &#92;&#92; &#92;vdots &amp; &#92;vdots &amp; &#92;ddots &amp; &#92;vdots &#92;&#92; -1 &amp; -1 &amp; &#92;cdots &amp; n-1.5 &#92;end{array} &#92;right]' class='latex' /></p>
<p>This matrix is equal to <img src='http://s0.wp.com/latex.php?latex=%28n-0.5%29I+-+J&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='(n-0.5)I - J' title='(n-0.5)I - J' class='latex' />, where <img src='http://s0.wp.com/latex.php?latex=I&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='I' title='I' class='latex' /> is the identity matrix and <img src='http://s0.wp.com/latex.php?latex=J&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='J' title='J' class='latex' /> is the all-ones matrix, which is rank 1 and whose single non-zero eigenvalue is <img src='http://s0.wp.com/latex.php?latex=n&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='n' title='n' class='latex' />. Therefore, this matrix has <img src='http://s0.wp.com/latex.php?latex=n-1&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='n-1' title='n-1' class='latex' /> eigenvalues of <img src='http://s0.wp.com/latex.php?latex=n-0.5&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='n-0.5' title='n-0.5' class='latex' />, as well as a single eigenvalue of <img src='http://s0.wp.com/latex.php?latex=-0.5&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='-0.5' title='-0.5' class='latex' />, and hence is not positive definite.</p>
<p>On the other hand, any submatrix of size <img src='http://s0.wp.com/latex.php?latex=n-1&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='n-1' title='n-1' class='latex' /> is of the form <img src='http://s0.wp.com/latex.php?latex=%28n-0.5%29I-J&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='(n-0.5)I-J' title='(n-0.5)I-J' class='latex' />, but where now <img src='http://s0.wp.com/latex.php?latex=J&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='J' title='J' class='latex' /> is only <img src='http://s0.wp.com/latex.php?latex=%28n-1%29+%5Ctimes+%28n-1%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='(n-1) &#92;times (n-1)' title='(n-1) &#92;times (n-1)' class='latex' />. This matrix now has <img src='http://s0.wp.com/latex.php?latex=n-2&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='n-2' title='n-2' class='latex' /> eigenvalues of <img src='http://s0.wp.com/latex.php?latex=n-0.5&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='n-0.5' title='n-0.5' class='latex' />, together with a single eigenvalue of <img src='http://s0.wp.com/latex.php?latex=0.5&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='0.5' title='0.5' class='latex' />, and hence is positive definite. Therefore, the Hessian is positive definite when restricted to any <img src='http://s0.wp.com/latex.php?latex=n-1&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='n-1' title='n-1' class='latex' /> variables, and hence <img src='http://s0.wp.com/latex.php?latex=f&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='f' title='f' class='latex' /> is convex in any <img src='http://s0.wp.com/latex.php?latex=n-1&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='n-1' title='n-1' class='latex' /> variables, but not in all <img src='http://s0.wp.com/latex.php?latex=n&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='n' title='n' class='latex' /> variables jointly.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/jsteinhardt.wordpress.com/495/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/jsteinhardt.wordpress.com/495/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=jsteinhardt.wordpress.com&#038;blog=8824138&#038;post=495&#038;subd=jsteinhardt&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://jsteinhardt.wordpress.com/2013/06/12/convexity-counterexample/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c0d709db669c6eb66c98ee050c45527d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jsteinhardt</media:title>
		</media:content>
	</item>
		<item>
		<title>Probabilistic Abstractions I</title>
		<link>http://jsteinhardt.wordpress.com/2013/03/15/probabilistic-abstractions-i/</link>
		<comments>http://jsteinhardt.wordpress.com/2013/03/15/probabilistic-abstractions-i/#comments</comments>
		<pubDate>Fri, 15 Mar 2013 03:45:04 +0000</pubDate>
		<dc:creator>jsteinhardt</dc:creator>
				<category><![CDATA[Machine Learning]]></category>

		<guid isPermaLink="false">http://jsteinhardt.wordpress.com/?p=487</guid>
		<description><![CDATA[(This post represents research in progress. I may think about these concepts entirely differently a few months from now, but for my own benefit I&#8217;m trying to exposit on them in order to force myself to understand them better.) For many inference tasks, especially ones with either non-linearities or non-convexities, it is common to use [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=jsteinhardt.wordpress.com&#038;blog=8824138&#038;post=487&#038;subd=jsteinhardt&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>(This post represents research in progress. I may think about these concepts entirely differently a few months from now, but for my own benefit I&#8217;m trying to exposit on them in order to force myself to understand them better.)</p>
<p>For many inference tasks, especially ones with either non-linearities or non-convexities, it is common to use particle-based methods such as beam search, particle filters, sequential Monte Carlo, or Markov Chain Monte Carlo. In these methods, we approximate a distribution by a collection of samples from that distribution, then update the samples as new information is added. For instance, in beam search, if we are trying to build up a tree, we might build up a collection of <img src='http://s0.wp.com/latex.php?latex=K&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='K' title='K' class='latex' /> samples for the left and right subtrees, then look at all <img src='http://s0.wp.com/latex.php?latex=K%5E2&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='K^2' title='K^2' class='latex' /> ways of combining them into the entire tree, but then downsample again to the <img src='http://s0.wp.com/latex.php?latex=K&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='K' title='K' class='latex' /> trees with the highest scores. This allows us to search through the exponentially large space of all trees efficiently (albeit at the cost of possibly missing high-scoring trees).</p>
<p>One major problem with such particle-based methods is diversity: the particles will tend to cluster around the highest-scoring mode, rather than exploring multiple local optima if they exist. This can be bad because it makes learning algorithms overly myopic. Another problem, especially in combinatorial domains, is difficulty of partial evaluation: if we have some training data that we are trying to fit to, and we have chosen settings of some, but not all, variables in our model, it can be difficult to know if that setting is on the right track (for instance, it can be difficult to know whether a partially-built tree is a promising candidate or not). For time-series modeling, this isn&#8217;t nearly as large of a problem, since we can evaluate against a prefix of the time series to get a good idea (this perhaps explains the success of particle filters in these domains).</p>
<p>I&#8217;ve been working on a method that tries to deal with both of these problems, which I call <strong>probabilistic abstractions</strong>. The idea is to improve the diversity of particle-based methods by creating &#8220;fat&#8221; particles which cover multiple states at once; the reason that such fat particles help is that they allow us to first optimize for coverage (by placing down relatively large particles that cover the entire space), then later worry about more local details (by placing down many particles near promising-looking local optima).</p>
<p>To be more concrete, if we have a probability distribution over a set of random variables <img src='http://s0.wp.com/latex.php?latex=%28X_1%2C%5Cldots%2CX_d%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='(X_1,&#92;ldots,X_d)' title='(X_1,&#92;ldots,X_d)' class='latex' />, then our particles will be sets obtained by specifying the values of some of the <img src='http://s0.wp.com/latex.php?latex=X_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='X_i' title='X_i' class='latex' /> and leaving the rest to vary arbitrarily. So, for instance, if <img src='http://s0.wp.com/latex.php?latex=d%3D4&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='d=4' title='d=4' class='latex' />, then <img src='http://s0.wp.com/latex.php?latex=%5C%7B%28X_1%2CX_2%2CX_3%2CX_4%29+%5Cmid+X_2+%3D+1%2C+x_4+%3D+7%5C%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;{(X_1,X_2,X_3,X_4) &#92;mid X_2 = 1, x_4 = 7&#92;}' title='&#92;{(X_1,X_2,X_3,X_4) &#92;mid X_2 = 1, x_4 = 7&#92;}' class='latex' /> might be a possible &#8220;fat&#8221; particle.</p>
<p>By choosing some number of fat particles and assigning probabilities to them, we are implicitly specifying a polytope of possible probability distributions; for instance, if our particles are <img src='http://s0.wp.com/latex.php?latex=S_1%2C%5Cldots%2CS_k&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='S_1,&#92;ldots,S_k' title='S_1,&#92;ldots,S_k' class='latex' />, and we assign probability <img src='http://s0.wp.com/latex.php?latex=%5Cpi_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;pi_i' title='&#92;pi_i' class='latex' /> to <img src='http://s0.wp.com/latex.php?latex=S_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='S_i' title='S_i' class='latex' />, then we have the polytope of distributions <img src='http://s0.wp.com/latex.php?latex=p&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='p' title='p' class='latex' /> that satisfy the constraints <img src='http://s0.wp.com/latex.php?latex=p%28S_1%29+%3D+%5Cpi_1%2C+p%28S_2%29+%3D+%5Cpi_2&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='p(S_1) = &#92;pi_1, p(S_2) = &#92;pi_2' title='p(S_1) = &#92;pi_1, p(S_2) = &#92;pi_2' class='latex' />, etc.</p>
<p>Given such a polytope, is there a way to pick a canonical representative from it? One such representative is the <strong>maximum entropy distribution</strong> in that polytope. This distribution has the property of minimizing the worst-case relative entropy to any other distribution within the polytope (and that worst-case relative entropy is just the entropy of the distribution).</p>
<p>Suppose that we have a polytope for two independent distributions, and we want to compute the polytope for their product. This is easy &#8212; just look at the cartesian products of each particle of the first distribution with each particle of the second distribution. If each individual distribution has <img src='http://s0.wp.com/latex.php?latex=k&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='k' title='k' class='latex' /> particles, then the product distribution has <img src='http://s0.wp.com/latex.php?latex=k%5E2&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='k^2' title='k^2' class='latex' /> particles &#8212; this could be problematic computationally, so we also want a way to narrow down to a subset of the <img src='http://s0.wp.com/latex.php?latex=k&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='k' title='k' class='latex' /> most informative particles. These will be the <img src='http://s0.wp.com/latex.php?latex=k&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='k' title='k' class='latex' /> particles such that the corresponding polytope minimizes the maximum entropy of that polytope. Finding this is NP-hard in general, but I&#8217;m currently working on good heuristics for computing it.</p>
<p>Next, suppose that we have a distribution on a space <img src='http://s0.wp.com/latex.php?latex=X&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='X' title='X' class='latex' /> and want to <strong>apply a function</strong> <img src='http://s0.wp.com/latex.php?latex=f+%3A+X+%5Cto+Y&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='f : X &#92;to Y' title='f : X &#92;to Y' class='latex' /> to it. If <img src='http://s0.wp.com/latex.php?latex=f&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='f' title='f' class='latex' /> is a complicated function, it might be difficult to propagate the fat particles (even though it would have been easy to propagate particles composed of single points). To get around this, we need what is called a <strong>valid abstraction</strong> of <img src='http://s0.wp.com/latex.php?latex=f&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='f' title='f' class='latex' />: a function <img src='http://s0.wp.com/latex.php?latex=%5Ctilde%7Bf%7D+%3A+2%5EX+%5Cto+2%5EY&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;tilde{f} : 2^X &#92;to 2^Y' title='&#92;tilde{f} : 2^X &#92;to 2^Y' class='latex' /> such that <img src='http://s0.wp.com/latex.php?latex=%5Ctilde%7Bf%7D%28S%29+%5Csupseteq+f%28S%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;tilde{f}(S) &#92;supseteq f(S)' title='&#92;tilde{f}(S) &#92;supseteq f(S)' class='latex' /> for all <img src='http://s0.wp.com/latex.php?latex=S+%5Cin+2%5EX&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='S &#92;in 2^X' title='S &#92;in 2^X' class='latex' />. In this case, if we map a particle <img src='http://s0.wp.com/latex.php?latex=S&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='S' title='S' class='latex' /> to <img src='http://s0.wp.com/latex.php?latex=%5Ctilde%7Bf%7D%28S%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;tilde{f}(S)' title='&#92;tilde{f}(S)' class='latex' />, our equality constraint on the mass assigned to <img src='http://s0.wp.com/latex.php?latex=S&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='S' title='S' class='latex' /> becomes a lower bound on the mass assigned to <img src='http://s0.wp.com/latex.php?latex=%5Ctilde%7Bf%7D%28S%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;tilde{f}(S)' title='&#92;tilde{f}(S)' class='latex' /> &#8212; we thus still have a polytope of possible probability distributions. Depending on the exact structure of the particles (i.e. the exact way in which the different sets overlap), it may be necessary to add additional constraints to the polytope to get good performance &#8212; I feel like I have some understanding of this, but it&#8217;s something I&#8217;ll need to investigate empirically as well. It&#8217;s also interesting to note that <img src='http://s0.wp.com/latex.php?latex=%5Ctilde%7Bf%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;tilde{f}' title='&#92;tilde{f}' class='latex' /> (when combined with conditioning on data, which is discussed below) allows us to assign partial credit to promising particles, which was the other property I discussed at the beginning.</p>
<p>Finally, suppose that I want to <strong>condition</strong> on data. In this case the polytope approach doesn&#8217;t work as well, because conditioning on data can blow up the polytope by an arbitrarily large amount. Instead, we just take the maximum-entropy distribution in our polytope and treat that as our &#8220;true&#8221; distribution, then condition. I haven&#8217;t been able to make any formal statements about this procedure, but it seems to work at least somewhat reasonably. It is worth noting that conditioning may not be straightforward, since the likelihood function may not be constant across a given fat particle. To deal with this, we can replace the likelihood function by its average (which I think can be justified in terms of maximum entropy as well, although the details here are a bit hazier).</p>
<p>So, in summary, we have a notion of fat particles, which provide better coverage than point particles, and can combine them, apply functions to them, subsample them, and condition on data. This is essentially all of the operations we want to be able to apply for particle-based methods, so we in theory should now be able to implement versions of these particle-based methods that get better coverage.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/jsteinhardt.wordpress.com/487/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/jsteinhardt.wordpress.com/487/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=jsteinhardt.wordpress.com&#038;blog=8824138&#038;post=487&#038;subd=jsteinhardt&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://jsteinhardt.wordpress.com/2013/03/15/probabilistic-abstractions-i/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c0d709db669c6eb66c98ee050c45527d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jsteinhardt</media:title>
		</media:content>
	</item>
		<item>
		<title>Pairwise Independence vs. Independence</title>
		<link>http://jsteinhardt.wordpress.com/2013/03/13/pairwise-independence-vs-independence/</link>
		<comments>http://jsteinhardt.wordpress.com/2013/03/13/pairwise-independence-vs-independence/#comments</comments>
		<pubDate>Wed, 13 Mar 2013 06:53:52 +0000</pubDate>
		<dc:creator>jsteinhardt</dc:creator>
				<category><![CDATA[Machine Learning]]></category>

		<guid isPermaLink="false">http://jsteinhardt.wordpress.com/?p=484</guid>
		<description><![CDATA[For collections of independent random variables, the Chernoff bound and related bounds give us very sharp concentration inequalities &#8212; if are independent, then their sum has a distribution that decays like . For random variables that are only pairwise independent, the strongest bound we have is Chebyshev&#8217;s inequality, which says that their sum decays like [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=jsteinhardt.wordpress.com&#038;blog=8824138&#038;post=484&#038;subd=jsteinhardt&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>For collections of independent random variables, the Chernoff bound and related bounds give us very sharp concentration inequalities &#8212; if <img src='http://s0.wp.com/latex.php?latex=X_1%2C%5Cldots%2CX_n&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='X_1,&#92;ldots,X_n' title='X_1,&#92;ldots,X_n' class='latex' /> are independent, then their sum has a distribution that decays like <img src='http://s0.wp.com/latex.php?latex=e%5E%7B-x%5E2%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='e^{-x^2}' title='e^{-x^2}' class='latex' />. For random variables that are only pairwise independent, the strongest bound we have is Chebyshev&#8217;s inequality, which says that their sum decays like <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7Bx%5E2%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;frac{1}{x^2}' title='&#92;frac{1}{x^2}' class='latex' />.</p>
<p>The point of this post is to construct an equality case for Chebyshev: a collection of pairwise independent random variables whose sum does not satisfy the concentration bound of Chernoff, and instead decays like <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7Bx%5E2%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;frac{1}{x^2}' title='&#92;frac{1}{x^2}' class='latex' />.</p>
<p>The construction is as follows: let <img src='http://s0.wp.com/latex.php?latex=X_1%2C%5Cldots%2CX_d&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='X_1,&#92;ldots,X_d' title='X_1,&#92;ldots,X_d' class='latex' /> be independent binary random variables, and for any <img src='http://s0.wp.com/latex.php?latex=S+%5Csubset+%5C%7B1%2C%5Cldots%2Cd%5C%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='S &#92;subset &#92;{1,&#92;ldots,d&#92;}' title='S &#92;subset &#92;{1,&#92;ldots,d&#92;}' class='latex' />, let <img src='http://s0.wp.com/latex.php?latex=Y_S+%3D+%5Csum_%7Bi+%5Cin+S%7D+X_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Y_S = &#92;sum_{i &#92;in S} X_i' title='Y_S = &#92;sum_{i &#92;in S} X_i' class='latex' />, where the sum is taken mod 2. Then we can easily check that the <img src='http://s0.wp.com/latex.php?latex=Y_S&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Y_S' title='Y_S' class='latex' /> are pairwise independent. Now consider  the random variable <img src='http://s0.wp.com/latex.php?latex=Z+%3D+%5Csum_%7BS%7D+Y_S&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Z = &#92;sum_{S} Y_S' title='Z = &#92;sum_{S} Y_S' class='latex' />. If any of the <img src='http://s0.wp.com/latex.php?latex=X_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='X_i' title='X_i' class='latex' /> is equal to 1, then we can pair up the <img src='http://s0.wp.com/latex.php?latex=Y_S&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Y_S' title='Y_S' class='latex' /> by either adding or removing <img src='http://s0.wp.com/latex.php?latex=i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='i' title='i' class='latex' /> from <img src='http://s0.wp.com/latex.php?latex=S&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='S' title='S' class='latex' /> to get the other element of the pair. If we do this, we see that <img src='http://s0.wp.com/latex.php?latex=Z+%3D+2%5E%7Bd-1%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Z = 2^{d-1}' title='Z = 2^{d-1}' class='latex' /> in this case. On the other hand, if all of the <img src='http://s0.wp.com/latex.php?latex=X_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='X_i' title='X_i' class='latex' /> are equal to 0, then <img src='http://s0.wp.com/latex.php?latex=Z+%3D+0&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Z = 0' title='Z = 0' class='latex' /> as well. Thus, with probability <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B2%5Ed%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;frac{1}{2^d}' title='&#92;frac{1}{2^d}' class='latex' />, <img src='http://s0.wp.com/latex.php?latex=Z&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Z' title='Z' class='latex' /> deviates from its mean by <img src='http://s0.wp.com/latex.php?latex=2%5E%7Bd-1%7D-%5Cfrac%7B1%7D%7B2%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='2^{d-1}-&#92;frac{1}{2}' title='2^{d-1}-&#92;frac{1}{2}' class='latex' />, whereas the variance of <img src='http://s0.wp.com/latex.php?latex=Z&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Z' title='Z' class='latex' /> is <img src='http://s0.wp.com/latex.php?latex=2%5E%7Bd-2%7D-%5Cfrac%7B1%7D%7B4%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='2^{d-2}-&#92;frac{1}{4}' title='2^{d-2}-&#92;frac{1}{4}' class='latex' />. The bound on this probability form Chebyshev is <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B2%5E%7Bd-2%7D-1%2F4%7D%7B%282%5E%7Bd-1%7D-1%2F2%29%5E2%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;frac{2^{d-2}-1/4}{(2^{d-1}-1/2)^2}' title='&#92;frac{2^{d-2}-1/4}{(2^{d-1}-1/2)^2}' class='latex' />, which is very close to <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B2%5Ed%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;frac{1}{2^d}' title='&#92;frac{1}{2^d}' class='latex' />, so this constitutes something very close to the Chebyshev equality case.</p>
<p>Anyways, I just thought this was a cool example that demonstrates the difference between pairwise and full independence.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/jsteinhardt.wordpress.com/484/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/jsteinhardt.wordpress.com/484/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=jsteinhardt.wordpress.com&#038;blog=8824138&#038;post=484&#038;subd=jsteinhardt&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://jsteinhardt.wordpress.com/2013/03/13/pairwise-independence-vs-independence/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c0d709db669c6eb66c98ee050c45527d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jsteinhardt</media:title>
		</media:content>
	</item>
		<item>
		<title>A Fun Optimization Problem</title>
		<link>http://jsteinhardt.wordpress.com/2013/02/09/a-fun-optimization-problem/</link>
		<comments>http://jsteinhardt.wordpress.com/2013/02/09/a-fun-optimization-problem/#comments</comments>
		<pubDate>Sat, 09 Feb 2013 06:56:29 +0000</pubDate>
		<dc:creator>jsteinhardt</dc:creator>
				<category><![CDATA[Math]]></category>

		<guid isPermaLink="false">http://jsteinhardt.wordpress.com/?p=423</guid>
		<description><![CDATA[I spent the last several hours trying to come up with an efficient algorithm to the following problem: Problem: Suppose that we have a sequence of pairs of non-negative numbers such that and . Devise an efficient algorithm to find the pairs that maximize Commentary: I don&#8217;t have a fully satisfactory solution to this yet, although [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=jsteinhardt.wordpress.com&#038;blog=8824138&#038;post=423&#038;subd=jsteinhardt&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I spent the last several hours trying to come up with an efficient algorithm to the following problem:</p>
<p><strong>Problem:</strong> Suppose that we have a sequence of <img src='http://s0.wp.com/latex.php?latex=l&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='l' title='l' class='latex' /> pairs of non-negative numbers <img src='http://s0.wp.com/latex.php?latex=%28a_1%2Cb_1%29%2C%5Cldots%2C%28a_l%2Cb_l%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='(a_1,b_1),&#92;ldots,(a_l,b_l)' title='(a_1,b_1),&#92;ldots,(a_l,b_l)' class='latex' /> such that <img src='http://s0.wp.com/latex.php?latex=%5Csum_%7Bi%3D1%7D%5El+a_i+%5Cleq+A&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;sum_{i=1}^l a_i &#92;leq A' title='&#92;sum_{i=1}^l a_i &#92;leq A' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%5Csum_%7Bi%3D1%7D%5El+b_i+%5Cleq+B&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;sum_{i=1}^l b_i &#92;leq B' title='&#92;sum_{i=1}^l b_i &#92;leq B' class='latex' />. Devise an efficient algorithm to find the <img src='http://s0.wp.com/latex.php?latex=k&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='k' title='k' class='latex' /> pairs <img src='http://s0.wp.com/latex.php?latex=%28a_%7Bi_1%7D%2Cb_%7Bi_1%7D%29%2C%5Cldots%2C%28a_%7Bi_k%7D%2Cb_%7Bi_k%7D%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='(a_{i_1},b_{i_1}),&#92;ldots,(a_{i_k},b_{i_k})' title='(a_{i_1},b_{i_1}),&#92;ldots,(a_{i_k},b_{i_k})' class='latex' /> that maximize</p>
<p style="text-align:center;"><img src='http://s0.wp.com/latex.php?latex=%5Cleft%5B%5Csum_%7Br%3D1%7D%5Ek+a_%7Bi_r%7D%5Clog%28a_%7Bi_r%7D%2Fb_%7Bi_r%7D%29%5Cright%5D+%2B+%5Cleft%5BA-%5Csum_%7Br%3D1%7D%5Ek+a_%7Bi_r%7D%5Cright%5D%5Clog%5Cleft%28%5Cfrac%7BA-%5Csum_%7Br%3D1%7D%5Ek+a_%7Bi_r%7D%7D%7BB-%5Csum_%7Br%3D1%7D%5Ek+b_%7Bi_r%7D%7D%5Cright%29.&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;left[&#92;sum_{r=1}^k a_{i_r}&#92;log(a_{i_r}/b_{i_r})&#92;right] + &#92;left[A-&#92;sum_{r=1}^k a_{i_r}&#92;right]&#92;log&#92;left(&#92;frac{A-&#92;sum_{r=1}^k a_{i_r}}{B-&#92;sum_{r=1}^k b_{i_r}}&#92;right).' title='&#92;left[&#92;sum_{r=1}^k a_{i_r}&#92;log(a_{i_r}/b_{i_r})&#92;right] + &#92;left[A-&#92;sum_{r=1}^k a_{i_r}&#92;right]&#92;log&#92;left(&#92;frac{A-&#92;sum_{r=1}^k a_{i_r}}{B-&#92;sum_{r=1}^k b_{i_r}}&#92;right).' class='latex' /></p>
<p><strong>Commentary:</strong> I don&#8217;t have a fully satisfactory solution to this yet, although I do think I can find an algorithm that runs in <img src='http://s0.wp.com/latex.php?latex=O%5Cleft%28%5Cfrac%7Bl+%5Clog%28l%29%7D%7B%5Cepsilon%7D%5Cright%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='O&#92;left(&#92;frac{l &#92;log(l)}{&#92;epsilon}&#92;right)' title='O&#92;left(&#92;frac{l &#92;log(l)}{&#92;epsilon}&#92;right)' class='latex' /> time and finds <img src='http://s0.wp.com/latex.php?latex=2k&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='2k' title='2k' class='latex' /> pairs that do at least <img src='http://s0.wp.com/latex.php?latex=1-%5Cepsilon&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='1-&#92;epsilon' title='1-&#92;epsilon' class='latex' /> as well as the best set of <img src='http://s0.wp.com/latex.php?latex=k&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='k' title='k' class='latex' /> pairs. It&#8217;s possible I need to assume something like <img src='http://s0.wp.com/latex.php?latex=%5Csum_%7Bi%3D1%7D%5El+a_i+%5Cleq+A%2F2&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;sum_{i=1}^l a_i &#92;leq A/2' title='&#92;sum_{i=1}^l a_i &#92;leq A/2' class='latex' /> instead of just <img src='http://s0.wp.com/latex.php?latex=A&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='A' title='A' class='latex' /> (and similarly for the <img src='http://s0.wp.com/latex.php?latex=b_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='b_i' title='b_i' class='latex' />), although I&#8217;m happy to make that assumption.</p>
<p>While attempting to solve this problem, I&#8217;ve managed to utilize a pretty large subset of my bag of tricks for optimization problems, so I think working on it is pretty worthwhile intellectually. It also happens to be important to my research, so if anyone comes up with a good algorithm I&#8217;d be interested to know.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/jsteinhardt.wordpress.com/423/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/jsteinhardt.wordpress.com/423/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=jsteinhardt.wordpress.com&#038;blog=8824138&#038;post=423&#038;subd=jsteinhardt&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://jsteinhardt.wordpress.com/2013/02/09/a-fun-optimization-problem/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c0d709db669c6eb66c98ee050c45527d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jsteinhardt</media:title>
		</media:content>
	</item>
		<item>
		<title>Eigenvalue Bounds</title>
		<link>http://jsteinhardt.wordpress.com/2013/02/05/eigenvalue-bounds/</link>
		<comments>http://jsteinhardt.wordpress.com/2013/02/05/eigenvalue-bounds/#comments</comments>
		<pubDate>Tue, 05 Feb 2013 06:32:11 +0000</pubDate>
		<dc:creator>jsteinhardt</dc:creator>
				<category><![CDATA[Math]]></category>
		<category><![CDATA[Tricks]]></category>

		<guid isPermaLink="false">http://jsteinhardt.wordpress.com/?p=465</guid>
		<description><![CDATA[While grading homeworks today, I came across the following bound: Theorem 1: If A and B are symmetric matrices with eigenvalues and respectively, then . For such a natural-looking statement, this was surprisingly hard to prove. However, I finally came up with a proof, and it was cool enough that I felt the need to share. [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=jsteinhardt.wordpress.com&#038;blog=8824138&#038;post=465&#038;subd=jsteinhardt&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>While grading homeworks today, I came across the following bound:</p>
<p><strong>Theorem 1: </strong>If A and B are symmetric <img src='http://s0.wp.com/latex.php?latex=n%5Ctimes+n&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='n&#92;times n' title='n&#92;times n' class='latex' /> matrices with eigenvalues <img src='http://s0.wp.com/latex.php?latex=%5Clambda_1+%5Cgeq+%5Clambda_2+%5Cgeq+%5Cldots+%5Cgeq+%5Clambda_n&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;lambda_1 &#92;geq &#92;lambda_2 &#92;geq &#92;ldots &#92;geq &#92;lambda_n' title='&#92;lambda_1 &#92;geq &#92;lambda_2 &#92;geq &#92;ldots &#92;geq &#92;lambda_n' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%5Cmu_1+%5Cgeq+%5Cmu_2+%5Cgeq+%5Cldots+%5Cgeq+%5Cmu_n&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;mu_1 &#92;geq &#92;mu_2 &#92;geq &#92;ldots &#92;geq &#92;mu_n' title='&#92;mu_1 &#92;geq &#92;mu_2 &#92;geq &#92;ldots &#92;geq &#92;mu_n' class='latex' /> respectively, then <img src='http://s0.wp.com/latex.php?latex=Trace%28A%5ETB%29+%5Cleq+%5Csum_%7Bi%3D1%7D%5En+%5Clambda_i+%5Cmu_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Trace(A^TB) &#92;leq &#92;sum_{i=1}^n &#92;lambda_i &#92;mu_i' title='Trace(A^TB) &#92;leq &#92;sum_{i=1}^n &#92;lambda_i &#92;mu_i' class='latex' />.</p>
<p>For such a natural-looking statement, this was surprisingly hard to prove. However, I finally came up with a proof, and it was cool enough that I felt the need to share. To prove this, we actually need two ingredients. The first is the <a href="http://en.wikipedia.org/wiki/Min-max_theorem#Cauchy_interlacing_theorem">Cauchy Interlacing Theorem</a>:</p>
<p><strong>Theorem 2: </strong>If A is an <img src='http://s0.wp.com/latex.php?latex=n%5Ctimes+n&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='n&#92;times n' title='n&#92;times n' class='latex' /> symmetric matrix and B is an <img src='http://s0.wp.com/latex.php?latex=%28n-k%29+%5Ctimes+%28n-k%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='(n-k) &#92;times (n-k)' title='(n-k) &#92;times (n-k)' class='latex' /> principle submatrix of A, then <img src='http://s0.wp.com/latex.php?latex=%5Clambda_%7Bi-k%7D%28A%29+%5Cleq+%5Clambda_i%28B%29+%5Cleq+%5Clambda_i%28A%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;lambda_{i-k}(A) &#92;leq &#92;lambda_i(B) &#92;leq &#92;lambda_i(A)' title='&#92;lambda_{i-k}(A) &#92;leq &#92;lambda_i(B) &#92;leq &#92;lambda_i(A)' class='latex' />, where <img src='http://s0.wp.com/latex.php?latex=%5Clambda_i%28X%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;lambda_i(X)' title='&#92;lambda_i(X)' class='latex' /> is the ith largest eigenvalue of X.</p>
<p>As a corollary we have:</p>
<p><strong>Corollary 1:</strong> For any symmetric matrix X, <img src='http://s0.wp.com/latex.php?latex=%5Csum_%7Bi%3D1%7D%5Ek+X_%7Bii%7D+%5Cleq+%5Csum_%7Bi%3D1%7D%5Ek+%5Clambda_i%28X%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;sum_{i=1}^k X_{ii} &#92;leq &#92;sum_{i=1}^k &#92;lambda_i(X)' title='&#92;sum_{i=1}^k X_{ii} &#92;leq &#92;sum_{i=1}^k &#92;lambda_i(X)' class='latex' />.</p>
<p><strong>Proof:</strong> The left-hand-side is just the trace of the upper-left <img src='http://s0.wp.com/latex.php?latex=k%5Ctimes+k&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='k&#92;times k' title='k&#92;times k' class='latex' /> principle submatrix of X, whose eigenvalues are by Theorem 2 bounded by the k largest eigenvalues of X. <img src='http://s0.wp.com/latex.php?latex=%5Csquare&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;square' title='&#92;square' class='latex' /></p>
<p>The final ingredient we will need is a sort of &#8220;majorization&#8221; inequality based on Abel summation:</p>
<p><strong>Theorem 3:</strong> If <img src='http://s0.wp.com/latex.php?latex=x_1%2C%5Cldots%2Cx_n&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='x_1,&#92;ldots,x_n' title='x_1,&#92;ldots,x_n' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=y_1%2C%5Cldots%2Cy_n&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='y_1,&#92;ldots,y_n' title='y_1,&#92;ldots,y_n' class='latex' /> are such that <img src='http://s0.wp.com/latex.php?latex=%5Csum_%7Bi%3D1%7D%5Ek+x_i+%5Cleq+%5Csum_%7Bi%3D1%7D%5Ek+y_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;sum_{i=1}^k x_i &#92;leq &#92;sum_{i=1}^k y_i' title='&#92;sum_{i=1}^k x_i &#92;leq &#92;sum_{i=1}^k y_i' class='latex' /> for all k (with equality when <img src='http://s0.wp.com/latex.php?latex=k%3Dn&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='k=n' title='k=n' class='latex' />), and <img src='http://s0.wp.com/latex.php?latex=c_1+%5Cgeq+c_2+%5Cgeq+%5Cldots+%5Cgeq+c_n&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='c_1 &#92;geq c_2 &#92;geq &#92;ldots &#92;geq c_n' title='c_1 &#92;geq c_2 &#92;geq &#92;ldots &#92;geq c_n' class='latex' />, then <img src='http://s0.wp.com/latex.php?latex=%5Csum_%7Bi%3D1%7D%5En+c_ix_i+%5Cleq+%5Csum_%7Bi%3D1%7D%5En+c_iy_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;sum_{i=1}^n c_ix_i &#92;leq &#92;sum_{i=1}^n c_iy_i' title='&#92;sum_{i=1}^n c_ix_i &#92;leq &#92;sum_{i=1}^n c_iy_i' class='latex' />.</p>
<p><strong>Proof:</strong> We have:</p>
<p style="text-align:center;"><img src='http://s0.wp.com/latex.php?latex=%5Csum_%7Bi%3D1%7D%5En+c_ix_i+%3D+c_n%28x_1%2B%5Ccdots%2Bx_n%29+%2B+%5Csum_%7Bi%3D1%7D%5E%7Bn-1%7D+%28c_i-c_%7Bi%2B1%7D%29%28x_1%2B%5Ccdots%2Bx_i%29+%5Cleq+c_n%28y_1%2B%5Ccdots%2By_n%29+%2B+%5Csum_%7Bi%3D1%7D%5E%7Bn-1%7D+%28c_i-c_%7Bi%2B1%7D%29%28y_1%2B%5Ccdots%2By_i%29+%3D+%5Csum_%7Bi%3D1%7D%5En+c_iy_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;sum_{i=1}^n c_ix_i = c_n(x_1+&#92;cdots+x_n) + &#92;sum_{i=1}^{n-1} (c_i-c_{i+1})(x_1+&#92;cdots+x_i) &#92;leq c_n(y_1+&#92;cdots+y_n) + &#92;sum_{i=1}^{n-1} (c_i-c_{i+1})(y_1+&#92;cdots+y_i) = &#92;sum_{i=1}^n c_iy_i' title='&#92;sum_{i=1}^n c_ix_i = c_n(x_1+&#92;cdots+x_n) + &#92;sum_{i=1}^{n-1} (c_i-c_{i+1})(x_1+&#92;cdots+x_i) &#92;leq c_n(y_1+&#92;cdots+y_n) + &#92;sum_{i=1}^{n-1} (c_i-c_{i+1})(y_1+&#92;cdots+y_i) = &#92;sum_{i=1}^n c_iy_i' class='latex' /></p>
<p style="text-align:left;">where the equalities come from the <a href="http://en.wikipedia.org/wiki/Summation_by_parts">Abel summation method</a>. <img src='http://s0.wp.com/latex.php?latex=%5Csquare&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;square' title='&#92;square' class='latex' /></p>
<p style="text-align:left;">Now, we are finally ready to prove the original theorem:</p>
<p style="text-align:left;"><strong>Proof of Theorem 1: </strong>First note that since the trace is invariant under similarity transforms, we can without loss of generality assume that A is diagonal, in which case we want to prove that <img src='http://s0.wp.com/latex.php?latex=%5Csum_%7Bi%3D1%7D%5En+%5Clambda_i+B_%7Bii%7D+%5Cleq+%5Csum_%7Bi%3D1%7D%5En+%5Clambda_i+%5Cmu_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;sum_{i=1}^n &#92;lambda_i B_{ii} &#92;leq &#92;sum_{i=1}^n &#92;lambda_i &#92;mu_i' title='&#92;sum_{i=1}^n &#92;lambda_i B_{ii} &#92;leq &#92;sum_{i=1}^n &#92;lambda_i &#92;mu_i' class='latex' />. But by Corollary 1, we also know that <img src='http://s0.wp.com/latex.php?latex=%5Csum_%7Bi%3D1%7D%5Ek+B_%7Bii%7D+%5Cleq+%5Csum_%7Bi%3D1%7D%5Ek+%5Cmu_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;sum_{i=1}^k B_{ii} &#92;leq &#92;sum_{i=1}^k &#92;mu_i' title='&#92;sum_{i=1}^k B_{ii} &#92;leq &#92;sum_{i=1}^k &#92;mu_i' class='latex' /> for all k. Since by assumption the <img src='http://s0.wp.com/latex.php?latex=%5Clambda_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;lambda_i' title='&#92;lambda_i' class='latex' /> are a decreasing sequence, Theorem 3 then implies that <img src='http://s0.wp.com/latex.php?latex=%5Csum_%7Bi%3D1%7D%5En+%5Clambda_i+B_%7Bii%7D+%5Cleq+%5Csum_%7Bi%3D1%7D%5En+%5Clambda_i+%5Cmu_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;sum_{i=1}^n &#92;lambda_i B_{ii} &#92;leq &#92;sum_{i=1}^n &#92;lambda_i &#92;mu_i' title='&#92;sum_{i=1}^n &#92;lambda_i B_{ii} &#92;leq &#92;sum_{i=1}^n &#92;lambda_i &#92;mu_i' class='latex' />, which is what we wanted to show. <img src='http://s0.wp.com/latex.php?latex=%5Csquare&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;square' title='&#92;square' class='latex' /></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/jsteinhardt.wordpress.com/465/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/jsteinhardt.wordpress.com/465/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=jsteinhardt.wordpress.com&#038;blog=8824138&#038;post=465&#038;subd=jsteinhardt&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://jsteinhardt.wordpress.com/2013/02/05/eigenvalue-bounds/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c0d709db669c6eb66c98ee050c45527d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jsteinhardt</media:title>
		</media:content>
	</item>
		<item>
		<title>Local KL Divergence</title>
		<link>http://jsteinhardt.wordpress.com/2013/02/02/local-kl-divergence/</link>
		<comments>http://jsteinhardt.wordpress.com/2013/02/02/local-kl-divergence/#comments</comments>
		<pubDate>Sat, 02 Feb 2013 04:27:27 +0000</pubDate>
		<dc:creator>jsteinhardt</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Math]]></category>

		<guid isPermaLink="false">http://jsteinhardt.wordpress.com/?p=453</guid>
		<description><![CDATA[The KL divergence is an important tool for studying the distance between two probability distributions. Formally, given two distributions and , the KL divergence is defined as Note that . Intuitively, a small KL(p &#124;&#124; q) means that there are few points that p assigns high probability to but that q does not. We can [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=jsteinhardt.wordpress.com&#038;blog=8824138&#038;post=453&#038;subd=jsteinhardt&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>The KL divergence is an important tool for studying the distance between two probability distributions. Formally, given two distributions <img src='http://s0.wp.com/latex.php?latex=p&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='p' title='p' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=q&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='q' title='q' class='latex' />, the KL divergence is defined as</p>
<p><img src='http://s0.wp.com/latex.php?latex=KL%28p+%7C%7C+q%29+%3A%3D+%5Cint+p%28x%29+%5Clog%28p%28x%29%2Fq%28x%29%29+dx&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='KL(p || q) := &#92;int p(x) &#92;log(p(x)/q(x)) dx' title='KL(p || q) := &#92;int p(x) &#92;log(p(x)/q(x)) dx' class='latex' /></p>
<p>Note that <img src='http://s0.wp.com/latex.php?latex=KL%28p+%7C%7C+q%29+%5Cneq+KL%28q+%7C%7C+p%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='KL(p || q) &#92;neq KL(q || p)' title='KL(p || q) &#92;neq KL(q || p)' class='latex' />. Intuitively, a small KL(p || q) means that there are few points that p assigns high probability to but that q does not. We can also think of KL(p || q) as the number of bits of information needed to update from the distribution q to the distribution p.</p>
<p>Suppose that p and q are both mixtures of other distributions: <img src='http://s0.wp.com/latex.php?latex=p%28x%29+%3D+%5Csum_i+%5Calpha_i+F_i%28x%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='p(x) = &#92;sum_i &#92;alpha_i F_i(x)' title='p(x) = &#92;sum_i &#92;alpha_i F_i(x)' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=q%28x%29+%3D+%5Csum_i+%5Cbeta_i+G_i%28x%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='q(x) = &#92;sum_i &#92;beta_i G_i(x)' title='q(x) = &#92;sum_i &#92;beta_i G_i(x)' class='latex' />. Can we bound <img src='http://s0.wp.com/latex.php?latex=KL%28p+%7C%7C+q%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='KL(p || q)' title='KL(p || q)' class='latex' /> in terms of the <img src='http://s0.wp.com/latex.php?latex=KL%28F_i+%7C%7C+G_i%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='KL(F_i || G_i)' title='KL(F_i || G_i)' class='latex' />? In some sense this is asking to upper bound the KL divergence in terms of some more local KL divergence. It turns out this can be done:</p>
<p><strong>Theorem:</strong> If <img src='http://s0.wp.com/latex.php?latex=%5Csum_i+%5Calpha_i+%3D+%5Csum_i+%5Cbeta_i+%3D+1&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;sum_i &#92;alpha_i = &#92;sum_i &#92;beta_i = 1' title='&#92;sum_i &#92;alpha_i = &#92;sum_i &#92;beta_i = 1' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=F_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='F_i' title='F_i' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=G_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='G_i' title='G_i' class='latex' /> are all probability distributions, then</p>
<p style="text-align:center;"><img src='http://s0.wp.com/latex.php?latex=KL%5Cleft%28%5Csum_i+%5Calpha_i+F_i+%7C%7C+%5Csum_i+%5Cbeta_i+G_i%5Cright%29+%5Cleq+%5Csum_i+%5Calpha_i+%5Cleft%28%5Clog%28%5Calpha_i%2F%5Cbeta_i%29+%2B+KL%28F_i+%7C%7C+G_i%29%5Cright%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='KL&#92;left(&#92;sum_i &#92;alpha_i F_i || &#92;sum_i &#92;beta_i G_i&#92;right) &#92;leq &#92;sum_i &#92;alpha_i &#92;left(&#92;log(&#92;alpha_i/&#92;beta_i) + KL(F_i || G_i)&#92;right)' title='KL&#92;left(&#92;sum_i &#92;alpha_i F_i || &#92;sum_i &#92;beta_i G_i&#92;right) &#92;leq &#92;sum_i &#92;alpha_i &#92;left(&#92;log(&#92;alpha_i/&#92;beta_i) + KL(F_i || G_i)&#92;right)' class='latex' />.</p>
<p><strong>Proof:</strong> If we expand the definition, then we are trying to prove that</p>
<p style="text-align:center;"><img src='http://s0.wp.com/latex.php?latex=%5Cint+%5Cleft%28%5Csum+%5Calpha_i+F_i%28x%29%5Cright%29+%5Clog%5Cleft%28%5Cfrac%7B%5Csum+%5Calpha_i+F_i%28x%29%7D%7B%5Csum+%5Cbeta_i+G_i%28x%29%7D%5Cright%29+dx+%5Cleq+%5Cint+%5Cleft%28%5Csum_i+%5Calpha_iF_i%28x%29+%5Clog%5Cleft%28%5Cfrac%7B%5Calpha_i+F_i%28x%29%7D%7B%5Cbeta_i+G_i%28x%29%7D%5Cright%29%5Cright%29+dx&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;int &#92;left(&#92;sum &#92;alpha_i F_i(x)&#92;right) &#92;log&#92;left(&#92;frac{&#92;sum &#92;alpha_i F_i(x)}{&#92;sum &#92;beta_i G_i(x)}&#92;right) dx &#92;leq &#92;int &#92;left(&#92;sum_i &#92;alpha_iF_i(x) &#92;log&#92;left(&#92;frac{&#92;alpha_i F_i(x)}{&#92;beta_i G_i(x)}&#92;right)&#92;right) dx' title='&#92;int &#92;left(&#92;sum &#92;alpha_i F_i(x)&#92;right) &#92;log&#92;left(&#92;frac{&#92;sum &#92;alpha_i F_i(x)}{&#92;sum &#92;beta_i G_i(x)}&#92;right) dx &#92;leq &#92;int &#92;left(&#92;sum_i &#92;alpha_iF_i(x) &#92;log&#92;left(&#92;frac{&#92;alpha_i F_i(x)}{&#92;beta_i G_i(x)}&#92;right)&#92;right) dx' class='latex' /></p>
<p style="text-align:left;">We will in fact show that this is true for every value of <img src='http://s0.wp.com/latex.php?latex=x&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='x' title='x' class='latex' />, so that it is certainly true for the integral. Using <img src='http://s0.wp.com/latex.php?latex=%5Clog%28x%2Fy%29+%3D+-%5Clog%28y%2Fx%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;log(x/y) = -&#92;log(y/x)' title='&#92;log(x/y) = -&#92;log(y/x)' class='latex' />, re-write the condition for a given value of <img src='http://s0.wp.com/latex.php?latex=x&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='x' title='x' class='latex' /> as</p>
<p style="text-align:center;"><img src='http://s0.wp.com/latex.php?latex=%5Cleft%28%5Csum+%5Calpha_i+F_i%28x%29%5Cright%29+%5Clog%5Cleft%28%5Cfrac%7B%5Csum+%5Cbeta_i+G_i%28x%29%7D%7B%5Csum+%5Calpha_i+F_i%28x%29%7D%5Cright%29+%5Cgeq+%5Csum_i+%5Calpha_iF_i%28x%29+%5Clog%5Cleft%28%5Cfrac%7B%5Cbeta_i+G_i%28x%29%7D%7B%5Calpha_i+F_i%28x%29%7D%5Cright%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;left(&#92;sum &#92;alpha_i F_i(x)&#92;right) &#92;log&#92;left(&#92;frac{&#92;sum &#92;beta_i G_i(x)}{&#92;sum &#92;alpha_i F_i(x)}&#92;right) &#92;geq &#92;sum_i &#92;alpha_iF_i(x) &#92;log&#92;left(&#92;frac{&#92;beta_i G_i(x)}{&#92;alpha_i F_i(x)}&#92;right)' title='&#92;left(&#92;sum &#92;alpha_i F_i(x)&#92;right) &#92;log&#92;left(&#92;frac{&#92;sum &#92;beta_i G_i(x)}{&#92;sum &#92;alpha_i F_i(x)}&#92;right) &#92;geq &#92;sum_i &#92;alpha_iF_i(x) &#92;log&#92;left(&#92;frac{&#92;beta_i G_i(x)}{&#92;alpha_i F_i(x)}&#92;right)' class='latex' /></p>
<p style="text-align:left;">(Note that the sign of the inequality flipped because we replaced the two expressions with their negatives.) Now, this follows by using Jensen&#8217;s inequality on the <img src='http://s0.wp.com/latex.php?latex=%5Clog&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;log' title='&#92;log' class='latex' /> function:</p>
<p style="text-align:center;"><img src='http://s0.wp.com/latex.php?latex=%5Csum_i+%5Calpha_iF_i%28x%29+%5Clog%5Cleft%28%5Cfrac%7B%5Cbeta_i+G_i%28x%29%7D%7B%5Calpha_i+F_i%28x%29%7D%5Cright%29+%5Cleq+%5Cleft%28%5Csum_i+%5Calpha_iF_i%28x%29%5Cright%29+%5Clog%5Cleft%28%5Cfrac%7B%5Csum_i+%5Cfrac%7B%5Cbeta_i+G_i%28x%29%7D%7B%5Calpha_i+F_i%28x%29%7D+%5Calpha_i+F_i%28x%29%7D%7B%5Csum+%5Calpha_i+F_i%28x%29%7D%5Cright%29+%3D+%5Cleft%28%5Csum_i+%5Calpha_i+F_i%28x%29%5Cright%29+%5Clog%5Cleft%28%5Cfrac%7B%5Csum_i+%5Cbeta_i+G_i%28x%29%7D%7B%5Csum_i+%5Calpha_i+F_i%28x%29%7D%5Cright%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;sum_i &#92;alpha_iF_i(x) &#92;log&#92;left(&#92;frac{&#92;beta_i G_i(x)}{&#92;alpha_i F_i(x)}&#92;right) &#92;leq &#92;left(&#92;sum_i &#92;alpha_iF_i(x)&#92;right) &#92;log&#92;left(&#92;frac{&#92;sum_i &#92;frac{&#92;beta_i G_i(x)}{&#92;alpha_i F_i(x)} &#92;alpha_i F_i(x)}{&#92;sum &#92;alpha_i F_i(x)}&#92;right) = &#92;left(&#92;sum_i &#92;alpha_i F_i(x)&#92;right) &#92;log&#92;left(&#92;frac{&#92;sum_i &#92;beta_i G_i(x)}{&#92;sum_i &#92;alpha_i F_i(x)}&#92;right)' title='&#92;sum_i &#92;alpha_iF_i(x) &#92;log&#92;left(&#92;frac{&#92;beta_i G_i(x)}{&#92;alpha_i F_i(x)}&#92;right) &#92;leq &#92;left(&#92;sum_i &#92;alpha_iF_i(x)&#92;right) &#92;log&#92;left(&#92;frac{&#92;sum_i &#92;frac{&#92;beta_i G_i(x)}{&#92;alpha_i F_i(x)} &#92;alpha_i F_i(x)}{&#92;sum &#92;alpha_i F_i(x)}&#92;right) = &#92;left(&#92;sum_i &#92;alpha_i F_i(x)&#92;right) &#92;log&#92;left(&#92;frac{&#92;sum_i &#92;beta_i G_i(x)}{&#92;sum_i &#92;alpha_i F_i(x)}&#92;right)' class='latex' /></p>
<p style="text-align:left;">This proves the inequality and therefore the theorem. <img src='http://s0.wp.com/latex.php?latex=%5Csquare&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;square' title='&#92;square' class='latex' /></p>
<p style="text-align:left;"><strong>Remark:</strong> Intuitively, if we want to describe <img src='http://s0.wp.com/latex.php?latex=%5Csum+%5Calpha_i+F_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;sum &#92;alpha_i F_i' title='&#92;sum &#92;alpha_i F_i' class='latex' /> in terms of <img src='http://s0.wp.com/latex.php?latex=%5Csum+%5Cbeta_i+G_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;sum &#92;beta_i G_i' title='&#92;sum &#92;beta_i G_i' class='latex' />, it is enough to first locate the <img src='http://s0.wp.com/latex.php?latex=i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='i' title='i' class='latex' />th term in the sum and then to describe <img src='http://s0.wp.com/latex.php?latex=F_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='F_i' title='F_i' class='latex' /> in terms of <img src='http://s0.wp.com/latex.php?latex=G_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='G_i' title='G_i' class='latex' />. The theorem is a formalization of this intuition. In the case that <img src='http://s0.wp.com/latex.php?latex=F_i+%3D+G_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='F_i = G_i' title='F_i = G_i' class='latex' />, it also says that the KL divergence between two different mixtures of the same set of distributions is at most the KL divergence between the mixture weights.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/jsteinhardt.wordpress.com/453/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/jsteinhardt.wordpress.com/453/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=jsteinhardt.wordpress.com&#038;blog=8824138&#038;post=453&#038;subd=jsteinhardt&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://jsteinhardt.wordpress.com/2013/02/02/local-kl-divergence/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c0d709db669c6eb66c98ee050c45527d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jsteinhardt</media:title>
		</media:content>
	</item>
		<item>
		<title>Quadratically Independent Monomials</title>
		<link>http://jsteinhardt.wordpress.com/2013/01/31/quadratically-independent-monomials/</link>
		<comments>http://jsteinhardt.wordpress.com/2013/01/31/quadratically-independent-monomials/#comments</comments>
		<pubDate>Thu, 31 Jan 2013 08:42:05 +0000</pubDate>
		<dc:creator>jsteinhardt</dc:creator>
				<category><![CDATA[Math]]></category>
		<category><![CDATA[Tricks]]></category>

		<guid isPermaLink="false">http://jsteinhardt.wordpress.com/?p=450</guid>
		<description><![CDATA[Today Arun asked me the following question: &#8220;Under what conditions will a set of polynomials be quadratically independent, in the sense that is a linearly independent set?&#8221; I wasn&#8217;t able to make much progress on this general question, but in the specific setting where the are all polynomials in one variable, and we further restrict [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=jsteinhardt.wordpress.com&#038;blog=8824138&#038;post=450&#038;subd=jsteinhardt&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Today Arun asked me the following question:</p>
<p>&#8220;Under what conditions will a set <img src='http://s0.wp.com/latex.php?latex=%5C%7Bp_1%2C%5Cldots%2Cp_n%5C%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;{p_1,&#92;ldots,p_n&#92;}' title='&#92;{p_1,&#92;ldots,p_n&#92;}' class='latex' /> of polynomials be quadratically independent, in the sense that <img src='http://s0.wp.com/latex.php?latex=%5C%7Bp_1%5E2%2C+p_1p_2%2C+p_2%5E2%2C+p_1p_3%2C%5Cldots%2Cp_%7Bn-1%7Dp_n%2C+p_n%5E2%5C%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;{p_1^2, p_1p_2, p_2^2, p_1p_3,&#92;ldots,p_{n-1}p_n, p_n^2&#92;}' title='&#92;{p_1^2, p_1p_2, p_2^2, p_1p_3,&#92;ldots,p_{n-1}p_n, p_n^2&#92;}' class='latex' /> is a linearly independent set?&#8221;</p>
<p>I wasn&#8217;t able to make much progress on this general question, but in the specific setting where the <img src='http://s0.wp.com/latex.php?latex=p_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='p_i' title='p_i' class='latex' /> are all polynomials in one variable, and we further restrict to just monomials, (i.e. <img src='http://s0.wp.com/latex.php?latex=p_i%28x%29+%3D+x%5E%7Bd_i%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='p_i(x) = x^{d_i}' title='p_i(x) = x^{d_i}' class='latex' /> for some <img src='http://s0.wp.com/latex.php?latex=d_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='d_i' title='d_i' class='latex' />), the condition is just that there are no distinct unordered pairs <img src='http://s0.wp.com/latex.php?latex=%28i_1%2Cj_1%29%2C%28i_2%2Cj_2%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='(i_1,j_1),(i_2,j_2)' title='(i_1,j_1),(i_2,j_2)' class='latex' /> such that <img src='http://s0.wp.com/latex.php?latex=d_%7Bi_1%7D+%2B+d_%7Bj_1%7D+%3D+d_%7Bi_2%7D+%2B+d_%7Bj_2%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='d_{i_1} + d_{j_1} = d_{i_2} + d_{j_2}' title='d_{i_1} + d_{j_1} = d_{i_2} + d_{j_2}' class='latex' />. Arun was interested in the largest such a set could be for a given maximum degree <img src='http://s0.wp.com/latex.php?latex=D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='D' title='D' class='latex' />, so we are left with the following interesting combinatorics problem:</p>
<p>&#8220;What is the largest subset <img src='http://s0.wp.com/latex.php?latex=S&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='S' title='S' class='latex' /> of <img src='http://s0.wp.com/latex.php?latex=%5C%7B1%2C%5Cldots%2CD%5C%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;{1,&#92;ldots,D&#92;}' title='&#92;{1,&#92;ldots,D&#92;}' class='latex' /> such that no two distinct pairs of elements of <img src='http://s0.wp.com/latex.php?latex=S&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='S' title='S' class='latex' /> have the same sum?&#8221;</p>
<p>For convenience of notation let <img src='http://s0.wp.com/latex.php?latex=n&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='n' title='n' class='latex' /> denote the size of <img src='http://s0.wp.com/latex.php?latex=S&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='S' title='S' class='latex' />. A simple upper bound is <img src='http://s0.wp.com/latex.php?latex=%5Cbinom%7BN%2B1%7D%7B2%7D+%5Cleq+2D-1&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;binom{N+1}{2} &#92;leq 2D-1' title='&#92;binom{N+1}{2} &#92;leq 2D-1' class='latex' />, since there are <img src='http://s0.wp.com/latex.php?latex=%5Cbinom%7BN%2B1%7D%7B2%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;binom{N+1}{2}' title='&#92;binom{N+1}{2}' class='latex' /> pairs to take a sum of, and all pairwise sums lie between <img src='http://s0.wp.com/latex.php?latex=2&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='2' title='2' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=2D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='2D' title='2D' class='latex' />. We therefore have <img src='http://s0.wp.com/latex.php?latex=n+%3D+O%28%5Csqrt%7BD%7D%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='n = O(&#92;sqrt{D})' title='n = O(&#92;sqrt{D})' class='latex' />.</p>
<p>What about lower bounds on n? If we let S be the powers of 2 less than or equal to D, then we get a lower bound of <img src='http://s0.wp.com/latex.php?latex=%5Clog_2%28D%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;log_2(D)' title='&#92;log_2(D)' class='latex' />; we can do slightly better by taking the Fibonacci numbers instead, but this still only gives us logarithmic growth. So the question is, can we find sets that grow polynomially in D?</p>
<p>It turns out the answer is yes, and we can do so by choosing randomly. Let each element of <img src='http://s0.wp.com/latex.php?latex=%5C%7B1%2C%5Cldots%2CD%5C%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;{1,&#92;ldots,D&#92;}' title='&#92;{1,&#92;ldots,D&#92;}' class='latex' /> be placed in S with probability p. Now consider any k, <img src='http://s0.wp.com/latex.php?latex=2+%5Cleq+k+%5Cleq+2D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='2 &#92;leq k &#92;leq 2D' title='2 &#92;leq k &#92;leq 2D' class='latex' />. If k is odd, then there are (k-1)/2 possible pairs that could add up to k: (1,k-1), (2,k-2),&#8230;,((k-1)/2,(k+1)/2). The probability of each such pair existing is <img src='http://s0.wp.com/latex.php?latex=p%5E2&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='p^2' title='p^2' class='latex' />. Note that each of these events is independent.</p>
<p>S is invalid if and only if there exists some k such that more than one of these pairs is active in S. The probability of any two given pairs being simultaneously active is <img src='http://s0.wp.com/latex.php?latex=p%5E4&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='p^4' title='p^4' class='latex' />, and there are <img src='http://s0.wp.com/latex.php?latex=%5Cbinom%7B%28k-1%29%2F2%7D%7B2%7D+%5Cleq+%5Cbinom%7BD%7D%7B2%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;binom{(k-1)/2}{2} &#92;leq &#92;binom{D}{2}' title='&#92;binom{(k-1)/2}{2} &#92;leq &#92;binom{D}{2}' class='latex' /> such pairs for a given <img src='http://s0.wp.com/latex.php?latex=k&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='k' title='k' class='latex' />, hence <img src='http://s0.wp.com/latex.php?latex=%28D-1%29%5Cbinom%7BD%7D%7B2%7D+%5Cleq+D%5E3%2F2&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='(D-1)&#92;binom{D}{2} &#92;leq D^3/2' title='(D-1)&#92;binom{D}{2} &#92;leq D^3/2' class='latex' /> such pairs total (since we were just looking at odd k). Therefore, the probability of an odd value of k invalidating S is at most <img src='http://s0.wp.com/latex.php?latex=p%5E4D%5E3%2F2&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='p^4D^3/2' title='p^4D^3/2' class='latex' />.</p>
<p>For even <img src='http://s0.wp.com/latex.php?latex=k&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='k' title='k' class='latex' /> we get much the same result except that the probability for a given value of <img src='http://s0.wp.com/latex.php?latex=k&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='k' title='k' class='latex' /> comes out to the slightly more complicated formula <img src='http://s0.wp.com/latex.php?latex=%5Cbinom%7Bk%2F2-1%7D%7B2%7Dp%5E4+%2B+%28k%2F2-1%29p%5E3+%2B+p%5E2+%5Cleq+D%5E2p%5E4%2F2+%2B+Dp%5E3+%2B+p%5E2&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;binom{k/2-1}{2}p^4 + (k/2-1)p^3 + p^2 &#92;leq D^2p^4/2 + Dp^3 + p^2' title='&#92;binom{k/2-1}{2}p^4 + (k/2-1)p^3 + p^2 &#92;leq D^2p^4/2 + Dp^3 + p^2' class='latex' />, so that the total probability of an even value of k invalidating S is at most <img src='http://s0.wp.com/latex.php?latex=p%5E4D%5E3%2F2+%2B+p%5E3D%5E2+%2B+p%5E2D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='p^4D^3/2 + p^3D^2 + p^2D' title='p^4D^3/2 + p^3D^2 + p^2D' class='latex' />.</p>
<p>Putting this all together gives us a bound of <img src='http://s0.wp.com/latex.php?latex=p%5E4D%5E3+%2B+p%5E3D%5E2+%2B+p%5E2D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='p^4D^3 + p^3D^2 + p^2D' title='p^4D^3 + p^3D^2 + p^2D' class='latex' />. If we set p to be $\frac{1}{2}D^{-\frac{3}{4}}$ then the probability of S being invalid is then at most <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B16%7D+%2B+%5Cfrac%7B1%7D%7B8%7D+D%5E%7B-%5Cfrac%7B1%7D%7B4%7D%7D+%2B+%5Cfrac%7B1%7D%7B4%7DD%5E%7B-%5Cfrac%7B1%7D%7B2%7D%7D+%5Cleq+%5Cfrac%7B7%7D%7B16%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;frac{1}{16} + &#92;frac{1}{8} D^{-&#92;frac{1}{4}} + &#92;frac{1}{4}D^{-&#92;frac{1}{2}} &#92;leq &#92;frac{7}{16}' title='&#92;frac{1}{16} + &#92;frac{1}{8} D^{-&#92;frac{1}{4}} + &#92;frac{1}{4}D^{-&#92;frac{1}{2}} &#92;leq &#92;frac{7}{16}' class='latex' />, so with probability at least <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B7%7D%7B16%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;frac{7}{16}' title='&#92;frac{7}{16}' class='latex' /> a set S with elements chosen randomly with probability <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B2%7DD%5E%7B-%5Cfrac%7B3%7D%7B4%7D%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;frac{1}{2}D^{-&#92;frac{3}{4}}' title='&#92;frac{1}{2}D^{-&#92;frac{3}{4}}' class='latex' /> will be valid. On the other hand, such a set has <img src='http://s0.wp.com/latex.php?latex=D%5E%7B1%2F4%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='D^{1/4}' title='D^{1/4}' class='latex' /> elements in expectation, and asymptotically the probability of having at least this many elements is <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B2%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;frac{1}{2}' title='&#92;frac{1}{2}' class='latex' />. Therefore, with probability at least <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B16%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;frac{1}{16}' title='&#92;frac{1}{16}' class='latex' /> a randomly chosen set will be both valid and have size greater than <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B2%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;frac{1}{2}' title='&#92;frac{1}{2}' class='latex' />, which shows that the largest value of <img src='http://s0.wp.com/latex.php?latex=n&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='n' title='n' class='latex' /> is at least <img src='http://s0.wp.com/latex.php?latex=%5COmega%5Cleft%28D%5E%7B1%2F4%7D%5Cright%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;Omega&#92;left(D^{1/4}&#92;right)' title='&#92;Omega&#92;left(D^{1/4}&#92;right)' class='latex' />.</p>
<p>We can actually do better: if all elements are chosen with probability <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B2%7DD%5E%7B-2%2F3%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;frac{1}{2}D^{-2/3}' title='&#92;frac{1}{2}D^{-2/3}' class='latex' />, then one can show that the expected number of invalid pairs is at most <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B8%7DD%5E%7B1%2F3%7D+%2B+O%281%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;frac{1}{8}D^{1/3} + O(1)' title='&#92;frac{1}{8}D^{1/3} + O(1)' class='latex' />, and hence we can pick randomly with probability <img src='http://s0.wp.com/latex.php?latex=p+%3D+%5Cfrac%7B1%7D%7B2%7DD%5E%7B-2%2F3%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='p = &#92;frac{1}{2}D^{-2/3}' title='p = &#92;frac{1}{2}D^{-2/3}' class='latex' />, remove one element of each of the invalid pairs, and still be left with <img src='http://s0.wp.com/latex.php?latex=%5COmega%28D%5E%7B1%2F3%7D%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;Omega(D^{1/3})' title='&#92;Omega(D^{1/3})' class='latex' /> elements in S.</p>
<p>So, to recap: choosing elements randomly gives us S of size <img src='http://s0.wp.com/latex.php?latex=%5COmega%28D%5E%7B1%2F4%7D%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;Omega(D^{1/4})' title='&#92;Omega(D^{1/4})' class='latex' />; choosing randomly and then removing any offending pairs gives us S of size <img src='http://s0.wp.com/latex.php?latex=%5COmega%28D%5E%7B1%2F3%7D%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;Omega(D^{1/3})' title='&#92;Omega(D^{1/3})' class='latex' />; and we have an upper bound of <img src='http://s0.wp.com/latex.php?latex=O%28D%5E%7B1%2F2%7D%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='O(D^{1/2})' title='O(D^{1/2})' class='latex' />. What is the actual asymptotic answer? I don&#8217;t actually know the answer to this, but I thought I&#8217;d share what I have so far because I think the techniques involved are pretty cool.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/jsteinhardt.wordpress.com/450/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/jsteinhardt.wordpress.com/450/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=jsteinhardt.wordpress.com&#038;blog=8824138&#038;post=450&#038;subd=jsteinhardt&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://jsteinhardt.wordpress.com/2013/01/31/quadratically-independent-monomials/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c0d709db669c6eb66c98ee050c45527d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jsteinhardt</media:title>
		</media:content>
	</item>
		<item>
		<title>Exponential Families</title>
		<link>http://jsteinhardt.wordpress.com/2012/12/21/exponential-families/</link>
		<comments>http://jsteinhardt.wordpress.com/2012/12/21/exponential-families/#comments</comments>
		<pubDate>Fri, 21 Dec 2012 08:06:24 +0000</pubDate>
		<dc:creator>jsteinhardt</dc:creator>
				<category><![CDATA[Machine Learning]]></category>

		<guid isPermaLink="false">http://jsteinhardt.wordpress.com/?p=444</guid>
		<description><![CDATA[In my last post I discussed log-linear models. In this post I&#8217;d like to take another perspective on log-linear models, by thinking of them as members of an exponential family. There are many reasons to take this perspective: exponential families give us efficient representations of log-linear models, which is important for continuous domains; they always [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=jsteinhardt.wordpress.com&#038;blog=8824138&#038;post=444&#038;subd=jsteinhardt&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>In my <a href="http://jsteinhardt.wordpress.com/2012/12/06/log-linear-models/">last post</a> I discussed log-linear models. In this post I&#8217;d like to take another perspective on log-linear models, by thinking of them as members of an <em>exponential family</em>. There are many reasons to take this perspective: exponential families give us efficient representations of log-linear models, which is important for continuous domains; they always have conjugate priors, which provide an analytically tractable regularization method; finally, they can be viewed as maximum-entropy models for a given set of sufficient statistics. Don&#8217;t worry if these terms are unfamiliar; I will explain all of them by the end of this post. Also note that most of this material is available on the Wikipedia page on exponential families, which I used quite liberally in preparing the below exposition.</p>
<p><b>1. Exponential Families </b></p>
<p>An <em>exponential family</em> is a family of probability distributions, parameterized by <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta+%5Cin+%5Cmathbb%7BR%7D%5En%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;theta &#92;in &#92;mathbb{R}^n}' title='{&#92;theta &#92;in &#92;mathbb{R}^n}' class='latex' />, of the form</p>
<p><a name="eqnexp-def-0"></a></p>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+p%28x+%5Cmid+%5Ctheta%29+%5Cpropto+h%28x%29%5Cexp%28%5Ctheta%5ET%5Cphi%28x%29%29.+%5C+%5C+%5C+%5C+%5C+%281%29&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle p(x &#92;mid &#92;theta) &#92;propto h(x)&#92;exp(&#92;theta^T&#92;phi(x)). &#92; &#92; &#92; &#92; &#92; (1)' title='&#92;displaystyle p(x &#92;mid &#92;theta) &#92;propto h(x)&#92;exp(&#92;theta^T&#92;phi(x)). &#92; &#92; &#92; &#92; &#92; (1)' class='latex' /></p>
<p><a name="eqnexp-def-0"></a></p>
<p><a name="eqnexp-def-0"></a></p>
<p>Notice the similarity to the definition of a log-linear model, which is</p>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+p%28x+%5Cmid+%5Ctheta%29+%5Cpropto+%5Cexp%28%5Ctheta%5ET%5Cphi%28x%29%29.+%5C+%5C+%5C+%5C+%5C+%282%29&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle p(x &#92;mid &#92;theta) &#92;propto &#92;exp(&#92;theta^T&#92;phi(x)). &#92; &#92; &#92; &#92; &#92; (2)' title='&#92;displaystyle p(x &#92;mid &#92;theta) &#92;propto &#92;exp(&#92;theta^T&#92;phi(x)). &#92; &#92; &#92; &#92; &#92; (2)' class='latex' /></p>
<p>So, a log-linear model is simply an exponential family model with <img src='http://s0.wp.com/latex.php?latex=%7Bh%28x%29+%3D+1%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{h(x) = 1}' title='{h(x) = 1}' class='latex' />. Note that we can re-write the right-hand-side of (<a href="#eqnexp-def-0">1</a>) as <img src='http://s0.wp.com/latex.php?latex=%7B%5Cexp%28%5Ctheta%5ET%5Cphi%28x%29%2B%5Clog+h%28x%29%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;exp(&#92;theta^T&#92;phi(x)+&#92;log h(x))}' title='{&#92;exp(&#92;theta^T&#92;phi(x)+&#92;log h(x))}' class='latex' />, so an exponential family is really just a log-linear model with one of the coordinates of <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' /> constrained to equal <img src='http://s0.wp.com/latex.php?latex=%7B1%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{1}' title='{1}' class='latex' />. Also note that the normalization constant in (<a href="#eqnexp-def-0">1</a>) is a function of <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' /> (since <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' /> fully specifies the distribution over <img src='http://s0.wp.com/latex.php?latex=%7Bx%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{x}' title='{x}' class='latex' />), so we can express (<a href="#eqnexp-def-0">1</a>) more explicitly as</p>
<p><a name="eqnexp-def-1"></a></p>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+p%28x+%5Cmid+%5Ctheta%29+%3D+h%28x%29%5Cexp%28%5Ctheta%5ET%5Cphi%28x%29-A%28%5Ctheta%29%29%2C+%5C+%5C+%5C+%5C+%5C+%283%29&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle p(x &#92;mid &#92;theta) = h(x)&#92;exp(&#92;theta^T&#92;phi(x)-A(&#92;theta)), &#92; &#92; &#92; &#92; &#92; (3)' title='&#92;displaystyle p(x &#92;mid &#92;theta) = h(x)&#92;exp(&#92;theta^T&#92;phi(x)-A(&#92;theta)), &#92; &#92; &#92; &#92; &#92; (3)' class='latex' /></p>
<p><a name="eqnexp-def-1"></a></p>
<p><a name="eqnexp-def-1"></a></p>
<p>where</p>
<p><a name="eqnA"></a></p>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+A%28%5Ctheta%29+%3D+%5Clog%5Cleft%28%5Cint+h%28x%29%5Cexp%28%5Ctheta%5ET%5Cphi%28x%29%29+d%28x%29%5Cright%29.+%5C+%5C+%5C+%5C+%5C+%284%29&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle A(&#92;theta) = &#92;log&#92;left(&#92;int h(x)&#92;exp(&#92;theta^T&#92;phi(x)) d(x)&#92;right). &#92; &#92; &#92; &#92; &#92; (4)' title='&#92;displaystyle A(&#92;theta) = &#92;log&#92;left(&#92;int h(x)&#92;exp(&#92;theta^T&#92;phi(x)) d(x)&#92;right). &#92; &#92; &#92; &#92; &#92; (4)' class='latex' /></p>
<p><a name="eqnA"></a></p>
<p><a name="eqnA"></a></p>
<p>Exponential families are capable of capturing almost all of the common distributions you are familiar with. There is an extensive <a href="http://en.wikipedia.org/wiki/Exponential_family#Table_of_distributions">table</a> on Wikipedia; I&#8217;ve also included some of the most common below:</p>
<ol>
<li><em>Gaussian distributions.</em> Let <img src='http://s0.wp.com/latex.php?latex=%7B%5Cphi%28x%29+%3D+%5Cleft%5B+%5Cbegin%7Barray%7D%7Bc%7D+x+%5C%5C+x%5E2%5Cend%7Barray%7D+%5Cright%5D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;phi(x) = &#92;left[ &#92;begin{array}{c} x &#92;&#92; x^2&#92;end{array} &#92;right]}' title='{&#92;phi(x) = &#92;left[ &#92;begin{array}{c} x &#92;&#92; x^2&#92;end{array} &#92;right]}' class='latex' />. Then <img src='http://s0.wp.com/latex.php?latex=%7Bp%28x+%5Cmid+%5Ctheta%29+%5Cpropto+%5Cexp%28%5Ctheta_1x%2B%5Ctheta_2x%5E2%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(x &#92;mid &#92;theta) &#92;propto &#92;exp(&#92;theta_1x+&#92;theta_2x^2)}' title='{p(x &#92;mid &#92;theta) &#92;propto &#92;exp(&#92;theta_1x+&#92;theta_2x^2)}' class='latex' />. If we let <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta+%3D+%5Cleft%5B%5Cfrac%7B%5Cmu%7D%7B%5Csigma%5E2%7D%2C-%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D%5Cright%5D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;theta = &#92;left[&#92;frac{&#92;mu}{&#92;sigma^2},-&#92;frac{1}{2&#92;sigma^2}&#92;right]}' title='{&#92;theta = &#92;left[&#92;frac{&#92;mu}{&#92;sigma^2},-&#92;frac{1}{2&#92;sigma^2}&#92;right]}' class='latex' />, then <img src='http://s0.wp.com/latex.php?latex=%7Bp%28x+%5Cmid+%5Ctheta%29+%5Cpropto+%5Cexp%28%5Cfrac%7B%5Cmu+x%7D%7B%5Csigma%5E2%7D-%5Cfrac%7Bx%5E2%7D%7B2%5Csigma%5E2%7D%29+%5Cpropto+%5Cexp%28-%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D%28x-%5Cmu%29%5E2%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(x &#92;mid &#92;theta) &#92;propto &#92;exp(&#92;frac{&#92;mu x}{&#92;sigma^2}-&#92;frac{x^2}{2&#92;sigma^2}) &#92;propto &#92;exp(-&#92;frac{1}{2&#92;sigma^2}(x-&#92;mu)^2)}' title='{p(x &#92;mid &#92;theta) &#92;propto &#92;exp(&#92;frac{&#92;mu x}{&#92;sigma^2}-&#92;frac{x^2}{2&#92;sigma^2}) &#92;propto &#92;exp(-&#92;frac{1}{2&#92;sigma^2}(x-&#92;mu)^2)}' class='latex' />. We therefore see that Gaussian distributions are an exponential family for <img src='http://s0.wp.com/latex.php?latex=%7B%5Cphi%28x%29+%3D+%5Cleft%5B+%5Cbegin%7Barray%7D%7Bc%7D+x+%5C%5C+x%5E2+%5Cend%7Barray%7D+%5Cright%5D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;phi(x) = &#92;left[ &#92;begin{array}{c} x &#92;&#92; x^2 &#92;end{array} &#92;right]}' title='{&#92;phi(x) = &#92;left[ &#92;begin{array}{c} x &#92;&#92; x^2 &#92;end{array} &#92;right]}' class='latex' />.</li>
<li><em>Poisson distributions.</em> Let <img src='http://s0.wp.com/latex.php?latex=%7B%5Cphi%28x%29+%3D+%5Bx%5D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;phi(x) = [x]}' title='{&#92;phi(x) = [x]}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7Bh%28x%29+%3D+%5Cleft%5C%7B%5Cbegin%7Barray%7D%7Bccc%7D+%5Cfrac%7B1%7D%7Bx%21%7D+%26+%3A+%26+x+%5Cin+%5C%7B0%2C1%2C2%2C%5Cldots%5C%7D+%5C%5C+0+%26+%3A+%26+%5Cmathrm%7Belse%7D+%5Cend%7Barray%7D%5Cright.%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{h(x) = &#92;left&#92;{&#92;begin{array}{ccc} &#92;frac{1}{x!} &amp; : &amp; x &#92;in &#92;{0,1,2,&#92;ldots&#92;} &#92;&#92; 0 &amp; : &amp; &#92;mathrm{else} &#92;end{array}&#92;right.}' title='{h(x) = &#92;left&#92;{&#92;begin{array}{ccc} &#92;frac{1}{x!} &amp; : &amp; x &#92;in &#92;{0,1,2,&#92;ldots&#92;} &#92;&#92; 0 &amp; : &amp; &#92;mathrm{else} &#92;end{array}&#92;right.}' class='latex' />. Then <img src='http://s0.wp.com/latex.php?latex=%7Bp%28k+%5Cmid+%5Ctheta%29+%5Cpropto+%5Cfrac%7B1%7D%7Bk%21%7D%5Cexp%28%5Ctheta+x%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(k &#92;mid &#92;theta) &#92;propto &#92;frac{1}{k!}&#92;exp(&#92;theta x)}' title='{p(k &#92;mid &#92;theta) &#92;propto &#92;frac{1}{k!}&#92;exp(&#92;theta x)}' class='latex' />. If we let <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta_1+%3D+%5Clog%28%5Clambda%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;theta_1 = &#92;log(&#92;lambda)}' title='{&#92;theta_1 = &#92;log(&#92;lambda)}' class='latex' /> then we get <img src='http://s0.wp.com/latex.php?latex=%7Bp%28k+%5Cmid+%5Ctheta%29+%5Cpropto+%5Cfrac%7B%5Clambda%5Ek%7D%7Bk%21%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(k &#92;mid &#92;theta) &#92;propto &#92;frac{&#92;lambda^k}{k!}}' title='{p(k &#92;mid &#92;theta) &#92;propto &#92;frac{&#92;lambda^k}{k!}}' class='latex' />; we thus see that Poisson distributions are also an exponential family.</li>
<li><em>Multinomial distributions.</em> Suppose that <img src='http://s0.wp.com/latex.php?latex=%7BX+%3D+%5C%7B1%2C2%2C%5Cldots%2Cn%5C%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X = &#92;{1,2,&#92;ldots,n&#92;}}' title='{X = &#92;{1,2,&#92;ldots,n&#92;}}' class='latex' />. Let <img src='http://s0.wp.com/latex.php?latex=%7B%5Cphi%28k%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;phi(k)}' title='{&#92;phi(k)}' class='latex' /> be an <img src='http://s0.wp.com/latex.php?latex=%7Bn%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{n}' title='{n}' class='latex' />-dimensional vector whose <img src='http://s0.wp.com/latex.php?latex=%7Bk%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{k}' title='{k}' class='latex' />th element is <img src='http://s0.wp.com/latex.php?latex=%7B1%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{1}' title='{1}' class='latex' /> and where all other elements are zero. Then <img src='http://s0.wp.com/latex.php?latex=%7Bp%28k+%5Cmid+%5Ctheta%29+%5Cpropto+%5Cexp%28%5Ctheta_k%29+%5Cpropto+%5Cfrac%7B%5Cexp%28%5Ctheta_k%29%7D%7B%5Csum_%7Bk%3D1%7D%5En+%5Cexp%28%5Ctheta_k%29%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(k &#92;mid &#92;theta) &#92;propto &#92;exp(&#92;theta_k) &#92;propto &#92;frac{&#92;exp(&#92;theta_k)}{&#92;sum_{k=1}^n &#92;exp(&#92;theta_k)}}' title='{p(k &#92;mid &#92;theta) &#92;propto &#92;exp(&#92;theta_k) &#92;propto &#92;frac{&#92;exp(&#92;theta_k)}{&#92;sum_{k=1}^n &#92;exp(&#92;theta_k)}}' class='latex' />. If <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta_k+%3D+%5Clog+P%28x%3Dk%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;theta_k = &#92;log P(x=k)}' title='{&#92;theta_k = &#92;log P(x=k)}' class='latex' />, then we obtain an arbitrary multinomial distribution. Therefore, multinomial distributions are also an exponential family.</li>
</ol>
<p><b>2. Sufficient Statistics </b></p>
<p>A <em>statistic</em> of a random variable <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> is any deterministic function of that variable. For instance, if <img src='http://s0.wp.com/latex.php?latex=%7BX+%3D+%5BX_1%2C%5Cldots%2CX_n%5D%5ET%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X = [X_1,&#92;ldots,X_n]^T}' title='{X = [X_1,&#92;ldots,X_n]^T}' class='latex' /> is a vector of Gaussian random variables, then the sample mean <img src='http://s0.wp.com/latex.php?latex=%7B%5Chat%7B%5Cmu%7D+%3A%3D+%28X_1%2B%5Cldots%2BX_n%29%2Fn%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;hat{&#92;mu} := (X_1+&#92;ldots+X_n)/n}' title='{&#92;hat{&#92;mu} := (X_1+&#92;ldots+X_n)/n}' class='latex' /> and sample variance <img src='http://s0.wp.com/latex.php?latex=%7B%5Chat%7B%5Csigma%7D%5E2+%3A%3D+%28X_1%5E2%2B%5Ccdots%2BX_n%5E2%29%2Fn-%28X_1%2B%5Ccdots%2BX_n%29%5E2%2Fn%5E2%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;hat{&#92;sigma}^2 := (X_1^2+&#92;cdots+X_n^2)/n-(X_1+&#92;cdots+X_n)^2/n^2}' title='{&#92;hat{&#92;sigma}^2 := (X_1^2+&#92;cdots+X_n^2)/n-(X_1+&#92;cdots+X_n)^2/n^2}' class='latex' /> are both statistics.</p>
<p>Let <img src='http://s0.wp.com/latex.php?latex=%7B%5Cmathcal%7BF%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;mathcal{F}}' title='{&#92;mathcal{F}}' class='latex' /> be a family of distributions parameterized by <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' />, and let <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> be a random variable with distribution given by some unknown <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta_0%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;theta_0}' title='{&#92;theta_0}' class='latex' />. Then a vector <img src='http://s0.wp.com/latex.php?latex=%7BT%28X%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{T(X)}' title='{T(X)}' class='latex' /> of statistics are called <em>sufficient statistics</em> for <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta_0%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;theta_0}' title='{&#92;theta_0}' class='latex' /> if they contain all possible information about <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta_0%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;theta_0}' title='{&#92;theta_0}' class='latex' />, that is, for any function <img src='http://s0.wp.com/latex.php?latex=%7Bf%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{f}' title='{f}' class='latex' />, we have</p>
<p><a name="eqnsuff-def"></a></p>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cmathbb%7BE%7D%5Bf%28X%29+%5Cmid+T%28X%29+%3D+T_0%2C+%5Ctheta+%3D+%5Ctheta_0%5D+%3D+S%28f%2CT_0%29%2C+%5C+%5C+%5C+%5C+%5C+%285%29&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle &#92;mathbb{E}[f(X) &#92;mid T(X) = T_0, &#92;theta = &#92;theta_0] = S(f,T_0), &#92; &#92; &#92; &#92; &#92; (5)' title='&#92;displaystyle &#92;mathbb{E}[f(X) &#92;mid T(X) = T_0, &#92;theta = &#92;theta_0] = S(f,T_0), &#92; &#92; &#92; &#92; &#92; (5)' class='latex' /></p>
<p><a name="eqnsuff-def"></a></p>
<p><a name="eqnsuff-def"></a></p>
<p>for some function <img src='http://s0.wp.com/latex.php?latex=%7BS%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{S}' title='{S}' class='latex' /> that has no dependence on <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta_0%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;theta_0}' title='{&#92;theta_0}' class='latex' />.</p>
<p>For instance, let <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> be a vector of <img src='http://s0.wp.com/latex.php?latex=%7Bn%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{n}' title='{n}' class='latex' /> independent Gaussian random variables <img src='http://s0.wp.com/latex.php?latex=%7BX_1%2C%5Cldots%2CX_n%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X_1,&#92;ldots,X_n}' title='{X_1,&#92;ldots,X_n}' class='latex' /> with unknown mean <img src='http://s0.wp.com/latex.php?latex=%7B%5Cmu%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;mu}' title='{&#92;mu}' class='latex' /> and variance <img src='http://s0.wp.com/latex.php?latex=%7B%5Csigma%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;sigma}' title='{&#92;sigma}' class='latex' />. It turns out that <img src='http://s0.wp.com/latex.php?latex=%7BT%28X%29+%3A%3D+%5B%5Chat%7B%5Cmu%7D%2C%5Chat%7B%5Csigma%7D%5E2%5D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{T(X) := [&#92;hat{&#92;mu},&#92;hat{&#92;sigma}^2]}' title='{T(X) := [&#92;hat{&#92;mu},&#92;hat{&#92;sigma}^2]}' class='latex' /> is a sufficient statistic for <img src='http://s0.wp.com/latex.php?latex=%7B%5Cmu%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;mu}' title='{&#92;mu}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7B%5Csigma%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;sigma}' title='{&#92;sigma}' class='latex' />. This is not immediately obvious; a very useful tool for determining whether statistics are sufficient is the <b>Fisher-Neyman factorization theorem</b>:</p>
<blockquote><p><b>Theorem 1 (Fisher-Neyman)</b> <em> <a name="thmfisher-neyman"></a> Suppose that <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> has a probability density function <img src='http://s0.wp.com/latex.php?latex=%7Bp%28X+%5Cmid+%5Ctheta%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(X &#92;mid &#92;theta)}' title='{p(X &#92;mid &#92;theta)}' class='latex' />. Then the statistics <img src='http://s0.wp.com/latex.php?latex=%7BT%28X%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{T(X)}' title='{T(X)}' class='latex' /> are sufficient for <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' /> if and only if <img src='http://s0.wp.com/latex.php?latex=%7Bp%28X+%5Cmid+%5Ctheta%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(X &#92;mid &#92;theta)}' title='{p(X &#92;mid &#92;theta)}' class='latex' /> can be written in the form</em></p>
<p><a name="eqnfisher-neyman"></a></p>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+p%28X+%5Cmid+%5Ctheta%29+%3D+h%28X%29g_%5Ctheta%28T%28X%29%29.+%5C+%5C+%5C+%5C+%5C+%286%29&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle p(X &#92;mid &#92;theta) = h(X)g_&#92;theta(T(X)). &#92; &#92; &#92; &#92; &#92; (6)' title='&#92;displaystyle p(X &#92;mid &#92;theta) = h(X)g_&#92;theta(T(X)). &#92; &#92; &#92; &#92; &#92; (6)' class='latex' /></p>
<p><a name="eqnfisher-neyman"></a></p>
<p><a name="eqnfisher-neyman"></a></p>
<p>In other words, the probability of <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> can be factored into a part that does not depend on <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' />, and a part that depends on <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' /> only via <img src='http://s0.wp.com/latex.php?latex=%7BT%28X%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{T(X)}' title='{T(X)}' class='latex' />.</p></blockquote>
<p>What is going on here, intuitively? If <img src='http://s0.wp.com/latex.php?latex=%7Bp%28X+%5Cmid+%5Ctheta%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(X &#92;mid &#92;theta)}' title='{p(X &#92;mid &#92;theta)}' class='latex' /> depended only on <img src='http://s0.wp.com/latex.php?latex=%7BT%28X%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{T(X)}' title='{T(X)}' class='latex' />, then <img src='http://s0.wp.com/latex.php?latex=%7BT%28X%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{T(X)}' title='{T(X)}' class='latex' /> would definitely be a sufficient statistic. But that isn&#8217;t the only way for <img src='http://s0.wp.com/latex.php?latex=%7BT%28X%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{T(X)}' title='{T(X)}' class='latex' /> to be a sufficient statistic &#8212; <img src='http://s0.wp.com/latex.php?latex=%7Bp%28X+%5Cmid+%5Ctheta%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(X &#92;mid &#92;theta)}' title='{p(X &#92;mid &#92;theta)}' class='latex' /> could also just not depend on <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' /> at all, in which case <img src='http://s0.wp.com/latex.php?latex=%7BT%28X%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{T(X)}' title='{T(X)}' class='latex' /> would trivially be a sufficient statistic (as would anything else). The Fisher-Neyman theorem essentially says that the only way in which <img src='http://s0.wp.com/latex.php?latex=%7BT%28X%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{T(X)}' title='{T(X)}' class='latex' /> can be a sufficient statistic is if its density is a product of these two cases.</p>
<p><em>Proof:</em> If (<a href="#eqnfisher-neyman">6</a>) holds, then we can check that (<a href="#eqnsuff-def">5</a>) is satisfied:</p>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cbegin%7Barray%7D%7Brcl%7D+%5Cmathbb%7BE%7D%5Bf%28X%29+%5Cmid+T%28X%29+%3D+T_0%2C+%5Ctheta+%3D+%5Ctheta_0%5D+%26%3D%26+%5Cfrac%7B%5Cint_%7BT%28X%29+%3D+T_0%7D+f%28X%29+dp%28X+%5Cmid+%5Ctheta%3D%5Ctheta_0%29%7D%7B%5Cint_%7BT%28X%29+%3D+T_0%7D+dp%28X+%5Cmid+%5Ctheta%3D%5Ctheta_0%29%7D%5C%5C+%5C%5C+%26%3D%26+%5Cfrac%7B%5Cint_%7BT%28X%29%3DT_0%7D+f%28X%29h%28X%29g_%5Ctheta%28T_0%29+dX%7D%7B%5Cint_%7BT%28X%29%3DT_0%7D+h%28X%29g_%5Ctheta%28T_0%29+dX%7D%5C%5C+%5C%5C+%26%3D%26+%5Cfrac%7B%5Cint_%7BT%28X%29%3DT_0%7D+f%28X%29h%28X%29dX%7D%7B%5Cint_%7BT%28X%29%3DT_0%7D+h%28X%29+dX%7D%2C+%5Cend%7Barray%7D+&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle &#92;begin{array}{rcl} &#92;mathbb{E}[f(X) &#92;mid T(X) = T_0, &#92;theta = &#92;theta_0] &amp;=&amp; &#92;frac{&#92;int_{T(X) = T_0} f(X) dp(X &#92;mid &#92;theta=&#92;theta_0)}{&#92;int_{T(X) = T_0} dp(X &#92;mid &#92;theta=&#92;theta_0)}&#92;&#92; &#92;&#92; &amp;=&amp; &#92;frac{&#92;int_{T(X)=T_0} f(X)h(X)g_&#92;theta(T_0) dX}{&#92;int_{T(X)=T_0} h(X)g_&#92;theta(T_0) dX}&#92;&#92; &#92;&#92; &amp;=&amp; &#92;frac{&#92;int_{T(X)=T_0} f(X)h(X)dX}{&#92;int_{T(X)=T_0} h(X) dX}, &#92;end{array} ' title='&#92;displaystyle &#92;begin{array}{rcl} &#92;mathbb{E}[f(X) &#92;mid T(X) = T_0, &#92;theta = &#92;theta_0] &amp;=&amp; &#92;frac{&#92;int_{T(X) = T_0} f(X) dp(X &#92;mid &#92;theta=&#92;theta_0)}{&#92;int_{T(X) = T_0} dp(X &#92;mid &#92;theta=&#92;theta_0)}&#92;&#92; &#92;&#92; &amp;=&amp; &#92;frac{&#92;int_{T(X)=T_0} f(X)h(X)g_&#92;theta(T_0) dX}{&#92;int_{T(X)=T_0} h(X)g_&#92;theta(T_0) dX}&#92;&#92; &#92;&#92; &amp;=&amp; &#92;frac{&#92;int_{T(X)=T_0} f(X)h(X)dX}{&#92;int_{T(X)=T_0} h(X) dX}, &#92;end{array} ' class='latex' /></p>
<p>where the right-hand-side has no dependence on <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' />.</p>
<p>On the other hand, if we compute <img src='http://s0.wp.com/latex.php?latex=%7B%5Cmathbb%7BE%7D%5Bf%28X%29+%5Cmid+T%28X%29+%3D+T_0%2C+%5Ctheta+%3D+%5Ctheta_0%5D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;mathbb{E}[f(X) &#92;mid T(X) = T_0, &#92;theta = &#92;theta_0]}' title='{&#92;mathbb{E}[f(X) &#92;mid T(X) = T_0, &#92;theta = &#92;theta_0]}' class='latex' /> for an arbitrary density <img src='http://s0.wp.com/latex.php?latex=%7Bp%28X%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(X)}' title='{p(X)}' class='latex' />, we get</p>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cbegin%7Barray%7D%7Brcl%7D+%5Cmathbb%7BE%7D%5Bf%28X%29+%5Cmid+T%28X%29+%3D+T_0%2C+%5Ctheta+%3D+%5Ctheta_0%5D+%26%3D%26+%5Cint_%7BT%28X%29+%3D+T_0%7D+f%28X%29+%5Cfrac%7Bp%28X+%5Cmid+%5Ctheta%3D%5Ctheta_0%29%7D%7B%5Cint_%7BT%28X%29%3DT_0%7D+p%28X+%5Cmid+%5Ctheta%3D%5Ctheta_0%29+dX%7D+dX.+%5Cend%7Barray%7D+&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle &#92;begin{array}{rcl} &#92;mathbb{E}[f(X) &#92;mid T(X) = T_0, &#92;theta = &#92;theta_0] &amp;=&amp; &#92;int_{T(X) = T_0} f(X) &#92;frac{p(X &#92;mid &#92;theta=&#92;theta_0)}{&#92;int_{T(X)=T_0} p(X &#92;mid &#92;theta=&#92;theta_0) dX} dX. &#92;end{array} ' title='&#92;displaystyle &#92;begin{array}{rcl} &#92;mathbb{E}[f(X) &#92;mid T(X) = T_0, &#92;theta = &#92;theta_0] &amp;=&amp; &#92;int_{T(X) = T_0} f(X) &#92;frac{p(X &#92;mid &#92;theta=&#92;theta_0)}{&#92;int_{T(X)=T_0} p(X &#92;mid &#92;theta=&#92;theta_0) dX} dX. &#92;end{array} ' class='latex' /></p>
<p>If the right-hand-side cannot depend on <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' /> for <em>any</em> choice of <img src='http://s0.wp.com/latex.php?latex=%7Bf%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{f}' title='{f}' class='latex' />, then the term that we multiply <img src='http://s0.wp.com/latex.php?latex=%7Bf%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{f}' title='{f}' class='latex' /> by must not depend on <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' />; that is, <img src='http://s0.wp.com/latex.php?latex=%7B%5Cfrac%7Bp%28X+%5Cmid+%5Ctheta%3D%5Ctheta_0%29%7D%7B%5Cint_%7BT%28X%29+%3D+T_0%7D+p%28X+%5Cmid+%5Ctheta%3D%5Ctheta_0%29+dX%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;frac{p(X &#92;mid &#92;theta=&#92;theta_0)}{&#92;int_{T(X) = T_0} p(X &#92;mid &#92;theta=&#92;theta_0) dX}}' title='{&#92;frac{p(X &#92;mid &#92;theta=&#92;theta_0)}{&#92;int_{T(X) = T_0} p(X &#92;mid &#92;theta=&#92;theta_0) dX}}' class='latex' /> must be some function <img src='http://s0.wp.com/latex.php?latex=%7Bh_0%28X%2C+T_0%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{h_0(X, T_0)}' title='{h_0(X, T_0)}' class='latex' /> that depends only on <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7BT_0%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{T_0}' title='{T_0}' class='latex' /> and not on <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' />. On the other hand, the denominator <img src='http://s0.wp.com/latex.php?latex=%7B%5Cint_%7BT%28X%29%3DT_0%7D+p%28X+%5Cmid+%5Ctheta%3D%5Ctheta_0%29+dX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;int_{T(X)=T_0} p(X &#92;mid &#92;theta=&#92;theta_0) dX}' title='{&#92;int_{T(X)=T_0} p(X &#92;mid &#92;theta=&#92;theta_0) dX}' class='latex' /> depends only on <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta_0%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;theta_0}' title='{&#92;theta_0}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7BT_0%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{T_0}' title='{T_0}' class='latex' />; call this dependence <img src='http://s0.wp.com/latex.php?latex=%7Bg_%7B%5Ctheta_0%7D%28T_0%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{g_{&#92;theta_0}(T_0)}' title='{g_{&#92;theta_0}(T_0)}' class='latex' />. Finally, note that <img src='http://s0.wp.com/latex.php?latex=%7BT_0%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{T_0}' title='{T_0}' class='latex' /> is a deterministic function of <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' />, so let <img src='http://s0.wp.com/latex.php?latex=%7Bh%28X%29+%3A%3D+h_0%28X%2CT%28X%29%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{h(X) := h_0(X,T(X))}' title='{h(X) := h_0(X,T(X))}' class='latex' />. We then see that <img src='http://s0.wp.com/latex.php?latex=%7Bp%28X+%5Cmid+%5Ctheta%3D%5Ctheta_0%29+%3D+h_0%28X%2C+T_0%29g_%7B%5Ctheta_0%7D%28T_0%29+%3D+h%28X%29g_%7B%5Ctheta_0%7D%28T%28X%29%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(X &#92;mid &#92;theta=&#92;theta_0) = h_0(X, T_0)g_{&#92;theta_0}(T_0) = h(X)g_{&#92;theta_0}(T(X))}' title='{p(X &#92;mid &#92;theta=&#92;theta_0) = h_0(X, T_0)g_{&#92;theta_0}(T_0) = h(X)g_{&#92;theta_0}(T(X))}' class='latex' />, which is the same form as (<a href="#eqnfisher-neyman">6</a>), thus completing the proof of the theorem. <img src='http://s0.wp.com/latex.php?latex=%5CBox&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;Box' title='&#92;Box' class='latex' /></p>
<p>Now, let us apply the Fisher-Neyman theorem to exponential families. By definition, the density for an exponential family factors as</p>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+p%28x+%5Cmid+%5Ctheta%29+%3D+h%28x%29%5Cexp%28%5Ctheta%5ET%5Cphi%28x%29-A%28%5Ctheta%29%29.+&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle p(x &#92;mid &#92;theta) = h(x)&#92;exp(&#92;theta^T&#92;phi(x)-A(&#92;theta)). ' title='&#92;displaystyle p(x &#92;mid &#92;theta) = h(x)&#92;exp(&#92;theta^T&#92;phi(x)-A(&#92;theta)). ' class='latex' /></p>
<p>If we let <img src='http://s0.wp.com/latex.php?latex=%7BT%28x%29+%3D+%5Cphi%28x%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{T(x) = &#92;phi(x)}' title='{T(x) = &#92;phi(x)}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7Bg_%5Ctheta%28%5Cphi%28x%29%29+%3D+%5Cexp%28%5Ctheta%5ET%5Cphi%28x%29-A%28%5Ctheta%29%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{g_&#92;theta(&#92;phi(x)) = &#92;exp(&#92;theta^T&#92;phi(x)-A(&#92;theta))}' title='{g_&#92;theta(&#92;phi(x)) = &#92;exp(&#92;theta^T&#92;phi(x)-A(&#92;theta))}' class='latex' />, then the Fisher-Neyman condition is met; therefore, <img src='http://s0.wp.com/latex.php?latex=%7B%5Cphi%28x%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;phi(x)}' title='{&#92;phi(x)}' class='latex' /> is a vector of sufficient statistics for the exponential family. In fact, we can go further:</p>
<blockquote><p><b>Theorem 2</b> <em> Let <img src='http://s0.wp.com/latex.php?latex=%7BX_1%2C%5Cldots%2CX_n%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X_1,&#92;ldots,X_n}' title='{X_1,&#92;ldots,X_n}' class='latex' /> be drawn independently from an exponential family distribution with fixed parameter <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' />. Then the empirical expectation <img src='http://s0.wp.com/latex.php?latex=%7B%5Chat%7B%5Cphi%7D+%3A%3D+%5Cfrac%7B1%7D%7Bn%7D+%5Csum_%7Bi%3D1%7D%5En+%5Cphi%28X_i%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;hat{&#92;phi} := &#92;frac{1}{n} &#92;sum_{i=1}^n &#92;phi(X_i)}' title='{&#92;hat{&#92;phi} := &#92;frac{1}{n} &#92;sum_{i=1}^n &#92;phi(X_i)}' class='latex' /> is a sufficient statistic for <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' />. </em></p></blockquote>
<p><em>Proof:</em> The density for <img src='http://s0.wp.com/latex.php?latex=%7BX_1%2C%5Cldots%2CX_n%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X_1,&#92;ldots,X_n}' title='{X_1,&#92;ldots,X_n}' class='latex' /> given <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' /> is</p>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cbegin%7Barray%7D%7Brcl%7D+p%28X_1%2C%5Cldots%2CX_n+%5Cmid+%5Ctheta%29+%26%3D%26+h%28X_1%29%5Ccdots+h%28X_n%29+%5Cexp%28%5Ctheta%5ET%5Csum_%7Bi%3D1%7D%5En+%5Cphi%28X_i%29+-+nA%28%5Ctheta%29%29+%5C%5C+%26%3D%26+h%28X_1%29%5Ccdots+h%28X_n%29%5Cexp%28n+%5B%5Chat%7B%5Cphi%7D-A%28%5Ctheta%29%5D%29.+%5Cend%7Barray%7D+&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle &#92;begin{array}{rcl} p(X_1,&#92;ldots,X_n &#92;mid &#92;theta) &amp;=&amp; h(X_1)&#92;cdots h(X_n) &#92;exp(&#92;theta^T&#92;sum_{i=1}^n &#92;phi(X_i) - nA(&#92;theta)) &#92;&#92; &amp;=&amp; h(X_1)&#92;cdots h(X_n)&#92;exp(n [&#92;hat{&#92;phi}-A(&#92;theta)]). &#92;end{array} ' title='&#92;displaystyle &#92;begin{array}{rcl} p(X_1,&#92;ldots,X_n &#92;mid &#92;theta) &amp;=&amp; h(X_1)&#92;cdots h(X_n) &#92;exp(&#92;theta^T&#92;sum_{i=1}^n &#92;phi(X_i) - nA(&#92;theta)) &#92;&#92; &amp;=&amp; h(X_1)&#92;cdots h(X_n)&#92;exp(n [&#92;hat{&#92;phi}-A(&#92;theta)]). &#92;end{array} ' class='latex' /></p>
<p>Letting <img src='http://s0.wp.com/latex.php?latex=%7Bh%28X_1%2C%5Cldots%2CX_n%29+%3D+h%28X_1%29%5Ccdots+h%28X_n%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{h(X_1,&#92;ldots,X_n) = h(X_1)&#92;cdots h(X_n)}' title='{h(X_1,&#92;ldots,X_n) = h(X_1)&#92;cdots h(X_n)}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7Bg_%5Ctheta%28%5Chat%7B%5Cphi%7D%29+%3D+%5Cexp%28n%5B%5Chat%7B%5Cphi%7D-A%28%5Ctheta%29%5D%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{g_&#92;theta(&#92;hat{&#92;phi}) = &#92;exp(n[&#92;hat{&#92;phi}-A(&#92;theta)])}' title='{g_&#92;theta(&#92;hat{&#92;phi}) = &#92;exp(n[&#92;hat{&#92;phi}-A(&#92;theta)])}' class='latex' />, we see that the Fisher-Neyman conditions are satisfied, so that <img src='http://s0.wp.com/latex.php?latex=%7B%5Chat%7B%5Cphi%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;hat{&#92;phi}}' title='{&#92;hat{&#92;phi}}' class='latex' /> is indeed a sufficient statistic. <img src='http://s0.wp.com/latex.php?latex=%5CBox&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;Box' title='&#92;Box' class='latex' /></p>
<p>Finally, we note (without proof) the same relationship as in the log-linear case to the gradient and Hessian of <img src='http://s0.wp.com/latex.php?latex=%7Bp%28X_1%2C%5Cldots%2CX_n+%5Cmid+%5Ctheta%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(X_1,&#92;ldots,X_n &#92;mid &#92;theta)}' title='{p(X_1,&#92;ldots,X_n &#92;mid &#92;theta)}' class='latex' /> with respect to the model parameters:</p>
<blockquote><p><b>Theorem 3</b> <em> Again let <img src='http://s0.wp.com/latex.php?latex=%7BX_1%2C%5Cldots%2CX_n%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X_1,&#92;ldots,X_n}' title='{X_1,&#92;ldots,X_n}' class='latex' /> be drawn from an exponential family distribution with parameter <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' />. Then the gradient of <img src='http://s0.wp.com/latex.php?latex=%7Bp%28X_1%2C%5Cldots%2CX_n+%5Cmid+%5Ctheta%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(X_1,&#92;ldots,X_n &#92;mid &#92;theta)}' title='{p(X_1,&#92;ldots,X_n &#92;mid &#92;theta)}' class='latex' /> with respect to <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' /> is</em></p>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+n+%5Ctimes+%5Cleft%28%5Chat%7B%5Cphi%7D-%5Cmathbb%7BE%7D%5B%5Cphi+%5Cmid+%5Ctheta%5D%5Cright%29+&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle n &#92;times &#92;left(&#92;hat{&#92;phi}-&#92;mathbb{E}[&#92;phi &#92;mid &#92;theta]&#92;right) ' title='&#92;displaystyle n &#92;times &#92;left(&#92;hat{&#92;phi}-&#92;mathbb{E}[&#92;phi &#92;mid &#92;theta]&#92;right) ' class='latex' /></p>
<p>and the Hessian is</p>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+n+%5Ctimes+%5Cleft%28%5Cmathbb%7BE%7D%5B%5Cphi+%5Cmid+%5Ctheta%5D%5Cmathbb%7BE%7D%5B%5Cphi+%5Cmid+%5Ctheta%5D%5ET+-+%5Cmathbb%7BE%7D%5B%5Cphi%5Cphi%5ET+%5Cmid+%5Ctheta%5D%5Cright%29.+&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle n &#92;times &#92;left(&#92;mathbb{E}[&#92;phi &#92;mid &#92;theta]&#92;mathbb{E}[&#92;phi &#92;mid &#92;theta]^T - &#92;mathbb{E}[&#92;phi&#92;phi^T &#92;mid &#92;theta]&#92;right). ' title='&#92;displaystyle n &#92;times &#92;left(&#92;mathbb{E}[&#92;phi &#92;mid &#92;theta]&#92;mathbb{E}[&#92;phi &#92;mid &#92;theta]^T - &#92;mathbb{E}[&#92;phi&#92;phi^T &#92;mid &#92;theta]&#92;right). ' class='latex' /></p>
</blockquote>
<p>This theorem provides an efficient algorithm for fitting the parameters of an exponential family distribution (for details on the algorithm, see the part near the end of the <a href="http://jsteinhardt.wordpress.com/2012/12/06/log-linear-models/">log-linear models post</a> on parameter estimation).</p>
<p><b>3. Moments of an Exponential Family </b></p>
<p>If <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> is a real-valued random variable, then the <em><img src='http://s0.wp.com/latex.php?latex=%7Bp%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p}' title='{p}' class='latex' />th moment</em> of <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> is <img src='http://s0.wp.com/latex.php?latex=%7B%5Cmathbb%7BE%7D%5BX%5Ep%5D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;mathbb{E}[X^p]}' title='{&#92;mathbb{E}[X^p]}' class='latex' />. In general, if <img src='http://s0.wp.com/latex.php?latex=%7BX+%3D+%5BX_1%2C%5Cldots%2CX_n%5D%5ET%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X = [X_1,&#92;ldots,X_n]^T}' title='{X = [X_1,&#92;ldots,X_n]^T}' class='latex' /> is a random variable on <img src='http://s0.wp.com/latex.php?latex=%7B%5Cmathbb%7BR%7D%5En%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;mathbb{R}^n}' title='{&#92;mathbb{R}^n}' class='latex' />, then for every sequence <img src='http://s0.wp.com/latex.php?latex=%7Bp_1%2C%5Cldots%2Cp_n%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p_1,&#92;ldots,p_n}' title='{p_1,&#92;ldots,p_n}' class='latex' /> of non-negative integers, there is a corresponding moment <img src='http://s0.wp.com/latex.php?latex=%7BM_%7Bp_1%2C%5Ccdots%2Cp_n%7D+%3A%3D+%5Cmathbb%7BE%7D%5BX_1%5E%7Bp_1%7D%5Ccdots+X_n%5E%7Bp_n%7D%5D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{M_{p_1,&#92;cdots,p_n} := &#92;mathbb{E}[X_1^{p_1}&#92;cdots X_n^{p_n}]}' title='{M_{p_1,&#92;cdots,p_n} := &#92;mathbb{E}[X_1^{p_1}&#92;cdots X_n^{p_n}]}' class='latex' />.</p>
<p>In exponential families there is a very nice relationship between the normalization constant <img src='http://s0.wp.com/latex.php?latex=%7BA%28%5Ctheta%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{A(&#92;theta)}' title='{A(&#92;theta)}' class='latex' /> and the moments of <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' />. Before we establish this relationship, let us define the <em>moment generating function</em> of a random variable <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> as <img src='http://s0.wp.com/latex.php?latex=%7Bf%28%5Clambda%29+%3D+%5Cmathbb%7BE%7D%5B%5Cexp%28%5Clambda%5ETX%29%5D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{f(&#92;lambda) = &#92;mathbb{E}[&#92;exp(&#92;lambda^TX)]}' title='{f(&#92;lambda) = &#92;mathbb{E}[&#92;exp(&#92;lambda^TX)]}' class='latex' />.</p>
<blockquote><p><b>Lemma 4</b> <em> <a name="lemmgf"></a> The moment generating function for a random variable <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> is equal to</em></p>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Csum_%7Bp_1%2C%5Cldots%2Cp_n%3D0%7D%5E%7B%5Cinfty%7D+M_%7Bp_1%2C%5Ccdots%2Cp_n%7D+%5Cfrac%7B%5Clambda_1%5E%7Bp_1%7D%5Ccdots+%5Clambda_n%5E%7Bp_n%7D%7D%7Bp_1%21%5Ccdots+p_n%21%7D.+&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle &#92;sum_{p_1,&#92;ldots,p_n=0}^{&#92;infty} M_{p_1,&#92;cdots,p_n} &#92;frac{&#92;lambda_1^{p_1}&#92;cdots &#92;lambda_n^{p_n}}{p_1!&#92;cdots p_n!}. ' title='&#92;displaystyle &#92;sum_{p_1,&#92;ldots,p_n=0}^{&#92;infty} M_{p_1,&#92;cdots,p_n} &#92;frac{&#92;lambda_1^{p_1}&#92;cdots &#92;lambda_n^{p_n}}{p_1!&#92;cdots p_n!}. ' class='latex' /></p>
</blockquote>
<p>The proof of Lemma <a href="#lemmgf">4</a> is a straightforward application of Taylor&#8217;s theorem, together with linearity of expectation (note that in one dimension, the expression in Lemma <a href="#lemmgf">4</a> would just be <img src='http://s0.wp.com/latex.php?latex=%7B%5Csum_%7Bp%3D0%7D%5E%7B%5Cinfty%7D+%5Cmathbb%7BE%7D%5BX%5Ep%5D+%5Cfrac%7B%5Clambda%5Ep%7D%7Bp%21%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;sum_{p=0}^{&#92;infty} &#92;mathbb{E}[X^p] &#92;frac{&#92;lambda^p}{p!}}' title='{&#92;sum_{p=0}^{&#92;infty} &#92;mathbb{E}[X^p] &#92;frac{&#92;lambda^p}{p!}}' class='latex' />).</p>
<p>We now see why <img src='http://s0.wp.com/latex.php?latex=%7Bf%28%5Clambda%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{f(&#92;lambda)}' title='{f(&#92;lambda)}' class='latex' /> is called the moment generating function: it is the <a href="http://en.wikipedia.org/wiki/Generating_function#Exponential_generating_function">exponential generating function</a> for the moments of <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' />. The moment generating function for the sufficient statistics of an exponential family is particularly easy to compute:</p>
<blockquote><p><b>Lemma 5</b> <em> <a name="lemmgf-exp"></a> If <img src='http://s0.wp.com/latex.php?latex=%7Bp%28x+%5Cmid+%5Ctheta%29+%3D+h%28x%29%5Cexp%28%5Ctheta%5ET%5Cphi%28x%29-A%28%5Ctheta%29%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(x &#92;mid &#92;theta) = h(x)&#92;exp(&#92;theta^T&#92;phi(x)-A(&#92;theta))}' title='{p(x &#92;mid &#92;theta) = h(x)&#92;exp(&#92;theta^T&#92;phi(x)-A(&#92;theta))}' class='latex' />, then <img src='http://s0.wp.com/latex.php?latex=%7B%5Cmathbb%7BE%7D%5B%5Cexp%28%5Clambda%5ET%5Cphi%28x%29%29%5D+%3D+%5Cexp%28A%28%5Ctheta%2B%5Clambda%29-A%28%5Ctheta%29%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;mathbb{E}[&#92;exp(&#92;lambda^T&#92;phi(x))] = &#92;exp(A(&#92;theta+&#92;lambda)-A(&#92;theta))}' title='{&#92;mathbb{E}[&#92;exp(&#92;lambda^T&#92;phi(x))] = &#92;exp(A(&#92;theta+&#92;lambda)-A(&#92;theta))}' class='latex' />. </em></p></blockquote>
<p><em>Proof:</em></p>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cbegin%7Barray%7D%7Brcl%7D+%5Cmathbb%7BE%7D%5B%5Cexp%28%5Clambda%5ETx%29%5D+%26%3D%26+%5Cint+%5Cexp%28%5Clambda%5ETx%29+p%28x+%5Cmid+%5Ctheta%29+dx+%5C%5C+%26%3D%26+%5Cint+%5Cexp%28%5Clambda%5ETx%29h%28x%29%5Cexp%28%5Ctheta%5ET%5Cphi%28x%29-A%28%5Ctheta%29%29+dx+%5C%5C+%26%3D%26+%5Cint+h%28x%29%5Cexp%28%28%5Ctheta%2B%5Clambda%29%5ET%5Cphi%28x%29-A%28%5Ctheta%29%29+dx+%5C%5C+%26%3D%26+%5Cint+h%28x%29%5Cexp%28%28%5Ctheta%2B%5Clambda%29%5ET%5Cphi%28x%29-A%28%5Ctheta%2B%5Clambda%29%29dx+%5Ctimes+%5Cexp%28A%28%5Ctheta%2B%5Clambda%29-A%28%5Ctheta%29%29+%5C%5C+%26%3D%26+%5Cint+p%28x+%5Cmid+%5Ctheta%2B%5Clambda%29+dx+%5Ctimes+%5Cexp%28A%28%5Ctheta%2B%5Clambda%29-A%28%5Ctheta%29%29+%5C%5C+%26%3D%26+%5Cexp%28A%28%5Ctheta%2B%5Clambda%29-A%28%5Ctheta%29%29%2C+%5Cend%7Barray%7D+&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle &#92;begin{array}{rcl} &#92;mathbb{E}[&#92;exp(&#92;lambda^Tx)] &amp;=&amp; &#92;int &#92;exp(&#92;lambda^Tx) p(x &#92;mid &#92;theta) dx &#92;&#92; &amp;=&amp; &#92;int &#92;exp(&#92;lambda^Tx)h(x)&#92;exp(&#92;theta^T&#92;phi(x)-A(&#92;theta)) dx &#92;&#92; &amp;=&amp; &#92;int h(x)&#92;exp((&#92;theta+&#92;lambda)^T&#92;phi(x)-A(&#92;theta)) dx &#92;&#92; &amp;=&amp; &#92;int h(x)&#92;exp((&#92;theta+&#92;lambda)^T&#92;phi(x)-A(&#92;theta+&#92;lambda))dx &#92;times &#92;exp(A(&#92;theta+&#92;lambda)-A(&#92;theta)) &#92;&#92; &amp;=&amp; &#92;int p(x &#92;mid &#92;theta+&#92;lambda) dx &#92;times &#92;exp(A(&#92;theta+&#92;lambda)-A(&#92;theta)) &#92;&#92; &amp;=&amp; &#92;exp(A(&#92;theta+&#92;lambda)-A(&#92;theta)), &#92;end{array} ' title='&#92;displaystyle &#92;begin{array}{rcl} &#92;mathbb{E}[&#92;exp(&#92;lambda^Tx)] &amp;=&amp; &#92;int &#92;exp(&#92;lambda^Tx) p(x &#92;mid &#92;theta) dx &#92;&#92; &amp;=&amp; &#92;int &#92;exp(&#92;lambda^Tx)h(x)&#92;exp(&#92;theta^T&#92;phi(x)-A(&#92;theta)) dx &#92;&#92; &amp;=&amp; &#92;int h(x)&#92;exp((&#92;theta+&#92;lambda)^T&#92;phi(x)-A(&#92;theta)) dx &#92;&#92; &amp;=&amp; &#92;int h(x)&#92;exp((&#92;theta+&#92;lambda)^T&#92;phi(x)-A(&#92;theta+&#92;lambda))dx &#92;times &#92;exp(A(&#92;theta+&#92;lambda)-A(&#92;theta)) &#92;&#92; &amp;=&amp; &#92;int p(x &#92;mid &#92;theta+&#92;lambda) dx &#92;times &#92;exp(A(&#92;theta+&#92;lambda)-A(&#92;theta)) &#92;&#92; &amp;=&amp; &#92;exp(A(&#92;theta+&#92;lambda)-A(&#92;theta)), &#92;end{array} ' class='latex' /></p>
<p>where the last step uses the fact that <img src='http://s0.wp.com/latex.php?latex=%7Bp%28x+%5Cmid+%5Ctheta%2B%5Clambda%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(x &#92;mid &#92;theta+&#92;lambda)}' title='{p(x &#92;mid &#92;theta+&#92;lambda)}' class='latex' /> is a probability density and hence <img src='http://s0.wp.com/latex.php?latex=%7B%5Cint+p%28x+%5Cmid+%5Ctheta%2B%5Clambda%29+dx+%3D+1%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;int p(x &#92;mid &#92;theta+&#92;lambda) dx = 1}' title='{&#92;int p(x &#92;mid &#92;theta+&#92;lambda) dx = 1}' class='latex' />. <img src='http://s0.wp.com/latex.php?latex=%5CBox&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;Box' title='&#92;Box' class='latex' /></p>
<p>Now, by Lemma <a href="#lemmgf">4</a>, <img src='http://s0.wp.com/latex.php?latex=%7BM_%7Bp_1%2C%5Ccdots%2Cp_n%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{M_{p_1,&#92;cdots,p_n}}' title='{M_{p_1,&#92;cdots,p_n}}' class='latex' /> is just the <img src='http://s0.wp.com/latex.php?latex=%7B%28p_1%2C%5Cldots%2Cp_n%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{(p_1,&#92;ldots,p_n)}' title='{(p_1,&#92;ldots,p_n)}' class='latex' /> coefficient in the Taylor series for the moment generating function <img src='http://s0.wp.com/latex.php?latex=%7Bf%28%5Clambda%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{f(&#92;lambda)}' title='{f(&#92;lambda)}' class='latex' />, and hence we can compute <img src='http://s0.wp.com/latex.php?latex=%7BM_%7Bp_1%2C%5Ccdots%2Cp_n%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{M_{p_1,&#92;cdots,p_n}}' title='{M_{p_1,&#92;cdots,p_n}}' class='latex' /> as <img src='http://s0.wp.com/latex.php?latex=%7B%5Cfrac%7B%5Cpartial%5E%7Bp_1%2B%5Ccdots%2Bp_n%7D+f%28%5Clambda%29%7D%7B%5Cpartial%5E%7Bp_1%7D%5Clambda_1%5Ccdots+%5Cpartial%5E%7Bp_n%7D%5Clambda_n%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;frac{&#92;partial^{p_1+&#92;cdots+p_n} f(&#92;lambda)}{&#92;partial^{p_1}&#92;lambda_1&#92;cdots &#92;partial^{p_n}&#92;lambda_n}}' title='{&#92;frac{&#92;partial^{p_1+&#92;cdots+p_n} f(&#92;lambda)}{&#92;partial^{p_1}&#92;lambda_1&#92;cdots &#92;partial^{p_n}&#92;lambda_n}}' class='latex' />. Combining this with Lemma <a href="#lemmgf-exp">5</a> gives us a closed-form expression for <img src='http://s0.wp.com/latex.php?latex=%7BM_%7Bp_1%2C%5Ccdots%2Cp_n%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{M_{p_1,&#92;cdots,p_n}}' title='{M_{p_1,&#92;cdots,p_n}}' class='latex' /> in terms of the normalization constant <img src='http://s0.wp.com/latex.php?latex=%7BA%28%5Ctheta%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{A(&#92;theta)}' title='{A(&#92;theta)}' class='latex' />:</p>
<blockquote><p><b>Lemma 6</b> <em> <a name="lemexp-moment"></a> The moments of an exponential family can be computed as<br />
</em></p>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+M_%7Bp_1%2C%5Cldots%2Cp_n%7D+%3D+%5Cfrac%7B%5Cpartial%5E%7Bp_1%2B%5Ccdots%2Bp_n%7D+%5Cexp%28A%28%5Ctheta%2B%5Clambda%29-A%28%5Ctheta%29%29%7D%7B%5Cpartial%5E%7Bp_1%7D%5Clambda_1%5Ccdots+%5Cpartial%5E%7Bp_n%7D%5Clambda_n%7D.+&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle M_{p_1,&#92;ldots,p_n} = &#92;frac{&#92;partial^{p_1+&#92;cdots+p_n} &#92;exp(A(&#92;theta+&#92;lambda)-A(&#92;theta))}{&#92;partial^{p_1}&#92;lambda_1&#92;cdots &#92;partial^{p_n}&#92;lambda_n}. ' title='&#92;displaystyle M_{p_1,&#92;ldots,p_n} = &#92;frac{&#92;partial^{p_1+&#92;cdots+p_n} &#92;exp(A(&#92;theta+&#92;lambda)-A(&#92;theta))}{&#92;partial^{p_1}&#92;lambda_1&#92;cdots &#92;partial^{p_n}&#92;lambda_n}. ' class='latex' /></p>
</blockquote>
<p>For those who prefer <a href="http://en.wikipedia.org/wiki/Cumulant">cumulants</a> to moments, I will note that there is a version of Lemma <a href="#lemexp-moment">6</a> for cumulants with an even simpler formula.</p>
<p><b>Exercise:</b> Use Lemma <a href="#lemexp-moment">6</a> to compute <img src='http://s0.wp.com/latex.php?latex=%7B%5Cmathbb%7BE%7D%5BX%5E6%5D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;mathbb{E}[X^6]}' title='{&#92;mathbb{E}[X^6]}' class='latex' />, where <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> is a Gaussian with mean <img src='http://s0.wp.com/latex.php?latex=%7B%5Cmu%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;mu}' title='{&#92;mu}' class='latex' /> and variance <img src='http://s0.wp.com/latex.php?latex=%7B%5Csigma%5E2%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;sigma^2}' title='{&#92;sigma^2}' class='latex' />.</p>
<p><b>4. Conjugate Priors </b></p>
<p>Given a family of distributions <img src='http://s0.wp.com/latex.php?latex=%7Bp%28X+%5Cmid+%5Ctheta%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(X &#92;mid &#92;theta)}' title='{p(X &#92;mid &#92;theta)}' class='latex' />, a <em>conjugate prior family</em> <img src='http://s0.wp.com/latex.php?latex=%7Bp%28%5Ctheta+%5Cmid+%5Calpha%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(&#92;theta &#92;mid &#92;alpha)}' title='{p(&#92;theta &#92;mid &#92;alpha)}' class='latex' /> is a family that has the property that</p>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+p%28%5Ctheta+%5Cmid+X%2C+%5Calpha%29+%3D+p%28%5Ctheta+%5Cmid+%5Calpha%27%29+&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle p(&#92;theta &#92;mid X, &#92;alpha) = p(&#92;theta &#92;mid &#92;alpha&#039;) ' title='&#92;displaystyle p(&#92;theta &#92;mid X, &#92;alpha) = p(&#92;theta &#92;mid &#92;alpha&#039;) ' class='latex' /></p>
<p>for some <img src='http://s0.wp.com/latex.php?latex=%7B%5Calpha%27%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;alpha&#039;}' title='{&#92;alpha&#039;}' class='latex' /> depending on <img src='http://s0.wp.com/latex.php?latex=%7B%5Calpha%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;alpha}' title='{&#92;alpha}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' />. In other words, if the prior over <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' /> lies in the conjugate family, and we observe <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' />, then the posterior over <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' /> also lies in the conjugate family. This is very useful algebraically as it means that we can get our posterior simply by updating the parameters of the prior. The following are examples of conjugate families:</p>
<ol>
<li>(Gaussian-Gaussian) Let <img src='http://s0.wp.com/latex.php?latex=%7Bp%28X+%5Cmid+%5Cmu%29+%5Cpropto+%5Cexp%28%28X-%5Cmu%29%5E2%2F2%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(X &#92;mid &#92;mu) &#92;propto &#92;exp((X-&#92;mu)^2/2)}' title='{p(X &#92;mid &#92;mu) &#92;propto &#92;exp((X-&#92;mu)^2/2)}' class='latex' />, and let <img src='http://s0.wp.com/latex.php?latex=%7Bp%28%5Cmu+%5Cmid+%5Cmu_0%2C+%5Csigma_0%29+%5Cpropto+%5Cexp%28%28%5Cmu-%5Cmu_0%29%5E2%2F2%5Csigma_0%5E2%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(&#92;mu &#92;mid &#92;mu_0, &#92;sigma_0) &#92;propto &#92;exp((&#92;mu-&#92;mu_0)^2/2&#92;sigma_0^2)}' title='{p(&#92;mu &#92;mid &#92;mu_0, &#92;sigma_0) &#92;propto &#92;exp((&#92;mu-&#92;mu_0)^2/2&#92;sigma_0^2)}' class='latex' />. Then, by Bayes&#8217; rule,</li>
</ol>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cbegin%7Barray%7D%7Brcl%7D+p%28%5Cmu+%5Cmid+X%3Dx%2C+%5Cmu_0%2C+%5Csigma_0%29+%26%5Cpropto+%5Cexp%28%28x-%5Cmu%29%5E2%2F2%29%5Cexp%28%28%5Cmu-%5Cmu_0%29%5E2%2F2%5Csigma_0%5E2%29+%5C%5C+%26%3D+%26%5Cexp%5Cleft%28%5Cfrac%7B%28%5Cmu-%5Cmu_0%29%5E2%2B%5Csigma_0%5E2%28%5Cmu-x%29%5E2%7D%7B2%5Csigma_0%5E2%7D%5Cright%29+%5C%5C+%26%5Cpropto%26+%5Cexp%5Cleft%28%5Cfrac%7B%281%2B%5Csigma_0%29%5E2%5Cmu%5E2-2%28%5Cmu_0%2B%5Csigma_0%5E2x%29%5Cmu%7D%7B2%5Csigma_0%5E2%7D%5Cright%29+%5C%5C+%26%5Cpropto%26+%5Cexp%5Cleft%28%5Cfrac%7B%5Cmu%5E2-2%5Cfrac%7B%5Cmu_0%2Bx%5Csigma_0%5E2%7D%7B1%2B%5Csigma_0%5E2%7D%5Cmu%7D%7B2%5Csigma_0%5E2%2F%281%2B%5Csigma_0%5E2%29%7D%5Cright%29+%5C%5C+%26%5Cpropto%26+%5Cexp%5Cleft%28%5Cfrac%7B%28%5Cmu-%28%5Cmu_0%2Bx%5Csigma_0%5E2%29%2F%281%2B%5Csigma_0%5E2%29%29%5E2%7D%7B2%5Csigma_0%5E2%2F%281%2B%5Csigma_0%5E2%29%7D%5Cright%29+%5C%5C+%26%5Cpropto%26+p%5Cleft%28%5Cmu+%5Cmid+%5Cfrac%7B%5Cmu_0%2Bx%5Csigma_0%5E2%7D%7B1%2B%5Csigma_0%5E2%7D%2C+%5Cfrac%7B%5Csigma_0%7D%7B%5Csqrt%7B1%2B%5Csigma_0%5E2%7D%7D%5Cright%29.+%5Cend%7Barray%7D+&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle &#92;begin{array}{rcl} p(&#92;mu &#92;mid X=x, &#92;mu_0, &#92;sigma_0) &amp;&#92;propto &#92;exp((x-&#92;mu)^2/2)&#92;exp((&#92;mu-&#92;mu_0)^2/2&#92;sigma_0^2) &#92;&#92; &amp;= &amp;&#92;exp&#92;left(&#92;frac{(&#92;mu-&#92;mu_0)^2+&#92;sigma_0^2(&#92;mu-x)^2}{2&#92;sigma_0^2}&#92;right) &#92;&#92; &amp;&#92;propto&amp; &#92;exp&#92;left(&#92;frac{(1+&#92;sigma_0)^2&#92;mu^2-2(&#92;mu_0+&#92;sigma_0^2x)&#92;mu}{2&#92;sigma_0^2}&#92;right) &#92;&#92; &amp;&#92;propto&amp; &#92;exp&#92;left(&#92;frac{&#92;mu^2-2&#92;frac{&#92;mu_0+x&#92;sigma_0^2}{1+&#92;sigma_0^2}&#92;mu}{2&#92;sigma_0^2/(1+&#92;sigma_0^2)}&#92;right) &#92;&#92; &amp;&#92;propto&amp; &#92;exp&#92;left(&#92;frac{(&#92;mu-(&#92;mu_0+x&#92;sigma_0^2)/(1+&#92;sigma_0^2))^2}{2&#92;sigma_0^2/(1+&#92;sigma_0^2)}&#92;right) &#92;&#92; &amp;&#92;propto&amp; p&#92;left(&#92;mu &#92;mid &#92;frac{&#92;mu_0+x&#92;sigma_0^2}{1+&#92;sigma_0^2}, &#92;frac{&#92;sigma_0}{&#92;sqrt{1+&#92;sigma_0^2}}&#92;right). &#92;end{array} ' title='&#92;displaystyle &#92;begin{array}{rcl} p(&#92;mu &#92;mid X=x, &#92;mu_0, &#92;sigma_0) &amp;&#92;propto &#92;exp((x-&#92;mu)^2/2)&#92;exp((&#92;mu-&#92;mu_0)^2/2&#92;sigma_0^2) &#92;&#92; &amp;= &amp;&#92;exp&#92;left(&#92;frac{(&#92;mu-&#92;mu_0)^2+&#92;sigma_0^2(&#92;mu-x)^2}{2&#92;sigma_0^2}&#92;right) &#92;&#92; &amp;&#92;propto&amp; &#92;exp&#92;left(&#92;frac{(1+&#92;sigma_0)^2&#92;mu^2-2(&#92;mu_0+&#92;sigma_0^2x)&#92;mu}{2&#92;sigma_0^2}&#92;right) &#92;&#92; &amp;&#92;propto&amp; &#92;exp&#92;left(&#92;frac{&#92;mu^2-2&#92;frac{&#92;mu_0+x&#92;sigma_0^2}{1+&#92;sigma_0^2}&#92;mu}{2&#92;sigma_0^2/(1+&#92;sigma_0^2)}&#92;right) &#92;&#92; &amp;&#92;propto&amp; &#92;exp&#92;left(&#92;frac{(&#92;mu-(&#92;mu_0+x&#92;sigma_0^2)/(1+&#92;sigma_0^2))^2}{2&#92;sigma_0^2/(1+&#92;sigma_0^2)}&#92;right) &#92;&#92; &amp;&#92;propto&amp; p&#92;left(&#92;mu &#92;mid &#92;frac{&#92;mu_0+x&#92;sigma_0^2}{1+&#92;sigma_0^2}, &#92;frac{&#92;sigma_0}{&#92;sqrt{1+&#92;sigma_0^2}}&#92;right). &#92;end{array} ' class='latex' /></p>
<p>Therefore, <img src='http://s0.wp.com/latex.php?latex=%7B%5Cmu_0%2C+%5Csigma_0%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;mu_0, &#92;sigma_0}' title='{&#92;mu_0, &#92;sigma_0}' class='latex' /> parameterize a family of priors over <img src='http://s0.wp.com/latex.php?latex=%7B%5Cmu%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;mu}' title='{&#92;mu}' class='latex' /> that is conjugate to <img src='http://s0.wp.com/latex.php?latex=%7BX+%5Cmid+%5Cmu%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X &#92;mid &#92;mu}' title='{X &#92;mid &#92;mu}' class='latex' />.</p>
<ul>
<li>(Beta-Bernoulli) Let <img src='http://s0.wp.com/latex.php?latex=%7BX+%5Cin+%5C%7B0%2C1%5C%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X &#92;in &#92;{0,1&#92;}}' title='{X &#92;in &#92;{0,1&#92;}}' class='latex' />, <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta+%5Cin+%5B0%2C1%5D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;theta &#92;in [0,1]}' title='{&#92;theta &#92;in [0,1]}' class='latex' />, <img src='http://s0.wp.com/latex.php?latex=%7Bp%28X%3D1+%5Cmid+%5Ctheta%29+%3D+%5Ctheta%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(X=1 &#92;mid &#92;theta) = &#92;theta}' title='{p(X=1 &#92;mid &#92;theta) = &#92;theta}' class='latex' />, and <img src='http://s0.wp.com/latex.php?latex=%7Bp%28%5Ctheta+%5Cmid+%5Calpha%2C+%5Cbeta%29+%5Cpropto+%5Ctheta%5E%7B%5Calpha-1%7D%281-%5Ctheta%29%5E%7B%5Cbeta-1%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(&#92;theta &#92;mid &#92;alpha, &#92;beta) &#92;propto &#92;theta^{&#92;alpha-1}(1-&#92;theta)^{&#92;beta-1}}' title='{p(&#92;theta &#92;mid &#92;alpha, &#92;beta) &#92;propto &#92;theta^{&#92;alpha-1}(1-&#92;theta)^{&#92;beta-1}}' class='latex' />. The distribution over <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> given <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' /> is then called a <em>Bernoulli distribution</em>, and that of <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' /> given <img src='http://s0.wp.com/latex.php?latex=%7B%5Calpha%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;alpha}' title='{&#92;alpha}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7B%5Cbeta%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;beta}' title='{&#92;beta}' class='latex' /> is called a <em>beta distribution</em>. Note that <img src='http://s0.wp.com/latex.php?latex=%7Bp%28X%5Cmid+%5Ctheta%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(X&#92;mid &#92;theta)}' title='{p(X&#92;mid &#92;theta)}' class='latex' /> can also be written as <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta%5EX%281-%5Ctheta%29%5E%7B1-X%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;theta^X(1-&#92;theta)^{1-X}}' title='{&#92;theta^X(1-&#92;theta)^{1-X}}' class='latex' />. From this, we see that the family of beta distributions is a conjugate prior to the family of Bernoulli distributions, since</li>
</ul>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cbegin%7Barray%7D%7Brcl%7D+p%28%5Ctheta+%5Cmid+X%3Dx%2C+%5Calpha%2C+%5Cbeta%29+%26%5Cpropto%26+%5Ctheta%5Ex%281-%5Ctheta%29%5E%7B1-x%7D+%5Ctimes+%5Ctheta%5E%7B%5Calpha-1%7D%281-%5Ctheta%29%5E%7B%5Cbeta-1%7D+%5C%5C+%26%3D%26+%5Ctheta%5E%7B%5Calpha%2Bx-1%7D%281-%5Ctheta%29%5E%7B%5Cbeta%2B%281-x%29-1%7D+%5C%5C+%26%5Cpropto%26+p%28%5Ctheta+%5Cmid+%5Calpha%2Bx%2C+%5Cbeta%2B%281-x%29%29.+%5Cend%7Barray%7D+&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle &#92;begin{array}{rcl} p(&#92;theta &#92;mid X=x, &#92;alpha, &#92;beta) &amp;&#92;propto&amp; &#92;theta^x(1-&#92;theta)^{1-x} &#92;times &#92;theta^{&#92;alpha-1}(1-&#92;theta)^{&#92;beta-1} &#92;&#92; &amp;=&amp; &#92;theta^{&#92;alpha+x-1}(1-&#92;theta)^{&#92;beta+(1-x)-1} &#92;&#92; &amp;&#92;propto&amp; p(&#92;theta &#92;mid &#92;alpha+x, &#92;beta+(1-x)). &#92;end{array} ' title='&#92;displaystyle &#92;begin{array}{rcl} p(&#92;theta &#92;mid X=x, &#92;alpha, &#92;beta) &amp;&#92;propto&amp; &#92;theta^x(1-&#92;theta)^{1-x} &#92;times &#92;theta^{&#92;alpha-1}(1-&#92;theta)^{&#92;beta-1} &#92;&#92; &amp;=&amp; &#92;theta^{&#92;alpha+x-1}(1-&#92;theta)^{&#92;beta+(1-x)-1} &#92;&#92; &amp;&#92;propto&amp; p(&#92;theta &#92;mid &#92;alpha+x, &#92;beta+(1-x)). &#92;end{array} ' class='latex' /></p>
<ul>
<li>(Gamma-Poisson) Let <img src='http://s0.wp.com/latex.php?latex=%7Bp%28X%3Dk+%5Cmid+%5Clambda%29+%3D+%5Cfrac%7B%5Clambda%5Ek%7D%7Be%5E%7B%5Clambda%7Dk%21%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(X=k &#92;mid &#92;lambda) = &#92;frac{&#92;lambda^k}{e^{&#92;lambda}k!}}' title='{p(X=k &#92;mid &#92;lambda) = &#92;frac{&#92;lambda^k}{e^{&#92;lambda}k!}}' class='latex' /> for <img src='http://s0.wp.com/latex.php?latex=%7Bk+%5Cin+%5Cmathbb%7BZ%7D_%7B%5Cgeq+0%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{k &#92;in &#92;mathbb{Z}_{&#92;geq 0}}' title='{k &#92;in &#92;mathbb{Z}_{&#92;geq 0}}' class='latex' />. Let <img src='http://s0.wp.com/latex.php?latex=%7Bp%28%5Clambda+%5Cmid+%5Calpha%2C+%5Cbeta%29+%5Cpropto+%5Clambda%5E%7B%5Calpha-1%7D%5Cexp%28-%5Cbeta+%5Clambda%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(&#92;lambda &#92;mid &#92;alpha, &#92;beta) &#92;propto &#92;lambda^{&#92;alpha-1}&#92;exp(-&#92;beta &#92;lambda)}' title='{p(&#92;lambda &#92;mid &#92;alpha, &#92;beta) &#92;propto &#92;lambda^{&#92;alpha-1}&#92;exp(-&#92;beta &#92;lambda)}' class='latex' />. As noted before, the distribution for <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> given <img src='http://s0.wp.com/latex.php?latex=%7B%5Clambda%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;lambda}' title='{&#92;lambda}' class='latex' /> is called a <em>Poisson distribution</em>; the distribution for <img src='http://s0.wp.com/latex.php?latex=%7B%5Clambda%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;lambda}' title='{&#92;lambda}' class='latex' /> given <img src='http://s0.wp.com/latex.php?latex=%7B%5Calpha%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;alpha}' title='{&#92;alpha}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7B%5Cbeta%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;beta}' title='{&#92;beta}' class='latex' /> is called a <em>gamma distribution</em>. We can check that the family of gamma distributions is conjugate to the family of Poisson distributions.<em><b>Important note:</b></em> unlike in the last two examples, the normalization constant for the Poisson distribution actually depends on <img src='http://s0.wp.com/latex.php?latex=%7B%5Clambda%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;lambda}' title='{&#92;lambda}' class='latex' />, and so we need to include it in our calculations:
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cbegin%7Barray%7D%7Brcl%7D+p%28%5Clambda+%5Cmid+X%3Dk%2C+%5Calpha%2C+%5Cbeta%29+%26%5Cpropto%26+%5Cfrac%7B%5Clambda%5Ek%7D%7Be%5E%7B%5Clambda%7Dk%21%7D+%5Ctimes+%5Clambda%5E%7B%5Calpha-1%7D%5Cexp%28-%5Cbeta%5Clambda%29+%5C%5C+%26%5Cpropto%26+%5Clambda%5E%7B%5Calpha%2Bk-1%7D%5Cexp%28-%28%5Cbeta%2B1%29%5Clambda%29+%5C%5C+%26%5Cpropto%26+p%28%5Clambda+%5Cmid+%5Calpha%2Bk%2C+%5Cbeta%2B1%29.+%5Cend%7Barray%7D+&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle &#92;begin{array}{rcl} p(&#92;lambda &#92;mid X=k, &#92;alpha, &#92;beta) &amp;&#92;propto&amp; &#92;frac{&#92;lambda^k}{e^{&#92;lambda}k!} &#92;times &#92;lambda^{&#92;alpha-1}&#92;exp(-&#92;beta&#92;lambda) &#92;&#92; &amp;&#92;propto&amp; &#92;lambda^{&#92;alpha+k-1}&#92;exp(-(&#92;beta+1)&#92;lambda) &#92;&#92; &amp;&#92;propto&amp; p(&#92;lambda &#92;mid &#92;alpha+k, &#92;beta+1). &#92;end{array} ' title='&#92;displaystyle &#92;begin{array}{rcl} p(&#92;lambda &#92;mid X=k, &#92;alpha, &#92;beta) &amp;&#92;propto&amp; &#92;frac{&#92;lambda^k}{e^{&#92;lambda}k!} &#92;times &#92;lambda^{&#92;alpha-1}&#92;exp(-&#92;beta&#92;lambda) &#92;&#92; &amp;&#92;propto&amp; &#92;lambda^{&#92;alpha+k-1}&#92;exp(-(&#92;beta+1)&#92;lambda) &#92;&#92; &amp;&#92;propto&amp; p(&#92;lambda &#92;mid &#92;alpha+k, &#92;beta+1). &#92;end{array} ' class='latex' /></p>
<p>Note that, in general, a family of distributions will always have some conjugate family, as if nothing else the family of all probability distributions over <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' /> will be a conjugate family. What we really care about is a conjugate family that itself has nice properties, such as tractably computable moments.</p>
<p>Conjugate priors have a very nice relationship to exponential families, established in the following theorem:</p>
<blockquote><p><b>Theorem 7</b> <em> <a name="thmconjugate"></a> Let <img src='http://s0.wp.com/latex.php?latex=%7Bp%28x+%5Cmid+%5Ctheta%29+%3D+h%28x%29%5Cexp%28%5Ctheta%5ET%5Cphi%28x%29-A%28%5Ctheta%29%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(x &#92;mid &#92;theta) = h(x)&#92;exp(&#92;theta^T&#92;phi(x)-A(&#92;theta))}' title='{p(x &#92;mid &#92;theta) = h(x)&#92;exp(&#92;theta^T&#92;phi(x)-A(&#92;theta))}' class='latex' /> be an exponential family. Then <img src='http://s0.wp.com/latex.php?latex=%7Bp%28%5Ctheta+%5Cmid+%5Ceta%2C+%5Ckappa%29+%5Cpropto+h_2%28%5Ctheta%29%5Cexp%5Cleft%28%5Ceta%5ET%5Ctheta-%5Ckappa+A%28%5Ctheta%29%5Cright%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(&#92;theta &#92;mid &#92;eta, &#92;kappa) &#92;propto h_2(&#92;theta)&#92;exp&#92;left(&#92;eta^T&#92;theta-&#92;kappa A(&#92;theta)&#92;right)}' title='{p(&#92;theta &#92;mid &#92;eta, &#92;kappa) &#92;propto h_2(&#92;theta)&#92;exp&#92;left(&#92;eta^T&#92;theta-&#92;kappa A(&#92;theta)&#92;right)}' class='latex' /> is a conjugate prior for <img src='http://s0.wp.com/latex.php?latex=%7Bx+%5Cmid+%5Ctheta%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{x &#92;mid &#92;theta}' title='{x &#92;mid &#92;theta}' class='latex' /> for any choice of <img src='http://s0.wp.com/latex.php?latex=%7Bh_2%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{h_2}' title='{h_2}' class='latex' />. The update formula is <img src='http://s0.wp.com/latex.php?latex=%7Bp%28%5Ctheta+%5Cmid+x%2C+%5Ceta%2C+%5Ckappa%29+%3D+p%28%5Ctheta+%5Cmid+%5Ceta%2B%5Cphi%28x%29%2C+%5Ckappa%2B1%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(&#92;theta &#92;mid x, &#92;eta, &#92;kappa) = p(&#92;theta &#92;mid &#92;eta+&#92;phi(x), &#92;kappa+1)}' title='{p(&#92;theta &#92;mid x, &#92;eta, &#92;kappa) = p(&#92;theta &#92;mid &#92;eta+&#92;phi(x), &#92;kappa+1)}' class='latex' />. Furthermore, <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta+%5Cmid+%5Cphi%2C+%5Ckappa%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;theta &#92;mid &#92;phi, &#92;kappa}' title='{&#92;theta &#92;mid &#92;phi, &#92;kappa}' class='latex' /> is itself an exponential family, with sufficient statistics <img src='http://s0.wp.com/latex.php?latex=%7B%5B%5Ctheta%3B+A%28%5Ctheta%29%5D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{[&#92;theta; A(&#92;theta)]}' title='{[&#92;theta; A(&#92;theta)]}' class='latex' />. </em></p></blockquote>
<p>Checking the theorem is a matter of straightforward algebra, so I will leave the proof as an exercise to the reader. Note that, as before, there is no guarantee that <img src='http://s0.wp.com/latex.php?latex=%7Bp%28%5Ctheta+%5Cmid+%5Ceta%2C+%5Ckappa%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(&#92;theta &#92;mid &#92;eta, &#92;kappa)}' title='{p(&#92;theta &#92;mid &#92;eta, &#92;kappa)}' class='latex' /> will be tractable; however, in many cases the conjugate prior given by Theorem <a href="#thmconjugate">7</a> is a well-behaved family. See <a href="http://en.wikipedia.org/wiki/Conjugate_prior#Table_of_conjugate_distributions">this Wikipedia page</a> for examples of conjugate priors, many of which correspond to exponential family distributions.</p>
<p><b>5. Maximum Entropy and Duality </b></p>
<p>The final property of exponential families I would like to establish is a certain <em>duality property</em>. What I mean by this is that exponential families can be thought of as the maximum entropy distributions subject to a constraint on the expected value of their sufficient statistics. For those unfamiliar with the term, the <em>entropy</em> of a distribution over <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> with density <img src='http://s0.wp.com/latex.php?latex=%7Bp%28X%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(X)}' title='{p(X)}' class='latex' /> is <img src='http://s0.wp.com/latex.php?latex=%7B%5Cmathbb%7BE%7D%5B-%5Clog+p%28X%29%5D+%3A%3D+-%5Cint+p%28x%29%5Clog%28p%28x%29%29+dx%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;mathbb{E}[-&#92;log p(X)] := -&#92;int p(x)&#92;log(p(x)) dx}' title='{&#92;mathbb{E}[-&#92;log p(X)] := -&#92;int p(x)&#92;log(p(x)) dx}' class='latex' />. Intuitively, higher entropy corresponds to higher uncertainty, so a maximum entropy distribution is one specifying as much uncertainty as possible given a certain set of information (such as the values of various moments). This makes them appealing, at least in theory, from a modeling perspective, since they &#8220;encode exactly as much information as is given and no more&#8221;. (Caveat: this intuition isn&#8217;t entirely valid, and in practice maximum-entropy distributions aren&#8217;t always necessarily appropriate.)</p>
<p>In any case, the duality property is captured in the following theorem:</p>
<blockquote><p><b>Theorem 8</b> <em> <a name="thmmax-ent"></a> The distribution over <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> with maximum entropy such that <img src='http://s0.wp.com/latex.php?latex=%7B%5Cmathbb%7BE%7D%5B%5Cphi%28X%29%5D+%3D+T%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;mathbb{E}[&#92;phi(X)] = T}' title='{&#92;mathbb{E}[&#92;phi(X)] = T}' class='latex' /> lies in the exponential family with sufficient statistic <img src='http://s0.wp.com/latex.php?latex=%7B%5Cphi%28X%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;phi(X)}' title='{&#92;phi(X)}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7Bh%28X%29+%3D+1%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{h(X) = 1}' title='{h(X) = 1}' class='latex' />. </em></p></blockquote>
<p>Proving this fully rigorously requires the calculus of variations; I will instead give the &#8220;physicist&#8217;s proof&#8221;. <em>Proof:</em> } Let <img src='http://s0.wp.com/latex.php?latex=%7Bp%28X%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(X)}' title='{p(X)}' class='latex' /> be the density for <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' />. Then we can view <img src='http://s0.wp.com/latex.php?latex=%7Bp%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p}' title='{p}' class='latex' /> as the solution to the constrained maximization problem:</p>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cbegin%7Barray%7D%7Brcl%7D+%5Cmathrm%7Bmaximize%7D+%26%26+-%5Cint+p%28X%29+%5Clog+p%28X%29+dX+%5C%5C+%5Cmathrm%7Bsubject+%5C+to%7D+%26%26+%5Cint+p%28X%29+dX+%3D+1+%5C%5C+%26%26+%5Cint+p%28X%29+%5Cphi%28X%29+dX+%3D+T.+%5Cend%7Barray%7D+&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle &#92;begin{array}{rcl} &#92;mathrm{maximize} &amp;&amp; -&#92;int p(X) &#92;log p(X) dX &#92;&#92; &#92;mathrm{subject &#92; to} &amp;&amp; &#92;int p(X) dX = 1 &#92;&#92; &amp;&amp; &#92;int p(X) &#92;phi(X) dX = T. &#92;end{array} ' title='&#92;displaystyle &#92;begin{array}{rcl} &#92;mathrm{maximize} &amp;&amp; -&#92;int p(X) &#92;log p(X) dX &#92;&#92; &#92;mathrm{subject &#92; to} &amp;&amp; &#92;int p(X) dX = 1 &#92;&#92; &amp;&amp; &#92;int p(X) &#92;phi(X) dX = T. &#92;end{array} ' class='latex' /></p>
<p>By the method of Lagrange multipliers, there exist <img src='http://s0.wp.com/latex.php?latex=%7B%5Calpha%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;alpha}' title='{&#92;alpha}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7B%5Clambda%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;lambda}' title='{&#92;lambda}' class='latex' /> such that</p>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cfrac%7Bd%7D%7Bdp%7D%5Cleft%28-%5Cint+p%28X%29%5Clog+p%28X%29+dX+-+%5Calpha+%5B%5Cint+p%28X%29+dX-1%5D+-+%5Clambda%5ET%5B%5Cint+%5Cphi%28X%29+p%28X%29+dX-T%5D%5Cright%29+%3D+0.+&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle &#92;frac{d}{dp}&#92;left(-&#92;int p(X)&#92;log p(X) dX - &#92;alpha [&#92;int p(X) dX-1] - &#92;lambda^T[&#92;int &#92;phi(X) p(X) dX-T]&#92;right) = 0. ' title='&#92;displaystyle &#92;frac{d}{dp}&#92;left(-&#92;int p(X)&#92;log p(X) dX - &#92;alpha [&#92;int p(X) dX-1] - &#92;lambda^T[&#92;int &#92;phi(X) p(X) dX-T]&#92;right) = 0. ' class='latex' /></p>
<p>This simplifies to:</p>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+-%5Clog+p%28X%29+-+1+-+%5Calpha+-%5Clambda%5ET+%5Cphi%28X%29+%3D+0%2C+&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle -&#92;log p(X) - 1 - &#92;alpha -&#92;lambda^T &#92;phi(X) = 0, ' title='&#92;displaystyle -&#92;log p(X) - 1 - &#92;alpha -&#92;lambda^T &#92;phi(X) = 0, ' class='latex' /></p>
<p>which implies</p>
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+p%28X%29+%3D+%5Cexp%28-1-%5Calpha-%5Clambda%5ET%5Cphi%28X%29%29+&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle p(X) = &#92;exp(-1-&#92;alpha-&#92;lambda^T&#92;phi(X)) ' title='&#92;displaystyle p(X) = &#92;exp(-1-&#92;alpha-&#92;lambda^T&#92;phi(X)) ' class='latex' /></p>
<p>for some <img src='http://s0.wp.com/latex.php?latex=%7B%5Calpha%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;alpha}' title='{&#92;alpha}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7B%5Clambda%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;lambda}' title='{&#92;lambda}' class='latex' />. In particular, if we let <img src='http://s0.wp.com/latex.php?latex=%7B%5Clambda+%3D+-%5Ctheta%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;lambda = -&#92;theta}' title='{&#92;lambda = -&#92;theta}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7B%5Calpha+%3D+A%28%5Ctheta%29-1%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;alpha = A(&#92;theta)-1}' title='{&#92;alpha = A(&#92;theta)-1}' class='latex' />, then we recover the exponential family with <img src='http://s0.wp.com/latex.php?latex=%7Bh%28X%29+%3D+1%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{h(X) = 1}' title='{h(X) = 1}' class='latex' />, as claimed. <img src='http://s0.wp.com/latex.php?latex=%5CBox&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;Box' title='&#92;Box' class='latex' /></p>
<p><b>6. Conclusion </b></p>
<p>Hopefully I have by now convinced you that exponential families have many nice properties: they have conjugate priors, simple-to-fit parameters, and easily-computed moments. While exponential families aren&#8217;t always appropriate models for a given situation, their tractability makes them the model of choice when no other information is present; and, since they can be obtained as maximum-entropy families, they are actually appropriate models in a wide family of circumstances.</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/jsteinhardt.wordpress.com/444/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/jsteinhardt.wordpress.com/444/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=jsteinhardt.wordpress.com&#038;blog=8824138&#038;post=444&#038;subd=jsteinhardt&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://jsteinhardt.wordpress.com/2012/12/21/exponential-families/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c0d709db669c6eb66c98ee050c45527d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jsteinhardt</media:title>
		</media:content>
	</item>
		<item>
		<title>Algebra trick of the day</title>
		<link>http://jsteinhardt.wordpress.com/2012/12/17/algebra-trick-of-the-day/</link>
		<comments>http://jsteinhardt.wordpress.com/2012/12/17/algebra-trick-of-the-day/#comments</comments>
		<pubDate>Mon, 17 Dec 2012 04:43:19 +0000</pubDate>
		<dc:creator>jsteinhardt</dc:creator>
				<category><![CDATA[Tricks]]></category>

		<guid isPermaLink="false">http://jsteinhardt.wordpress.com/?p=435</guid>
		<description><![CDATA[I&#8217;ve decided to start recording algebra tricks as I end up using them. Today I actually have two tricks, but they end up being used together a lot. I don&#8217;t know if they have more formal names, but I call them the &#8220;trace trick&#8221; and the &#8220;rank 1 relaxation&#8221;. Suppose that we want to maximize [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=jsteinhardt.wordpress.com&#038;blog=8824138&#038;post=435&#038;subd=jsteinhardt&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I&#8217;ve decided to start recording algebra tricks as I end up using them. Today I actually have two tricks, but they end up being used together a lot. I don&#8217;t know if they have more formal names, but I call them the &#8220;trace trick&#8221; and the &#8220;rank 1 relaxation&#8221;.</p>
<p>Suppose that we want to maximize the <a href="http://en.wikipedia.org/wiki/Rayleigh_quotient">Rayleigh quotient</a> <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7Bx%5ETAx%7D%7Bx%5ETx%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;frac{x^TAx}{x^Tx}' title='&#92;frac{x^TAx}{x^Tx}' class='latex' /> of a matrix <img src='http://s0.wp.com/latex.php?latex=A&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='A' title='A' class='latex' />. There are many reasons we might want to do this, for instance of <img src='http://s0.wp.com/latex.php?latex=A&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='A' title='A' class='latex' /> is symmetric then the maximum corresponds to the largest eigenvalue. There are also many ways to do this, and the one that I&#8217;m about to describe is definitely not the most efficient, but it has the advantage of being flexible, in that it easily generalizes to constrained maximizations, etc.</p>
<p>The first observation is that <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7Bx%5ETAx%7D%7Bx%5ETx%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;frac{x^TAx}{x^Tx}' title='&#92;frac{x^TAx}{x^Tx}' class='latex' /> is homogeneous, meaning that scaling <img src='http://s0.wp.com/latex.php?latex=x&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='x' title='x' class='latex' /> doesn&#8217;t affect the result. So, we can assume without loss of generality that <img src='http://s0.wp.com/latex.php?latex=x%5ETx+%3D+1&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='x^Tx = 1' title='x^Tx = 1' class='latex' />, and we end up with the optimization problem:</p>
<p style="text-align:center;">maximize <img src='http://s0.wp.com/latex.php?latex=x%5ETAx&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='x^TAx' title='x^TAx' class='latex' /></p>
<p style="text-align:center;">subject to <img src='http://s0.wp.com/latex.php?latex=x%5ETx+%3D+1&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='x^Tx = 1' title='x^Tx = 1' class='latex' /></p>
<p>This is where the trace trick comes in. Recall that the trace of a matrix is the sum of its diagonal entries. We are going to use two facts: first, the trace of a number is just the number itself. Second, trace(AB) = trace(BA). (Note, however, that trace(ABC) is <em>not</em> in general equal to trace(BAC), although trace(ABC) <em>is</em> equal to trace(CAB).) We use these two properties as follows &#8212; first, we re-write the optimization problem as:</p>
<p style="text-align:center;">maximize <img src='http://s0.wp.com/latex.php?latex=Trace%28x%5ETAx%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Trace(x^TAx)' title='Trace(x^TAx)' class='latex' /></p>
<p style="text-align:center;">subject to <img src='http://s0.wp.com/latex.php?latex=Trace%28x%5ETx%29+%3D+1&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Trace(x^Tx) = 1' title='Trace(x^Tx) = 1' class='latex' /></p>
<p>Second, we re-write it again using the invariance of trace under cyclic permutations:</p>
<p style="text-align:center;">maximize <img src='http://s0.wp.com/latex.php?latex=Trace%28Axx%5ET%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Trace(Axx^T)' title='Trace(Axx^T)' class='latex' /></p>
<p style="text-align:center;">subject to <img src='http://s0.wp.com/latex.php?latex=Trace%28xx%5ET%29+%3D+1&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Trace(xx^T) = 1' title='Trace(xx^T) = 1' class='latex' /></p>
<p>Now we make the substitution <img src='http://s0.wp.com/latex.php?latex=X+%3D+xx%5ET&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='X = xx^T' title='X = xx^T' class='latex' />:</p>
<p style="text-align:center;">maximize <img src='http://s0.wp.com/latex.php?latex=Trace%28AX%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Trace(AX)' title='Trace(AX)' class='latex' /></p>
<p style="text-align:center;">subject to <img src='http://s0.wp.com/latex.php?latex=Trace%28X%29+%3D+1%2C+X+%3D+xx%5ET&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Trace(X) = 1, X = xx^T' title='Trace(X) = 1, X = xx^T' class='latex' /></p>
<p>Finally, note that a matrix <img src='http://s0.wp.com/latex.php?latex=X&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='X' title='X' class='latex' /> can be written as <img src='http://s0.wp.com/latex.php?latex=xx%5ET&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='xx^T' title='xx^T' class='latex' /> if and only if <img src='http://s0.wp.com/latex.php?latex=X&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='X' title='X' class='latex' /> is positive semi-definite and has rank 1. Therefore, we can further write this as</p>
<p style="text-align:center;">maximize <img src='http://s0.wp.com/latex.php?latex=Trace%28AX%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Trace(AX)' title='Trace(AX)' class='latex' /></p>
<p style="text-align:center;">subject to <img src='http://s0.wp.com/latex.php?latex=Trace%28X%29+%3D+1%2C+Rank%28X%29+%3D+1%2C+X+%5Csucceq+0&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Trace(X) = 1, Rank(X) = 1, X &#92;succeq 0' title='Trace(X) = 1, Rank(X) = 1, X &#92;succeq 0' class='latex' /></p>
<p style="text-align:left;">Aside from the rank 1 constraint, this would be a <a href="http://en.wikipedia.org/wiki/Semidefinite_programming#Equivalent_formulations">semidefinite program</a>, a type of problem that can be solved efficiently. What happens if we drop the rank 1 constraint? Then I claim that the solution to this program would be the same as if I had kept the constraint in! Why is this? Let&#8217;s look at the eigendecomposition of <img src='http://s0.wp.com/latex.php?latex=X&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='X' title='X' class='latex' />, written as <img src='http://s0.wp.com/latex.php?latex=%5Csum_%7Bi%3D1%7D%5En+%5Clambda_i+x_ix_i%5ET&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;sum_{i=1}^n &#92;lambda_i x_ix_i^T' title='&#92;sum_{i=1}^n &#92;lambda_i x_ix_i^T' class='latex' />, with <img src='http://s0.wp.com/latex.php?latex=%5Clambda_i+%5Cgeq+0&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;lambda_i &#92;geq 0' title='&#92;lambda_i &#92;geq 0' class='latex' /> (by positive semidefiniteness) and <img src='http://s0.wp.com/latex.php?latex=%5Csum_%7Bi%3D1%7D%5En+%5Clambda_i+%3D+1&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;sum_{i=1}^n &#92;lambda_i = 1' title='&#92;sum_{i=1}^n &#92;lambda_i = 1' class='latex' /> (by the trace constraint). Let&#8217;s also look at <img src='http://s0.wp.com/latex.php?latex=Trace%28AX%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Trace(AX)' title='Trace(AX)' class='latex' />, which can be written as <img src='http://s0.wp.com/latex.php?latex=%5Csum_%7Bi%3D1%7D%5En+%5Clambda_i+Trace%28Ax_ix_i%5ET%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;sum_{i=1}^n &#92;lambda_i Trace(Ax_ix_i^T)' title='&#92;sum_{i=1}^n &#92;lambda_i Trace(Ax_ix_i^T)' class='latex' />. Since <img src='http://s0.wp.com/latex.php?latex=Trace%28AX%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Trace(AX)' title='Trace(AX)' class='latex' /> is just a convex combination of the <img src='http://s0.wp.com/latex.php?latex=Trace%28Ax_ix_i%5ET%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Trace(Ax_ix_i^T)' title='Trace(Ax_ix_i^T)' class='latex' />, we might as well have just picked <img src='http://s0.wp.com/latex.php?latex=X&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='X' title='X' class='latex' /> to be <img src='http://s0.wp.com/latex.php?latex=x_ix_i%5ET&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='x_ix_i^T' title='x_ix_i^T' class='latex' />, where <img src='http://s0.wp.com/latex.php?latex=i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='i' title='i' class='latex' /> is chosen to maximize <img src='http://s0.wp.com/latex.php?latex=Trace%28Ax_ix_i%5ET%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Trace(Ax_ix_i^T)' title='Trace(Ax_ix_i^T)' class='latex' />. If we set that <img src='http://s0.wp.com/latex.php?latex=%5Clambda_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;lambda_i' title='&#92;lambda_i' class='latex' /> to 1 and all the rest to 0, then we maintain all of the constraints while increasing <img src='http://s0.wp.com/latex.php?latex=Trace%28AX%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Trace(AX)' title='Trace(AX)' class='latex' />, meaning that we couldn&#8217;t have been at the optimum value of <img src='http://s0.wp.com/latex.php?latex=X&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='X' title='X' class='latex' /> unless <img src='http://s0.wp.com/latex.php?latex=n&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='n' title='n' class='latex' /> was equal to 1. What we have shown, then, is that the rank of <img src='http://s0.wp.com/latex.php?latex=X&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='X' title='X' class='latex' /> must be 1, so that the rank 1 constraint was unnecessary.</p>
<p style="text-align:left;">Technically, <img src='http://s0.wp.com/latex.php?latex=X&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='X' title='X' class='latex' /> could be a linear combination of rank 1 matrices that all have the same value of <img src='http://s0.wp.com/latex.php?latex=Trace%28AX%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='Trace(AX)' title='Trace(AX)' class='latex' />, but in that case we could just pick any one of those matrices. So what I have really shown is that <em>at least one</em> optimal point has rank 1, and we can recover such a point from any solution, even if the original solution was not rank 1.</p>
<p style="text-align:left;">Here is a problem that uses a similar trick. Suppose we want to find <img src='http://s0.wp.com/latex.php?latex=x&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='x' title='x' class='latex' /> that simultaneously satisfies the equations:</p>
<p style="text-align:center;"><img src='http://s0.wp.com/latex.php?latex=b_i+%3D+%7Ca_i%5ETx%7C%5E2&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='b_i = |a_i^Tx|^2' title='b_i = |a_i^Tx|^2' class='latex' /></p>
<p style="text-align:left;">for each <img src='http://s0.wp.com/latex.php?latex=i+%3D+1%2C%5Cldots%2Cn&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='i = 1,&#92;ldots,n' title='i = 1,&#92;ldots,n' class='latex' /> (this example was inspired from the recent NIPS paper by <a href="http://www.eecs.berkeley.edu/~yang/paper/nips2012.pdf">Ohlsson, Yang, Dong, and Sastry</a>, although the idea itself goes at least back to <a href="http://arxiv.org/abs/1109.4499">Candes, Strohmer, and Voroninski</a>). Note that this is basically equivalent to solving a system of linear equations where we only know each equation up to a sign (or a phase, in the complex case). Therefore, in general, this problem will not have a unique solution. To ensure the solution is unique, let us assume the very strong condition that whenever <img src='http://s0.wp.com/latex.php?latex=a_i%5ETVa_i+%3D+0&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='a_i^TVa_i = 0' title='a_i^TVa_i = 0' class='latex' /> for all <img src='http://s0.wp.com/latex.php?latex=i+%3D+1%2C%5Cldots%2Cn&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='i = 1,&#92;ldots,n' title='i = 1,&#92;ldots,n' class='latex' />, the matrix <img src='http://s0.wp.com/latex.php?latex=V&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='V' title='V' class='latex' /> must itself be zero (note: Candes et al. get away with a much weaker condition). Given this, can we phrase the problem as a semidefinite program? I highly recommend trying to solve this problem on your own, or at least reducing it to a rank-constrained SDP, so I&#8217;ll include the solution below a fold.</p>
<p style="text-align:left;"><span id="more-435"></span></p>
<p style="text-align:left;"><strong>Solution.</strong> We can, as before, re-write the equations as:</p>
<p style="text-align:center;"><img src='http://s0.wp.com/latex.php?latex=b_i+%3D+Trace%28a_ia_i%5ETxx%5ET%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='b_i = Trace(a_ia_i^Txx^T)' title='b_i = Trace(a_ia_i^Txx^T)' class='latex' /></p>
<p style="text-align:left;">and further write this as</p>
<p style="text-align:center;"><img src='http://s0.wp.com/latex.php?latex=b_i+%3D+Trace%28a_ia_i%5ETX%29%2C+X+%5Csucceq+0%2C+rank%28X%29+%3D+1&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='b_i = Trace(a_ia_i^TX), X &#92;succeq 0, rank(X) = 1' title='b_i = Trace(a_ia_i^TX), X &#92;succeq 0, rank(X) = 1' class='latex' /></p>
<p style="text-align:left;">As before, drop the rank 1 constraint and let <img src='http://s0.wp.com/latex.php?latex=X+%3D+%5Csum_%7Bj%3D1%7D%5Em+%5Clambda_j+x_jx_j%5ET&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='X = &#92;sum_{j=1}^m &#92;lambda_j x_jx_j^T' title='X = &#92;sum_{j=1}^m &#92;lambda_j x_jx_j^T' class='latex' />. Then we get:</p>
<p style="text-align:center;"><img src='http://s0.wp.com/latex.php?latex=b_i+%3D+%5Csum_%7Bj%3D1%7D%5Em+Trace%28a_ia_i%5ETx_jx_j%5ET%29%5Clambda_j&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='b_i = &#92;sum_{j=1}^m Trace(a_ia_i^Tx_jx_j^T)&#92;lambda_j' title='b_i = &#92;sum_{j=1}^m Trace(a_ia_i^Tx_jx_j^T)&#92;lambda_j' class='latex' />,</p>
<p style="text-align:left;">which we can re-write as <img src='http://s0.wp.com/latex.php?latex=b_i+%3D+a_i%5ET%5Cleft%28%5Csum_%7Bj%3D1%7D%5Em+%5Clambda_jx_jx_j%5ET%5Cright%29a_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='b_i = a_i^T&#92;left(&#92;sum_{j=1}^m &#92;lambda_jx_jx_j^T&#92;right)a_i' title='b_i = a_i^T&#92;left(&#92;sum_{j=1}^m &#92;lambda_jx_jx_j^T&#92;right)a_i' class='latex' />. But if <img src='http://s0.wp.com/latex.php?latex=x%5E%2A&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='x^*' title='x^*' class='latex' /> is the true solution, then we also know that <img src='http://s0.wp.com/latex.php?latex=b_i+%3D+a_i%5ETx%5E%2A%28x%5E%2A%29%5ETa_i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='b_i = a_i^Tx^*(x^*)^Ta_i' title='b_i = a_i^Tx^*(x^*)^Ta_i' class='latex' />, so that <img src='http://s0.wp.com/latex.php?latex=a_i%5ET%5Cleft%28-x%5E%2A%28x%5E%2A%29%5ET%2B%5Csum_%7Bj%3D1%7D%5Em+%5Clambda_jx_jx_j%5ET%5Cright%29+%3D+0&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='a_i^T&#92;left(-x^*(x^*)^T+&#92;sum_{j=1}^m &#92;lambda_jx_jx_j^T&#92;right) = 0' title='a_i^T&#92;left(-x^*(x^*)^T+&#92;sum_{j=1}^m &#92;lambda_jx_jx_j^T&#92;right) = 0' class='latex' /> for all <img src='http://s0.wp.com/latex.php?latex=i&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='i' title='i' class='latex' />. By the non-degeneracy assumption, this implies that</p>
<p style="text-align:center;"><img src='http://s0.wp.com/latex.php?latex=x%5E%2A%28x%5E%2A%29%5ET+%3D+%5Csum_%7Bj%3D1%7D%5Em+%5Clambda_jx_jx_j%5ET&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='x^*(x^*)^T = &#92;sum_{j=1}^m &#92;lambda_jx_jx_j^T' title='x^*(x^*)^T = &#92;sum_{j=1}^m &#92;lambda_jx_jx_j^T' class='latex' />,</p>
<p style="text-align:left;">so in particular <img src='http://s0.wp.com/latex.php?latex=X+%3D+x%5E%2A%28x%5E%2A%29%5ET&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='X = x^*(x^*)^T' title='X = x^*(x^*)^T' class='latex' />. Therefore, <img src='http://s0.wp.com/latex.php?latex=X+%3D+x%5E%2A%28x%5E%2A%29%5ET&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='X = x^*(x^*)^T' title='X = x^*(x^*)^T' class='latex' /> is the only solution to the semidefinite program even after dropping the rank constraint.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/jsteinhardt.wordpress.com/435/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/jsteinhardt.wordpress.com/435/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=jsteinhardt.wordpress.com&#038;blog=8824138&#038;post=435&#038;subd=jsteinhardt&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://jsteinhardt.wordpress.com/2012/12/17/algebra-trick-of-the-day/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c0d709db669c6eb66c98ee050c45527d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jsteinhardt</media:title>
		</media:content>
	</item>
		<item>
		<title>Log-Linear Models</title>
		<link>http://jsteinhardt.wordpress.com/2012/12/06/log-linear-models/</link>
		<comments>http://jsteinhardt.wordpress.com/2012/12/06/log-linear-models/#comments</comments>
		<pubDate>Thu, 06 Dec 2012 19:46:23 +0000</pubDate>
		<dc:creator>jsteinhardt</dc:creator>
				<category><![CDATA[Machine Learning]]></category>

		<guid isPermaLink="false">http://jsteinhardt.wordpress.com/?p=425</guid>
		<description><![CDATA[I&#8217;ve spent most of my research career trying to build big, complex nonparametric models; however, I&#8217;ve more recently delved into the realm of natural language processing, where how awesome your model looks on paper is irrelevant compared to how well it models your data. In the spirit of this new work (and to lay the [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=jsteinhardt.wordpress.com&#038;blog=8824138&#038;post=425&#038;subd=jsteinhardt&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I&#8217;ve spent most of my research career trying to build big, complex <a href="http://jmlr.csail.mit.edu/proceedings/papers/v22/steinhardt12/steinhardt12.pdf">nonparametric models</a>; however, I&#8217;ve more recently delved into the realm of natural language processing, where how awesome your model looks on paper is irrelevant compared to how well it models your data. In the spirit of this new work (and to lay the groundwork for a later post on NLP), I&#8217;d like to go over a family of models that I think is often overlooked due to not being terribly sexy (or at least, I overlooked it for a good while). This family is the family of log-linear models, which are models of the form:</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+p%28x+%5Cmid+%5Ctheta%29+%5Cpropto+e%5E%7B%5Cphi%28x%29%5ET%5Ctheta%7D%2C+&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle p(x &#92;mid &#92;theta) &#92;propto e^{&#92;phi(x)^T&#92;theta}, ' title='&#92;displaystyle p(x &#92;mid &#92;theta) &#92;propto e^{&#92;phi(x)^T&#92;theta}, ' class='latex' /></p>
<p>where <img src='http://s0.wp.com/latex.php?latex=%7B%5Cphi%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;phi}' title='{&#92;phi}' class='latex' /> maps a data point to a feature vector; they are called log-linear because the log of the probability is a linear function of <img src='http://s0.wp.com/latex.php?latex=%7B%5Cphi%28x%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;phi(x)}' title='{&#92;phi(x)}' class='latex' />. We refer to <img src='http://s0.wp.com/latex.php?latex=%7B%5Cphi%28x%29%5ET%5Ctheta%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;phi(x)^T&#92;theta}' title='{&#92;phi(x)^T&#92;theta}' class='latex' /> as the <em>score</em> of <img src='http://s0.wp.com/latex.php?latex=%7Bx%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{x}' title='{x}' class='latex' />.</p>
<p>This model class might look fairly restricted at first, but the real magic comes in with the feature vector <img src='http://s0.wp.com/latex.php?latex=%7B%5Cphi%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;phi}' title='{&#92;phi}' class='latex' />. In fact, every probabilistic model that is <a href="http://en.wikipedia.org/wiki/Radon-Nikodym_theorem">absolutely continuous</a> with respect to Lebesgue measure can be represented as a log-linear model for sufficient choices of <img src='http://s0.wp.com/latex.php?latex=%7B%5Cphi%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;phi}' title='{&#92;phi}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' />. This is actually trivially true, as we can just take <img src='http://s0.wp.com/latex.php?latex=%7B%5Cphi+%3A+X+%5Crightarrow+%5Cmathbb%7BR%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;phi : X &#92;rightarrow &#92;mathbb{R}}' title='{&#92;phi : X &#92;rightarrow &#92;mathbb{R}}' class='latex' /> to be <img src='http://s0.wp.com/latex.php?latex=%7B%5Clog+p%28x%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;log p(x)}' title='{&#92;log p(x)}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' /> to be <img src='http://s0.wp.com/latex.php?latex=%7B1%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{1}' title='{1}' class='latex' />.</p>
<p>You might object to this choice of <img src='http://s0.wp.com/latex.php?latex=%7B%5Cphi%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;phi}' title='{&#92;phi}' class='latex' />, since it maps into <img src='http://s0.wp.com/latex.php?latex=%7B%5Cmathbb%7BR%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;mathbb{R}}' title='{&#92;mathbb{R}}' class='latex' /> rather than <img src='http://s0.wp.com/latex.php?latex=%7B%5C%7B0%2C1%5C%7D%5En%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;{0,1&#92;}^n}' title='{&#92;{0,1&#92;}^n}' class='latex' />, and feature vectors are typically discrete. However, we can do just as well by letting <img src='http://s0.wp.com/latex.php?latex=%7B%5Cphi+%3A+X+%5Crightarrow+%5C%7B0%2C1%5C%7D%5E%7B%5Cinfty%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;phi : X &#92;rightarrow &#92;{0,1&#92;}^{&#92;infty}}' title='{&#92;phi : X &#92;rightarrow &#92;{0,1&#92;}^{&#92;infty}}' class='latex' />, where the <img src='http://s0.wp.com/latex.php?latex=%7Bi%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{i}' title='{i}' class='latex' />th coordinate of <img src='http://s0.wp.com/latex.php?latex=%7B%5Cphi%28x%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;phi(x)}' title='{&#92;phi(x)}' class='latex' /> is the <img src='http://s0.wp.com/latex.php?latex=%7Bi%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{i}' title='{i}' class='latex' />th digit in the binary representation of <img src='http://s0.wp.com/latex.php?latex=%7B%5Clog+p%28x%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;log p(x)}' title='{&#92;log p(x)}' class='latex' />, then let <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' /> be the vector <img src='http://s0.wp.com/latex.php?latex=%7B%5Cleft%28%5Cfrac%7B1%7D%7B2%7D%2C%5Cfrac%7B1%7D%7B4%7D%2C%5Cfrac%7B1%7D%7B8%7D%2C%5Cldots%5Cright%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;left(&#92;frac{1}{2},&#92;frac{1}{4},&#92;frac{1}{8},&#92;ldots&#92;right)}' title='{&#92;left(&#92;frac{1}{2},&#92;frac{1}{4},&#92;frac{1}{8},&#92;ldots&#92;right)}' class='latex' />.</p>
<p>It is important to distinguish between the ability to represent an arbitrary model as log-linear and the ability to represent an arbitrary <em>family</em> of models as a log-linear family (that is, as the set of models we get if we fix a choice of features <img src='http://s0.wp.com/latex.php?latex=%7B%5Cphi%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;phi}' title='{&#92;phi}' class='latex' /> and then vary <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' />). When we don&#8217;t know the correct model in advance and want to learn it, this latter consideration can be crucial. Below, I give two examples of model families and discuss how they fit (or do not fit) into the log-linear framework. <b>Important caveat:</b> in both of the models below, it is typically the case that at least some of the variables involved are unobserved. However, we will ignore this for now, and assume that, at least at training time, all of the variables are fully observed (in other words, we can see <img src='http://s0.wp.com/latex.php?latex=%7Bx_i%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{x_i}' title='{x_i}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7By_i%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{y_i}' title='{y_i}' class='latex' /> in the hidden Markov model and we can see the full tree of productions in the probabilistic context free grammar).</p>
<p><strong>Hidden Markov Models.</strong> A hidden Markov model, or HMM, is a model with <em>latent</em> (unobserved) variables <img src='http://s0.wp.com/latex.php?latex=%7Bx_1%2C%5Cldots%2Cx_n%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{x_1,&#92;ldots,x_n}' title='{x_1,&#92;ldots,x_n}' class='latex' /> together with observed variables <img src='http://s0.wp.com/latex.php?latex=%7By_1%2C%5Cldots%2Cy_n%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{y_1,&#92;ldots,y_n}' title='{y_1,&#92;ldots,y_n}' class='latex' />. The distribution for <img src='http://s0.wp.com/latex.php?latex=%7By_i%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{y_i}' title='{y_i}' class='latex' /> depends only on <img src='http://s0.wp.com/latex.php?latex=%7Bx_i%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{x_i}' title='{x_i}' class='latex' />, and the distribution for <img src='http://s0.wp.com/latex.php?latex=%7Bx_i%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{x_i}' title='{x_i}' class='latex' /> depends only on <img src='http://s0.wp.com/latex.php?latex=%7Bx_%7Bi-1%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{x_{i-1}}' title='{x_{i-1}}' class='latex' /> (in the sense that <img src='http://s0.wp.com/latex.php?latex=%7Bx_i%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{x_i}' title='{x_i}' class='latex' /> is conditionally independent of <img src='http://s0.wp.com/latex.php?latex=%7Bx_1%2C%5Cldots%2Cx_%7Bi-2%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{x_1,&#92;ldots,x_{i-2}}' title='{x_1,&#92;ldots,x_{i-2}}' class='latex' /> given <img src='http://s0.wp.com/latex.php?latex=%7Bx_%7Bi-1%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{x_{i-1}}' title='{x_{i-1}}' class='latex' />). We can thus summarize the information in an HMM with the distributions <img src='http://s0.wp.com/latex.php?latex=%7Bp%28x_%7Bi%7D+%3D+t+%5Cmid+x_%7Bi-1%7D+%3D+s%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(x_{i} = t &#92;mid x_{i-1} = s)}' title='{p(x_{i} = t &#92;mid x_{i-1} = s)}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7Bp%28y_i+%3D+u+%5Cmid+x_i+%3D+s%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(y_i = u &#92;mid x_i = s)}' title='{p(y_i = u &#92;mid x_i = s)}' class='latex' />.</p>
<p>We can express a hidden Markov model as a log-linear model by defining two classes of features: (i) features <img src='http://s0.wp.com/latex.php?latex=%7B%5Cphi_%7Bs%2Ct%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;phi_{s,t}}' title='{&#92;phi_{s,t}}' class='latex' /> that count the number of <img src='http://s0.wp.com/latex.php?latex=%7Bi%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{i}' title='{i}' class='latex' /> such that <img src='http://s0.wp.com/latex.php?latex=%7Bx_%7Bi-1%7D+%3D+s%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{x_{i-1} = s}' title='{x_{i-1} = s}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7Bx_i+%3D+t%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{x_i = t}' title='{x_i = t}' class='latex' />; and (ii) features <img src='http://s0.wp.com/latex.php?latex=%7B%5Cpsi_%7Bs%2Cu%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;psi_{s,u}}' title='{&#92;psi_{s,u}}' class='latex' /> that count the number of <img src='http://s0.wp.com/latex.php?latex=%7Bi%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{i}' title='{i}' class='latex' /> such that <img src='http://s0.wp.com/latex.php?latex=%7Bx_i+%3D+s%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{x_i = s}' title='{x_i = s}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7By_i+%3D+u%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{y_i = u}' title='{y_i = u}' class='latex' />. While this choice of features yields a model family capable of expressing an arbitrary hidden Markov model, it is also capable of learning models that are not hidden Markov models. In particular, we would like to think of <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta_%7Bs%2Ct%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;theta_{s,t}}' title='{&#92;theta_{s,t}}' class='latex' /> (the index of <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' /> corresponding to <img src='http://s0.wp.com/latex.php?latex=%7B%5Cphi_%7Bs%2Ct%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;phi_{s,t}}' title='{&#92;phi_{s,t}}' class='latex' />) as <img src='http://s0.wp.com/latex.php?latex=%7B%5Clog+p%28x_i%3Dt+%5Cmid+x_%7Bi-1%7D%3Ds%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;log p(x_i=t &#92;mid x_{i-1}=s)}' title='{&#92;log p(x_i=t &#92;mid x_{i-1}=s)}' class='latex' />, but there is no constraint that <img src='http://s0.wp.com/latex.php?latex=%7B%5Csum_%7Bt%7D+%5Cexp%28%5Ctheta_%7Bs%2Ct%7D%29+%3D+1%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;sum_{t} &#92;exp(&#92;theta_{s,t}) = 1}' title='{&#92;sum_{t} &#92;exp(&#92;theta_{s,t}) = 1}' class='latex' /> for each <img src='http://s0.wp.com/latex.php?latex=%7Bs%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{s}' title='{s}' class='latex' />, whereas we do necessarily have <img src='http://s0.wp.com/latex.php?latex=%7B%5Csum_%7Bt%7D+p%28x_i+%3D+t+%5Cmid+x_%7Bi-1%7D%3Ds%29+%3D+1%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;sum_{t} p(x_i = t &#92;mid x_{i-1}=s) = 1}' title='{&#92;sum_{t} p(x_i = t &#92;mid x_{i-1}=s) = 1}' class='latex' /> for each <img src='http://s0.wp.com/latex.php?latex=%7Bs%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{s}' title='{s}' class='latex' />. If <img src='http://s0.wp.com/latex.php?latex=%7Bn%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{n}' title='{n}' class='latex' /> is fixed, we still do obtain an HMM for any setting of <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' />, although <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta_%7Bs%2Ct%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;theta_{s,t}}' title='{&#92;theta_{s,t}}' class='latex' /> will have no simple relationship with <img src='http://s0.wp.com/latex.php?latex=%7B%5Clog+p%28x_i+%3D+t+%5Cmid+x_%7Bi-1%7D+%3D+s%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;log p(x_i = t &#92;mid x_{i-1} = s)}' title='{&#92;log p(x_i = t &#92;mid x_{i-1} = s)}' class='latex' />. Furthermore, the relationship depends on <img src='http://s0.wp.com/latex.php?latex=%7Bn%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{n}' title='{n}' class='latex' />, and will therefore not work if we care about multiple Markov chains with different lengths.</p>
<p>Is the ability to express models that are not HMMs good or bad? It depends. If we know for certain that our data satisfy the HMM assumption, then expanding our model family to include models that violate that assumption can only end up hurting us. If the data do not satisfy the HMM assumption, then increasing the size of the model family may allow us to overcome what would otherwise be a model mis-specification. I personally would prefer to have as much control as possible about what assumptions I make, so I tend to see the over-expressivity of HMMs as a bug rather than a feature.</p>
<p><strong>Probabilistic Context Free Grammars.</strong> A probabilistic context free grammar, or PCFG, is simply a context free grammar where we place a probability distribution over the production rules for each non-terminal. For those unfamiliar with context free grammars, a <em>context free grammar</em> is specified by:</p>
<ol>
<li>A set <img src='http://s0.wp.com/latex.php?latex=%7B%5Cmathcal%7BS%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;mathcal{S}}' title='{&#92;mathcal{S}}' class='latex' /> of non-terminal symbols, including a distinguished <em>initial symbol</em> <img src='http://s0.wp.com/latex.php?latex=%7BE%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{E}' title='{E}' class='latex' />.</li>
<li>A set <img src='http://s0.wp.com/latex.php?latex=%7B%5Cmathcal%7BT%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;mathcal{T}}' title='{&#92;mathcal{T}}' class='latex' /> of terminal symbols.</li>
<li>For each <img src='http://s0.wp.com/latex.php?latex=%7Bs+%5Cin+S%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{s &#92;in S}' title='{s &#92;in S}' class='latex' />, one or more <em>production rules</em> of the form <img src='http://s0.wp.com/latex.php?latex=%7Bs+%5Cmapsto+w_1w_2%5Ccdots+w_k%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{s &#92;mapsto w_1w_2&#92;cdots w_k}' title='{s &#92;mapsto w_1w_2&#92;cdots w_k}' class='latex' />, where <img src='http://s0.wp.com/latex.php?latex=%7Bk+%5Cgeq+0%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{k &#92;geq 0}' title='{k &#92;geq 0}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7Bw_i+%5Cin+%5Cmathcal%7BS%7D+%5Ccup+%5Cmathcal%7BT%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{w_i &#92;in &#92;mathcal{S} &#92;cup &#92;mathcal{T}}' title='{w_i &#92;in &#92;mathcal{S} &#92;cup &#92;mathcal{T}}' class='latex' />.</li>
</ol>
<p>For instance, a context free grammar for arithmetic expressions might have <img src='http://s0.wp.com/latex.php?latex=%7B%5Cmathcal%7BS%7D+%3D+%5C%7BE%5C%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;mathcal{S} = &#92;{E&#92;}}' title='{&#92;mathcal{S} = &#92;{E&#92;}}' class='latex' />, <img src='http://s0.wp.com/latex.php?latex=%7B%5Cmathcal%7BT%7D+%3D+%5C%7B%2B%2C-%2C%5Ctimes%2C%2F%2C%28%2C%29%5C%7D+%5Ccup+%5Cmathbb%7BR%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;mathcal{T} = &#92;{+,-,&#92;times,/,(,)&#92;} &#92;cup &#92;mathbb{R}}' title='{&#92;mathcal{T} = &#92;{+,-,&#92;times,/,(,)&#92;} &#92;cup &#92;mathbb{R}}' class='latex' />, and the following production rules:</p>
<ul>
<li><img src='http://s0.wp.com/latex.php?latex=%7BE+%5Cmapsto+x%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{E &#92;mapsto x}' title='{E &#92;mapsto x}' class='latex' /> for all <img src='http://s0.wp.com/latex.php?latex=%7Bx+%5Cin+%5Cmathbb%7BR%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{x &#92;in &#92;mathbb{R}}' title='{x &#92;in &#92;mathbb{R}}' class='latex' /></li>
<li><img src='http://s0.wp.com/latex.php?latex=%7BE+%5Cmapsto+E+%2B+E%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{E &#92;mapsto E + E}' title='{E &#92;mapsto E + E}' class='latex' /></li>
<li><img src='http://s0.wp.com/latex.php?latex=%7BE+%5Cmapsto+E+-+E%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{E &#92;mapsto E - E}' title='{E &#92;mapsto E - E}' class='latex' /></li>
<li><img src='http://s0.wp.com/latex.php?latex=%7BE+%5Cmapsto+E+%5Ctimes+E%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{E &#92;mapsto E &#92;times E}' title='{E &#92;mapsto E &#92;times E}' class='latex' /></li>
<li><img src='http://s0.wp.com/latex.php?latex=%7BE+%5Cmapsto+E+%2F+E%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{E &#92;mapsto E / E}' title='{E &#92;mapsto E / E}' class='latex' /></li>
<li><img src='http://s0.wp.com/latex.php?latex=%7BE+%5Cmapsto+%28E%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{E &#92;mapsto (E)}' title='{E &#92;mapsto (E)}' class='latex' /></li>
</ul>
<p>The <em>language</em> corresponding to a context free grammar is the set of all strings that can be obtained by starting from <img src='http://s0.wp.com/latex.php?latex=%7BE%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{E}' title='{E}' class='latex' /> and applying production rules until we only have terminal symbols. The language corresponding to the above grammar is, in fact, the set of well-formed arithmetic expressions, such as <img src='http://s0.wp.com/latex.php?latex=%7B5-4-2%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{5-4-2}' title='{5-4-2}' class='latex' />, <img src='http://s0.wp.com/latex.php?latex=%7B2-3%5Ctimes+%284.3%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{2-3&#92;times (4.3)}' title='{2-3&#92;times (4.3)}' class='latex' />, and <img src='http://s0.wp.com/latex.php?latex=%7B5%2F9927.12%2F%283-3%5Ctimes+1%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{5/9927.12/(3-3&#92;times 1)}' title='{5/9927.12/(3-3&#92;times 1)}' class='latex' />.</p>
<p>As mentioned above, a probabilistic context free grammar simply places a distribution over the production rules for any given non-terminal symbol. By repeatedly sampling from these distributions until we are left with only terminal symbols, we obtain a probability distribution over the language of the grammar.</p>
<p>We can represent a PCFG as a log-linear model by using a feature <img src='http://s0.wp.com/latex.php?latex=%7B%5Cphi_r%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;phi_r}' title='{&#92;phi_r}' class='latex' /> for each production rule <img src='http://s0.wp.com/latex.php?latex=%7Br%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{r}' title='{r}' class='latex' />. For instance, we have a feature that counts the number of times that the rule <img src='http://s0.wp.com/latex.php?latex=%7BE+%5Cmapsto+E+%2B+E%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{E &#92;mapsto E + E}' title='{E &#92;mapsto E + E}' class='latex' /> gets applied, and another feature that counts the number of times that <img src='http://s0.wp.com/latex.php?latex=%7BE+%5Cmapsto+%28E%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{E &#92;mapsto (E)}' title='{E &#92;mapsto (E)}' class='latex' /> gets applied. Such features yield a log-linear model family that contains all probabilistic context free grammars for a given (deterministic) context free grammar. However, it also contains additional models that do not correspond to PCFGs; this is because we run into the same problem as for HMMs, which is that the sum of <img src='http://s0.wp.com/latex.php?latex=%7B%5Cexp%28%5Ctheta_r%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;exp(&#92;theta_r)}' title='{&#92;exp(&#92;theta_r)}' class='latex' /> over production rules of a given non-terminal does not necessarily add up to <img src='http://s0.wp.com/latex.php?latex=%7B1%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{1}' title='{1}' class='latex' />. In fact, the problem is even worse here. For instance, suppose that <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta_%7BE+%5Cmapsto+E+%2B+E%7D+%3D+0.1%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;theta_{E &#92;mapsto E + E} = 0.1}' title='{&#92;theta_{E &#92;mapsto E + E} = 0.1}' class='latex' /> in the model above. Then the expression <img src='http://s0.wp.com/latex.php?latex=%7BE%2BE%2BE%2BE%2BE%2BE%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{E+E+E+E+E+E}' title='{E+E+E+E+E+E}' class='latex' /> gets a score of <img src='http://s0.wp.com/latex.php?latex=%7B0.5%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{0.5}' title='{0.5}' class='latex' />, and longer chains of <img src='http://s0.wp.com/latex.php?latex=%7BE%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{E}' title='{E}' class='latex' />s get even higher scores. In particular, there is an infinite sequence of expressions with increasing scores and therefore the model doesn&#8217;t normalize (since the sum of the exponentiated scores of all possible productions is infinite).</p>
<p>So, log-linear models over-represent PCFGs in the same way as they over-represent HMMs, but the problems are even worse than before. Let&#8217;s ignore these issues for now, and suppose that we want to learn PCFGs with an <em>unknown</em> underlying CFG. To be a bit more concrete, suppose that we have a large collection of possible production rules for each non-terminal <img src='http://s0.wp.com/latex.php?latex=%7Bs+%5Cin+%5Cmathcal%7BS%7D%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{s &#92;in &#92;mathcal{S}}' title='{s &#92;in &#92;mathcal{S}}' class='latex' />, and we think that a small but unknown subset of those production rules should actually appear in the grammar. Then there is no way to encode this directly within the context of a log-linear model family, although we can encode such &#8220;sparsity constraints&#8221; using simple extensions to log-linear models (for instance, by adding a penalty for the number of non-zero entries in <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' />). So, we have found another way in which the log-linear representation is not entirely adequate.</p>
<p><strong>Conclusion.</strong> Based on the examples above, we have seen that log-linear models have difficulty placing constraints on latent variables. This showed up in two different ways: first, we are unable to constrain subsets of variables to add up to <img src='http://s0.wp.com/latex.php?latex=%7B1%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{1}' title='{1}' class='latex' /> (what I call &#8220;local normalization&#8221; constraints); second, we are unable to encode sparsity constraints within the model. In both of these cases, it is possible to extend the log-linear framework to address these sorts of constraints, although that is outside the scope of this post.</p>
<p><b> Parameter Estimation for Log-Linear Models </b></p>
<p>I&#8217;ve explained what a log-linear model is, and partially characterized its representational power. I will now answer the practical question of how to estimate the parameters of a log-linear model (i.e., how to fit <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' /> based on observed data). Recall that a log-linear model places a distribution over a space <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> by choosing <img src='http://s0.wp.com/latex.php?latex=%7B%5Cphi+%3A+X+%5Crightarrow+%5Cmathbb%7BR%7D%5En%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;phi : X &#92;rightarrow &#92;mathbb{R}^n}' title='{&#92;phi : X &#92;rightarrow &#92;mathbb{R}^n}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta+%5Cin+%5Cmathbb%7BR%7D%5En%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;theta &#92;in &#92;mathbb{R}^n}' title='{&#92;theta &#92;in &#92;mathbb{R}^n}' class='latex' /> and defining</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+p%28x+%5Cmid+%5Ctheta%29+%5Cpropto+%5Cexp%28%5Cphi%28x%29%5ET%5Ctheta%29&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;displaystyle p(x &#92;mid &#92;theta) &#92;propto &#92;exp(&#92;phi(x)^T&#92;theta)' title='&#92;displaystyle p(x &#92;mid &#92;theta) &#92;propto &#92;exp(&#92;phi(x)^T&#92;theta)' class='latex' /></p>
<p>More precisely (assuming <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> is a discrete space), we have</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+p%28x+%5Cmid+%5Ctheta%29+%3D+%5Cfrac%7B%5Cexp%28%5Cphi%28x%29%5ET%5Ctheta%29%7D%7B%5Csum_%7Bx%27+%5Cin+X%7D+%5Cexp%28%5Cphi%28x%27%29%5ET%5Ctheta%29%7D&amp;bg=f0f0f0&amp;fg=555555&amp;s=0' alt='&#92;displaystyle p(x &#92;mid &#92;theta) = &#92;frac{&#92;exp(&#92;phi(x)^T&#92;theta)}{&#92;sum_{x&#039; &#92;in X} &#92;exp(&#92;phi(x&#039;)^T&#92;theta)}' title='&#92;displaystyle p(x &#92;mid &#92;theta) = &#92;frac{&#92;exp(&#92;phi(x)^T&#92;theta)}{&#92;sum_{x&#039; &#92;in X} &#92;exp(&#92;phi(x&#039;)^T&#92;theta)}' class='latex' /></p>
<p>Given observations <img src='http://s0.wp.com/latex.php?latex=%7Bx_1%2C%5Cldots%2Cx_n%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{x_1,&#92;ldots,x_n}' title='{x_1,&#92;ldots,x_n}' class='latex' />, which we assume to be independent given <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' />, our goal is to choose <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' /> maximizing <img src='http://s0.wp.com/latex.php?latex=%7Bp%28x_1%2C%5Cldots%2Cx_n+%5Cmid+%5Ctheta%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p(x_1,&#92;ldots,x_n &#92;mid &#92;theta)}' title='{p(x_1,&#92;ldots,x_n &#92;mid &#92;theta)}' class='latex' />, or, equivalently, <img src='http://s0.wp.com/latex.php?latex=%7B%5Clog+p%28x_1%2C%5Cldots%2Cx_n+%5Cmid+%5Ctheta%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;log p(x_1,&#92;ldots,x_n &#92;mid &#92;theta)}' title='{&#92;log p(x_1,&#92;ldots,x_n &#92;mid &#92;theta)}' class='latex' />. In equations, we want</p>
<p><a name="eqnobj"></a></p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Ctheta%5E%2A+%3D+%5Carg%5Cmax%5Climits_%7B%5Ctheta%7D+%5Csum_%7Bi%3D1%7D%5En+%5Cleft%5B%5Cphi%28x_i%29%5ET%5Ctheta+-+%5Clog%5Cleft%28%5Csum_%7Bx+%5Cin+X%7D+%5Cexp%28%5Cphi%28x%29%5ET%5Ctheta%29%5Cright%29+%5Cright%5D.+%5C+%5C+%5C+%5C+%5C+%281%29&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle &#92;theta^* = &#92;arg&#92;max&#92;limits_{&#92;theta} &#92;sum_{i=1}^n &#92;left[&#92;phi(x_i)^T&#92;theta - &#92;log&#92;left(&#92;sum_{x &#92;in X} &#92;exp(&#92;phi(x)^T&#92;theta)&#92;right) &#92;right]. &#92; &#92; &#92; &#92; &#92; (1)' title='&#92;displaystyle &#92;theta^* = &#92;arg&#92;max&#92;limits_{&#92;theta} &#92;sum_{i=1}^n &#92;left[&#92;phi(x_i)^T&#92;theta - &#92;log&#92;left(&#92;sum_{x &#92;in X} &#92;exp(&#92;phi(x)^T&#92;theta)&#92;right) &#92;right]. &#92; &#92; &#92; &#92; &#92; (1)' class='latex' /></p>
<p><a name="eqnobj"></a></p>
<p><a name="eqnobj"></a></p>
<p>We typically use gradient methods (such as <a href="http://en.wikipedia.org/wiki/Gradient_descent">gradient descent</a>, <a href="http://en.wikipedia.org/wiki/Stochastic_gradient_descent">stochastic gradient descent</a>, or <a href="http://en.wikipedia.org/wiki/Limited-memory_BFGS">L-BFGS</a>) to minimize the right-hand side of (<a href="#eqnobj">1</a>). If we compute the gradient of (<a href="#eqnobj">1</a>) then we get:</p>
<p><a name="eqngrad"></a></p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Csum_%7Bi%3D1%7D%5En+%5Cleft%28%5Cphi%28x_i%29-%5Cfrac%7B%5Csum_%7Bx+%5Cin+X%7D+%5Cexp%28%5Cphi%28x%29%5ET%5Ctheta%29%5Cphi%28x%29%7D%7B%5Csum_%7Bx+%5Cin+X%7D+%5Cexp%28%5Cphi%28x%29%5ET%5Ctheta%29%7D%5Cright%29.+%5C+%5C+%5C+%5C+%5C+%282%29&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle &#92;sum_{i=1}^n &#92;left(&#92;phi(x_i)-&#92;frac{&#92;sum_{x &#92;in X} &#92;exp(&#92;phi(x)^T&#92;theta)&#92;phi(x)}{&#92;sum_{x &#92;in X} &#92;exp(&#92;phi(x)^T&#92;theta)}&#92;right). &#92; &#92; &#92; &#92; &#92; (2)' title='&#92;displaystyle &#92;sum_{i=1}^n &#92;left(&#92;phi(x_i)-&#92;frac{&#92;sum_{x &#92;in X} &#92;exp(&#92;phi(x)^T&#92;theta)&#92;phi(x)}{&#92;sum_{x &#92;in X} &#92;exp(&#92;phi(x)^T&#92;theta)}&#92;right). &#92; &#92; &#92; &#92; &#92; (2)' class='latex' /></p>
<p><a name="eqngrad"></a></p>
<p><a name="eqngrad"></a></p>
<p>We can re-write (<a href="#eqngrad">2</a>) in the following more compact form:</p>
<p><a name="eqngrad2"></a></p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Csum_%7Bi%3D1%7D%5En+%5Cleft%28%5Cphi%28x_i%29+-+%5Cmathbb%7BE%7D%5B%5Cphi%28x%29+%5Cmid+%5Ctheta%5D%5Cright%29.+%5C+%5C+%5C+%5C+%5C+%283%29&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle &#92;sum_{i=1}^n &#92;left(&#92;phi(x_i) - &#92;mathbb{E}[&#92;phi(x) &#92;mid &#92;theta]&#92;right). &#92; &#92; &#92; &#92; &#92; (3)' title='&#92;displaystyle &#92;sum_{i=1}^n &#92;left(&#92;phi(x_i) - &#92;mathbb{E}[&#92;phi(x) &#92;mid &#92;theta]&#92;right). &#92; &#92; &#92; &#92; &#92; (3)' class='latex' /></p>
<p><a name="eqngrad2"></a></p>
<p><a name="eqngrad2"></a></p>
<p>In other words, the contribution of each training example <img src='http://s0.wp.com/latex.php?latex=%7Bx_i%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{x_i}' title='{x_i}' class='latex' /> to the gradient is the extent to which the features values for <img src='http://s0.wp.com/latex.php?latex=%7Bx_i%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{x_i}' title='{x_i}' class='latex' /> exceed their expected values conditioned on <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' />.</p>
<p>One important consideration for such gradient-based numerical optimizers is <em>convexity</em>. If the objective function we are trying to minimize is convex (or concave), then gradient methods are guaranteed to converge to the global optimum. If the objective function is non-convex, then a gradient-based approach (or any other type of local search) may converge to a <em>local optimum</em> that is very far from the global optimum. In order to assess convexity, we compute the <em>Hessian</em> (matrix of second derivatives) and check whether it is positive definite. (In this case, we actually care about concavity, so we want the Hessian to be negative definite.) We can compute the Hessian by differentiating (<a href="#eqngrad">2</a>), which gives us</p>
<p><a name="eqnhess"></a></p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+n+%5Ctimes+%5Cleft%5B%5Cleft%28%5Cfrac%7B%5Csum_%7Bx+%5Cin+X%7D+%5Cexp%28%5Cphi%28x%29%5ET%5Ctheta%29%5Cphi%28x%29%7D%7B%5Csum_%7Bx+%5Cin+X%7D+%5Cexp%28%5Cphi%28x%29%5ET%5Ctheta%29%7D%5Cright%29%5Cleft%28%5Cfrac%7B%5Csum_%7Bx+%5Cin+X%7D+%5Cexp%28%5Cphi%28x%29%5ET%5Ctheta%29%5Cphi%28x%29%7D%7B%5Csum_%7Bx+%5Cin+X%7D+%5Cexp%28%5Cphi%28x%29%5ET%5Ctheta%29%7D%5Cright%29%5ET-%5Cfrac%7B%5Csum_%7Bx+%5Cin+X%7D+%5Cexp%28%5Cphi%28x%29%5ET%5Ctheta%29%5Cphi%28x%29%5Cphi%28x%29%5ET%7D%7B%5Csum_%7Bx+%5Cin+X%7D+%5Cexp%28%5Cphi%28x%29%5ET+%5Ctheta%29%7D%5Cright%5D.+%5C+%5C+%5C+%5C+%5C+%284%29&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle n &#92;times &#92;left[&#92;left(&#92;frac{&#92;sum_{x &#92;in X} &#92;exp(&#92;phi(x)^T&#92;theta)&#92;phi(x)}{&#92;sum_{x &#92;in X} &#92;exp(&#92;phi(x)^T&#92;theta)}&#92;right)&#92;left(&#92;frac{&#92;sum_{x &#92;in X} &#92;exp(&#92;phi(x)^T&#92;theta)&#92;phi(x)}{&#92;sum_{x &#92;in X} &#92;exp(&#92;phi(x)^T&#92;theta)}&#92;right)^T-&#92;frac{&#92;sum_{x &#92;in X} &#92;exp(&#92;phi(x)^T&#92;theta)&#92;phi(x)&#92;phi(x)^T}{&#92;sum_{x &#92;in X} &#92;exp(&#92;phi(x)^T &#92;theta)}&#92;right]. &#92; &#92; &#92; &#92; &#92; (4)' title='&#92;displaystyle n &#92;times &#92;left[&#92;left(&#92;frac{&#92;sum_{x &#92;in X} &#92;exp(&#92;phi(x)^T&#92;theta)&#92;phi(x)}{&#92;sum_{x &#92;in X} &#92;exp(&#92;phi(x)^T&#92;theta)}&#92;right)&#92;left(&#92;frac{&#92;sum_{x &#92;in X} &#92;exp(&#92;phi(x)^T&#92;theta)&#92;phi(x)}{&#92;sum_{x &#92;in X} &#92;exp(&#92;phi(x)^T&#92;theta)}&#92;right)^T-&#92;frac{&#92;sum_{x &#92;in X} &#92;exp(&#92;phi(x)^T&#92;theta)&#92;phi(x)&#92;phi(x)^T}{&#92;sum_{x &#92;in X} &#92;exp(&#92;phi(x)^T &#92;theta)}&#92;right]. &#92; &#92; &#92; &#92; &#92; (4)' class='latex' /></p>
<p><a name="eqnhess"></a></p>
<p><a name="eqnhess"></a></p>
<p>Again, we can re-write this more compactly as</p>
<p><a name="eqnhess2"></a></p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+n%5Ctimes+%5Cleft%28%5Cmathbb%7BE%7D%5B%5Cphi%28x%29+%5Cmid+%5Ctheta%5D%5Cmathbb%7BE%7D%5B%5Cphi%28x%29+%5Cmid+%5Ctheta%5D%5ET+-+%5Cmathbb%7BE%7D%5B%5Cphi%28x%29%5Cphi%28x%29%5ET+%5Cmid+%5Ctheta%5D%5Cright%29.+%5C+%5C+%5C+%5C+%5C+%285%29&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;displaystyle n&#92;times &#92;left(&#92;mathbb{E}[&#92;phi(x) &#92;mid &#92;theta]&#92;mathbb{E}[&#92;phi(x) &#92;mid &#92;theta]^T - &#92;mathbb{E}[&#92;phi(x)&#92;phi(x)^T &#92;mid &#92;theta]&#92;right). &#92; &#92; &#92; &#92; &#92; (5)' title='&#92;displaystyle n&#92;times &#92;left(&#92;mathbb{E}[&#92;phi(x) &#92;mid &#92;theta]&#92;mathbb{E}[&#92;phi(x) &#92;mid &#92;theta]^T - &#92;mathbb{E}[&#92;phi(x)&#92;phi(x)^T &#92;mid &#92;theta]&#92;right). &#92; &#92; &#92; &#92; &#92; (5)' class='latex' /></p>
<p><a name="eqnhess2"></a></p>
<p><a name="eqnhess2"></a></p>
<p>The term inside the parentheses of (<a href="#eqnhess2">5</a>) is exactly the negative of the <a href="http://en.wikipedia.org/wiki/Covariance_matrix">covariance matrix</a> of <img src='http://s0.wp.com/latex.php?latex=%7B%5Cphi%28x%29%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{&#92;phi(x)}' title='{&#92;phi(x)}' class='latex' /> given <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' />, and is therefore necessarily negative definite, so the objective function we are trying to minimize is indeed concave, which, as noted before, implies that our gradient methods will always reach the global optimum.</p>
<p><b> Regularization and Concavity </b></p>
<p>We may in practice wish to encode additional prior knowledge about <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' /> in our model, especially if the dimensionality of <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='&#92;theta' title='&#92;theta' class='latex' /> is large relative to the amount of data we have. Can we do this and still maintain concavity? The answer in many cases is yes: since the <a href="http://en.wikipedia.org/wiki/Lp_space#The_p-norm_in_finite_dimensions"><img src='http://s0.wp.com/latex.php?latex=%7BL%5Ep%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{L^p}' title='{L^p}' class='latex' />-norm</a> is convex for all <img src='http://s0.wp.com/latex.php?latex=%7Bp+%5Cgeq+1%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p &#92;geq 1}' title='{p &#92;geq 1}' class='latex' />, we can add an <img src='http://s0.wp.com/latex.php?latex=%7BL%5Ep%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{L^p}' title='{L^p}' class='latex' /> penalty to the objective for any such <img src='http://s0.wp.com/latex.php?latex=%7Bp%7D&amp;bg=f0f0f0&amp;fg=000000&amp;s=0' alt='{p}' title='{p}' class='latex' /> and still have a concave objective function.</p>
<p><b> Conclusion </b></p>
<p>Log-linear models provide a universal representation for individual probability distributions, but not for arbitrary families of probability distributions (for instance, due to the inability to capture local normalization constraints or sparsity constraints). However, for the families they do express, parameter optimization can be performed efficiently due to a likelihood function that is log-concave in its parameters. Log-linear models also have tie-ins to many other beautiful areas of statistics, such as exponential families, which will be the subject of the next post.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/jsteinhardt.wordpress.com/425/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/jsteinhardt.wordpress.com/425/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=jsteinhardt.wordpress.com&#038;blog=8824138&#038;post=425&#038;subd=jsteinhardt&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://jsteinhardt.wordpress.com/2012/12/06/log-linear-models/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c0d709db669c6eb66c98ee050c45527d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jsteinhardt</media:title>
		</media:content>
	</item>
	</channel>
</rss>
