Generalizing Across Categories

Humans are very good at correctly generalizing rules across categories (at least, compared to computers). In this post I will examine mechanisms that would allow us to do this in a reasonably rigorous manner. To this end I will present a probabilistic model such that conditional inference on that model leads to generalization across a category.

There are three questions along these lines that I hope to answer:

  • How does one generalize rules across categories?
  • How does one determine which rules should generalize across which categories?
  • How does one determine when to group objects into a category in the first place?

I suspect that the mechanisms for each of these is rather complex, but I am reasonably confident that the methods I present make up at least part of the actual answer. A good exercise is to come up with examples where these methods fail.

Generalizing Across Categories

For simplicity I’m going to consider just some sort of binary rule, such as the existence of an attribute. So as an example, let’s suppose that we see a bunch of ducks, which look to varying degrees to be mallards. In addition to this, we notice that some of the ducks have three toes, and some have four toes. In this case the category is “mallards”, and the attribute is “has three toes”.

A category is going to be represented as a probabilistic relation for each potential member, of the form p(x \in C), where C is the category and x is the object in question. This essentially indicates the degree to which the object x belongs to the category C. For instance, if the category is “birds”, then pigeons, sparrows, and eagles all fit into the category very well, so we assign a high probability (close to 1) to pigeons, sparrows, and eagles belonging in the category “birds”. On the other hand, flamingos, penguins, ostriches, and pterodactyls, while still being bird-like, don’t necessarily fit into our archetypal notion of what it means to be a bird. So they get lower probabilities (I’d say around 0.8 for the first 3 and around 0.1 for pterodactyls, but that is all pretty subjective). Finally, dogs and cats get a near-zero probability of being birds, and a table, the plus operator, or Beethoven’s Ninth Symphony would get even-closer-to-zero probabilities of being a bird.

In the example with ducks, the probability of being a mallard will probably be based on how many observed characteristics the duck in question has in common with mallards.

For now we will think of a rule or an attribute as an observable binary property about various objects. Let’s let P be the set of all things that have this property. Now we assume that membership in P has some base probability of occurrence \theta, together with some probability of occurrence within C of \theta_C. We will assume that there are objects x_i, y_j, and z_k, such that the x_i were all observed to lie in P, the y_j were all observed to not lie in P, and membership in P has been unobserved for the z_k.

Furthermore, x_i has probability p_i of belonging to C, y_j has probability q_j of belonging to C, and z_k has probability r_k of belonging to C.

Again, going back to the ducks example, the x_i are the ducks observed to have three toes, the y_j are the ducks observed to have four toes, and the z_k are the ducks whose number of toes we have not yet observed. The p_i, q_j, and r_k are the aforementioned probabilities that each of these ducks is a mallard.

So to summarize, we have three classes of objects — those which certainly lie in P, those which certainly don’t lie in P, and those for which we have no information about P. For each element in each of these classes, we have a measure of the extent to which it belongs in C.

Given this setup, how do we actually go about specifying a model explicitly? Basically, we say that if something lies in C then it has property P with probability \theta_C, and otherwise it has it with probability \theta. Thus we end up with

p(\theta, \theta_C \ \mid \ p_i, q_j) \propto p(\theta)p(\theta_C) \left(\prod_{i} p_i\theta_C+(1-p_i)\theta\right)\left(\prod_{j} q_j(1-\theta_C)+(1-q_j)(1-\theta)\right)

What this ends up meaning is that \theta will shift to accommodate the objects that (most likely) don’t lie in C, and \theta_C will shift to accommodate the objects that (most likely) do lie in C. At the same time, if for instance most elements in C also lie in P, then the elements that don’t lie in P will have a relatively lower posterior probability of lying in C (compared to its prior probability). So this model not only has the advantage of generalizing across all of C based on the given observations, it also has the advantage of re-evaluating whether an object is likely to lie in a category based on whether it shares attributes with the other objects in that category.

Let’s go back to the mallards example one more time. Suppose that after we make all of our observations, we notice that most of the ducks that we think are mallards also have three toes. Then \theta_C in this case (the probability that a mallard has three toes) will be close to 1. Furthermore, any duck that we observe to have four toes, we will think much less likely to be a mallard, even if it otherwise looks similar (although it wouldn’t be impossible; it could be a genetic mutant, for instance). At the same time, if non-mallards seem to have either three or four toes with roughly equal frequency, then \theta will be close to 0.5.

There is one thing that I am dissatisfied with in the above model, though. As it stands, \theta measures the probability of an object not lying in C to have property P, rather than the probability of a generic object to have property P. This is mainly a problem because, later on, I would like to be able to talk about an object lying in multiple categories, and I don’t have a good way of doing that yet.

An important thing to realize here is that satisfying a rule or having an attribute is just another way of indicating membership in a set. So we can think of both the category and the attribute as potential sets that an object could lie in; as before, we’ll call the category C and we will call the set of objects having a given attribute A. Then \theta_C is something like p(x \in A \ \mid \ x \in C), whereas \theta (should be) p(x \in A), although as I’ve set it up now it’s more like p(x \in A \ \mid \ x \not\in C).

As a final note, this same setup can be applied to the case when there are multiple attributes under consideration. We can also apply it to the case when the objects are themselves categories, and so instead of having strict observations about each attribute, we have some sort of probability that the object will possess an attribute. In this latter case we just treat these probabilities as uncertain observations, as discussed previously.

When To Generalize

Another important question is which rules / attributes we should expect to generalize across a category, and which we should not. For instance, if we would expect “number of toes” to generalize across all animals in a given species, but not “age”.

I think that for binary attributes we will always have to generalize, although our generalization could just be “occurs at the base rate in the population”. However, for more complicated attributes (for instance, age, which has an entire continuum of possible values), we can think about whether to generalize based on whether our posterior distribution ends up very tightly concentrated or very spread out. Even if it ends up fairly spread out, there is still some degree of generalization — for instance, we expect that most insects have fairly short lifespans compared to mammals. You could say that this is actually a scientific fact (since we can measure lifespan), but even if you ran into a completely new species of insect, you would expect it to have a relatively short lifespan.

On the other hand, if we went to a college town, we might quickly infer that everyone there was in their 20s; if we went to a bridge club, we would quickly infer that everyone there was past retirement age.

What I would say is that we can always generalize across a category, but sometimes the posterior distribution we end up with is very spread out, and so our generalization isn’t particularly strong. In some cases, the posterior distribution might not even be usefully different from the prior distribution; in this case, from a computational standpoint it doesn’t make sense to keep track of it, and so it feels like an attribute shouldn’t generalize across a category, while what we really mean is that we don’t bother to keep track of that generalization.

An important thing to keep in mind on this topic is that everything that we do when we construct models is a computational heuristic. If we didn’t care about computational complexity and only wanted to arrive at an answer, we would use Solomonoff induction, or some computable approximation to it. So whenever I talk about something in probabilistic modeling, I’m partially thinking about whether the method is computationally feasible at all, and what sort of heuristics might be necessary to actually implement something in practice.

To summarize:

  • attributes always generalize across a category
  • but the generalization might be so weak as to be indistinguishable from the prior distribution
  • it would be interesting to figure out how we decide which generalizations to keep track of, and which not to

Forming Categories

My final topic of consideration is when to actually form a category. Before going into this, I’d like to share some intuition with you. This is the intuition of explanatory “cheapness” or “expensiveness”. The idea is that events that have low probability under a given hypothesis are “expensive”, and events that have relatively high probability are “cheap”. When a fixed event is expensive under a model, that is evidence against that model; when it is cheap under a model, that is evidence for the model.

The reason that forming clusters can be cheaper is that it allows us to explain many related events simultaneously, which is cheaper than explaining them all separately. Let’s take an extreme example — suppose that 70% of all ducks have long beaks, and 70% have three toes (the remaining 30% in both cases have short beaks and four toes). Then the “cheapest” explanation would be to assign both long beaks and three toes a probability of 0.7, and the probability of our observations would be (supposing there were 100 ducks) 0.7^{140}0.3^{60} \approx 10^{-53}. However, suppose that we also observe that the ducks with long beaks are exactly the ducks with three toes. Then we can instead postulate that there are two categories, “(long,3)” and “(short,4)”, and that the probability of ending up in the first category is 0.7 while the probability of ending up in the second category is 0.3. This is a much cheaper explanation, as the probability of our observations in this scenario becomes 0.7^{70}0.3^{30} \approx 10^{-27}. We have to balance this slightly with the fact that an explanation that involves creating an extra category is more expensive (because it has a lower prior probability / higher complexity) than one that doesn’t, but the extra category can’t possibly be 10^{26} times more expensive, so we should still always favor it.

This also demonstrates that it will become progressively cheaper to form categories as we get large amounts of similar evidence, which corresponds with our intuitive notion of categories as similarity clusters.

So, now that I’ve actually given you this intuition, how do we actually go about forming categories? Each time we form a category, we will have some base probability of membership in that category, together with a probability of each member of that category possessing each attribute under consideration. If we simply pick prior distributions for each of these parameters, then they will naturally adopt posterior distributions based on the observed data, although these distributions might have multiple high-probability regions if multiple categories can be formed from the data. At the same time, each object will naturally end up with some probability of being in each category, based on how well that category explains its characteristics (as well as the base probability that objects should end up in that category). Putting these together, we see that clusters will naturally form to accommodate the data.

As an example of how we would have multiple high-probability regions, suppose that, as before, there are long-billed, three-toed ducks and short-billed, four-toed ducks. But we also notice that ducks with orange beaks have white feathers and ducks with yellow beaks have brown feathers. If we form a single category, then there are two good candidates, so that category is likely to be either the (long,3) category or the (orange,white) category (or their complements). If we form two categories, each category in isolation is still likely to be either the (long,3) or (orange,white) category, although if the first category is (long,3), then the second is very likely to be (orange,white), and vice versa. In other words, it would be silly to make both categories (long,3), or both categories (orange,white).

The only thing that is left to do is to specify some prior probability of forming a category. While there are various ways to do this, the most commonly used way is the Indian Buffet Process. I won’t explain it in detail here, but I might explain it later.

Future Considerations

There are still some unresolved questions here. First of all, in reality something has the potential to be in many, many categories, and it is not entirely clear how to resolve such issues in the above framework. Secondly, keeping track of all of our observations about a given category can be quite difficult computationally (updating in the above scenarios requires performing computations on high-degree polynomials), so we need efficient algorithms for dealing with all of the data we get.

I’m not sure yet how to deal with the first issue, although I’ll be thinking about it in the coming weeks. To deal with the second issue, I intend to use an approach based on streaming algorithms, which will hopefully make up a good final project for a class I’m taking this semester (Indyk and Rubinfeld’s class on Sublinear Algorithms, if you know about MIT course offerings).

Uncertain Observations

What happens when you are uncertain about observations you made? For instance, you remember something happening, but you don’t remember who did it. Or you remember some fact you read on wikipedia, but you don’t know whether it said that hydrogen or helium was used in some chemical process.

How do we take this information into account in the context of Bayes’ rule? First, I’d like to note that there are different ways something could be uncertain. It could be that you observed X, but you don’t remember if it was in state A or state B. Or it could be that you think you observed X in state A, but you aren’t sure.

These are different because in the first case you don’t know whether to concentrate probability mass towards A or B, whereas in the second case you don’t know whether to concentrate probability mass at all.

Fortunately, both cases are pretty straightforward as long as you are careful about using Bayes’ rule. However, today I am going to focus on the latter case. In fact, I will restrict my attention to the following problem:

You have a coin that has some probability \pi of coming up heads. You also know that all flips of this coin are independent. But you don’t know what \pi is. However, you have observed this coin n times in the past. But for each observation, you aren’t completely sure that this was the coin you were observing. In particular, you only assign a probability r_i to your ith observation actually being about this coin. Given this, and the sequence of heads and tails you remember, what is your estimate of \pi?

To use Bayes’ rule, let’s first figure out what we need to condition on. In this case, we need to condition on remembering the sequence of coin flips that we remembered. So we are looking for

p(\pi = \theta | we remember the given sequence of flips),

which is proportional to

p(we remember the given sequence of flips | \pi = \theta) \cdot p(\pi = \theta).

The only thing that the uncertain nature of our observations does is cause there to be multiple ways to eventually land in the set of universes where we remember the sequence of flips; in particular, for any observation we remember, it could have actually happened, or we could have incorrectly remembered it. Thus if \pi = \theta, and we remember the ith coin flip as being heads, then this could happen with probability 1-r_i if we incorrectly remembered a coin flip of heads. In the remaining probability r_i case, it could happen with probability \theta by actually coming up heads. Therefore the probability of us remembering that the ith flip was heads is (1-r_i)+r_i \theta.

A similar computation shows that the probability of us remembering that the ith flip was tails is (1-r_i)+r_i(1-\theta) = 1-r_i\theta.

For convenience of notation, let’s actually split up our remembered flips into those that were heads and those that were tails. The probability of the ith remembered heads being real is h_i, and the probability of the jth remembered tails being real is t_i. There are m heads and n tails. Then we get

\displaystyle p(\pi = \theta | \mathrm{\ our \ memory}) \propto p(\pi = \theta) \cdot \left(\prod_{i=1}^m (1-h_i)+h_i\theta \right) \cdot \left(\prod_{i=1}^n 1-t_i\theta\right).

Note that when we consider values of \theta close to 0, the term from the remembered tails becomes close to 1-\theta raised to the power of the expected number of tails, whereas the term from the remembered heads becomes close to the probability that we incorrectly remembered each of the heads. A similar phenomenon will happen when \theta gets close to 1. This is an instance of a more general phenomenon whereby unlikely observations get “explained away” by whatever means possible.

A Caveat

Applying the above model in practice can be quite tricky. The reason is that your memories are intimately tied to all sorts of events that happen to you; in particular, your assessment of how likely you are to remember an event probably already takes into account how well that event fits into your existing model. So if you saw 100 heads, and then a tails, you would place more weight than normal on your recollection of the tails being incorrect, even though that is the job of the above model. In essence, you are conditioning on your data twice — once intuitively, and once as part of the model. This is bad because it assumes that you made each observation twice as many times as you actually did.

What is interesting, though, is that you can actually compute things like the probability that you incorrectly remembered an event, given the rest of the data, and it will be different from the prior probability. So in addition to a posterior estimate of \pi, you get posterior estimates of the likelihood of each of your recollections. Just be careful not to take these posterior estimates and use them as if they were prior estimates (which, as explained above, is what we are likely to do intuitively).

There are other issues to using this in practice, as well. For instance, if you really want the coin to be fair, or unfair, or be biased in a certain direction, it is very easy to fool yourself into assigning skewed probability estimates towards each of your recollections, thus ending up with a biased answer at the end. It’s not even difficult — if I take a fair coin, and underestimate my recollection of each tails by 20%, and overestimate my recollection of each heads by 20%, then all of a sudden I “have a coin” that is 50% more likely to come up heads than tails.

Fortunately, my intended application of this model will be in a less slippery domain (hopefully). The purpose is to finally answer the question I posed in the last post, which I’ll repeat here for convenience:

Suppose that you have never played a sport before, and you play soccer, and enjoy it. Now suppose instead that you have never played a sport before, and play soccer, and hate it. In the first case, you will think yourself more likely to enjoy other sports in the future, relative to in the second case. Why is this?

Or if you disagree with the premises of the above scenario, simply “If X and Y belong to the same category C, why is it that in certain cases we think it more likely that Y will have attribute A upon observing that X has attribute A?”

In the interest of making my posts shorter, I will leave that until next time, but hopefully I’ll get to it in the next week.

Nobody Understands Probability

The goal of this post is to give an overview of Bayesian statistics as well as to correct errors about probability that even mathematically sophisticated people commonly make. Hopefully by the end of this post I will convince you that you don’t actually understand probability theory as well as you think, and that probability itself is something worth thinking about.

I will try to make this post somewhat shorter than the previous posts. As a result, this will be only the first of several posts on probability. Even though this post will be shorter, I will summarize its organization below:

  • Bayes’ theorem: the fundamentals of conditional probability
  • modeling your sources: how not to calculate conditional probabilities; the difference between “you are given X” and “you are given that you are given X”
  • how to build models: examples using toy problems
  • probabilities are statements about your beliefs (not the world)
  • re-evaluating a standard statistical test

I bolded the section on models because I think it is very important, so I hope that bolding it will make you more likely to read it.

Also, I should note that when I say that nobody understands probability, I don’t mean it in the sense that most people are bad at combinatorics. Indeed, I expect that most of the readers of this blog are quite proficient at combinatorics, and that many of them even have sophisticated mathematical definitions of probability. Rather I would say that actually using probability theory in practice is non-trivial. This is partially because there are some subtleties (or at least, I have found myself tripped up by certain points, and did not realize this until much later). It is also because whenever you use probability theory in practice, you end up employing various heuristics, and it’s not clear which ones are the “right” ones.

If you disagree with me, and think that everything about probability is trivial, then I challenge you to come up with a probability-theoretic explanation of the following phenomenon:

Suppose that you have never played a sport before, and you play soccer, and enjoy it. Now suppose instead that you have never played a sport before, and play soccer, and hate it. In the first case, you will think yourself more likely to enjoy other sports in the future, relative to in the second case. Why is this?

Or if you disagree with the premises of the above scenario, simply “If X and Y belong to the same category C, why is it that in certain cases we think it more likely that Y will have attribute A upon observing that X has attribute A?”

Bayes’ Theorem

Bayes’ theorem is a fundamental result about conditional probability. It says the following:

p(A \mid B) = \frac{p(B \mid A)p(A)}{p(B)}

Here A and B are two events, and p(A \mid B) means the probability of A conditioned on B. In other words, if we already know that B occurred, what is the probability of A? The above theorem is quite easy to prove, using the fact that p(A \cap B) = p(A \mid B)p(B), and thus also equals p(B \mid A)p(A), so that p(A \mid B)p(B) = p(B \mid A)p(A), which implies Bayes’ theorem. So, why is it useful, and how do we use it?

One example is the following famous problem: A doctor has a test for a disease that is 99% accurate. In other words, it has a 1% chance of telling you that you have a disease even if you don’t, and it has a 1% chance of telling you that you don’t have a disease even if you do. Now suppose that the disease that this tests for is extremely rare, and only affects 1 in 1,000,000 people. If the doctor performs the test on you, and it comes up positive, how likely are you to have the disease?

The answer is close to 10^{-4}, since it is roughly 10^4 times as likely for the test to come up positive due to an error in the test relative to you actually having the disease. To actually compute this with Bayes’ rule, you can say

p(Disease | Test is positive) = p(Test is positive | Disease)p(Disease)/p(Test is positive),

which comes out to \frac{0.99 \cdot 10^{-6}}{0.01 \cdot (1-10^{-6}) + 0.99 \cdot 10^{-6}}, which is quite close to 10^{-4}.

In general, we can use Bayes’ law to test hypotheses:

p(Hypothesis | Data) = p(Data | Hypothesis) p(Hypothesis) / p(Data)

Let’s consider each of these terms separately:

  • p(Hypothesis | Data) — the weight we assign to a given hypothesis being correct under our observed data
  • p(Data | Hypothesis) — the likelihood of seeing the data we saw under our hypothesis; note that this should be quite easy to compute. If it isn’t, then we haven’t yet fully specified our hypothesis.
  • p(Hypothesis) — the prior weight we give to our hypothesis. This is subjective, but should intuitively be informed by the consideration that “simpler hypotheses are better”.
  • p(Data) — how likely we are to see the data in the first place. This is quite hard to compute, as it involves considering all possible hypotheses, how likely each of those hypotheses is to be correct, and how likely the data is to occur under each hypothesis.

So, we have an expression for p(Hypothesis | Data), one of which is easy to compute, the other of which can be chosen subjectively, and the last of which is hard to compute. How do we get around the fact that p(Data) is hard to compute? Note that p(Data) is independent of which hypothesis we are testing, so Bayes’ theorem actually gives us a very good way for comparing the relative merits of two hypotheses:

p(Hypothesis 1 | Data) / p(Hypothesis 2 | Data) = [p(Data | Hypothesis 1) / p(Data | Hypothesis 2)] * p(Hypothesis 1) / p(Hypothesis 2)

Let’s consider the following toy example. There is a stream of digits going past us, too fast for us to tell what the numbers are. But we are allowed to push a button that will stop the stream and allow us to see a single number (whichever one is currently in front of us). We push this button three times, and see the numbers 3, 5, and 3. How many different numbers would we estimate are in the stream?

For simplicity, we will make the (somewhat unnatural) assumption that each number between 0 and 9 is selected to be in the stream with probability 0.5, and that each digit in the stream is chosen uniformly from the set of selected numbers. It is worth noting now that making this assumption, rather than some other assumption, will change our final answer.

Now under this assumption, what is the probability, say, of there being exactly 2 numbers (3 and 5) in the stream? By Bayes’ theorem, we have

p(\{3,5\} \mid (3,5,3)) \propto p((3,5,3) \mid \{3,5\}) p(\{3,5\}) = \left(\frac{1}{2}\right)^3 \left(\frac{1}{2}\right)^{10}.

Here \propto means “is proportional to”.

What about the probability of there being 3 numbers (3, 5, and some other number)? For any given other number, this would be

p(\{3,5,x\} \mid (3,5,3)) \propto p((3,5,3) \mid \{3,5,x\}) p(\{3,5,x\}) = \left(\frac{1}{3}\right)^3 \left(\frac{1}{2}\right)^{10}.

However, there are 8 possibilities for x above, all of which correspond to disjoint scenarios, so the probability of there being 3 numbers is proportional to 9 \left(\frac{1}{3}\right)^3 \left(\frac{1}{2}\right)^{10}. If we compare this to the probability of there being 2 numbers, we get

p(2 numbers | (3,5,3)) / p(3 numbers | (3,5,3)) = 3/8.

Even though we have only seen two numbers in our first three samples, we still think it is more likely that there are 3 numbers than 2, just because the prior likelihood of there being 3 numbers is so much higher. However, suppose that we made 6 draws, and they were 3,5,3,3,3,5. Then we would get

p(2 numbers | (3,5,3,3,5)) / p(3 numbers | (3,5,3,3,5)) = (1/2)^6 / [9 * (1/3)^6] = 81/64.

Now we find it more likely that there are only 2 numbers. This is what tends to happen in general with Bayes’ rule — over time, more restrictive hypotheses become exponentially more likely than less restrictive hypotheses, provided that they correctly explain the data. Put another way, hypotheses that concentrate probability density towards the actual observed events will do best in the long run. This is a nice feature of Bayes’ rule because it means that, even if the prior you choose is not perfect, you can still arrive at the “correct” hypothesis through enough observations (provided that the hypothesis is among the set of hypotheses you consider).

I will use Bayes’ rule extensively through the rest of this post and the next few posts, so you should make sure that you understand it. If something is unclear, post a comment and I will try to explain in more detail.

Model Your Sources

An important distinction that I think most people don’t think about is the difference between experiments you perform, and experiments you observe. To illustrate what I mean by this, I would point to the difference between biology and particle physics — where scientists set out to test a hypothesis by creating an experiment specifically designed to do so — and astrophysics and economics, where many “experiments” come from seeking out existing phenomena that can help evaluate a hypothesis.

To illustrate why one might need to be careful in the latter case, consider empirical estimates of average long-term GDP growth rate. How would one do this? Since it would be inefficient to wait around for the next 10 years and record the data of all currently existing countries, instead we go back and look at countries that kept records allowing us to compute GDP. But in this case we are only sampling from countries that kept such records, which implies a stable government as well as a reasonable degree of economics expertise within that government. So such a study almost certainly overestimates the actual average growth rate.

Or as another example, we can argue that a scientist is more likely to try to publish a paper if it doesn’t agree with prevalent theories than if it does, so looking merely at the proportion of papers that lend support to or take support away from a theory (even if weighted by the convincingness of each paper) is probably not a good way to determine the validity of a theory.

So why are we safer in the case that we forcibly gather our own data? By gathering our own data, we understand much better (although still not perfectly) the way in which it was constructed, and so there is less room for confounding parameters. In general, we would like it to be the case that the likelihood of observing something that we want to observe does not depend on anything else that we care about — or at the very least, we would like it to depend in a well-defined way.

Let’s consider an example. Suppose that a man comes up to you and says “I have two children. At least one of them is a boy.” What is the probability that they are both boys?

The standard way to solve this is as follows: Assuming that male and female children are equally likely, there is a \frac{1}{4} chance of two girls or two boys, and a \frac{1}{2} chance of having one girl and one boy. Now by Bayes’ theorem,

p(Two boys | At least one boy) = p(At least one boy | Two boys) * p(Two boys) / p(At least one boy) = 1 * (1/4) / (1/2+1/4) = 1/3.

So the answer should be 1/3 (if you did math contests in high school, this problem should look quite familiar). However, the answer is not, in fact, 1/3. Why is this? We were given that the man had at least one boy, and we just computed the probability that the man had at two boys given that he had at least one boy using Bayes’ theorem. So what’s up? Is Bayes’ theorem wrong?

No, the answer comes from an unfortunate namespace collision in the word “given”. The man “gave” us the information that he has at least one male child. By this we mean that he asserted the statement “I have at least one male child.” Now our issue is when we confuse this with being “given” that the man has at least one male child, in the sense that we should restrict to the set of universes in which the man has at least one male child. This is a very different statement than the previous one. For instance, it rules out universes where the man has two girls, but is lying to us.

Even if we decide to ignore the possibility that the man is lying, we should note that most universes where the man has at least one son don’t even involve him informing us of this fact, and so it may be the case that proportionally more universes where the man has two boys involve him telling us “I have at least one male child”, relative to the proportion of such universes where the man has one boy and one girl. In this case the probability that he has two boys would end up being greater than 1/3.

The most accurate way to parse this scenario would be to say that we are given (restricted to the set of possible universes) that we are given (the man told us that) that the man has at least one male child. The correct way to apply Bayes’ rule in this case is

p(X has two boys | X says he has >= 1 boy) = p(X says he has >= 1 boy | X has two boys) * p(X has two boys) / p(X says he has >=1 boy)

If we further assume that the man is not lying, and that male and female children are equally likely and uncorrelated, then we get

p(X has two boys | X says he has >= 1 boy) = [p(X says he has >= 1 boy | X has two boys) * 1/4]/[p(X says he has >= 1 boy | X has two boys) * 1/4 + p(X says he has >= 1 boy | X has one boy) * 1/2]

So if we think that the man is \alpha times more likely to tell us that he has at least one boy when he has two boys, then

p(X has two boys | X says he has >= 1 boy) = \frac{\alpha}{\alpha+2}.

Now this means that if we want to claim that the probability that the man has two boys is \frac{1}{3}, what we are really claiming is that he is equally likely to inform us that he has at least one boy, in all situations where it is true, independent of the actual gender distribution of his children.

I would argue that this is quite unlikely, as if he has a boy and a girl, then he could equally well have told us that he has at least one girl, whereas he couldn’t tell us that if he has only boys. So I would personally put \alpha closer to 2, which yields an answer of \frac{1}{2}. On the other hand, situations where someone walks up to me and tells me strange facts about the gender distribution of their children are, well, strange. So I would also have to take into account the likely psychology of such a person, which would end up changing my estimate of $\alpha$.

The whole point here is that, because we were an observer receiving information, rather than an experimenter acquiring information, there are all sorts of confounding factors that are difficult to estimate, making it difficult to get a good probability estimate (more on that later). That doesn’t mean that we should give up and blindly guess \frac{1}{3}, though — it might feel like doing so gets away without making unwarranted assumptions, but it in fact implicitly makes the assumption that \alpha = 1, which as discussed above is almost certainly unwarranted.

What it does mean, though, is that, as scientists, we should try to avoid situations like the one above where there are lots of confounding factors between what we care about and our observations. In particular, we should avoid uncertainties in the source of our information by collecting the information ourselves.

I should note that, even when we construct our own experiments, we should still model the source of our information. But doing so is often much easier. In fact, if we wanted to be particularly pedantic, we really need to restrict to the set of universes in which our personal consciousness receives a particular set of stimuli, but that set of stimuli has almost perfect correlation with photons hitting our eyes, which has almost perfect correlation with the set of objects in front of us, so going to such lengths is rarely necessary — we can usually stop at the level of our personal observations, as long as we remember where they come from.

How to Build Models

Now that I’ve told you that you need to model your information sources, you perhaps care about how to do said modeling. Actually, constructing probabilistic models is an extremely important skill, so even if you ignore the rest of this post, I recommend paying attention to this section.

This section will have the following examples:

  • determining if a coin is fair
  • finding clusters

Determining if a Coin is Fair

Suppose that you have occasion to observe a coin being flipped (or better yet, you flip it yourself). You do this several times and observe a particular sequence of heads and tails. If you see all heads or all tails, you will probably think the coin is unfair. If you see roughly half heads and half tails, you will probably think the coin is fair. But how do we quantify such a calculation? And what if there are noticeably many more heads than tails, but not so many as to make the coin obviously unfair?

We’ll solve this problem by building up a model in parts. First, there is the thing we care about, namely whether the coin is fair or unfair. So we will construct a random variable X that can take the values Fair and Unfair. Then p(X = Fair) is the probability we assign to a generic coin being fair, and p(X = Unfair) is the probability we assign to a generic coin being unfair.

Now supposing the coin is fair, what do we expect? We expect each flip of the coin to be independent, and have a \frac{1}{2} probability of coming up heads. So if we let F1, F2, …, Fn be the flips of the coin, then p(Fi = Heads | X = Fair) = 0.5.

What if the coin is unfair? Let’s go ahead and blindly assume that the flips will still be independent, and furthermore that each possible weight of the coin is equally likely (this is unrealistic, as weights near 0 or 1 are much more likely than weights near 0.5). Then we have to have an extra variable \theta, the probability that the unfair coin comes up heads. So we have p(Unfair coin weight = \theta) = 1. Note that this is a probability density, not an actual probability (as opposed to p(Fi = Heads | X = Fair), which was a probability).

Continuing, if F1, F2, …, Fn are the flips of the coin, then p(Fi = Heads | X = Fair, Weight = \theta) = \theta.

Now we’ve set up a model for this problem. How do we actually calculate a posterior probability of the coin being fair for a given sequence of heads and tails? (A posterior probability is just the technical term for the conditional probability of a hypothesis given a set of data; this is to distinguish it from the prior probability of the hypothesis before seeing any data.)

Well, we’ll still just use Bayes’ rule:

p(Fair | F1, …, Fn) \propto p(F1, …, Fn | Fair) p(Fair) = \left(\frac{1}{2}\right)^n p(Fair)

p(Unfair | F1, …, Fn) \propto p(F1, …, Fn | Unfair) p(Unfair) = \int_{0}^{1} \theta^{H}(1-\theta)^{T} d\theta p(Unfair)

Here H is the number of heads and T is the number of tails. In this case we can fortunately actually compute the integral in question and see that it is equal to \frac{H!T!}{(H+T+1)!}. So we get that

p(Fair | F1, …, Fn) / p(Unfair | F1, …, Fn) = p(Fair)/p(Unfair) * \frac{(H+T+1)!}{2^n H!T!}.

It is often useful to draw a diagram of our model to help keep track of it:

Now suppose that we, being specialists in determining if coins are fair, have been called in to study a large collection of coins. We get to one of the coins in the collection, flip it several times, and observe the following sequence of heads and tails:


Since there are an equal number of heads and tails, our previous analysis will certainly conclude that the coin is fair, but its behavior does seem rather suspicious. In particular, different flips don’t look like they are really independent, so perhaps our previous model is wrong. Maybe the right model is one where the next coin value is usually the same as the previous coin value, but flips with some probability. Now we have a new value of X, which we’ll call Weird, and a parameter \phi (basically the same as \theta) that tells us how likely a weird coin is to have a given probability of switching. We’ll again give \phi a uniform distribution over [0,1], so p(Switching probability of weird coin = \phi) = 1.

To predict the actual coin flips, we get p(F1 = Heads | X = Weird, Switching probability = \phi) = 1, p(F(i+1) = Heads | Fi = Heads, X = Weird, Switching probability = \phi) = 1-\phi, and p(F(i+1) = Heads | Fi = Tails, X = Weird, Switching probability = \phi) = \phi. We can represent this all with the following graphical model:

Now we are ready to evaluate whether the coin we saw was a Weird coin or not.

p(X = Weird | HHHHTTTHHTTT) \propto p(HHHHTTTHHTTT | X = Weird) p(X = Weird) = $\int_{0}^{1} \frac{1}{2}(1-\phi)^8 \phi^3 d\phi$ p(X = Weird)

Evaluating that integral gives \frac{8!3!}{2 \cdot 12!} = \frac{1}{3960}. So p(X = Weird | Data) = p(X = Weird) / 3960, compared to p(X = Fair | Data), which is p(X = Fair) / 4096. In other words, positing a Weird coin only explains the data slightly better than positing a Fair coin, and since the vast majority of coins we encounter are fair, it is quite likely that this one is, as well.

Note: I’d like to draw your attention to a particular subtlety here. Note that I referred to, for instance, “Probability that an unfair coin weight is \theta“, as opposed to “Probability that a coin weight is \theta given that it is unfair”. This really is an important distinction, because the distribution over \theta really is the probability distribution over the weights of a generic unfair coin, and this distribution doesn’t change based on whether our current coin happens to be fair or unfair. Of course, we can still condition on our coin being fair or unfair, but that won’t change the probability distribution over \theta one bit.

Finding Clusters

Now let’s suppose that we have a bunch of points (for simplicity, we’ll say in two-dimensional Euclidean space). We would like to group the points into a collection of clusters. Let’s also go ahead and assume that we know in advance that there are k clusters. How do we actually find those clusters?

We’ll make the further heuristic assumption that clusters tend to arise from a “true” version of the cluster, and some Gaussian deviation from that true version. So in other words, if we let there be k means for our clusters, \mu_1, \mu_2, \ldots, \mu_k, and multivariate Gaussians about their means with covariance matrices \Sigma_1, \Sigma_2, \ldots, \Sigma_k, and finally assume that the probability that a point belongs to cluster i is \rho_i, then the probability of a set of points \vec{x_1}, \vec{x_2}, \ldots, \vec{x_n} is

W_{\mu,\Sigma,\rho}(\vec{x}) := \prod_{i=1}^n \sum_{j=1}^k \frac{\rho_j}{2\pi \det(\Sigma_j)} e^{-\frac{1}{2}(\vec{x_i}-\mu_j)^T \Sigma_j^{-1} (\vec{x_i}-\mu_j)}

From this, once we pick probability distributions over the \Sigma, \mu, and \rho, we can calculate the posterior probability of a given set of clusters as

p(\Sigma, \mu, \rho | \vec{x}) \propto p(\Sigma) p(\mu) p(\rho) W_{\mu,\Sigma,\rho}(\vec{x})

This corresponds to the following graphical model:

Note that once we have a set of clusters, we can also determine the probability that a given point belongs to each cluster:

p(\vec{x} belongs to cluster (\Sigma, \mu, \rho)) \propto \frac{\rho}{2\pi \det(\Sigma)} e^{-\frac{1}{2}(\vec{x}-\mu)^T \Sigma^{-1} (\vec{x}-\mu)}.

You might notice, though, that in this case it is much less straightforward to actually find clusters with high posterior probability (as opposed to in the previous case, where it was quite easy to distinguish between Fair, Unfair, and Weird, and furthermore to figure out the most likely values of \theta and \phi). One reason why is that, in the previous case, we really only needed to make one-dimensional searches over \theta and \phi to figure out what the most likely values were. In this case, we need to search over all of the \Sigma_i, \mu_i, and \rho_i simultaneously, which gives us, essentially, a 3k-1-dimensional search problem, which becomes exponentially hard quite quickly.

This brings us to an important point, which is that, even if we write down a model, searching over that model can be difficult. So in addition to the model, I will go over a good algorithm for finding the clusters from this model, known as the EM algorithm. For the version of the EM algorithm described below, I will assume that we have uniform priors over \Sigma_i, \mu_i, and \rho_i (in the last case, we have to do this by picking a set of un-normalized \rho_i uniformly over \mathbb{R}^k and then normalizing). We’ll ignore the problem that it is not clear how to define a uniform distribution over a non-compact space.

The way the EM algorithm works is that we start by initializing \Sigma_i, \mu_i, and \rho_i arbitrarily. Then, given these values, we compute the probability that each point belongs to each cluster. Once we have these probabilities, we re-compute the maximum-likelihood values of the \mu_i (as the expected mean of each cluster given how likely each point is to belong to it). Then we find the maximum-likelihood values of the \Sigma_i (as the expected covariance relative to the means we just found). Finally, we find the maximum-likelihood values of the \rho_i (as the expected portion of points that belong to each cluster). We then repeat this until converging on an answer.

For a visualization of how the EM algorithm actually works, and a more detailed description of the two steps, I recommend taking a look at Josh Tenenbaum’s lecture notes starting at slide 38.


The Mind Projection Fallacy

This is perhaps a nitpicky point, but I have found that keeping it in mind has led me to better understanding what I am doing, or at least to ask interesting questions.

The point here is that people often intuitively think of probabilities as a fact about the world, when in reality probabilities are a fact about our model of the world. For instance, one might say that the probability of a child being male versus female is 0.5. And perhaps this is a good thing to say in a generic case. But we also have a much better model of gender, and we know that it is based on X and Y chromosomes. If we could look at a newly conceived ball of cells in a mother’s womb, and read off the chromosomes, then we could say with near certainty whether the child would end up being male or female.

You could also argue that I can empirically measure the probability that a person is male or female, by counting up all the people ever, and looking at the proportion of males and females. But this runs into two issues — first of all, the portion of males will be slightly off of 0.5. So how do we justify just randomly rounding off to 0.5? Or do we not?

Second of all, you can do this all you want, but it doesn’t give me any reason why I should take this information, and use it to form a conjecture about how likely the next person I meet is to be male or female. Once we do that, we are taking into account my model of the world.


This final section seeks to look at a result from classical statistics and re-interpret it in a Bayesian framework.

In particular, I’d like to consider the following strategy for rejecting a hypothesis. In abstract terms, it says that, if we have a random variable Data’ that consists of re-drawing our data assuming that our hypothesis is correct, then

p(Hypothesis) < p(p(Data’ | Hypothesis) <= p(Data | Hypothesis))

In other words, suppose that the probability of drawing data less likely (under our hypothesis) than the data we actually saw is less than \alpha. Then the likelihood of our hypothesis is at most \alpha.

Or actually, this is not quite true. But it is true that there is an algorithm that will only reject correct hypotheses with probability \alpha, and this algorithm is to reject a hypothesis when p(p(Data’ | Hypothesis) <= p(Data | Hypothesis)) < \alpha. I will leave the proof of this to you, as it is quite easy.

To illustrate this example, let’s suppose (as in a previous section) that we have a coin and would like to determine whether it is fair. In the above method, we would flip it many times, and record the number H of heads. If there is less than an \alpha chance of coming up with a less likely number of heads than H, then we can reject the hypothesis that the coin is fair with confidence 1-\alpha. For instance, if there are 80 total flips, and H = 25, then we would calculatae

\alpha = \frac{1}{2^{80}} \left(\sum_{k=0}^{25} \binom{80}{k} + \sum_{k=55}^{80} \binom{80}{k} \right).

So this seems like a pretty good test, especially if we choose \alpha to be extremely small (e.g., 10^{-100} or so). The mere fact that we reject good hypotheses with probability less than \alpha is not helpful. What we really want is to also reject bad hypotheses with a reasonably large probability. I think you can get around this by repeating the same experiment many times, though.

Of course, Bayesian statistics also can’t ever say that a hypothesis is good, but when given two hypotheses it will always say which one is better. On the other hand, Bayesian statistics has the downside that it is extremely aggressive at making inferences. It will always output an answer, even if it really doesn’t have enough data to arrive at that answer confidently.