# Uncertain Observations

What happens when you are uncertain about observations you made? For instance, you remember something happening, but you don’t remember who did it. Or you remember some fact you read on wikipedia, but you don’t know whether it said that hydrogen or helium was used in some chemical process.

How do we take this information into account in the context of Bayes’ rule? First, I’d like to note that there are different ways something could be uncertain. It could be that you observed X, but you don’t remember if it was in state A or state B. Or it could be that you think you observed X in state A, but you aren’t sure.

These are different because in the first case you don’t know whether to concentrate probability mass towards A or B, whereas in the second case you don’t know whether to concentrate probability mass at all.

Fortunately, both cases are pretty straightforward as long as you are careful about using Bayes’ rule. However, today I am going to focus on the latter case. In fact, I will restrict my attention to the following problem:

You have a coin that has some probability $\pi$ of coming up heads. You also know that all flips of this coin are independent. But you don’t know what $\pi$ is. However, you have observed this coin $n$ times in the past. But for each observation, you aren’t completely sure that this was the coin you were observing. In particular, you only assign a probability $r_i$ to your $i$th observation actually being about this coin. Given this, and the sequence of heads and tails you remember, what is your estimate of $\pi?$

To use Bayes’ rule, let’s first figure out what we need to condition on. In this case, we need to condition on remembering the sequence of coin flips that we remembered. So we are looking for

p($\pi = \theta$ | we remember the given sequence of flips),

which is proportional to

p(we remember the given sequence of flips | $\pi = \theta$) $\cdot$ p($\pi = \theta$).

The only thing that the uncertain nature of our observations does is cause there to be multiple ways to eventually land in the set of universes where we remember the sequence of flips; in particular, for any observation we remember, it could have actually happened, or we could have incorrectly remembered it. Thus if $\pi = \theta$, and we remember the $i$th coin flip as being heads, then this could happen with probability $1-r_i$ if we incorrectly remembered a coin flip of heads. In the remaining probability $r_i$ case, it could happen with probability $\theta$ by actually coming up heads. Therefore the probability of us remembering that the $i$th flip was heads is $(1-r_i)+r_i \theta$.

A similar computation shows that the probability of us remembering that the $i$th flip was tails is $(1-r_i)+r_i(1-\theta) = 1-r_i\theta$.

For convenience of notation, let’s actually split up our remembered flips into those that were heads and those that were tails. The probability of the $i$th remembered heads being real is $h_i$, and the probability of the $j$th remembered tails being real is $t_i$. There are $m$ heads and $n$ tails. Then we get

$\displaystyle p(\pi = \theta | \mathrm{\ our \ memory}) \propto p(\pi = \theta) \cdot \left(\prod_{i=1}^m (1-h_i)+h_i\theta \right) \cdot \left(\prod_{i=1}^n 1-t_i\theta\right).$

Note that when we consider values of $\theta$ close to $0$, the term from the remembered tails becomes close to $1-\theta$ raised to the power of the expected number of tails, whereas the term from the remembered heads becomes close to the probability that we incorrectly remembered each of the heads. A similar phenomenon will happen when $\theta$ gets close to $1$. This is an instance of a more general phenomenon whereby unlikely observations get “explained away” by whatever means possible.

A Caveat

Applying the above model in practice can be quite tricky. The reason is that your memories are intimately tied to all sorts of events that happen to you; in particular, your assessment of how likely you are to remember an event probably already takes into account how well that event fits into your existing model. So if you saw 100 heads, and then a tails, you would place more weight than normal on your recollection of the tails being incorrect, even though that is the job of the above model. In essence, you are conditioning on your data twice — once intuitively, and once as part of the model. This is bad because it assumes that you made each observation twice as many times as you actually did.

What is interesting, though, is that you can actually compute things like the probability that you incorrectly remembered an event, given the rest of the data, and it will be different from the prior probability. So in addition to a posterior estimate of $\pi$, you get posterior estimates of the likelihood of each of your recollections. Just be careful not to take these posterior estimates and use them as if they were prior estimates (which, as explained above, is what we are likely to do intuitively).

There are other issues to using this in practice, as well. For instance, if you really want the coin to be fair, or unfair, or be biased in a certain direction, it is very easy to fool yourself into assigning skewed probability estimates towards each of your recollections, thus ending up with a biased answer at the end. It’s not even difficult — if I take a fair coin, and underestimate my recollection of each tails by 20%, and overestimate my recollection of each heads by 20%, then all of a sudden I “have a coin” that is 50% more likely to come up heads than tails.

Fortunately, my intended application of this model will be in a less slippery domain (hopefully). The purpose is to finally answer the question I posed in the last post, which I’ll repeat here for convenience:

Suppose that you have never played a sport before, and you play soccer, and enjoy it. Now suppose instead that you have never played a sport before, and play soccer, and hate it. In the first case, you will think yourself more likely to enjoy other sports in the future, relative to in the second case. Why is this?

Or if you disagree with the premises of the above scenario, simply “If X and Y belong to the same category C, why is it that in certain cases we think it more likely that Y will have attribute A upon observing that X has attribute A?”

In the interest of making my posts shorter, I will leave that until next time, but hopefully I’ll get to it in the next week.