A Fervent Defense of Frequentist Statistics

[Highlights for the busy: de-bunking standard “Bayes is optimal” arguments; frequentist Solomonoff induction; and a description of the online learning framework.]

Short summary. This essay makes many points, each of which I think is worth reading, but if you are only going to understand one point I think it should be “Myth 5″ below, which describes the online learning framework as a response to the claim that frequentist methods need to make strong modeling assumptions. Among other things, online learning allows me to perform the following remarkable feat: if I’m betting on horses, and I get to place bets after watching other people bet but before seeing which horse wins the race, then I can guarantee that after a relatively small number of races, I will do almost as well overall as the best other person, even if the number of other people is very large (say, 1 billion), and their performance is correlated in complicated ways.

If you’re only going to understand two points, then also read about the frequentist version of Solomonoff induction, which is described in “Myth 6″.

Main article. I’ve already written one essay on Bayesian vs. frequentist statistics. In that essay, I argued for a balanced, pragmatic approach in which we think of the two families of methods as a collection of tools to be used as appropriate. Since I’m currently feeling contrarian, this essay will be far less balanced and will argue explicitly against Bayesian methods and in favor of frequentist methods. I hope this will be forgiven as so much other writing goes in the opposite direction of unabashedly defending Bayes. I should note that this essay is partially inspired by some of Cosma Shalizi’s blog posts, such as this one.

This essay will start by listing a series of myths, then debunk them one-by-one. My main motivation for this is that Bayesian approaches seem to be highly popularized, to the point that one may get the impression that they are the uncontroversially superior method of doing statistics. I actually think the opposite is true: I think most statisticans would for the most part defend frequentist methods, although there are also many departments that are decidedly Bayesian (e.g. many places in England, as well as some U.S. universities like Columbia). I have a lot of respect for many of the people at these universities, such as Andrew Gelman and Philip Dawid, but I worry that many of the other proponents of Bayes (most of them non-statisticians) tend to oversell Bayesian methods or undersell alternative methodologies.

If you are like me from, say, two years ago, you are firmly convinced that Bayesian methods are superior and that you have knockdown arguments in favor of this. If this is the case, then I hope this essay will give you an experience that I myself found life-altering: the experience of having a way of thinking that seemed unquestionably true slowly dissolve into just one of many imperfect models of reality. This experience helped me gain more explicit appreciation for the skill of viewing the world from many different angles, and of distinguishing between a very successful paradigm and reality.

If you are not like me, then you may have had the experience of bringing up one of many reasonable objections to normative Bayesian epistemology, and having it shot down by one of many “standard” arguments that seem wrong but not for easy-to-articulate reasons. I hope to lend some reprieve to those of you in this camp, by providing a collection of “standard” replies to these standard arguments.

I will start with the myths (and responses) that I think will require the least technical background and be most interesting to a general audience. Toward the end, I deal with some attacks on frequentist methods that I believe amount to technical claims that are demonstrably false; doing so involves more math. Also, I should note that for the sake of simplicity I’ve labeled everything that is non-Bayesian as a “frequentist” method, even though I think there’s actually a fair amount of variation among these methods, although also a fair amount of overlap (e.g. I’m throwing in statistical learning theory with minimax estimation, which certainly have a lot of overlap in ideas but were also in some sense developed by different communities).

The Myths:

  • Bayesian methods are optimal.
  • Bayesian methods are optimal except for computational considerations.
  • We can deal with computational constraints simply by making approximations to Bayes.
  • The prior isn’t a big deal because Bayesians can always share likelihood ratios.
  • Frequentist methods need to assume their model is correct, or that the data are i.i.d.
  • Frequentist methods can only deal with simple models, and make arbitrary cutoffs in model complexity (aka: “I’m Bayesian because I want to do Solomonoff induction”).
  • Frequentist methods hide their assumptions while Bayesian methods make assumptions explicit.
  • Frequentist methods are fragile, Bayesian methods are robust.
  • Frequentist methods are responsible for bad science
  • Frequentist methods are unprincipled/hacky.
  • Frequentist methods have no promising approach to computationally bounded inference.

Myth 1: Bayesian methods are optimal. Presumably when most people say this they are thinking of either Dutch-booking or the complete class theorem. Roughly what these say are the following:

Dutch-book argument: Every coherent set of beliefs can be modeled as a subjective probability distribution. (Roughly, coherent means “unable to be Dutch-booked”.)

Complete class theorem: Every non-Bayesian method is worse than some Bayesian method (in the sense of performing deterministically at least as poorly in every possible world).

Let’s unpack both of these. My high-level argument regarding Dutch books is that I would much rather spend my time trying to correspond with reality than trying to be internally consistent. More concretely, the Dutch-book argument says that if for every bet you force me to take one side or the other, then unless I’m Bayesian there’s a collection of bets that will cause me to lose money for sure. I don’t find this very compelling. This seems analogous to the situation where there’s some quant at Jane Street, and they’re about to run code that will make thousands of dollars trading stocks, and someone comes up to them and says “Wait! You should add checks to your code to make sure that no subset of your trades will lose you money!” This just doesn’t seem worth the quant’s time, it will slow down the code substantially, and instead the quant should be writing the next program to make thousands more dollars. This is basically what dutch-booking arguments seem like to me.

Moving on, the complete class theorem says that for any decision rule, I can do better by replacing it with some Bayesian decision rule. But this injunction is not useful in practice, because it doesn’t say anything about which decision rule I should replace it with. Of course, if you hand me a decision rule and give me infinite computational resources, then I can hand you back a Bayesian method that will perform better. But it still might not perform well. All the complete class theorem says is that every local optimum is Bayesan. To be a useful theory of epistemology, I need a prescription for how, in the first place, I am to arrive at a good decision rule, not just a locally optimal one. And this is something that frequentist methods do provide, to a far greater extent than Bayesian methods (for instance by using minimax decision rules such as the maximum-entropy example given later). Note also that many frequentist methods do correspond to a Bayesian method for some appropriately chosen prior. But the crucial point is that the frequentist told me how to pick a prior I would be happy with (also, many frequentist methods don’t correspond to a Bayesian method for any choice of prior; they nevertheless often perform quite well).

Myth 2: Bayesian methods are optimal except for computational considerations. We already covered this in the previous point under the complete class theorem, but to re-iterate: Bayesian methods are locally optimal, not global optimal. Identifying all the local optima is very different from knowing which of them is the global optimum. I would much rather have someone hand me something that wasn’t a local optimum but was close to the global optimum, than something that was a local optimum but was far from the global optimum.

Myth 3: We can deal with computational constraints simply by making approximations to Bayes. I have rarely seen this born out in practice. Here’s a challenge: suppose I give you data generated in the following way. There are a collection of vectors {x_1}, {x_2}, {\ldots}, {x_{10,000}}, each with {10^6} coordinates. I generate outputs {y_1}, {y_2}, {\ldots}, {y_{10,000}} in the following way. First I globally select {100} of the {10^6} coordinates uniformly at random, then I select a fixed vector {u} such that those {100} coordinates are drawn from i.i.d. Gaussians and the rest of the coordinates are zero. Now I set {x_n = u^{\top}y_n} (i.e. {x_n} is the dot product of {u} with {y_n}). You are given {x} and {y}, and your job is to infer {u}. This is a completely well-specified problem, the only task remaining is computational. I know people who have solved this problem using Bayesan methods with approximate inference. I have respect for these people, because doing so is no easy task. I think very few of them would say that “we can just approximate Bayesian updating and be fine”. (Also, this particular problem can be solved trivially with frequentist methods.)

A particularly egregious example of this is when people talk about “computable approximations to Solomonoff induction” or “computable approximations to AIXI” as if such notions were meaningful.

Myth 4: the prior isn’t a big deal because Bayesians can always share likelihood ratios. Putting aside the practical issue that there would in general be an infinite number of likelihood ratios to share, there is the larger issue that for any hypothesis {h}, there is also the hypothesis {h'} that matches {h} exactly up to now, and then predicts the opposite of {h} at all points in the future. You have to constrain model complexity at some point, the question is about how. To put this another way, sharing my likelihood ratios without also constraining model complexity (by focusing on a subset of all logically possible hypotheses) would be equivalent to just sharing all sensory data I’ve ever accrued in my life. To the extent that such a notion is even possible, I certainly don’t need to be a Bayesian to do such a thing.

Myth 5: frequentist methods need to assume their model is correct or that the data are i.i.d. Understanding the content of this section is the most important single insight to gain from this essay. For some reason it’s assumed that frequentist methods need to make strong assumptions (such as Gaussianity), whereas Bayesian methods are somehow immune to this. In reality, the opposite is true. While there are many beautiful and deep frequentist formalisms that answer this, I will choose to focus on one of my favorite, which is online learning.

To explain the online learning framework, let us suppose that our data are {(x_1, y_1), (x_2, y_2), \ldots, (x_T, y_T)}. We don’t observe {y_t} until after making a prediction {z_t} of what {y_t} will be, and then we receive a penalty {L(y_t, z_t)} based on how incorrect we were. So we can think of this as receiving prediction problems one-by-one, and in particular we make no assumptions about the relationship between the different problems; they could be i.i.d., they could be positively correlated, they could be anti-correlated, they could even be adversarially chosen.

As a running example, suppose that I’m betting on horses and before each race there are {n} other people who give me advice on which horse to bet on. I know nothing about horses, so based on this advice I’d like to devise a good betting strategy. In this case, {x_t} would be the {n} bets that each of the other people recommend, {z_t} would be the horse that I actually bet on, and {y_t} would be the horse that actually wins the race. Then, supposing that {y_t = z_t} (i.e., the horse I bet on actually wins), {L(y_t, z_t)} is the negative of the payoff from correctly betting on that horse. Otherwise, if the horse I bet on doesn’t win, {L(y_t, z_t)} is the cost I had to pay to place the bet.

If I’m in this setting, what guarantee can I hope for? I might ask for an algorithm that is guaranteed to make good bets — but this seems impossible unless the people advising me actually know something about horses. Or, at the very least, one of the people advising me knows something. Motivated by this, I define my regret to be the difference between my penalty and the penalty of the best of the {n} people (note that I only have access to the latter after all {T} rounds of betting). More formally, given a class {\mathcal{M}} of predictors {h : x \mapsto z}, I define

\displaystyle \mathrm{Regret}(T) = \frac{1}{T} \sum_{t=1}^T L(y_t, z_t) - \min_{h \in \mathcal{M}} \frac{1}{T} \sum_{t=1}^T L(y_t, h(x_t))

In this case, {\mathcal{M}} would have size {n} and the {i}th predictor would just always follow the advice of person {i}. The regret is then how much worse I do on average than the best expert. A remarkable fact is that, in this case, there is a strategy such that {\mathrm{Regret}(T)} shrinks at a rate of {\sqrt{\frac{\log(n)}{T}}}. In other words, I can have an average score within {\epsilon} of the best advisor after {\frac{\log(n)}{\epsilon^2}} rounds of betting.

One reason that this is remarkable is that it does not depend at all on how the data are distributed; the data could be i.i.d., positively correlated, negatively correlated, even adversarial, and one can still construct an (adaptive) prediction rule that does almost as well as the best predictor in the family.

To be even more concrete, if we assume that all costs and payoffs are bounded by {\$1} per round, and that there are {1,000,000,000} people in total, then an explicit upper bound is that after {28/\epsilon^2} rounds, we will be within {\epsilon} dollars on average of the best other person. Under slightly stronger assumptions, we can do even better, for instance if the best person has an average variance of {0.1} about their mean, then the {28} can be replaced with {4.5}.

It is important to note that the betting scenario is just a running example, and one can still obtain regret bounds under fairly general scenarios; {\mathcal{M}} could be continuous and {L} could have quite general structure; the only technical assumption is that {\mathcal{M}} be a convex set and that {L} be a convex function of {z}. These assumptions tend to be easy to satisfy, though I have run into a few situations where they end up being problematic, mainly for computational reasons. For an {n}-dimensional model family, typically {\mathrm{Regret}(T)} decreases at a rate of {\sqrt{\frac{n}{T}}}, although under additional assumptions this can be reduced to {\sqrt{\frac{\log(n)}{T}}}, as in the betting example above. I would consider this reduction to be one of the crowning results of modern frequentist statistics.

Yes, these guarantees sound incredibly awesome and perhaps too good to be true. They actually are that awesome, and they are actually true. The work is being done by measuring the error relative to the best model in the model family. We aren’t required to do well in an absolute sense, we just need to not do any worse than the best model. Of as long as at least one of the models in our family makes good predictions, that means we will as well. This is really what statistics is meant to be doing: you come up with everything you imagine could possibly be reasonable, and hand it to me, and then I come up with an algorithm that will figure out which of the things you handed me was most reasonable, and will do almost as well as that. As long as at least one of the things you come up with is good, then my algorithm will do well. Importantly, due to the {\log(n)} dependence on the dimension of the model family, you can actually write down extremely broad classes of models and I will still successfully sift through them.

Let me stress again: regret bounds are saying that, no matter how the {x_t} and {y_t} are related, no i.i.d. assumptions anywhere in sight, we will do almost as well as any predictor {h} in {\mathcal{M}} (in particular, almost as well as the best predictor).

Myth 6: frequentist methods can only deal with simple models and need to make arbitrary cutoffs in model complexity. A naive perusal of the literature might lead one to believe that frequentists only ever consider very simple models, because many discussions center on linear and log-linear models. To dispel this, I will first note that there are just as many discussions that focus on much more general properties such as convexity and smoothness, and that can achieve comparably good bounds in many cases. But more importantly, the reason we focus so much on linear models is because we have already reduced a large family of problems to (log-)linear regression. The key insight, and I think one of the most important insights in all of applied mathematics, is that of featurization: given a non-linear problem, we can often embed it into a higher-dimensional linear problem, via a feature map {\phi : X \rightarrow \mathbb{R}^n} ({\mathbb{R}^n} denotes {n}-dimensional space, i.e. vectors of real numbers of length {n}). For instance, if I think that {y} is a polynomial (say cubic) function of {x}, I can apply the mapping {\phi(x) = (1, x, x^2, x^3)}, and now look for a linear relationship between {y} and {\phi(x)}.

This insight extends far beyond polynomials. In combinatorial domains such as natural language, it is common to use indicator features: features that are {1} if a certain event occurs and {0} otherwise. For instance, I might have an indicator feature for whether two words appear consecutively in a sentence, whether two parts of speech are adjacent in a syntax tree, or for what part of speech a word has. Almost all state of the art systems in natural language processing work by solving a relatively simple regression task (typically either log-linear or max-margin) over a rich feature space (often involving hundreds of thousands or millions of features, i.e. an embedding into {\mathbb{R}^{10^5}} or {\mathbb{R}^{10^6}}).

A counter-argument to the previous point could be: “Sure, you could create a high-dimensional family of models, but it’s still a parameterized family. I don’t want to be stuck with a parameterized family, I want my family to include all Turing machines!” Putting aside for a second the question of whether “all Turing machines” is a well-advised model choice, this is something that a frequentist approach can handle just fine, using a tool called regularization, which after featurization is the second most important idea in statistics.

Specifically, given any sufficiently quickly growing function {\psi(h)}, one can show that, given {T} data points, there is a strategy whose average error is at most {\sqrt{\frac{\psi(h)}{T}}} worse than any estimator {h}. This can hold even if the model class {\mathcal{M}} is infinite dimensional. For instance, if {\mathcal{M}} consists of all probability distributions over Turing machines, and we let {h_i} denote the probability mass placed on the {i}th Turing machine, then a valid regularizer {\psi} would be

\displaystyle \psi(h) = \sum_i h_i \log(i^2 \cdot h_i)

If we consider this, then we see that, for any probability distribution over the first {2^k} Turing machines (i.e. all Turing machines with description length {\leq k}), the value of {\psi} is at most {\log((2^k)^2) = k\log(4)}. (Here we use the fact that {\psi(h) \geq \sum_i h_i \log(i^2)}, since {h_i \leq 1} and hence {h_i\log(h_i) \leq 0}.) This means that, if we receive roughly {\frac{k}{\epsilon^2}} data, we will achieve error within {\epsilon} of the best Turing machine that has description length {\leq k}.

Let me note several things here:

  • This strategy makes no assumptions about the data being i.i.d. It doesn’t even assume that the data are computable. It just guarantees that it will perform as well as any Turing machine (or distribution over Turing machines) given the appropriate amount of data.
  • This guarantee holds for any given sufficiently smooth measurement of prediction error (the update strategy depends on the particular error measure).
  • This guarantee holds deterministically, no randomness required (although predictions may need to consist of probability distributions rather than specific points, but this is also true of Bayesian predictions).

Interestingly, in the case that the prediction error is given by the negative log probability assigned to the truth, then the corresponding strategy that achieves the error bound is just normal Bayesian updating. But for other measurements of error, we get different update strategies. Although I haven’t worked out the math, intuitively this difference could be important if the universe is fundamentally unpredictable but our notion of error is insensitive to the unpredictable aspects.

Myth 7: frequentist methods hide their assumptions while Bayesian methods make assumptions explicit. I’m still not really sure where this came from. As we’ve seen numerous times so far, a very common flavor among frequentist methods is the following: I have a model class {\mathcal{M}}, I want to do as well as any model in {\mathcal{M}}; or put another way:

Assumption: At least one model in {\mathcal{M}} has error at most {E}.
Guarantee: My method will have error at most {E + \epsilon}.

This seems like a very explicit assumption with a very explicit guarantee. On the other hand, an argument I hear is that Bayesian methods make their assumptions explicit because they have an explicit prior. If I were to write this as an assumption and guarantee, I would write:

Assumption: The data were generated from the prior.
Guarantee: I will perform at least as well as any other method.

While I agree that this is an assumption and guarantee of Bayesian methods, there are two problems that I have with drawing the conclusion that “Bayesian methods make their assumptions explicit”. The first is that it can often be very difficult to understand how a prior behaves; so while we could say “The data were generated from the prior” is an explicit assumption, it may be unclear what exactly that assumption entails. However, a bigger issue is that “The data were generated from the prior” is an assumption that very rarely holds; indeed, in many cases the underlying process is deterministic (if you’re a subjective Bayesian then this isn’t necessarily a problem, but it does certainly mean that the assumption given above doesn’t hold). So given that that assumption doesn’t hold but Bayesian methods still often perform well in practice, I would say that Bayesian methods are making some other sort of “assumption” that is far less explicit (indeed, I would be very interested in understanding what this other, more nebulous assumption might be).

Myth 8: frequentist methods are fragile, Bayesian methods are robust. This is another one that’s straightforwardly false. First, since frequentist methods often rest on weaker assumptions they are more robust if the assumptions don’t quite hold. Secondly, there is an entire area of robust statistics, which focuses on being robust to adversarial errors in the problem data.

Myth 9: frequentist methods are responsible for bad science. I will concede that much bad science is done using frequentist statistics. But this is true only because pretty much all science is done using frequentist statistics. I’ve heard arguments that using Bayesian methods instead of frequentist methods would fix at least some of the problems with science. I don’t think this is particularly likely, as I think many of the problems come from mis-application of statistical tools or from failure to control for multiple hypotheses. If anything, Bayesian methods would exacerbate the former, because they often require more detailed modeling (although in most simple cases the difference doesn’t matter at all). I don’t think being Bayesian guards against multiple hypothesis testing. Yes, in some sense a prior “controls for multiple hypotheses”, but in general the issue is that the “multiple hypotheses” are never written down in the first place, or are written down and then discarded. One could argue that being in the habit of writing down a prior might make practitioners more likely to think about multiple hypotheses, but I’m not sure this is the first-order thing to worry about.

Myth 10: frequentist methods are unprincipled / hacky. One of the most beautiful theoretical paradigms that I can think of is what I could call the “geometric view of statistics”. One place that does a particularly good job of show-casing this is Shai Shalev-Shwartz’s PhD thesis, which was so beautiful that I cried when I read it. I’ll try (probably futilely) to convey a tiny amount of the intuition and beauty of this paradigm in the next few paragraphs, although focusing on minimax estimation, rather than online learning as in Shai’s thesis.

The geometric paradigm tends to emphasize a view of measurements (i.e. empirical expected values over observed data) as “noisy” linear constraints on a model family. We can control the noise by either taking few enough measurements that the total error from the noise is small (classical statistics), or by broadening the linear constraints to convex constraints (robust statistics), or by controlling the Lagrange multipliers on the constraints (regularization). One particularly beautiful result in this vein is the duality between maximum entropy and maximum likelihood. (I can already predict the Jaynesians trying to claim this result for their camp, but (i) Jaynes did not invent maximum entropy; (ii) maximum entropy is not particularly Bayesian (in the sense that frequentists use it as well); and (iii) the view on maximum entropy that I’m about to provide is different from the view given in Jaynes or by physicists in general [edit: EHeller thinks this last claim is questionable, see discussion here].)

To understand the duality mentioned above, suppose that we have a probability distribution {p(x)} and the only information we have about it is the expected value of a certain number of functions, i.e. the information that {\mathbb{E}[\phi(x)] = \phi^*}, where the expectation is taken with respect to {p(x)}. We are interested in constructing a probability distribution {q(x)} such that no matter what particular value {p(x)} takes, {q(x)} will still make good predictions. In other words (taking {\log p(x)} as our measurement of prediction accuracy) we want {\mathbb{E}_{p'}[\log q(x)]} to be large for all distributions {p'} such that {\mathbb{E}_{p'}[\phi(x)] = \phi^*}. Using a technique called Lagrangian duality, we can both find the optimal distribution {q} and compute its worse-case accuracy over all {p'} with {\mathbb{E}_{p'}[\phi(x)] = \phi^*}. The characterization is as follows: consider all probability distributions {q(x)} that are proportional to {\exp(\lambda^{\top}\phi(x))} for some vector {\lambda}, i.e. {q(x) = \exp(\lambda^{\top}\phi(x))/Z(\lambda)} for some {Z(\lambda)}. Of all of these, take the q(x) with the largest value of {\lambda^{\top}\phi^* - \log Z(\lambda)}. Then {q(x)} will be the optimal distribution and the accuracy for all distributions {p'} will be exactly {\lambda^{\top}\phi^* - \log Z(\lambda)}. Furthermore, if {\phi^*} is the empirical expectation given some number of samples, then one can show that {\lambda^{\top}\phi^* - \log Z(\lambda)} is propotional to the log likelihood of {q}, which is why I say that maximum entropy and maximum likelihood are dual to each other.

This is a relatively simple result but it underlies a decent chunk of models used in practice.

Myth 11: frequentist methods have no promising approach to computationally bounded inference. I would personally argue that frequentist methods are more promising than Bayesian methods at handling computational constraints, although computationally bounded inference is a very cutting edge area and I’m sure other experts would disagree. However, one point in favor of the frequentist approach here is that we already have some frameworks, such as the “tightening relaxations” framework discussed here, that provide quite elegant and rigorous ways of handling computationally intractable models.

References

(Myth 3) Sparse recovery: Sparse recovery using sparse matrices
(Myth 5) Online learning: Online learning and online convex optimization
(Myth 8) Robust statistics: see this blog post and the two linked papers
(Myth 10) Maximum entropy duality: Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory