## A Fervent Defense of Frequentist Statistics

[Highlights for the busy: de-bunking standard "Bayes is optimal" arguments; frequentist Solomonoff induction; and a description of the online learning framework.]

Short summary. This essay makes many points, each of which I think is worth reading, but if you are only going to understand one point I think it should be “Myth 5″ below, which describes the online learning framework as a response to the claim that frequentist methods need to make strong modeling assumptions. Among other things, online learning allows me to perform the following remarkable feat: if I’m betting on horses, and I get to place bets after watching other people bet but before seeing which horse wins the race, then I can guarantee that after a relatively small number of races, I will do almost as well overall as the best other person, even if the number of other people is very large (say, 1 billion), and their performance is correlated in complicated ways.

If you’re only going to understand two points, then also read about the frequentist version of Solomonoff induction, which is described in “Myth 6″.

Main article. I’ve already written one essay on Bayesian vs. frequentist statistics. In that essay, I argued for a balanced, pragmatic approach in which we think of the two families of methods as a collection of tools to be used as appropriate. Since I’m currently feeling contrarian, this essay will be far less balanced and will argue explicitly against Bayesian methods and in favor of frequentist methods. I hope this will be forgiven as so much other writing goes in the opposite direction of unabashedly defending Bayes. I should note that this essay is partially inspired by some of Cosma Shalizi’s blog posts, such as this one.

This essay will start by listing a series of myths, then debunk them one-by-one. My main motivation for this is that Bayesian approaches seem to be highly popularized, to the point that one may get the impression that they are the uncontroversially superior method of doing statistics. I actually think the opposite is true: I think most statisticans would for the most part defend frequentist methods, although there are also many departments that are decidedly Bayesian (e.g. many places in England, as well as some U.S. universities like Columbia). I have a lot of respect for many of the people at these universities, such as Andrew Gelman and Philip Dawid, but I worry that many of the other proponents of Bayes (most of them non-statisticians) tend to oversell Bayesian methods or undersell alternative methodologies.

If you are like me from, say, two years ago, you are firmly convinced that Bayesian methods are superior and that you have knockdown arguments in favor of this. If this is the case, then I hope this essay will give you an experience that I myself found life-altering: the experience of having a way of thinking that seemed unquestionably true slowly dissolve into just one of many imperfect models of reality. This experience helped me gain more explicit appreciation for the skill of viewing the world from many different angles, and of distinguishing between a very successful paradigm and reality.

If you are not like me, then you may have had the experience of bringing up one of many reasonable objections to normative Bayesian epistemology, and having it shot down by one of many “standard” arguments that seem wrong but not for easy-to-articulate reasons. I hope to lend some reprieve to those of you in this camp, by providing a collection of “standard” replies to these standard arguments.

I will start with the myths (and responses) that I think will require the least technical background and be most interesting to a general audience. Toward the end, I deal with some attacks on frequentist methods that I believe amount to technical claims that are demonstrably false; doing so involves more math. Also, I should note that for the sake of simplicity I’ve labeled everything that is non-Bayesian as a “frequentist” method, even though I think there’s actually a fair amount of variation among these methods, although also a fair amount of overlap (e.g. I’m throwing in statistical learning theory with minimax estimation, which certainly have a lot of overlap in ideas but were also in some sense developed by different communities).

The Myths:

• Bayesian methods are optimal.
• Bayesian methods are optimal except for computational considerations.
• We can deal with computational constraints simply by making approximations to Bayes.
• The prior isn’t a big deal because Bayesians can always share likelihood ratios.
• Frequentist methods need to assume their model is correct, or that the data are i.i.d.
• Frequentist methods can only deal with simple models, and make arbitrary cutoffs in model complexity (aka: “I’m Bayesian because I want to do Solomonoff induction”).
• Frequentist methods hide their assumptions while Bayesian methods make assumptions explicit.
• Frequentist methods are fragile, Bayesian methods are robust.
• Frequentist methods are responsible for bad science
• Frequentist methods are unprincipled/hacky.
• Frequentist methods have no promising approach to computationally bounded inference.

Myth 1: Bayesian methods are optimal. Presumably when most people say this they are thinking of either Dutch-booking or the complete class theorem. Roughly what these say are the following:

Dutch-book argument: Every coherent set of beliefs can be modeled as a subjective probability distribution. (Roughly, coherent means “unable to be Dutch-booked”.)

Complete class theorem: Every non-Bayesian method is worse than some Bayesian method (in the sense of performing deterministically at least as poorly in every possible world).

Let’s unpack both of these. My high-level argument regarding Dutch books is that I would much rather spend my time trying to correspond with reality than trying to be internally consistent. More concretely, the Dutch-book argument says that if for every bet you force me to take one side or the other, then unless I’m Bayesian there’s a collection of bets that will cause me to lose money for sure. I don’t find this very compelling. This seems analogous to the situation where there’s some quant at Jane Street, and they’re about to run code that will make thousands of dollars trading stocks, and someone comes up to them and says “Wait! You should add checks to your code to make sure that no subset of your trades will lose you money!” This just doesn’t seem worth the quant’s time, it will slow down the code substantially, and instead the quant should be writing the next program to make thousands more dollars. This is basically what dutch-booking arguments seem like to me.

Moving on, the complete class theorem says that for any decision rule, I can do better by replacing it with some Bayesian decision rule. But this injunction is not useful in practice, because it doesn’t say anything about which decision rule I should replace it with. Of course, if you hand me a decision rule and give me infinite computational resources, then I can hand you back a Bayesian method that will perform better. But it still might not perform well. All the complete class theorem says is that every local optimum is Bayesan. To be a useful theory of epistemology, I need a prescription for how, in the first place, I am to arrive at a good decision rule, not just a locally optimal one. And this is something that frequentist methods do provide, to a far greater extent than Bayesian methods (for instance by using minimax decision rules such as the maximum-entropy example given later). Note also that many frequentist methods do correspond to a Bayesian method for some appropriately chosen prior. But the crucial point is that the frequentist told me how to pick a prior I would be happy with (also, many frequentist methods don’t correspond to a Bayesian method for any choice of prior; they nevertheless often perform quite well).

Myth 2: Bayesian methods are optimal except for computational considerations. We already covered this in the previous point under the complete class theorem, but to re-iterate: Bayesian methods are locally optimal, not global optimal. Identifying all the local optima is very different from knowing which of them is the global optimum. I would much rather have someone hand me something that wasn’t a local optimum but was close to the global optimum, than something that was a local optimum but was far from the global optimum.

Myth 3: We can deal with computational constraints simply by making approximations to Bayes. I have rarely seen this born out in practice. Here’s a challenge: suppose I give you data generated in the following way. There are a collection of vectors ${x_1}$, ${x_2}$, ${\ldots}$, ${x_{10,000}}$, each with ${10^6}$ coordinates. I generate outputs ${y_1}$, ${y_2}$, ${\ldots}$, ${y_{10,000}}$ in the following way. First I globally select ${100}$ of the ${10^6}$ coordinates uniformly at random, then I select a fixed vector ${u}$ such that those ${100}$ coordinates are drawn from i.i.d. Gaussians and the rest of the coordinates are zero. Now I set ${x_n = u^{\top}y_n}$ (i.e. ${x_n}$ is the dot product of ${u}$ with ${y_n}$). You are given ${x}$ and ${y}$, and your job is to infer ${u}$. This is a completely well-specified problem, the only task remaining is computational. I know people who have solved this problem using Bayesan methods with approximate inference. I have respect for these people, because doing so is no easy task. I think very few of them would say that “we can just approximate Bayesian updating and be fine”. (Also, this particular problem can be solved trivially with frequentist methods.)

A particularly egregious example of this is when people talk about “computable approximations to Solomonoff induction” or “computable approximations to AIXI” as if such notions were meaningful.

Myth 4: the prior isn’t a big deal because Bayesians can always share likelihood ratios. Putting aside the practical issue that there would in general be an infinite number of likelihood ratios to share, there is the larger issue that for any hypothesis ${h}$, there is also the hypothesis ${h'}$ that matches ${h}$ exactly up to now, and then predicts the opposite of ${h}$ at all points in the future. You have to constrain model complexity at some point, the question is about how. To put this another way, sharing my likelihood ratios without also constraining model complexity (by focusing on a subset of all logically possible hypotheses) would be equivalent to just sharing all sensory data I’ve ever accrued in my life. To the extent that such a notion is even possible, I certainly don’t need to be a Bayesian to do such a thing.

Myth 5: frequentist methods need to assume their model is correct or that the data are i.i.d. Understanding the content of this section is the most important single insight to gain from this essay. For some reason it’s assumed that frequentist methods need to make strong assumptions (such as Gaussianity), whereas Bayesian methods are somehow immune to this. In reality, the opposite is true. While there are many beautiful and deep frequentist formalisms that answer this, I will choose to focus on one of my favorite, which is online learning.

To explain the online learning framework, let us suppose that our data are ${(x_1, y_1), (x_2, y_2), \ldots, (x_T, y_T)}$. We don’t observe ${y_t}$ until after making a prediction ${z_t}$ of what ${y_t}$ will be, and then we receive a penalty ${L(y_t, z_t)}$ based on how incorrect we were. So we can think of this as receiving prediction problems one-by-one, and in particular we make no assumptions about the relationship between the different problems; they could be i.i.d., they could be positively correlated, they could be anti-correlated, they could even be adversarially chosen.

As a running example, suppose that I’m betting on horses and before each race there are ${n}$ other people who give me advice on which horse to bet on. I know nothing about horses, so based on this advice I’d like to devise a good betting strategy. In this case, ${x_t}$ would be the ${n}$ bets that each of the other people recommend, ${z_t}$ would be the horse that I actually bet on, and ${y_t}$ would be the horse that actually wins the race. Then, supposing that ${y_t = z_t}$ (i.e., the horse I bet on actually wins), ${L(y_t, z_t)}$ is the negative of the payoff from correctly betting on that horse. Otherwise, if the horse I bet on doesn’t win, ${L(y_t, z_t)}$ is the cost I had to pay to place the bet.

If I’m in this setting, what guarantee can I hope for? I might ask for an algorithm that is guaranteed to make good bets — but this seems impossible unless the people advising me actually know something about horses. Or, at the very least, one of the people advising me knows something. Motivated by this, I define my regret to be the difference between my penalty and the penalty of the best of the ${n}$ people (note that I only have access to the latter after all ${T}$ rounds of betting). More formally, given a class ${\mathcal{M}}$ of predictors ${h : x \mapsto z}$, I define

$\displaystyle \mathrm{Regret}(T) = \frac{1}{T} \sum_{t=1}^T L(y_t, z_t) - \min_{h \in \mathcal{M}} \frac{1}{T} \sum_{t=1}^T L(y_t, h(x_t))$

In this case, ${\mathcal{M}}$ would have size ${n}$ and the ${i}$th predictor would just always follow the advice of person ${i}$. The regret is then how much worse I do on average than the best expert. A remarkable fact is that, in this case, there is a strategy such that ${\mathrm{Regret}(T)}$ shrinks at a rate of ${\sqrt{\frac{\log(n)}{T}}}$. In other words, I can have an average score within ${\epsilon}$ of the best advisor after ${\frac{\log(n)}{\epsilon^2}}$ rounds of betting.

One reason that this is remarkable is that it does not depend at all on how the data are distributed; the data could be i.i.d., positively correlated, negatively correlated, even adversarial, and one can still construct an (adaptive) prediction rule that does almost as well as the best predictor in the family.

To be even more concrete, if we assume that all costs and payoffs are bounded by ${\1}$ per round, and that there are ${1,000,000,000}$ people in total, then an explicit upper bound is that after ${28/\epsilon^2}$ rounds, we will be within ${\epsilon}$ dollars on average of the best other person. Under slightly stronger assumptions, we can do even better, for instance if the best person has an average variance of ${0.1}$ about their mean, then the ${28}$ can be replaced with ${4.5}$.

It is important to note that the betting scenario is just a running example, and one can still obtain regret bounds under fairly general scenarios; ${\mathcal{M}}$ could be continuous and ${L}$ could have quite general structure; the only technical assumption is that ${\mathcal{M}}$ be a convex set and that ${L}$ be a convex function of ${z}$. These assumptions tend to be easy to satisfy, though I have run into a few situations where they end up being problematic, mainly for computational reasons. For an ${n}$-dimensional model family, typically ${\mathrm{Regret}(T)}$ decreases at a rate of ${\sqrt{\frac{n}{T}}}$, although under additional assumptions this can be reduced to ${\sqrt{\frac{\log(n)}{T}}}$, as in the betting example above. I would consider this reduction to be one of the crowning results of modern frequentist statistics.

Yes, these guarantees sound incredibly awesome and perhaps too good to be true. They actually are that awesome, and they are actually true. The work is being done by measuring the error relative to the best model in the model family. We aren’t required to do well in an absolute sense, we just need to not do any worse than the best model. Of as long as at least one of the models in our family makes good predictions, that means we will as well. This is really what statistics is meant to be doing: you come up with everything you imagine could possibly be reasonable, and hand it to me, and then I come up with an algorithm that will figure out which of the things you handed me was most reasonable, and will do almost as well as that. As long as at least one of the things you come up with is good, then my algorithm will do well. Importantly, due to the ${\log(n)}$ dependence on the dimension of the model family, you can actually write down extremely broad classes of models and I will still successfully sift through them.

Let me stress again: regret bounds are saying that, no matter how the ${x_t}$ and ${y_t}$ are related, no i.i.d. assumptions anywhere in sight, we will do almost as well as any predictor ${h}$ in ${\mathcal{M}}$ (in particular, almost as well as the best predictor).

Myth 6: frequentist methods can only deal with simple models and need to make arbitrary cutoffs in model complexity. A naive perusal of the literature might lead one to believe that frequentists only ever consider very simple models, because many discussions center on linear and log-linear models. To dispel this, I will first note that there are just as many discussions that focus on much more general properties such as convexity and smoothness, and that can achieve comparably good bounds in many cases. But more importantly, the reason we focus so much on linear models is because we have already reduced a large family of problems to (log-)linear regression. The key insight, and I think one of the most important insights in all of applied mathematics, is that of featurization: given a non-linear problem, we can often embed it into a higher-dimensional linear problem, via a feature map ${\phi : X \rightarrow \mathbb{R}^n}$ (${\mathbb{R}^n}$ denotes ${n}$-dimensional space, i.e. vectors of real numbers of length ${n}$). For instance, if I think that ${y}$ is a polynomial (say cubic) function of ${x}$, I can apply the mapping ${\phi(x) = (1, x, x^2, x^3)}$, and now look for a linear relationship between ${y}$ and ${\phi(x)}$.

This insight extends far beyond polynomials. In combinatorial domains such as natural language, it is common to use indicator features: features that are ${1}$ if a certain event occurs and ${0}$ otherwise. For instance, I might have an indicator feature for whether two words appear consecutively in a sentence, whether two parts of speech are adjacent in a syntax tree, or for what part of speech a word has. Almost all state of the art systems in natural language processing work by solving a relatively simple regression task (typically either log-linear or max-margin) over a rich feature space (often involving hundreds of thousands or millions of features, i.e. an embedding into ${\mathbb{R}^{10^5}}$ or ${\mathbb{R}^{10^6}}$).

A counter-argument to the previous point could be: “Sure, you could create a high-dimensional family of models, but it’s still a parameterized family. I don’t want to be stuck with a parameterized family, I want my family to include all Turing machines!” Putting aside for a second the question of whether “all Turing machines” is a well-advised model choice, this is something that a frequentist approach can handle just fine, using a tool called regularization, which after featurization is the second most important idea in statistics.

Specifically, given any sufficiently quickly growing function ${\psi(h)}$, one can show that, given ${T}$ data points, there is a strategy whose average error is at most ${\sqrt{\frac{\psi(h)}{T}}}$ worse than any estimator ${h}$. This can hold even if the model class ${\mathcal{M}}$ is infinite dimensional. For instance, if ${\mathcal{M}}$ consists of all probability distributions over Turing machines, and we let ${h_i}$ denote the probability mass placed on the ${i}$th Turing machine, then a valid regularizer ${\psi}$ would be

$\displaystyle \psi(h) = \sum_i h_i \log(i^2 \cdot h_i)$

If we consider this, then we see that, for any probability distribution over the first ${2^k}$ Turing machines (i.e. all Turing machines with description length ${\leq k}$), the value of ${\psi}$ is at most ${\log((2^k)^2) = k\log(4)}$. (Here we use the fact that ${\psi(h) \geq \sum_i h_i \log(i^2)}$, since ${h_i \leq 1}$ and hence ${h_i\log(h_i) \leq 0}$.) This means that, if we receive roughly ${\frac{k}{\epsilon^2}}$ data, we will achieve error within ${\epsilon}$ of the best Turing machine that has description length ${\leq k}$.

Let me note several things here:

• This strategy makes no assumptions about the data being i.i.d. It doesn’t even assume that the data are computable. It just guarantees that it will perform as well as any Turing machine (or distribution over Turing machines) given the appropriate amount of data.
• This guarantee holds for any given sufficiently smooth measurement of prediction error (the update strategy depends on the particular error measure).
• This guarantee holds deterministically, no randomness required (although predictions may need to consist of probability distributions rather than specific points, but this is also true of Bayesian predictions).

Interestingly, in the case that the prediction error is given by the negative log probability assigned to the truth, then the corresponding strategy that achieves the error bound is just normal Bayesian updating. But for other measurements of error, we get different update strategies. Although I haven’t worked out the math, intuitively this difference could be important if the universe is fundamentally unpredictable but our notion of error is insensitive to the unpredictable aspects.

Myth 7: frequentist methods hide their assumptions while Bayesian methods make assumptions explicit. I’m still not really sure where this came from. As we’ve seen numerous times so far, a very common flavor among frequentist methods is the following: I have a model class ${\mathcal{M}}$, I want to do as well as any model in ${\mathcal{M}}$; or put another way:

Assumption: At least one model in ${\mathcal{M}}$ has error at most ${E}$.
Guarantee: My method will have error at most ${E + \epsilon}$.

This seems like a very explicit assumption with a very explicit guarantee. On the other hand, an argument I hear is that Bayesian methods make their assumptions explicit because they have an explicit prior. If I were to write this as an assumption and guarantee, I would write:

Assumption: The data were generated from the prior.
Guarantee: I will perform at least as well as any other method.

While I agree that this is an assumption and guarantee of Bayesian methods, there are two problems that I have with drawing the conclusion that “Bayesian methods make their assumptions explicit”. The first is that it can often be very difficult to understand how a prior behaves; so while we could say “The data were generated from the prior” is an explicit assumption, it may be unclear what exactly that assumption entails. However, a bigger issue is that “The data were generated from the prior” is an assumption that very rarely holds; indeed, in many cases the underlying process is deterministic (if you’re a subjective Bayesian then this isn’t necessarily a problem, but it does certainly mean that the assumption given above doesn’t hold). So given that that assumption doesn’t hold but Bayesian methods still often perform well in practice, I would say that Bayesian methods are making some other sort of “assumption” that is far less explicit (indeed, I would be very interested in understanding what this other, more nebulous assumption might be).

Myth 8: frequentist methods are fragile, Bayesian methods are robust. This is another one that’s straightforwardly false. First, since frequentist methods often rest on weaker assumptions they are more robust if the assumptions don’t quite hold. Secondly, there is an entire area of robust statistics, which focuses on being robust to adversarial errors in the problem data.

Myth 9: frequentist methods are responsible for bad science. I will concede that much bad science is done using frequentist statistics. But this is true only because pretty much all science is done using frequentist statistics. I’ve heard arguments that using Bayesian methods instead of frequentist methods would fix at least some of the problems with science. I don’t think this is particularly likely, as I think many of the problems come from mis-application of statistical tools or from failure to control for multiple hypotheses. If anything, Bayesian methods would exacerbate the former, because they often require more detailed modeling (although in most simple cases the difference doesn’t matter at all). I don’t think being Bayesian guards against multiple hypothesis testing. Yes, in some sense a prior “controls for multiple hypotheses”, but in general the issue is that the “multiple hypotheses” are never written down in the first place, or are written down and then discarded. One could argue that being in the habit of writing down a prior might make practitioners more likely to think about multiple hypotheses, but I’m not sure this is the first-order thing to worry about.

Myth 10: frequentist methods are unprincipled / hacky. One of the most beautiful theoretical paradigms that I can think of is what I could call the “geometric view of statistics”. One place that does a particularly good job of show-casing this is Shai Shalev-Shwartz’s PhD thesis, which was so beautiful that I cried when I read it. I’ll try (probably futilely) to convey a tiny amount of the intuition and beauty of this paradigm in the next few paragraphs, although focusing on minimax estimation, rather than online learning as in Shai’s thesis.

The geometric paradigm tends to emphasize a view of measurements (i.e. empirical expected values over observed data) as “noisy” linear constraints on a model family. We can control the noise by either taking few enough measurements that the total error from the noise is small (classical statistics), or by broadening the linear constraints to convex constraints (robust statistics), or by controlling the Lagrange multipliers on the constraints (regularization). One particularly beautiful result in this vein is the duality between maximum entropy and maximum likelihood. (I can already predict the Jaynesians trying to claim this result for their camp, but (i) Jaynes did not invent maximum entropy; (ii) maximum entropy is not particularly Bayesian (in the sense that frequentists use it as well); and (iii) the view on maximum entropy that I’m about to provide is different from the view given in Jaynes or by physicists in general [edit: EHeller thinks this last claim is questionable, see discussion here].)

To understand the duality mentioned above, suppose that we have a probability distribution ${p(x)}$ and the only information we have about it is the expected value of a certain number of functions, i.e. the information that ${\mathbb{E}[\phi(x)] = \phi^*}$, where the expectation is taken with respect to ${p(x)}$. We are interested in constructing a probability distribution ${q(x)}$ such that no matter what particular value ${p(x)}$ takes, ${q(x)}$ will still make good predictions. In other words (taking ${\log p(x)}$ as our measurement of prediction accuracy) we want ${\mathbb{E}_{p'}[\log q(x)]}$ to be large for all distributions ${p'}$ such that ${\mathbb{E}_{p'}[\phi(x)] = \phi^*}$. Using a technique called Lagrangian duality, we can both find the optimal distribution ${q}$ and compute its worse-case accuracy over all ${p'}$ with ${\mathbb{E}_{p'}[\phi(x)] = \phi^*}$. The characterization is as follows: consider all probability distributions ${q(x)}$ that are proportional to ${\exp(\lambda^{\top}\phi(x))}$ for some vector ${\lambda}$, i.e. ${q(x) = \exp(\lambda^{\top}\phi(x))/Z(\lambda)}$ for some ${Z(\lambda)}$. Of all of these, take the q(x) with the largest value of ${\lambda^{\top}\phi^* - \log Z(\lambda)}$. Then ${q(x)}$ will be the optimal distribution and the accuracy for all distributions ${p'}$ will be exactly ${\lambda^{\top}\phi^* - \log Z(\lambda)}$. Furthermore, if ${\phi^*}$ is the empirical expectation given some number of samples, then one can show that ${\lambda^{\top}\phi^* - \log Z(\lambda)}$ is propotional to the log likelihood of ${q}$, which is why I say that maximum entropy and maximum likelihood are dual to each other.

This is a relatively simple result but it underlies a decent chunk of models used in practice.

Myth 11: frequentist methods have no promising approach to computationally bounded inference. I would personally argue that frequentist methods are more promising than Bayesian methods at handling computational constraints, although computationally bounded inference is a very cutting edge area and I’m sure other experts would disagree. However, one point in favor of the frequentist approach here is that we already have some frameworks, such as the “tightening relaxations” framework discussed here, that provide quite elegant and rigorous ways of handling computationally intractable models.

References

(Myth 3) Sparse recovery: Sparse recovery using sparse matrices
(Myth 5) Online learning: Online learning and online convex optimization
(Myth 8) Robust statistics: see this blog post and the two linked papers
(Myth 10) Maximum entropy duality: Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory

## Another Critique of Effective Altruism

I’ve decided to branch out a bit from technical discussions and engage in, as Scott Aaronson would call it, some metaphysical spouting. The topic of today is the effective altruism movement. I’m about to be relentlessly critical of it, so this is probably not the best post to read as your first introduction. Instead, read this and this. Then you can read what follows (but keep in mind that there are also many good things about the EA movement that I’m failing to mention here).

* * *

Another Critique of Effective Altruism

Recently Ben Kuhn wrote a critique of effective altruism. I’m glad to see such self-examination taking place, but I’m also concerned that the essay did not attack some of the most serious issues I see in the effective altruist movement, so I’ve decided to write my own critique. Due to time constraints, this critique is short and incomplete. I’ve tried to bring up arguments that would make people feel uncomfortable and defensive; hopefully I’ve succeeded.

Briefly, here are some of the major issues I have with the effective altruism movement as it currently stands:

• Over-focus on “tried and true” and “default” options, which may both reduce actual impact and decrease exploration of new potentially high-value opportunities.

• Over-confident claims coupled with insufficient background research.

• Over-reliance on a small set of tools for assessing opportunities, which lead many to underestimate the value of things such as “flow-through” effects.

The common theme here is a subtle underlying message that simple, shallow analyses can allow one to make high-impact career and giving choices, and divest one of the need to dig further. I doubt that anyone explicitly believes this, but I do believe that this theme comes out implicitly both in arguments people make and in actions people take.

Lest this essay give a mistaken impression to the casual reader, I should note that there are many examplary effective altruists who I feel are mostly immune to the issues above; for instance, the GiveWell blog does a very good job of warning against the first and third points above, and I would recommend anyone who isn’t already to subscribe to it (and there are other examples that I’m failing to mention). But for the purposes of this essay, I will ignore this fact except for the current caveat.

Over-focus on “tried and true” options

It seems to me that the effective altruist movement over-focuses on “tried and true” options, both in giving opportunities and in career paths. Perhaps the biggest example of this is the prevalence of “earning to give”. While this is certainly an admirable option, it should be considered as a baseline to improve upon, not a definitive answer.

The biggest issue with the “earning to give” path is that careers in finance and software (the two most common avenues for this) are incredibly straight-forward and secure. The two things that finance and software have in common is that there is a well-defined application process similar to the one for undergraduate admissions, and given reasonable job performance one will continue to be given promotions and raises (this probably entails working hard, but the end result is still rarely in doubt). One also gets a constant source of extrinsic positive reinforcement from the money they earn. Why do I call these things an “issue”? Because I think that these attributes encourage people to pursue these paths without looking for less obvious, less certain, but ultimately better paths. One in six Yale graduates go into finance and consulting, seemingly due to the simplicity of applying and the easy supply of extrinsic motivation. My intuition is that this ratio is higher than an optimal society would have, even if such people commonly gave generously (and it is certainly much higher than the number of people who enter college planning to pursue such paths).

Contrast this with, for instance, working at a start-up. Most start-ups are low-impact, but it is undeniable that at least some have been extraordinarily high-impact, so this seems like an area that effective altruists should be considering strongly. Why aren’t there more of us at 23&me, or Coursera, or Quora, or Stripe? I think it is because these opportunities are less obvious and take more work to find, once you start working it often isn’t clear whether what you’re doing will have a positive impact or not, and your future job security is massively uncertain. There are few sources of extrinsic motivation in such a career: perhaps moreso at one of the companies mentioned above, which are reasonably established and have customers, but what about the 4-person start-up teams working in a warehouse somewhere? Some of them will go on to do great things but right now their lives must be full of anxiousness and uncertainty.

I don’t mean to fetishize start-ups. They are just one well-known example of a potentially high-value career path that, to me, seems underexplored within the EA movement. I would argue (perhaps self-servingly) that academia is another example of such a path, with similar psychological obstacles: every 5 years or so you have the opportunity to get kicked out (e.g. applying for faculty jobs, and being up for tenure), you need to relocate regularly, few people will read your work and even fewer will praise it, and it won’t be clear whether it had a positive impact until many years down the road. And beyond the “obvious” alternatives of start-ups and academia, what of the paths that haven’t been created yet? GiveWell was revolutionary when it came about. Who will be the next GiveWell? And by this I don’t mean the next charity evaluator, but the next set of people who fundamentally alter how we view altruism.

Over-confident claims coupled with insufficient background research

The history of effective altruism is littered with over-confident claims, many of which have later turned out to be false. In 2009, Peter Singer claimed that you could save a life for $200 (and many others repeated his claim). While the number was already questionable at the time, by 2011 we discovered that the number was completely off. Now new numbers were thrown around: from numbers still in the hundreds of dollars (GWWC’s estimate for SCI, which was later shown to be flawed) up to$1600 (GiveWell’s estimate for AMF, which GiveWell itself expected to go up, and which indeed did go up). These numbers were often cited without caveats, as well as other claims such as that the effectiveness of charities can vary by a factor of 1,000. How many people citing these numbers understood the process that generated them, or the high degree of uncertainty surrounding them, or the inaccuracy of past estimates? How many would have pointed out that saying that charities vary by a factor of 1,000 in effectiveness is by itself not very helpful, and is more a statement about how bad the bottom end is than how good the top end is?

More problematic than the careless bandying of numbers is the tendency toward not doing strong background research. A common pattern I see is: an effective altruist makes a bold claim, then when pressed on it offers a heuristic justification together with the claim that “estimation is the best we have”. This sort of argument acts as a conversation-stopper (and can also be quite annoying, which may be part of what drives some people away from effective altruism). In many of these cases, there are relatively easy opportunities to do background reading to further educate oneself about the claim being made. It can appear to an outside observer as though people are opting for the fun, easy activity (speculation) rather than the harder and more worthwhile activity (research). Again, I’m not claiming that this is people’s explicit thought process, but it does seem to be what ends up happening.

Why haven’t more EAs signed up for a course on global security, or tried to understand how DARPA funds projects, or learned about third-world health? I’ve heard claims that this would be too time-consuming relative to the value it provides, but this seems like a poor excuse if we want to be taken seriously as a movement (or even just want to reach consistently accurate conclusions about the world).

Over-reliance on a small set of tools

Effective altruists tend to have a lot of interest in quantitative estimates. We want to know what the best thing to do is, and we want a numerical value. This causes us to rely on scientific studies, economic reports, and Fermi estimates. It can cause us to underweight things like the competence of a particular organization, the strength of the people involved, and other “intangibles” (which are often not actually intangible but simply difficult to assign a number to). It also can cause us to over-focus on money as a unit of altruism, while often-times “it isn’t about the money”: it’s about doing the groundwork that no one is doing, or finding the opportunity that no one has found yet.

Quantitative estimates often also tend to ignore flow-through effects: effects which are an indirect, rather than direct, result of an action (such as decreased disease in the third world contributing in the long run to increased global security). These effects are difficult to quantify but human and cultural intuition can do a reasonable job of taking them into account. As such, I often worry that effective altruists may actually be less effective than “normal” altruists. (One can point to all sorts of examples of farcical charities to claim that regular altruism sucks, but this misses the point that there are also amazing organizations out there, such as the Simons Foundation or HHMI, which are doing enormous amounts of good despite not subscribing to the EA philosophy.)

What’s particularly worrisome is that even if we were less effective than normal altruists, we would probably still end up looking better by our own standards, which explicitly fail to account for the ways in which normal altruists might outperform us (see above). This is a problem with any paradigm, but the fact that the effective altruist community is small and insular and relies heavily on its paradigm makes us far more susceptible to it.

## Convex Conditions for Strong Convexity

An important concept in online learning and convex optimization is that of strong convexity: a twice-differentiable function $f$ is said to be strongly convex with respect to a norm $\|\cdot\|$ if

$z^T\frac{\partial^2 f}{\partial x^2}z \geq \|z\|^2$

for all $z$ (for functions that are not twice-differentiable, there is an analogous criterion in terms of the Bregman divergence). To check strong convexity, then, we basically need to check a condition on the Hessian, namely that $z^THz \geq \|z\|^2$. So, under what conditions does this hold?

For the $l^2$ norm, the answer is easy: $z^THz \geq \|z\|_2^2$ if and only if $H \succeq I$ (i.e., $H-I$ is positive semidefinite). This can be shown in many ways, perhaps the easiest is by noting that $z^THz-\|z\|_2^2 = z^T(H-I)z$.

For the $l^{\infty}$ norm, the answer is a bit trickier but still not too complicated. Recall that we want necessary and sufficient conditions under which $z^THz \geq \|z\|_{\infty}^2$. Note that this is equivalent to asking that $z^THz \geq (z_i)^2$ for each coordinate $i$ of $z$, which in turn is equivalent to $H \succeq e_ie_i^T$ for each coordinate vector $e_i$ (these are the vectors that are 1 in the $i$th coordinate and 0 everywhere else).

More generally, for any norm $\|\cdot\|$, there exists a dual norm $\|\cdot\|_*$ which satisfies, among other properties, the relationship $\|z\| = \sup_{\|w\|_* = 1} w^Tz$. So, in general, $z^THz \geq \|z\|^2$ is equivalent to asking that $z^THz \geq (w^Tz)^2$ for all $w$ with $\|w\|_* = 1$. But this is in turn equivalent to asking that

$H \succeq ww^T$ for all $w$ such that $\|w\|_* = 1$.

In fact, it suffices to pick a subset of the $w$ such that the convex hull consists of all $w$ with $\|w\|_* \leq 1$; this is why we were able to obtain such a clean formulation in the $l^{\infty}$ case: the dual norm to $l^{\infty}$ is $l^1$, whose unit ball is the simplex, which is a polytope with only $2n$ vertices (namely, each of the signed unit vectors $\pm e_i$).

We can also derive a simple (but computationally expensive) criterion for $l^1$ strong convexity: here the dual norm is $l^{\infty}$, whose unit ball is the $n$-dimensional hypercube, with vertices given by all $2^n$ vectors of the form $[ \pm 1 \ \cdots \ \pm 1]$. Thus $z^THz \geq \|z\|_1^2$ if and only if $H \succeq ss^T$ for all $2^n$ sign vectors $s$.

Finally, we re-examine the $l^2$ case; even though the $l^2$-ball is not a polytope, we were still able to obtain a very simple expression. This was because the condition $H \succeq I$ manages to capture simultaneously all dual vectors such that $w^Tw \leq 1$. We thus have the general criterion:

Theorem. $H \succeq M_jM_j^T$ for $j = 1,\ldots,m$ if and only if $H$ is strongly convex with respect to the norm $\|\cdot\|$ whose dual unit ball is the convex hull of the transformed unit balls $M_j\mathcal{B}_j$, $j = 1, \ldots, m$, where $\mathcal{B}_j$ is the $l^2$ unit ball whose dimension matches the number of columns of $M_j$.

Proof. $H \succeq M_jM_j^T$ if and only if $z^THz \geq \max_{j=1}^m \|M_j^Tz\|_2^2$. Now note that $\|M_j^Tz\|_2 = \sup_{w \in \mathcal{B}_j} w^TM_j^Tz = \sup_{w' \in M_j\mathcal{B}_j} (w')^Tz$. If we define $\|z\| = \max_{j=1}^m \|M_j^Tz\|_2$, it is then apparent that the dual norm unit ball is the convex hull of the $M_j\mathcal{B}_j$.

## Convexity counterexample

Here’s a fun counterexample: a function $\mathbb{R}^n \to \mathbb{R}$ that is jointly convex in any $n-1$ of the variables, but not in all variables at once. The function is

$f(x_1,\ldots,x_n) = \frac{1}{2}(n-1.5)\sum_{i=1}^n x_i^2 - \sum_{i < j} x_ix_j$

To see why this is, note that the Hessian of $f$ is equal to

$\left[ \begin{array}{cccc} n-1.5 & -1 & \cdots & -1 \\ -1 & n-1.5 & \cdots & -1 \\ \vdots & \vdots & \ddots & \vdots \\ -1 & -1 & \cdots & n-1.5 \end{array} \right]$

This matrix is equal to $(n-0.5)I - J$, where $I$ is the identity matrix and $J$ is the all-ones matrix, which is rank 1 and whose single non-zero eigenvalue is $n$. Therefore, this matrix has $n-1$ eigenvalues of $n-0.5$, as well as a single eigenvalue of $-0.5$, and hence is not positive definite.

On the other hand, any submatrix of size $n-1$ is of the form $(n-0.5)I-J$, but where now $J$ is only $(n-1) \times (n-1)$. This matrix now has $n-2$ eigenvalues of $n-0.5$, together with a single eigenvalue of $0.5$, and hence is positive definite. Therefore, the Hessian is positive definite when restricted to any $n-1$ variables, and hence $f$ is convex in any $n-1$ variables, but not in all $n$ variables jointly.

## Probabilistic Abstractions I

(This post represents research in progress. I may think about these concepts entirely differently a few months from now, but for my own benefit I’m trying to exposit on them in order to force myself to understand them better.)

For many inference tasks, especially ones with either non-linearities or non-convexities, it is common to use particle-based methods such as beam search, particle filters, sequential Monte Carlo, or Markov Chain Monte Carlo. In these methods, we approximate a distribution by a collection of samples from that distribution, then update the samples as new information is added. For instance, in beam search, if we are trying to build up a tree, we might build up a collection of $K$ samples for the left and right subtrees, then look at all $K^2$ ways of combining them into the entire tree, but then downsample again to the $K$ trees with the highest scores. This allows us to search through the exponentially large space of all trees efficiently (albeit at the cost of possibly missing high-scoring trees).

One major problem with such particle-based methods is diversity: the particles will tend to cluster around the highest-scoring mode, rather than exploring multiple local optima if they exist. This can be bad because it makes learning algorithms overly myopic. Another problem, especially in combinatorial domains, is difficulty of partial evaluation: if we have some training data that we are trying to fit to, and we have chosen settings of some, but not all, variables in our model, it can be difficult to know if that setting is on the right track (for instance, it can be difficult to know whether a partially-built tree is a promising candidate or not). For time-series modeling, this isn’t nearly as large of a problem, since we can evaluate against a prefix of the time series to get a good idea (this perhaps explains the success of particle filters in these domains).

I’ve been working on a method that tries to deal with both of these problems, which I call probabilistic abstractions. The idea is to improve the diversity of particle-based methods by creating “fat” particles which cover multiple states at once; the reason that such fat particles help is that they allow us to first optimize for coverage (by placing down relatively large particles that cover the entire space), then later worry about more local details (by placing down many particles near promising-looking local optima).

To be more concrete, if we have a probability distribution over a set of random variables $(X_1,\ldots,X_d)$, then our particles will be sets obtained by specifying the values of some of the $X_i$ and leaving the rest to vary arbitrarily. So, for instance, if $d=4$, then $\{(X_1,X_2,X_3,X_4) \mid X_2 = 1, x_4 = 7\}$ might be a possible “fat” particle.

By choosing some number of fat particles and assigning probabilities to them, we are implicitly specifying a polytope of possible probability distributions; for instance, if our particles are $S_1,\ldots,S_k$, and we assign probability $\pi_i$ to $S_i$, then we have the polytope of distributions $p$ that satisfy the constraints $p(S_1) = \pi_1, p(S_2) = \pi_2$, etc.

Given such a polytope, is there a way to pick a canonical representative from it? One such representative is the maximum entropy distribution in that polytope. This distribution has the property of minimizing the worst-case relative entropy to any other distribution within the polytope (and that worst-case relative entropy is just the entropy of the distribution).

Suppose that we have a polytope for two independent distributions, and we want to compute the polytope for their product. This is easy — just look at the cartesian products of each particle of the first distribution with each particle of the second distribution. If each individual distribution has $k$ particles, then the product distribution has $k^2$ particles — this could be problematic computationally, so we also want a way to narrow down to a subset of the $k$ most informative particles. These will be the $k$ particles such that the corresponding polytope minimizes the maximum entropy of that polytope. Finding this is NP-hard in general, but I’m currently working on good heuristics for computing it.

Next, suppose that we have a distribution on a space $X$ and want to apply a function $f : X \to Y$ to it. If $f$ is a complicated function, it might be difficult to propagate the fat particles (even though it would have been easy to propagate particles composed of single points). To get around this, we need what is called a valid abstraction of $f$: a function $\tilde{f} : 2^X \to 2^Y$ such that $\tilde{f}(S) \supseteq f(S)$ for all $S \in 2^X$. In this case, if we map a particle $S$ to $\tilde{f}(S)$, our equality constraint on the mass assigned to $S$ becomes a lower bound on the mass assigned to $\tilde{f}(S)$ — we thus still have a polytope of possible probability distributions. Depending on the exact structure of the particles (i.e. the exact way in which the different sets overlap), it may be necessary to add additional constraints to the polytope to get good performance — I feel like I have some understanding of this, but it’s something I’ll need to investigate empirically as well. It’s also interesting to note that $\tilde{f}$ (when combined with conditioning on data, which is discussed below) allows us to assign partial credit to promising particles, which was the other property I discussed at the beginning.

Finally, suppose that I want to condition on data. In this case the polytope approach doesn’t work as well, because conditioning on data can blow up the polytope by an arbitrarily large amount. Instead, we just take the maximum-entropy distribution in our polytope and treat that as our “true” distribution, then condition. I haven’t been able to make any formal statements about this procedure, but it seems to work at least somewhat reasonably. It is worth noting that conditioning may not be straightforward, since the likelihood function may not be constant across a given fat particle. To deal with this, we can replace the likelihood function by its average (which I think can be justified in terms of maximum entropy as well, although the details here are a bit hazier).

So, in summary, we have a notion of fat particles, which provide better coverage than point particles, and can combine them, apply functions to them, subsample them, and condition on data. This is essentially all of the operations we want to be able to apply for particle-based methods, so we in theory should now be able to implement versions of these particle-based methods that get better coverage.

## Pairwise Independence vs. Independence

For collections of independent random variables, the Chernoff bound and related bounds give us very sharp concentration inequalities — if $X_1,\ldots,X_n$ are independent, then their sum has a distribution that decays like $e^{-x^2}$. For random variables that are only pairwise independent, the strongest bound we have is Chebyshev’s inequality, which says that their sum decays like $\frac{1}{x^2}$.

The point of this post is to construct an equality case for Chebyshev: a collection of pairwise independent random variables whose sum does not satisfy the concentration bound of Chernoff, and instead decays like $\frac{1}{x^2}$.

The construction is as follows: let $X_1,\ldots,X_d$ be independent binary random variables, and for any $S \subset \{1,\ldots,d\}$, let $Y_S = \sum_{i \in S} X_i$, where the sum is taken mod 2. Then we can easily check that the $Y_S$ are pairwise independent. Now consider  the random variable $Z = \sum_{S} Y_S$. If any of the $X_i$ is equal to 1, then we can pair up the $Y_S$ by either adding or removing $i$ from $S$ to get the other element of the pair. If we do this, we see that $Z = 2^{d-1}$ in this case. On the other hand, if all of the $X_i$ are equal to 0, then $Z = 0$ as well. Thus, with probability $\frac{1}{2^d}$, $Z$ deviates from its mean by $2^{d-1}-\frac{1}{2}$, whereas the variance of $Z$ is $2^{d-2}-\frac{1}{4}$. The bound on this probability form Chebyshev is $\frac{2^{d-2}-1/4}{(2^{d-1}-1/2)^2}$, which is very close to $\frac{1}{2^d}$, so this constitutes something very close to the Chebyshev equality case.

Anyways, I just thought this was a cool example that demonstrates the difference between pairwise and full independence.

## A Fun Optimization Problem

I spent the last several hours trying to come up with an efficient algorithm to the following problem:

Problem: Suppose that we have a sequence of $l$ pairs of non-negative numbers $(a_1,b_1),\ldots,(a_l,b_l)$ such that $\sum_{i=1}^l a_i \leq A$ and $\sum_{i=1}^l b_i \leq B$. Devise an efficient algorithm to find the $k$ pairs $(a_{i_1},b_{i_1}),\ldots,(a_{i_k},b_{i_k})$ that maximize

$\left[\sum_{r=1}^k a_{i_r}\log(a_{i_r}/b_{i_r})\right] + \left[A-\sum_{r=1}^k a_{i_r}\right]\log\left(\frac{A-\sum_{r=1}^k a_{i_r}}{B-\sum_{r=1}^k b_{i_r}}\right).$

Commentary: I don’t have a fully satisfactory solution to this yet, although I do think I can find an algorithm that runs in $O\left(\frac{l \log(l)}{\epsilon}\right)$ time and finds $2k$ pairs that do at least $1-\epsilon$ as well as the best set of $k$ pairs. It’s possible I need to assume something like $\sum_{i=1}^l a_i \leq A/2$ instead of just $A$ (and similarly for the $b_i$), although I’m happy to make that assumption.

While attempting to solve this problem, I’ve managed to utilize a pretty large subset of my bag of tricks for optimization problems, so I think working on it is pretty worthwhile intellectually. It also happens to be important to my research, so if anyone comes up with a good algorithm I’d be interested to know.