Donations for 2016

The following explains where I plan to donate in 2016, with some of my thinking behind it. This year, I had $10,000 to allocate (the sum of my giving from 2015 and 2016, which I lumped together for tax reasons; although I think this was a mistake in retrospect, both due to discount rates and because I could have donated in January and December 2016 and still received the same tax benefits).

To start with the punch line: I plan to give $4000 to the EA donor lottery, $2500 to GiveWell for discretionary granting, $2000 to be held in reserve to fund promising projects, $500 to GiveDirectly, $500 to the Carnegie Endowment (earmarked for the Carnegie-Tsinghua Center), and $500 to the Blue Ribbon Study Panel.

For those interested in donating to any of these: instructions for the EA donor lottery and the Blue Ribbon Study Panel are in the corresponding links above, and you can donate to both GiveWell and GiveDirectly at this page. I am looking in to whether it is possible for small donors to give to the Carnegie Endowment, and will update this page when I find out.

At a high level, I partitioned my giving into two categories, which are roughly (A) “help poor people right now” and (B) “improve the overall trajectory of civilization” (these are meant to be rough delineations rather than rigorous definitions). I decided to split my giving into 30% category A and 70% category B. This is because while I believe that category B is the more pressing and impactful category to address in some overall utilitarian sense, I still feel a particular moral obligation towards helping the existing poor in the world we currently live in, which I don’t feel can be discharged simply by giving more to category B. The 30-70 split is meant to represent the fact that while category B seems more important to me, category A still receives substantial weight in my moral calculus (which isn’t fully utilitarian or even consequentialist).

The rest of this post treats categories A and B each in turn.

Category A: The Global Poor

Out of $3000 in total, I decided to give $2500 to GiveWell for discretionary regranting (which will likely be disbursed roughly but not exactly according to GiveWell’s recommended allocation), and $500 to some other source, with the only stipulation being that it did not exactly match GiveWell’s recommendation. The reason for this was the following: while I expect GiveWell’s recommendation to outperform any conclusion that I personally reach, I think there is substantial value in the exercise of personally thinking through where to direct my giving. A few more specific reasons:

  • Most importantly, while I think that offloading giving decisions to a trusted expert is the correct decision to maximize the impact of any individual donation, collectively it leads to a bad equilibrium where substantially fewer and less diverse brainpower is devoted to thinking about where to give. I think that giving a small but meaningful amount based on one’s own reasoning largely ameliorates this effect without losing much direct value.
  • In addition, I think it is good to build the skills to in principle think through where to direct resources, even if in practice most of the work is outsourced to a dedicated organization.
  • Finally, having a large number of individual donors check GiveWell’s work and search for alternatives creates stronger incentives for GiveWell to do a thorough job (and allows donors to have more confidence that GiveWell is doing a thorough job). While I know many GiveWell staff and believe that they would do an excellent job independently of external vetting, I still think this is good practice.

Related to the last point: doing this exercise gave me a better appreciation for the overall reliability, strengths, and limitations of GiveWell’s work. In general, I found that GiveWell’s work was incredibly thorough (more-so than I expected despite my high opinion of them), and moreover that they have moved substantial money beyond the publicized annual donor recommendations. An example of this is their 2016 grant to IDinsight. IDinsight ended up being one of my top candidates for where to donate, such that I thought it was plausibly even better than a GiveWell top charity. However, when I looked into it further it turned out that GiveWell had already essentially filled their entire funding gap.

I think this anecdote serves to illustrate a few things: first, as noted, GiveWell is very thorough, and does substantial work beyond what is apparent from the top charities page. Second, while GiveWell had already given to IDinsight, the grant was made in 2016. I think the same process I used would not have discovered IDinsight in 2015, but it’s possible that other processes would have. So, I think it is possible that a motivated individual could identify strong giving opportunities a year ahead of GiveWell. As a point against this, I think I am in an unusually good position to do this and still did not succeed. I also think that even if an individual identified a strong opportunity, it is unlikely that they could be confident that it was strong, and in most cases GiveWell’s top charities would still be better bets in expectation (but I think that merely identifying a plausibly strong giving opportunity should count as a huge success for the purposes of the overall exercise).

To elaborate on why my positioning might be atypically good: I already know GiveWell staff and so have some appreciation for their thinking, and I work at Stanford and have several friends in the economics department, which is one of the strongest departments in the world for Development Economics. In particular, I discussed my giving decisions extensively with a student of Pascaline Dupas, who is one of the world experts in the areas of economics most relevant to GiveWell’s recommendations.

Below are specifics on organizations I looked into and where I ultimately decided to give.

Object-level Process and Decisions (Category A)

My process for deciding where to give mostly consisted of talking to several people I trust, brainstorming and thinking things through myself, and a small amount of online research. (I think that I should likely have done substantially more online research than I ended up doing, but my thinking style tends to benefit from 1-on-1 discussions, which I also find more enjoyable.) The main types of charities that I ended up considering were:

  • GiveDirectly (direct cash transfers)
  • IPA/JPAL and similar groups (organizations that support academic research on  international development)
  • IDinsight and similar groups (similar to the previous group, but explicitly tries to do the “translational work” of going from academic research to evidence-backed large-scale interventions)
  • public information campaigns (such as Development Media International)
  • animal welfare
  • start-ups or other small groups in the development space that might need seed funding
  • meta-charities such as CEA that try to increase the amount of money moved to EA causes (or evidence-backed charity more generally)

I ultimately felt unsure whether animal welfare should count in this category, and while I felt that CEA was a potentially strong candidate in terms of pure cost-effectiveness, directing funds there felt overly insular/meta to me in a way that defeated the purpose of the giving exercise. (Note: two individuals who reviewed this post encouraged me to revisit this point; as a result, next year I plan to look into CEA in more detail.)

While looking into the “translational work” category, I came across one organization other than IDinsight that did work in this area and was well-regarded by at least some economists. While I was less impressed by them than I was by IDinsight, they seemed plausibly strong, and it turned out that GiveWell had not yet evaluated them. While I ended up deciding not to give to them (based on feeling that IDinsight was likely to do substantially better work in the same area) I did send GiveWell an e-mail bringing the organization to their attention.

When looking into IPA, my impression was that while they have been responsible for some really good work in the past, this was primarily while they were a smaller organization, and they have now become large and bureaucratic enough that their future value will be substantially lower. However, I also found out about an individual who was running a small organization in the same space as IPA, and seemed to be doing very good work. While I was unable to offer them money for reasons related to conflict of interest, I do plan to try to find ways to direct funds to them if they are interested.

While public information campaigns seem like they could a priori be very effective, briefly looking over GiveWell’s page on DMI gave me the impression that GiveWell had already considered this area in a great deal of depth and prioritized other interventions for good reasons.

I ultimately decided to give my money to GiveDirectly. While in some sense this violates the spirit of the exercise, I felt satisfied about having found at least one potentially good giving opportunity (the small IPA-like organization) even if I was unable to give to it personally, and overall felt that I had done a reasonable amount of research. Moreover, I have a strong intuition that 0% is the wrong allocation for GiveDirectly, and it wasn’t clear to me that GiveWell’s reasons for recommending 0% were strong enough to override that intuition.

So, overall, $2500 of my donation will go to GiveWell for discretionary re-granting, and $500 to GiveDirectly.

Trajectory of Civilization (Category B)

First, I plan to put $2000 into escrow for the purpose of supporting any useful small projects (specifically in the field of computer science / machine learning) that I come across in the next year. For the remaining $5000, I plan to allocate $4000 of it to the donor lottery, $500 to the Carnegie Endowment, and $500 to the Blue Ribbon Study Panel on Biodefense. For the latter, I wanted to donate to something that improved medium-term international security, because I believe that this is an important area that is relatively under-invested in by the effective altruist community (both in terms of money and cognitive effort). Here are all of the major possibilities that I considered:

  • Donating to the Future of Humanity Institute, with funds earmarked towards their collaboration with Allan Dafoe. I decided against this because my impression was that this particular project was not funding-constrained. (However, I am very excited by the work that Allan and his collaborators are doing, and would like to find ways to meaningfully support it.)
  • Donating to the Carnegie Endowment, restricted specifically to the Carnegie-Tsinghua Center. My understanding is that this is one of the few western organizations working to influence China’s nuclear policy (though this is based on personal conversation and not something I have looked into myself). My intuition is that influencing Chinese nuclear policy is substantially more tractable than U.S. nuclear policy, due to far fewer people trying to do so. In addition, from looking at their website, I felt that most of the areas they worked in were important areas, which I believe to be unusual for large organizations with multiple focuses (as a contrast, for other organizations with a similar number of focus areas, I felt that roughly half of the areas were obviously orders of magnitude less important than the areas I was most excited about). I had some reservations about donating (due to their size: $30 million in revenue per year, and $300 million in assets), but I decided to donate $500 anyways because I am excited about this general type of work. (This organization was brought to my attention by Nick Beckstead; Nick notes that he doesn’t have strong opinions about this organization, primarily due to not knowing much about them.)
  • Donating to the Blue Ribbon Study Panel: I am basically trusting Jaime Yassif that this is a strong recommendation within the area of biodefense.
  • Donating to the ACLU: The idea here would be to decrease the probability that a President Trump seriously erodes democratic norms within the U.S. I however currently expect the ACLU to be well-funded (my understanding is that they got a flood of donations after Trump was elected).
  • Donating to the DNC or the Obama/Holder redistricting campaign: This is based on the idea that (1) Democrats are much better than Republicans for global stability / good U.S. policy, and (2) Republicans should be punished for helping Trump to become president. I basically agree with both, and could see myself donating to the redistricting campaign in particular in the future, but this intuitively feels less tractable/underfunded than non-partisan efforts like the Carnegie Endowment or Blue Ribbon Study Panel.
  • Creating a prize fund for incentivizing important research projects within computer science: I was originally planning to allocate $1000 to $2000 to this, based on the idea that computer science is a key field for multiple important areas (both AI safety and cyber security) and that as an expert in this field I would be in a unique position to identify useful projects relative to others in the EA community. However, after talking to several people and thinking about it myself, I decided that it was likely not tractable to provide meaningful incentives via prizes at such a small scale, and opted to instead set aside $2000 to support promising projects as I come across them.

(As a side note: it isn’t completely clear to me whether the Carnegie Endowment accepts small donations. I plan to contact them about this, and if they do not, allocate the money to the Blue Ribbon Study Panel instead.)

In the remainder of this post I will briefly describe the $2000 project fund, how I plan to use it, and why I decided it was a strong giving opportunity. I also plan to describe this in more detail in a separate follow-up post. Credit goes to Owen Cotton-Barratt for suggesting this idea. In addition, one of Paul Christiano’s blog posts inspired me to think about using prizes to incentivize research, and Holden Karnofsky further encouraged me to think along these lines.

The idea behind the project fund is similar to the idea behind the prize fund: I understand research in computer science better than most other EAs, and can give in a low-friction way on scales that are too small for organizations like Open Phil to think about. Moreover, it is likely good for me to develop a habit of evaluating projects I come across and thinking about whether they could benefit from additional money (either because they are funding constrained, or to incentivize an individual who is on the fence about carrying the project out). Finally, if this effort is successful, it is possible that other EAs will start to do this as well, which could magnify the overall impact. I think there is some danger that I will not be able to allocate the $2000 in the next year, in which case any leftover funds will go to next year’s donor lottery.

Thinking Outside One’s Paradigm

When I meet someone who works in a field outside of computer science, I usually ask them a lot of questions about their field that I’m curious about. (This is still relevant even if I’ve already met someone in that field before, because it gives me an idea of the range of expert consensus; for some questions this ends up being surprisingly variable.) I often find that, as an outsider, I can think of natural-seeming questions that experts in the field haven’t thought about, because their thinking is confined by their field’s paradigm while mine is not (pessimistically, it’s instead constrained by a different paradigm, i.e. computer science).

Usually my questions are pretty naive, and are basically what a computer scientist would think to ask based on their own biases. For instance:

  • Neuroscience: How much computation would it take to simulate a brain? Do our current theories of how neurons work allow us to do that even in principle?
  • Political science: How does the rise of powerful multinational corporations affect theories of international security (typical past theories assume that the only major powers are states)? How do we keep software companies (like Google, etc.) politically accountable? How will cyber attacks / cyber warfare affect international security?
  • Materials science: How much of the materials design / discovery process can be automated? What are the bottlenecks to building whatever materials we would like to? How can different research groups effectively communicate and streamline their steps for synthesizing materials?

When I do this, it’s not unusual for me to end up asking questions that the other person hasn’t really thought about before. In this case, responses range from “that’s not a question that our field studies” to “I haven’t thought about this much, but let’s try to think it through on the spot”. Of course, sometimes the other person has thought about it, and sometimes my question really is just silly or ill-formed for some reason (I suspect this is true more often than I’m explicitly made aware of, since some people are too polite to point it out to me).

I find the cases where the other person hasn’t thought about the question to be striking, because it means that I as a naive outsider can ask natural-seeming questions that haven’t been considered before by an expert in the field. I think what is going on here is that I and my interlocutor are using different paradigms (in the Kuhnian sense) for determining what questions are worth asking in a field. But while there is a sense in which the other person’s paradigm is more trustworthy — since it arose from a consensus of experts in the relevant field — that doesn’t mean that it’s absolutely reliable. Paradigms tend to blind one to evidence or problems that don’t fit into that paradigm, and paradigm shifts in science aren’t really that rare. (In addition, many fields including machine learning don’t even have a single agreed-upon paradigm.)

I think that as a scientist (or really, even as a citizen) it is important to be able to see outside one’s own paradigm. I currently think that I do a good job of this, but it seems to me that there’s a big danger of becoming more entrenched as I get older. Based on the above experiences, I plan to use the following test: When someone asks me a question about my field, how often have I not thought about it before? How tempted am I to say, “That question isn’t interesting”? If these start to become more common, then I’ll know something has gone wrong.

A few miscellaneous observations:

  • There are several people I know who routinely have answers to whatever questions I ask. Interestingly, they tend to be considered slightly “crackpot-ish” within their field; and they might also be less successful by conventional metrics, relatively to how smart they are considered by their colleagues. I think this is a result of the fact that most academic fields over-reward progress within that field’s paradigm and under-reward progress outside of it.
  • Beyond “slightly crakpot-ish academics”, the other set of people who routinely have answers to my questions are philosophers and some people in program manager roles (this includes certain types of VCs as well).
  • I would guess that in general technical fields that overlap with the humanities are more likely to take a broad view and not get stuck in a single paradigm. For instance, I would expect political scientists to have thought about most of the political science questions I mentioned above; however, I haven’t talked to enough political scientists (or social scientists in general) to have much confidence in this.

Two Strange Facts

Here are two strange facts about matrices, which I can prove but not in a satisfying way.

  1. If A and B are symmetric matrices satisfying 0 \preceq A \preceq B, then A^{1/2} \preceq B^{1/2}, and B^{-1} \preceq A^{-1}, but it is NOT necessarily the case that A^2 \preceq B^2. Is there a nice way to see why the first two properties should hold but not necessarily the third? In general, do we have A^p \preceq B^p if p \in [0,1]?
  2. Given a rectangular matrix W \in \mathbb{R}^{n \times d}, and a set S \subseteq [n], let W_S be the submatrix of W with rows in S, and let \|W_S\|_* denote the nuclear norm (sum of singular values) of W_S. Then the function f(S) = \|W_S\|_* is submodular, meaning that f(S \cup T) + f(S \cap T) \leq f(S) + f(T) for all sets S, T. In fact, this is true if we take f_p(S), defined as the sum of the pth powers of the singular values of W_S, for any p \in [0,2]. The only proof I know involves trigonometric integrals and seems completely unmotivated to me. Is there any clean way of seeing why this should be true?

If anyone has insight into either of these, I’d be very interested!

Difficulty of Predicting the Maximum of Gaussians

Suppose that we have a random variable X \in \mathbb{R}^d, such that \mathbb{E}[XX^{\top}] = I_{d \times d}. Now take k independent Gaussian random variables Z_1, \ldots, Z_k \sim \mathcal{N}(0, I_{d \times d}), and let J be the argmax (over j in 1, …, k) of Z_j^{\top}X.

It seems that it should be very hard to predict J well, in the following sense: for any function q(j \mid x), the expectation of \mathbb{E}_{x}[q(J \mid x)], should with high probability be very close to \frac{1}{k} (where the second probability is taken over the randomness in Z). In fact, Alex Zhai and I think that the probability of the expectation exceeding \frac{1}{k} should be at most \exp(-C(\epsilon/k)^2d) for some constant C. (We can already show this to be true where we replace (\epsilon/k)^2 with (\epsilon/k)^4.) I will not sketch a proof here but the idea is pretty cool, it basically uses Lipschitz concentration of Gaussian random variables.

I’m mainly posting this problem because I think it’s pretty interesting, in case anyone else is inspired to work on it. It is closely related to the covering number of exponential families under the KL divergence, where we are interested in coverings at relatively large radii (\log(k) - \epsilon rather than \epsilon).

Maximal Maximum-Entropy Sets

Consider a probability distribution {p(y)} on a space {\mathcal{Y}}. Suppose we want to construct a set {\mathcal{P}} of probability distributions on {\mathcal{Y}} such that {p(y)} is the maximum-entropy distribution over {\mathcal{P}}:

\displaystyle H(p) = \max_{q \in \mathcal{P}} H(q),

where {H(p) = \mathbb{E}_{p}[-\log p(y)]} is the entropy. We call such a set a maximum-entropy set for {p}. Furthermore, we would like {\mathcal{P}} to be as large as possible, subject to the constraint that {\mathcal{P}} is convex.

Does such a maximal convex maximum-entropy set {\mathcal{P}} exist? That is, is there some convex set {\mathcal{P}} such that {p} is the maximum-entropy distribution in {\mathcal{P}}, and for any {\mathcal{Q}} satisfying the same property, {\mathcal{Q} \subseteq \mathcal{P}}? It turns out that the answer is yes, and there is even a simple characterization of {\mathcal{P}}:

Proposition 1 For any distribution {p} on {\mathcal{Y}}, the set

\displaystyle \mathcal{P} = \{q \mid \mathbb{E}_{q}[-\log p(y)] \leq H(p)\}

is the maximal convex maximum-entropy set for {p}.

To see why this is, first note that, clearly, {p \in \mathcal{P}}, and for any {q \in \mathcal{P}} we have

\displaystyle \begin{array}{rcl} H(q) &=& \mathbb{E}_{q}[-\log q(y)] \\ &\leq& \mathbb{E}_{q}[-\log p(y)] \\ &\leq& H(p), \end{array}

so {p} is indeed the maximum-entropy distribution in {\mathcal{P}}. On the other hand, let {\mathcal{Q}} be any other convex set whose maximum-entropy distribution is {p}. Then in particular, for any {q \in \mathcal{Q}}, we must have {H((1-\epsilon)p + \epsilon q) \leq H(p)}. Let us suppose for the sake of contradiction that {q \not\in \mathcal{P}}, so that {\mathbb{E}_{q}[-\log p(y)] > H(p)}. Then we have

\displaystyle \begin{array}{rcl} H((1-\epsilon)p + \epsilon q) &=& \mathbb{E}_{(1-\epsilon)p+\epsilon q}[-\log((1-\epsilon)p(y)+\epsilon q(y))] \\ &=& \mathbb{E}_{(1-\epsilon)p+\epsilon q}[-\log(p(y) + \epsilon (q(y)-p(y))] \\ &=& \mathbb{E}_{(1-\epsilon)p+\epsilon q}\left[-\log(p(y)) - \epsilon \frac{q(y)-p(y)}{p(y)} + \mathcal{O}(\epsilon^2)\right] \\ &=& H(p) + \epsilon(\mathbb{E}_{q}[-\log p(y)]-H(p)) - \epsilon \mathbb{E}_{(1-\epsilon)p+\epsilon q}\left[\frac{q(y)-p(y)}{p(y)}\right] + \mathcal{O}(\epsilon^2) \\ &=& H(p) + \epsilon(\mathbb{E}_{q}[-\log p(y)]-H(p)) - \epsilon^2 \mathbb{E}_{q}\left[\frac{q(y)-p(y)}{p(y)}\right] + \mathcal{O}(\epsilon^2) \\ &=& H(p) + \epsilon(\mathbb{E}_{q}[-\log p(y)]-H(p)) + \mathcal{O}(\epsilon^2). \end{array}

Since {\mathbb{E}_{q}[-\log p(y)] - H(p) > 0}, for sufficiently small {\epsilon} this will exceed {H(p)}, which is a contradiction. Therefore we must have {q \in \mathcal{P}} for all {q \in \mathcal{Q}}, and hence {\mathcal{Q} \subseteq \mathcal{P}}, so that {\mathcal{P}} is indeed the maximal convex maximum-entropy set for {p}.

Long-Term and Short-Term Challenges to Ensuring the Safety of AI Systems

Introduction

There has been much recent discussion about AI risk, meaning specifically the potential pitfalls (both short-term and long-term) that AI with improved capabilities could create for society. Discussants include AI researchers such as Stuart Russell and Eric Horvitz and Tom Dietterich, entrepreneurs such as Elon Musk and Bill Gates, and research institutes such as the Machine Intelligence Research Institute (MIRI) and Future of Humanity Institute (FHI); the director of the latter institute, Nick Bostrom, has even written a bestselling book on this topic. Finally, ten million dollars in funding have been earmarked towards research on ensuring that AI will be safe and beneficial. Given this, I think it would be useful for AI researchers to discuss the nature and extent of risks that might be posed by increasingly capable AI systems, both short-term and long-term. As a PhD student in machine learning and artificial intelligence, this essay will describe my own views on AI risk, in the hopes of encouraging other researchers to detail their thoughts, as well.

For the purposes of this essay, I will define “AI” to be technology that can carry out tasks with limited or no human guidance, “advanced AI” to be technology that performs substantially more complex and domain-general tasks than are possible today, and “highly capable AI” to be technology that can outperform humans in all or almost all domains. As the primary target audience of this essay is other researchers, I have used technical terms (e.g. weakly supervised learning, inverse reinforcement learning) whenever they were useful, though I have also tried to make the essay more generally accessible when possible.

Outline

I think it is important to distinguish between two questions. First, does artificial intelligence merit the same degree of engineering safety considerations as other technologies (such as bridges)? Second, does artificial intelligence merit additional precautions, beyond those that would be considered typical? I will argue that the answer is yes to the first, even in the short term, and that current engineering methodologies in the field of machine learning do not provide even a typical level of safety or robustness. Moreover, I will argue that the answer to the second question in the long term is likely also yes — namely, that there are important ways in which highly capable artificial intelligence could pose risks which are not addressed by typical engineering concerns.

The point of this essay is not to be alarmist; indeed, I think that AI is likely to be net-positive for humanity. Rather, the point of this essay is to encourage a discussion about the potential pitfalls posed by artificial intelligence, since I believe that research done now can mitigate many of these pitfalls. Without such a discussion, we are unlikely to understand which pitfalls are most important or likely, and thus unable to design effective research programs to prevent them.

A common objection to discussing risks posed by AI is that it seems somewhat early on to worry about such risks, and the discussion is likely to be more germane if we wait to have it until after the field of AI has advanced further. I think this objection is quite reasonable in the abstract; however, as I will argue below, I think we do have a reasonable understanding of at least some of the risks that AI might pose, that some of these will be realized even in the medium term, and that there are reasonable programs of research that can address these risks, which in many cases would also have the advantage of improving the usability of existing AI systems.

Ordinary Engineering

There are many issues related to AI safety that are just a matter of good engineering methodology. For instance, we would ideally like systems that are transparent, modular, robust, and work under well-understood assumptions. Unfortunately, machine learning as a field has not developed very good methodologies for obtaining any of these things, and so this is an important issue to remedy. In other words, I think we should put at least as much thought into building an AI as we do into building a bridge.

Just to be very clear, I do not think that machine learning researchers are bad engineers; looking at any of the open source tools such as Torch, Caffe, MLlib, and others make it clear that many machine learning researchers are also good software engineers. Rather, I think that as a field our methodologies are not mature enough to address the specific engineering desiderata of statistical models (in contrast to the algorithms that create them). In particular, the statistical models obtained from machine learning algorithms tend to be:

  1. Opaque: Many machine learning models consist of hundreds of thousands of parameters, making it difficult to understand how predictions are made. Typically, practitioners resort to error analysis examining the covariates that most strongly influence each incorrect prediction. However, this is not a very sustainable long-term solution, as it requires substantial effort even for relatively narrow-domain systems.
  2. Monolithic: In part due to their opacity, models act as a black box, with no modularity or encapsulation of behavior. Though machine learning systems are often split into pipelines of smaller models, the lack of encapsulation can make these pipelines even harder to manage than a single large model; indeed, since machine learning models are by design optimized for a particular input distribution (i.e. whatever distribution they are trained on), we end up in a situation where “Changing Anything Changes Everything” [1].
  3. Fragile: As another consequence of being optimized for a particular training distribution, machine learning models can have arbitrarily poor performance when that distribution shifts. For instance, Daumé and Marcu [2] show that a named entity classifier with 92% accuracy on one dataset drops to 58% accuracy on a superficially similar dataset. Though such issues are partially addressed by work on transfer learning and domain adaptation [3], these areas are not very developed compared to supervised learning.
  4. Poorly understood: Beyond their fragility, understanding when a machine learning model will work is difficult. We know that a model will work if it is tested on the same distribution it is trained on, and have some extensions beyond this case (e.g. based on robust optimization [4]), but we have very little in the way of practically relevant conditions under which a model trained in one situation will work well in another situation. Although they are related, this issue differs from the opacity issue above in that it relates to making predictions about the system’s future behavior (in particular, generalization to new situations), versus understanding the internal workings of the current system.

That these issues plague machine learning systems is likely uncontroversial among machine learning researchers. However, in comparison to research focused on extending capabilities, very little is being done to address them. Research in this area therefore seems particularly impactful, especially given the desire to deploy machine learning systems in increasingly complex and safety-critical situations.

Extraordinary Engineering

Does AI merit additional safety precautions, beyond those that are considered standard engineering practice in other fields? Here I am focusing only on the long-term impacts of advanced or highly capable AI systems.

My tentative answer is yes; there seem to be a few different ways in which AI could have bad effects, each of which seems individually unlikely but not implausible. Even if each of the risks identified so far are not likely, (i) the total risk might be large, especially if there are additional unidentified risks, and (ii) the existence of multiple “near-misses” motivates closer investigation, as it may suggest some underlying principle that makes AI risk-laden. In the sequel I will focus on so-called “global catastrophic” risks, meaning risks that could affect a large fraction of the earth’s population in a material way. I have chosen to focus on these risks because I think there is an important difference between an AI system messing up in a way that harms a few people (which would be a legal liability but perhaps should not motivate a major effort in terms of precautions) and an AI system that could cause damage on a global scale. The latter would justify substantial precautions, and I want to make it clear that this is the bar I am setting for myself.

With that in place, below are a few ways in which advanced or highly capable AI could have specific global catastrophic risks.

Cyber-attacks. There are two trends which taken together make the prospect of AI-aided cyber-attacks seem worrisome. The first trend is simply the increasing prevalence of cyber-attacks; even this year we have seen Russia attack Ukraine, North Korea attack Sony, and China attack the U.S. Office of Personnel Management. Secondly, the “Internet of Things” means that an increasing number of physical devices will be connected to the internet. Assuming that software exists to autonomously control them, many internet-enabled devices such as cars could be hacked and then weaponized, leading to a decisive military advantage in a short span of time. Such an attack could be enacted by a small group of humans aided by AI technologies, which would make it hard to detect in advance. Unlike other weaponizable technology such as nuclear fission or synthetic biology, it would be very difficult to control the distribution of AI since it does not rely on any specific raw materials. Finally, note that even a team with relatively small computing resources could potentially “bootstrap” to much more computing power by first creating a botnet with which to do computations; to date, the largest botnet has spanned 30 million computers and several other botnets have exceeded 1 million.

Autonomous weapons. Beyond cyber-attacks, improved autonomous robotics technology combined with ubiquitous access to miniature UAVs (“drones”) could allow both terrorists and governments to wage a particularly pernicious form of remote warfare by creating weapons that are both cheap and hard to detect or defend against (due to their small size and high maneuverability). Beyond direct malicious intent, if autonomous weapons systems or other powerful autonomous systems malfunction then they could cause a large amount of damage.

Mis-optimization. A highly capable AI could acquire a large amount of power but pursue an overly narrow goal, and end up harming humans or human value while optimizing for this goal. This may seem implausible at face value, but as I will argue below, it is easier to improve AI capabilities than to improve AI values, making such a mishap possible in theory.

Unemployment. It is already the case that increased automation is decreasing the number of available jobs, to the extent that some economists and policymakers are discussing what to do if the number of jobs is systematically smaller than the number of people seeking work. If AI systems allow a large number of jobs to be automated over a relatively short time period, then we may not have time to plan or implement policy solutions, and there could then be a large unemployment spike. In addition to the direct effects on the people who are unemployed, such a spike could also have indirect consequences by decreasing social stability on a global scale.

Opaque systems. It is also already the case that increasingly many tasks are being delegated to autonomous systems, from trades in financial markets to aggregation of information feeds. The opacity of these systems has led to issues such as the 2010 Flash Crash and will likely lead to larger issues in the future. In the long term, as AI systems become increasingly complex, humans may lose the ability to meaningfully understand or intervene in such systems, which could lead to a loss of sovereignty if autonomous systems are employed in executive-level functions (e.g. government, economy).

Beyond these specific risks, it seems clear that, eventually, AI will be able to outperform humans in essentially every domain. At that point, it seems doubtful that humanity will continue to have direct causal influence over its future unless specific measures are put in place to ensure this. While I do not think this day will come soon, I think it is worth thinking now about how we might meaningfully control highly capable AI systems, and I also think that many of the risks posed above (as well as others that we haven’t thought of yet) will occur on a somewhat shorter time scale.

Let me end with some specific ways in which control of AI may be particularly difficult compared to other human-engineered systems:

  1. AI may be “agent-like”, which means that the space of possible behaviors is much larger; our intuitions about how AI will act in pursuit of a given goal may not account for this and so AI behavior could be hard to predict.
  2. Since an AI would presumably learn from experience, and will likely run at a much faster serial processing speed than humans, its capabilities may change rapidly, ruling out the usual process of trial-and-error.
  3. AI will act in a much more open-ended domain. In contrast, our existing tools for specifying the necessary properties of a system only work well in narrow domains. For instance, for a bridge, safety relates to the ability to successfully accomplish a small number of tasks (e.g. not falling over). For these, it suffices to consider well-characterized engineering properties such as tensile strength. For AI, the number of tasks we would potentially want it to perform is large, and it is unclear how to obtain a small number of well-characterized properties that would ensure safety.
  4. Existing machine learning frameworks make it very easy for AI to acquire knowledge, but hard to acquire values. For instance, while an AI’s model of reality is flexibly learned from data, its goal/utility function is hard-coded in almost all situations; an exception is some work on inverse reinforcement learning [5], but this is still a very nascent framework. Importantly, the asymmetry between knowledge (and hence capabilities) and values is fundamental, rather than simply a statement about existing technologies. This is because knowledge is something that is regularly informed by reality, whereas values are only weakly informed by reality: an AI which learns incorrect facts could notice that it makes wrong predictions, but the world might never “tell” an AI that it learned the “wrong values”. At a technical level, while many tasks in machine learning are fully supervised or at least semi-supervised, value acquisition is a weakly supervised task.

In summary: there are several concrete global catastrophic risks posed by highly capable AI, and there are also several reasons to believe that highly capable AI would be difficult to control. Together, these suggest to me that the control of highly capable AI systems is an important problem posing unique research challenges.

Long-term Goals, Near-term Research

Above I presented an argument for why AI, in the long term, may require substantial precautionary efforts. Beyond this, I also believe that there is important research that can be done right now to reduce long-term AI risks. In this section I will elaborate on some specific research projects, though my list is not meant to be exhaustive.

  1. Value learning: In general, it seems important in the long term (and also in the short term) to design algorithms for learning values / goal systems / utility functions, rather than requiring them to be hand-coded. One framework for this is inverse reinforcement learning [5], though developing additional frameworks would also be useful.
  2. Weakly supervised learning: As argued above, inferring values, in contrast to beliefs, is an at most weakly supervised problem, since humans themselves are often incorrect about what they value and so any attempt to provide fully annotated training data about values would likely contain systematic errors. It may be possible to infer values indirectly through observing human actions; however, since humans often act immorally and human values change over time, current human actions are not consistent with our ideal long-term values, and so learning from actions in a naive way could lead to problems. Therefore, a better fundamental understanding of weakly supervised learning — particularly regarding guaranteed recovery of indirectly observed parameters under well-understood assumptions — seems important.
  3. Formal specification / verification: One way to make AI safer would be to formally specify desiderata for its behavior, and then prove that these desiderata are met. A major open challenge is to figure out how to meaningfully specify formal properties for an AI system. For instance, even if a speech transcription system did a near-perfect job of transcribing speech, it is unclear what sort of specification language one might use to state this property formally. Beyond this, though there is much existing work in formal verification, it is still extremely challenging to verify large systems.
  4. Transparency: To the extent that the decision-making process of an AI is transparent, it should be relatively easy to ensure that its impact will be positive. To the extent that the decision-making process is opaque, it should be relatively difficult to do so. Unfortunately, transparency seems difficult to obtain, especially for AIs that reach decisions through complex series of serial computations. Therefore, better techniques for rendering AI reasoning transparent seem important.
  5. Strategic assessment and planning: Better understanding of the likely impacts of AI will allow a better response. To this end, it seems valuable to map out and study specific concrete risks; for instance, better understanding ways in which machine learning could be used in cyber-attacks, or forecasting the likely effects of technology-driven unemployment, and determining useful policies around these effects. It would also be clearly useful to identify additional plausible risks beyond those of which we are currently aware. Finally, thought experiments surrounding different possible behaviors of advanced AI would help inform intuitions and point to specific technical problems. Some of these tasks are most effectively carried out by AI researchers, while others should be done in collaboration with economists, policy experts, security experts, etc.

The above constitute at least five concrete directions of research on which I think important progress can be made today, which would meaningfully improve the safety of advanced AI systems and which in many cases would likely have ancillary benefits in the short term, as well.

Related Work

At a high level, while I have implicitly provided a program of research above, there are other proposed research programs as well. Perhaps the earliest proposed program is from MIRI [6], which has focused on AI alignment problems that arise even in simplified settings (e.g. with unlimited computing power or easy-to-specify goals) in hopes of later generalizing to more complex settings. The Future of Life Institute (FLI) has also published a research priorities document [7, 8] with a broader focus, including non-technical topics such as regulation of autonomous weapons and economic shifts induced by AI-based technologies. I do not necessarily endorse either document, but think that both represent a big step in the right direction. Ideally, MIRI, FLI, and others will all justify why they think their problems are worth working on and we can let the best arguments and counterarguments rise to the top. This is already happening to some extent [9, 10, 11] but I would like to see more of it, especially from academics with expertise in machine learning and AI [12, 13].

In addition, several specific arguments I have advanced are similar to those already advanced by others. The issue of AI-driven unemployment has been studied by Brynjolfsson and McAfee [14], and is also discussed in the FLI research document. The problem of AI pursuing narrow goals has been elaborated through Bostrom’s “paperclipping argument” [15] as well as the orthogonality thesis [16], which states that beliefs and values are independent of each other. While I disagree with the orthogonality thesis in its strongest form, the arguments presented above for the difficulty of value learning can in many cases reach similar conclusions.

Omohundro [17] has argued that advanced agents would pursue certain instrumentally convergent drives under almost any value system, which is one way in which agent-like systems differ from systems without agency. Good [18] was the first to argue that AI capabilities could improve rapidly. Yudkowsky has argued that it would be easy for an AI to acquire power given few initial resources [19], though his example assumes the creation of advanced biotechnology.

Christiano has argued for the value of transparent AI systems, and proposed the “advisor games” framework as a potential operationalization of transparency [20].

Conclusion

To ensure the safety of AI systems, additional research is needed, both to meet ordinary short-term engineering desiderata as well as to make the additional precautions specific to highly capable AI systems. In both cases, there are clear programs of research that can be undertaken today, which in many cases seem to be under-researched relative to their potential societal value. I therefore think that well-directed research towards improving the safety of AI systems is a worthwhile undertaking, with the additional benefit of motivating interesting new directions of research.

Acknowledgments

Thanks to Paul Christiano, Holden Karnofsky, Percy Liang, Luke Muehlhauser, Nick Beckstead, Nate Soares, and Howie Lempel for providing feedback on a draft of this essay.

References

[1] D. Sculley, et al. Machine Learning: The High-Interest Credit Card of Technical Debt. 2014.
[2] Hal Daumé III and Daniel Marcu. Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, pages 101–126, 2006.
[3] Sinno J. Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, 2010.
[4] Dimitris Bertsimas, David B. Brown, and Constantine Caramanis. Theory and applications of robust optimization. SIAM Review, 53(3):464–501, 2011.
[5] Andrew Ng and Stuart Russell. Algorithms for inverse reinforcement learning. In International Conference in Machine Learning, pages 663–670, 2000.
[6] Nate Soares and Benja Fallenstein. Aligning Superintelligence with Human Interests: A Technical Research Agenda. 2014.
[7] Stuart Russell, Daniel Dewey, and Max Tegmark. Research priorities for robust and beneficial artificial intelligence. 2015.
[8] Daniel Dewey, Stuart Russell, and Max Tegmark. A survey of research questions for robust and beneficial AI. 2015.
[9] Paul Christiano. The Steering Problem. 2015.
[10] Paul Christiano. Stable self-improvement as an AI safety problem. 2015.
[11] Luke Muehlhauser. How to study superintelligence strategy. 2014.
[12] Stuart Russell. Of Myths and Moonshine. 2014.
[13] Tom Dietterich and Eric Horvitz. Benefits and Risks of Artificial Intelligence. 2015.
[14] Erik Brynjolfsson and Andrew McAfee. The second machine age: work, progress, and prosperity in a time of brilliant technologies. WW Norton & Company, 2014.
[15] Nick Bostrom (2003). Ethical Issues in Advanced Artificial Intelligence. Cognitive, Emotive and Ethical Aspects of Decision Making in Humans and in Artificial Intelligence.
[16] Nick Bostrom. “The superintelligent will: Motivation and instrumental rationality in advanced artificial agents.” Minds and Machines 22.2 (2012): 71-85.
[17] Stephen M. Omohundro (2008). The Basic AI Drives. Frontiers in Artificial Intelligence and Applications (IOS Press).
[18] Irving J. Good. “Speculations concerning the first ultraintelligent machine.” Advances in computers 6.99 (1965): 31-83.
[19] Eliezer Yudkowsky. “Artificial intelligence as a positive and negative factor in global risk.” Global catastrophic risks 1 (2008): 303.
[20] Paul Christiano. Advisor Games. 2015.

A Fervent Defense of Frequentist Statistics

[Highlights for the busy: de-bunking standard “Bayes is optimal” arguments; frequentist Solomonoff induction; and a description of the online learning framework.]

Short summary. This essay makes many points, each of which I think is worth reading, but if you are only going to understand one point I think it should be “Myth 5″ below, which describes the online learning framework as a response to the claim that frequentist methods need to make strong modeling assumptions. Among other things, online learning allows me to perform the following remarkable feat: if I’m betting on horses, and I get to place bets after watching other people bet but before seeing which horse wins the race, then I can guarantee that after a relatively small number of races, I will do almost as well overall as the best other person, even if the number of other people is very large (say, 1 billion), and their performance is correlated in complicated ways.

If you’re only going to understand two points, then also read about the frequentist version of Solomonoff induction, which is described in “Myth 6″.

Main article. I’ve already written one essay on Bayesian vs. frequentist statistics. In that essay, I argued for a balanced, pragmatic approach in which we think of the two families of methods as a collection of tools to be used as appropriate. Since I’m currently feeling contrarian, this essay will be far less balanced and will argue explicitly against Bayesian methods and in favor of frequentist methods. I hope this will be forgiven as so much other writing goes in the opposite direction of unabashedly defending Bayes. I should note that this essay is partially inspired by some of Cosma Shalizi’s blog posts, such as this one.

This essay will start by listing a series of myths, then debunk them one-by-one. My main motivation for this is that Bayesian approaches seem to be highly popularized, to the point that one may get the impression that they are the uncontroversially superior method of doing statistics. I actually think the opposite is true: I think most statisticans would for the most part defend frequentist methods, although there are also many departments that are decidedly Bayesian (e.g. many places in England, as well as some U.S. universities like Columbia). I have a lot of respect for many of the people at these universities, such as Andrew Gelman and Philip Dawid, but I worry that many of the other proponents of Bayes (most of them non-statisticians) tend to oversell Bayesian methods or undersell alternative methodologies.

If you are like me from, say, two years ago, you are firmly convinced that Bayesian methods are superior and that you have knockdown arguments in favor of this. If this is the case, then I hope this essay will give you an experience that I myself found life-altering: the experience of having a way of thinking that seemed unquestionably true slowly dissolve into just one of many imperfect models of reality. This experience helped me gain more explicit appreciation for the skill of viewing the world from many different angles, and of distinguishing between a very successful paradigm and reality.

If you are not like me, then you may have had the experience of bringing up one of many reasonable objections to normative Bayesian epistemology, and having it shot down by one of many “standard” arguments that seem wrong but not for easy-to-articulate reasons. I hope to lend some reprieve to those of you in this camp, by providing a collection of “standard” replies to these standard arguments.

I will start with the myths (and responses) that I think will require the least technical background and be most interesting to a general audience. Toward the end, I deal with some attacks on frequentist methods that I believe amount to technical claims that are demonstrably false; doing so involves more math. Also, I should note that for the sake of simplicity I’ve labeled everything that is non-Bayesian as a “frequentist” method, even though I think there’s actually a fair amount of variation among these methods, although also a fair amount of overlap (e.g. I’m throwing in statistical learning theory with minimax estimation, which certainly have a lot of overlap in ideas but were also in some sense developed by different communities).

The Myths:

  • Bayesian methods are optimal.
  • Bayesian methods are optimal except for computational considerations.
  • We can deal with computational constraints simply by making approximations to Bayes.
  • The prior isn’t a big deal because Bayesians can always share likelihood ratios.
  • Frequentist methods need to assume their model is correct, or that the data are i.i.d.
  • Frequentist methods can only deal with simple models, and make arbitrary cutoffs in model complexity (aka: “I’m Bayesian because I want to do Solomonoff induction”).
  • Frequentist methods hide their assumptions while Bayesian methods make assumptions explicit.
  • Frequentist methods are fragile, Bayesian methods are robust.
  • Frequentist methods are responsible for bad science
  • Frequentist methods are unprincipled/hacky.
  • Frequentist methods have no promising approach to computationally bounded inference.

Myth 1: Bayesian methods are optimal. Presumably when most people say this they are thinking of either Dutch-booking or the complete class theorem. Roughly what these say are the following:

Dutch-book argument: Every coherent set of beliefs can be modeled as a subjective probability distribution. (Roughly, coherent means “unable to be Dutch-booked”.)

Complete class theorem: Every non-Bayesian method is worse than some Bayesian method (in the sense of performing deterministically at least as poorly in every possible world).

Let’s unpack both of these. My high-level argument regarding Dutch books is that I would much rather spend my time trying to correspond with reality than trying to be internally consistent. More concretely, the Dutch-book argument says that if for every bet you force me to take one side or the other, then unless I’m Bayesian there’s a collection of bets that will cause me to lose money for sure. I don’t find this very compelling. This seems analogous to the situation where there’s some quant at Jane Street, and they’re about to run code that will make thousands of dollars trading stocks, and someone comes up to them and says “Wait! You should add checks to your code to make sure that no subset of your trades will lose you money!” This just doesn’t seem worth the quant’s time, it will slow down the code substantially, and instead the quant should be writing the next program to make thousands more dollars. This is basically what dutch-booking arguments seem like to me.

Moving on, the complete class theorem says that for any decision rule, I can do better by replacing it with some Bayesian decision rule. But this injunction is not useful in practice, because it doesn’t say anything about which decision rule I should replace it with. Of course, if you hand me a decision rule and give me infinite computational resources, then I can hand you back a Bayesian method that will perform better. But it still might not perform well. All the complete class theorem says is that every local optimum is Bayesan. To be a useful theory of epistemology, I need a prescription for how, in the first place, I am to arrive at a good decision rule, not just a locally optimal one. And this is something that frequentist methods do provide, to a far greater extent than Bayesian methods (for instance by using minimax decision rules such as the maximum-entropy example given later). Note also that many frequentist methods do correspond to a Bayesian method for some appropriately chosen prior. But the crucial point is that the frequentist told me how to pick a prior I would be happy with (also, many frequentist methods don’t correspond to a Bayesian method for any choice of prior; they nevertheless often perform quite well).

Myth 2: Bayesian methods are optimal except for computational considerations. We already covered this in the previous point under the complete class theorem, but to re-iterate: Bayesian methods are locally optimal, not global optimal. Identifying all the local optima is very different from knowing which of them is the global optimum. I would much rather have someone hand me something that wasn’t a local optimum but was close to the global optimum, than something that was a local optimum but was far from the global optimum.

Myth 3: We can deal with computational constraints simply by making approximations to Bayes. I have rarely seen this born out in practice. Here’s a challenge: suppose I give you data generated in the following way. There are a collection of vectors {x_1}, {x_2}, {\ldots}, {x_{10,000}}, each with {10^6} coordinates. I generate outputs {y_1}, {y_2}, {\ldots}, {y_{10,000}} in the following way. First I globally select {100} of the {10^6} coordinates uniformly at random, then I select a fixed vector {u} such that those {100} coordinates are drawn from i.i.d. Gaussians and the rest of the coordinates are zero. Now I set {x_n = u^{\top}y_n} (i.e. {x_n} is the dot product of {u} with {y_n}). You are given {x} and {y}, and your job is to infer {u}. This is a completely well-specified problem, the only task remaining is computational. I know people who have solved this problem using Bayesan methods with approximate inference. I have respect for these people, because doing so is no easy task. I think very few of them would say that “we can just approximate Bayesian updating and be fine”. (Also, this particular problem can be solved trivially with frequentist methods.)

A particularly egregious example of this is when people talk about “computable approximations to Solomonoff induction” or “computable approximations to AIXI” as if such notions were meaningful.

Myth 4: the prior isn’t a big deal because Bayesians can always share likelihood ratios. Putting aside the practical issue that there would in general be an infinite number of likelihood ratios to share, there is the larger issue that for any hypothesis {h}, there is also the hypothesis {h'} that matches {h} exactly up to now, and then predicts the opposite of {h} at all points in the future. You have to constrain model complexity at some point, the question is about how. To put this another way, sharing my likelihood ratios without also constraining model complexity (by focusing on a subset of all logically possible hypotheses) would be equivalent to just sharing all sensory data I’ve ever accrued in my life. To the extent that such a notion is even possible, I certainly don’t need to be a Bayesian to do such a thing.

Myth 5: frequentist methods need to assume their model is correct or that the data are i.i.d. Understanding the content of this section is the most important single insight to gain from this essay. For some reason it’s assumed that frequentist methods need to make strong assumptions (such as Gaussianity), whereas Bayesian methods are somehow immune to this. In reality, the opposite is true. While there are many beautiful and deep frequentist formalisms that answer this, I will choose to focus on one of my favorite, which is online learning.

To explain the online learning framework, let us suppose that our data are {(x_1, y_1), (x_2, y_2), \ldots, (x_T, y_T)}. We don’t observe {y_t} until after making a prediction {z_t} of what {y_t} will be, and then we receive a penalty {L(y_t, z_t)} based on how incorrect we were. So we can think of this as receiving prediction problems one-by-one, and in particular we make no assumptions about the relationship between the different problems; they could be i.i.d., they could be positively correlated, they could be anti-correlated, they could even be adversarially chosen.

As a running example, suppose that I’m betting on horses and before each race there are {n} other people who give me advice on which horse to bet on. I know nothing about horses, so based on this advice I’d like to devise a good betting strategy. In this case, {x_t} would be the {n} bets that each of the other people recommend, {z_t} would be the horse that I actually bet on, and {y_t} would be the horse that actually wins the race. Then, supposing that {y_t = z_t} (i.e., the horse I bet on actually wins), {L(y_t, z_t)} is the negative of the payoff from correctly betting on that horse. Otherwise, if the horse I bet on doesn’t win, {L(y_t, z_t)} is the cost I had to pay to place the bet.

If I’m in this setting, what guarantee can I hope for? I might ask for an algorithm that is guaranteed to make good bets — but this seems impossible unless the people advising me actually know something about horses. Or, at the very least, one of the people advising me knows something. Motivated by this, I define my regret to be the difference between my penalty and the penalty of the best of the {n} people (note that I only have access to the latter after all {T} rounds of betting). More formally, given a class {\mathcal{M}} of predictors {h : x \mapsto z}, I define

\displaystyle \mathrm{Regret}(T) = \frac{1}{T} \sum_{t=1}^T L(y_t, z_t) - \min_{h \in \mathcal{M}} \frac{1}{T} \sum_{t=1}^T L(y_t, h(x_t))

In this case, {\mathcal{M}} would have size {n} and the {i}th predictor would just always follow the advice of person {i}. The regret is then how much worse I do on average than the best expert. A remarkable fact is that, in this case, there is a strategy such that {\mathrm{Regret}(T)} shrinks at a rate of {\sqrt{\frac{\log(n)}{T}}}. In other words, I can have an average score within {\epsilon} of the best advisor after {\frac{\log(n)}{\epsilon^2}} rounds of betting.

One reason that this is remarkable is that it does not depend at all on how the data are distributed; the data could be i.i.d., positively correlated, negatively correlated, even adversarial, and one can still construct an (adaptive) prediction rule that does almost as well as the best predictor in the family.

To be even more concrete, if we assume that all costs and payoffs are bounded by {\$1} per round, and that there are {1,000,000,000} people in total, then an explicit upper bound is that after {28/\epsilon^2} rounds, we will be within {\epsilon} dollars on average of the best other person. Under slightly stronger assumptions, we can do even better, for instance if the best person has an average variance of {0.1} about their mean, then the {28} can be replaced with {4.5}.

It is important to note that the betting scenario is just a running example, and one can still obtain regret bounds under fairly general scenarios; {\mathcal{M}} could be continuous and {L} could have quite general structure; the only technical assumption is that {\mathcal{M}} be a convex set and that {L} be a convex function of {z}. These assumptions tend to be easy to satisfy, though I have run into a few situations where they end up being problematic, mainly for computational reasons. For an {n}-dimensional model family, typically {\mathrm{Regret}(T)} decreases at a rate of {\sqrt{\frac{n}{T}}}, although under additional assumptions this can be reduced to {\sqrt{\frac{\log(n)}{T}}}, as in the betting example above. I would consider this reduction to be one of the crowning results of modern frequentist statistics.

Yes, these guarantees sound incredibly awesome and perhaps too good to be true. They actually are that awesome, and they are actually true. The work is being done by measuring the error relative to the best model in the model family. We aren’t required to do well in an absolute sense, we just need to not do any worse than the best model. Of as long as at least one of the models in our family makes good predictions, that means we will as well. This is really what statistics is meant to be doing: you come up with everything you imagine could possibly be reasonable, and hand it to me, and then I come up with an algorithm that will figure out which of the things you handed me was most reasonable, and will do almost as well as that. As long as at least one of the things you come up with is good, then my algorithm will do well. Importantly, due to the {\log(n)} dependence on the dimension of the model family, you can actually write down extremely broad classes of models and I will still successfully sift through them.

Let me stress again: regret bounds are saying that, no matter how the {x_t} and {y_t} are related, no i.i.d. assumptions anywhere in sight, we will do almost as well as any predictor {h} in {\mathcal{M}} (in particular, almost as well as the best predictor).

Myth 6: frequentist methods can only deal with simple models and need to make arbitrary cutoffs in model complexity. A naive perusal of the literature might lead one to believe that frequentists only ever consider very simple models, because many discussions center on linear and log-linear models. To dispel this, I will first note that there are just as many discussions that focus on much more general properties such as convexity and smoothness, and that can achieve comparably good bounds in many cases. But more importantly, the reason we focus so much on linear models is because we have already reduced a large family of problems to (log-)linear regression. The key insight, and I think one of the most important insights in all of applied mathematics, is that of featurization: given a non-linear problem, we can often embed it into a higher-dimensional linear problem, via a feature map {\phi : X \rightarrow \mathbb{R}^n} ({\mathbb{R}^n} denotes {n}-dimensional space, i.e. vectors of real numbers of length {n}). For instance, if I think that {y} is a polynomial (say cubic) function of {x}, I can apply the mapping {\phi(x) = (1, x, x^2, x^3)}, and now look for a linear relationship between {y} and {\phi(x)}.

This insight extends far beyond polynomials. In combinatorial domains such as natural language, it is common to use indicator features: features that are {1} if a certain event occurs and {0} otherwise. For instance, I might have an indicator feature for whether two words appear consecutively in a sentence, whether two parts of speech are adjacent in a syntax tree, or for what part of speech a word has. Almost all state of the art systems in natural language processing work by solving a relatively simple regression task (typically either log-linear or max-margin) over a rich feature space (often involving hundreds of thousands or millions of features, i.e. an embedding into {\mathbb{R}^{10^5}} or {\mathbb{R}^{10^6}}).

A counter-argument to the previous point could be: “Sure, you could create a high-dimensional family of models, but it’s still a parameterized family. I don’t want to be stuck with a parameterized family, I want my family to include all Turing machines!” Putting aside for a second the question of whether “all Turing machines” is a well-advised model choice, this is something that a frequentist approach can handle just fine, using a tool called regularization, which after featurization is the second most important idea in statistics.

Specifically, given any sufficiently quickly growing function {\psi(h)}, one can show that, given {T} data points, there is a strategy whose average error is at most {\sqrt{\frac{\psi(h)}{T}}} worse than any estimator {h}. This can hold even if the model class {\mathcal{M}} is infinite dimensional. For instance, if {\mathcal{M}} consists of all probability distributions over Turing machines, and we let {h_i} denote the probability mass placed on the {i}th Turing machine, then a valid regularizer {\psi} would be

\displaystyle \psi(h) = \sum_i h_i \log(i^2 \cdot h_i)

If we consider this, then we see that, for any probability distribution over the first {2^k} Turing machines (i.e. all Turing machines with description length {\leq k}), the value of {\psi} is at most {\log((2^k)^2) = k\log(4)}. (Here we use the fact that {\psi(h) \geq \sum_i h_i \log(i^2)}, since {h_i \leq 1} and hence {h_i\log(h_i) \leq 0}.) This means that, if we receive roughly {\frac{k}{\epsilon^2}} data, we will achieve error within {\epsilon} of the best Turing machine that has description length {\leq k}.

Let me note several things here:

  • This strategy makes no assumptions about the data being i.i.d. It doesn’t even assume that the data are computable. It just guarantees that it will perform as well as any Turing machine (or distribution over Turing machines) given the appropriate amount of data.
  • This guarantee holds for any given sufficiently smooth measurement of prediction error (the update strategy depends on the particular error measure).
  • This guarantee holds deterministically, no randomness required (although predictions may need to consist of probability distributions rather than specific points, but this is also true of Bayesian predictions).

Interestingly, in the case that the prediction error is given by the negative log probability assigned to the truth, then the corresponding strategy that achieves the error bound is just normal Bayesian updating. But for other measurements of error, we get different update strategies. Although I haven’t worked out the math, intuitively this difference could be important if the universe is fundamentally unpredictable but our notion of error is insensitive to the unpredictable aspects.

Myth 7: frequentist methods hide their assumptions while Bayesian methods make assumptions explicit. I’m still not really sure where this came from. As we’ve seen numerous times so far, a very common flavor among frequentist methods is the following: I have a model class {\mathcal{M}}, I want to do as well as any model in {\mathcal{M}}; or put another way:

Assumption: At least one model in {\mathcal{M}} has error at most {E}.
Guarantee: My method will have error at most {E + \epsilon}.

This seems like a very explicit assumption with a very explicit guarantee. On the other hand, an argument I hear is that Bayesian methods make their assumptions explicit because they have an explicit prior. If I were to write this as an assumption and guarantee, I would write:

Assumption: The data were generated from the prior.
Guarantee: I will perform at least as well as any other method.

While I agree that this is an assumption and guarantee of Bayesian methods, there are two problems that I have with drawing the conclusion that “Bayesian methods make their assumptions explicit”. The first is that it can often be very difficult to understand how a prior behaves; so while we could say “The data were generated from the prior” is an explicit assumption, it may be unclear what exactly that assumption entails. However, a bigger issue is that “The data were generated from the prior” is an assumption that very rarely holds; indeed, in many cases the underlying process is deterministic (if you’re a subjective Bayesian then this isn’t necessarily a problem, but it does certainly mean that the assumption given above doesn’t hold). So given that that assumption doesn’t hold but Bayesian methods still often perform well in practice, I would say that Bayesian methods are making some other sort of “assumption” that is far less explicit (indeed, I would be very interested in understanding what this other, more nebulous assumption might be).

Myth 8: frequentist methods are fragile, Bayesian methods are robust. This is another one that’s straightforwardly false. First, since frequentist methods often rest on weaker assumptions they are more robust if the assumptions don’t quite hold. Secondly, there is an entire area of robust statistics, which focuses on being robust to adversarial errors in the problem data.

Myth 9: frequentist methods are responsible for bad science. I will concede that much bad science is done using frequentist statistics. But this is true only because pretty much all science is done using frequentist statistics. I’ve heard arguments that using Bayesian methods instead of frequentist methods would fix at least some of the problems with science. I don’t think this is particularly likely, as I think many of the problems come from mis-application of statistical tools or from failure to control for multiple hypotheses. If anything, Bayesian methods would exacerbate the former, because they often require more detailed modeling (although in most simple cases the difference doesn’t matter at all). I don’t think being Bayesian guards against multiple hypothesis testing. Yes, in some sense a prior “controls for multiple hypotheses”, but in general the issue is that the “multiple hypotheses” are never written down in the first place, or are written down and then discarded. One could argue that being in the habit of writing down a prior might make practitioners more likely to think about multiple hypotheses, but I’m not sure this is the first-order thing to worry about.

Myth 10: frequentist methods are unprincipled / hacky. One of the most beautiful theoretical paradigms that I can think of is what I could call the “geometric view of statistics”. One place that does a particularly good job of show-casing this is Shai Shalev-Shwartz’s PhD thesis, which was so beautiful that I cried when I read it. I’ll try (probably futilely) to convey a tiny amount of the intuition and beauty of this paradigm in the next few paragraphs, although focusing on minimax estimation, rather than online learning as in Shai’s thesis.

The geometric paradigm tends to emphasize a view of measurements (i.e. empirical expected values over observed data) as “noisy” linear constraints on a model family. We can control the noise by either taking few enough measurements that the total error from the noise is small (classical statistics), or by broadening the linear constraints to convex constraints (robust statistics), or by controlling the Lagrange multipliers on the constraints (regularization). One particularly beautiful result in this vein is the duality between maximum entropy and maximum likelihood. (I can already predict the Jaynesians trying to claim this result for their camp, but (i) Jaynes did not invent maximum entropy; (ii) maximum entropy is not particularly Bayesian (in the sense that frequentists use it as well); and (iii) the view on maximum entropy that I’m about to provide is different from the view given in Jaynes or by physicists in general [edit: EHeller thinks this last claim is questionable, see discussion here].)

To understand the duality mentioned above, suppose that we have a probability distribution {p(x)} and the only information we have about it is the expected value of a certain number of functions, i.e. the information that {\mathbb{E}[\phi(x)] = \phi^*}, where the expectation is taken with respect to {p(x)}. We are interested in constructing a probability distribution {q(x)} such that no matter what particular value {p(x)} takes, {q(x)} will still make good predictions. In other words (taking {\log p(x)} as our measurement of prediction accuracy) we want {\mathbb{E}_{p'}[\log q(x)]} to be large for all distributions {p'} such that {\mathbb{E}_{p'}[\phi(x)] = \phi^*}. Using a technique called Lagrangian duality, we can both find the optimal distribution {q} and compute its worse-case accuracy over all {p'} with {\mathbb{E}_{p'}[\phi(x)] = \phi^*}. The characterization is as follows: consider all probability distributions {q(x)} that are proportional to {\exp(\lambda^{\top}\phi(x))} for some vector {\lambda}, i.e. {q(x) = \exp(\lambda^{\top}\phi(x))/Z(\lambda)} for some {Z(\lambda)}. Of all of these, take the q(x) with the largest value of {\lambda^{\top}\phi^* - \log Z(\lambda)}. Then {q(x)} will be the optimal distribution and the accuracy for all distributions {p'} will be exactly {\lambda^{\top}\phi^* - \log Z(\lambda)}. Furthermore, if {\phi^*} is the empirical expectation given some number of samples, then one can show that {\lambda^{\top}\phi^* - \log Z(\lambda)} is propotional to the log likelihood of {q}, which is why I say that maximum entropy and maximum likelihood are dual to each other.

This is a relatively simple result but it underlies a decent chunk of models used in practice.

Myth 11: frequentist methods have no promising approach to computationally bounded inference. I would personally argue that frequentist methods are more promising than Bayesian methods at handling computational constraints, although computationally bounded inference is a very cutting edge area and I’m sure other experts would disagree. However, one point in favor of the frequentist approach here is that we already have some frameworks, such as the “tightening relaxations” framework discussed here, that provide quite elegant and rigorous ways of handling computationally intractable models.

References

(Myth 3) Sparse recovery: Sparse recovery using sparse matrices
(Myth 5) Online learning: Online learning and online convex optimization
(Myth 8) Robust statistics: see this blog post and the two linked papers
(Myth 10) Maximum entropy duality: Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory