Long-Term and Short-Term Challenges to Ensuring the Safety of AI Systems

Introduction

There has been much recent discussion about AI risk, meaning specifically the potential pitfalls (both short-term and long-term) that AI with improved capabilities could create for society. Discussants include AI researchers such as Stuart Russell and Eric Horvitz and Tom Dietterich, entrepreneurs such as Elon Musk and Bill Gates, and research institutes such as the Machine Intelligence Research Institute (MIRI) and Future of Humanity Institute (FHI); the director of the latter institute, Nick Bostrom, has even written a bestselling book on this topic. Finally, ten million dollars in funding have been earmarked towards research on ensuring that AI will be safe and beneficial. Given this, I think it would be useful for AI researchers to discuss the nature and extent of risks that might be posed by increasingly capable AI systems, both short-term and long-term. As a PhD student in machine learning and artificial intelligence, this essay will describe my own views on AI risk, in the hopes of encouraging other researchers to detail their thoughts, as well.

For the purposes of this essay, I will define “AI” to be technology that can carry out tasks with limited or no human guidance, “advanced AI” to be technology that perform substantially more complex and domain-general tasks than are possible today, and “highly capable AI” to be technology that can outperform humans in all or almost all domains. As the primary target audience of this essay is other researchers, I have used technical terms (e.g. weakly supervised learning, inverse reinforcement learning) whenever they were useful, though I have also tried to make the essay more generally accessible when possible.

Outline

I think it is important to distinguish between two questions. First, does artificial intelligence merit the same degree of engineering safety considerations as other technologies (such as bridges)? Second, does artificial intelligence merit additional precautions, beyond those that would be considered typical? I will argue that the answer is yes to the first, even in the short term, and that current engineering methodologies in the field of machine learning do not provide even a typical level of safety or robustness. Moreover, I will argue that the answer to the second question in the long term is likely also yes — namely, that there are important ways in which highly capable artificial intelligence could pose risks which are not addressed by typical engineering concerns.

The point of this essay is not to be alarmist; indeed, I think that AI is likely to be net-positive for humanity. Rather, the point of this essay is to encourage a discussion about the potential pitfalls posed by artificial intelligence, since I believe that research done now can mitigate many of these pitfalls. Without such a discussion, we are unlikely to understand which pitfalls are most important or likely, and thus unable to design effective research programs to prevent them.

A common objection to discussing risks posed by AI is that it seems somewhat early on to worry about such risks, and the discussion is likely to be more germane if we wait to have it until after the field of AI has advanced further. I think this objection is quite reasonable in the abstract; however, as I will argue below, I think we do have a reasonable understanding of at least some of the risks that AI might pose, that some of these will be realized even in the medium term, and that there are reasonable programs of research that can address these risks, which in many cases would also have the advantage of improving the usability of existing AI systems.

Ordinary Engineering

There are many issues related to AI safety that are just a matter of good engineering methodology. For instance, we would ideally like systems that are transparent, modular, robust, and work under well-understood assumptions. Unfortunately, machine learning as a field has not developed very good methodologies for obtaining any of these things, and so this is an important issue to remedy. In other words, I think we should put at least as much thought into building an AI as we do into building a bridge.

Just to be very clear, I do not think that machine learning researchers are bad engineers; looking at any of the open source tools such as Torch, Caffe, MLlib, and others make it clear that many machine learning researchers are also good software engineers. Rather, I think that as a field our methodologies are not mature enough to address the specific engineering desiderata of statistical models (in contrast to the algorithms that create them). In particular, the statistical models obtained from machine learning algorithms tend to be:

  1. Opaque: Many machine learning models consist of hundreds of thousands of parameters, making it difficult to understand how predictions are made. Typically, practitioners resort to error analysis examining the covariates that most strongly influence each incorrect prediction. However, this is not a very sustainable long-term solution, as it requires substantial effort even for relatively narrow-domain systems.
  2. Monolithic: In part due to their opacity, models act as a black box, with no modularity or encapsulation of behavior. Though machine learning systems are often split into pipelines of smaller models, the lack of encapsulation can make these pipelines even harder to manage than a single large model; indeed, since machine learning models are by design optimized for a particular input distribution (i.e. whatever distribution they are trained on), we end up in a situation where “Changing Anything Changes Everything” [1].
  3. Fragile: As another consequence of being optimized for a particular training distribution, machine learning models can have arbitrarily poor performance when that distribution shifts. For instance, Daumé and Marcu [2] show that a named entity classifier with 92% accuracy on one dataset drops to 58% accuracy on a superficially similar dataset. Though such issues are partially addressed by work on transfer learning and domain adaptation [3], these areas are not very developed compared to supervised learning.
  4. Poorly understood: Beyond their fragility, understanding when a machine learning model will work is difficult. We know that a model will work if it is tested on the same distribution it is trained on, and have some extensions beyond this case (e.g. based on robust optimization [4]), but we have very little in the way of practically relevant conditions under which a model trained in one situation will work well in another situation.

That these issues plague machine learning systems is likely uncontroversial among machine learning researchers. However, in comparison to research focused on extending capabilities, very little is being done to address them. Research in this area therefore seems particularly impactful, especially given the desire to deploy machine learning systems in increasingly complex and safety-critical situations.

Extraordinary Engineering

Does AI merit additional safety precautions, beyond those that are considered standard engineering practice in other fields? Here I am focusing only on the long-term impacts of advanced or highly capable AI systems.

My tentative answer is yes; there seem to be a few different ways in which AI could have bad effects, each of which seems individually unlikely but not implausible. Even if each of the risks identified so far are not likely, (i) the sum of the probabilities might get large, especially if there are additional unidentified risks, and (ii) the existence of multiple “near-misses” motivates closer investigation, as it may suggest some underlying principle that makes AI risk-laden. In the sequel I will focus on so-called “global catastrophic” risks, meaning risks that could affect a large fraction of the earth’s population in a material way. I have chosen to focus on these risks because I think there is an important difference between an AI system messing up in a way that harms a few people (which would be a legal liability but perhaps should not motivate a major effort in terms of precautions) and an AI system that could cause damage on a global scale. The latter would justify substantial precautions, and I want to make it clear that this is the bar I am setting for myself.

With that in place, below are a few ways in which advanced or highly capable AI could have specific global catastrophic risks.

Cyber-attacks. There are two trends which taken together make the prospect of AI-aided cyber-attacks seem worrisome. The first trend is simply the increasing prevalence of cyber-attacks; even this year we have seen Russia attack Ukraine, North Korea attack Sony, and China attack the U.S. Office of Personnel Management. Secondly, the “Internet of Things” means that an increasing number of physical devices will be connected to the internet. Assuming that software exists to autonomously control them, many internet-enabled devices such as cars could be hacked and then weaponized, leading to a decisive military advantage in a short span of time. Such an attack could be enacted by a small group of humans aided by AI technologies, which would make it hard to detect in advance. Unlike other weaponizable technology such as nuclear fission or synthetic biology, it would be very difficult to control the distribution of AI since it does not rely on any specific raw materials. Finally, note that even a team with relatively small computing resources could potentially “bootstrap” to much more computing power by first creating a botnet with which to do computations; to date, the largest botnet has spanned 30 million computers and several other botnets have exceeded 1 million.

Autonomous weapons. Beyond cyber-attacks, improved autonomous robotics technology combined with ubiquitous access to miniature UAVs (“drones”) could allow both terrorists and governments to wage a particularly pernicious form of remote warfare by creating weapons that are both cheap and hard to detect or defend against (due to their small size and high maneuverability). Beyond direct malicious intent, if autonomous weapons systems or other powerful autonomous systems malfunction then they could cause a large amount of damage.

Mis-optimization. A highly capable AI could acquire a large amount of power but pursue an overly narrow goal, and end up harming humans or human value while optimizing for this goal. This may seem implausible at face value, but as I will argue below, it is easier to improve AI capabilities than to improve AI values, making such a mishap possible in theory.

Unemployment. It is already the case that increased automation is decreasing the number of available jobs, to the extent that some economists and policymakers are discussing what to do if the number of jobs is systematically smaller than the number of people seeking work. If AI systems allow a large number of jobs to be automated over a relatively short time period, then we may not have time to plan or implement policy solutions, and there could then be a large unemployment spike. In addition to the direct effects on the people who are unemployed, such a spike could also have indirect consequences by decreasing social stability on a global scale.

Opaque systems. It is also already the case that increasingly many tasks are being delegated to autonomous systems, from trades in financial markets to aggregation of information feeds. The opacity of these systems has led to issues such as the 2010 Flash Crash and will likely lead to larger issues in the future. In the long term, as AI systems become increasingly complex, humans may lose the ability to meaningfully understand or intervene in such systems, which could lead to a loss of sovereignty if autonomous systems are employed in executive-level functions (e.g. government, economy).

Beyond these specific risks, it seems clear that, eventually, AI will be able to outperform humans in essentially every domain. At that point, it seems doubtful that humanity will continue to have direct causal influence over its future unless specific measures are put in place to ensure this. While I do not think this day will come soon, I think it is worth thinking now about how we might meaningfully control highly capable AI systems, and I also think that many of the risks posed above (as well as others that we haven’t thought of yet) will occur on a somewhat shorter time scale.

Let me end with some specific ways in which control of AI may be particularly difficult compared to other human-engineered systems:

  1. AI may be “agent-like”, which means that the space of possible behaviors is much larger; our intuitions about how AI will act in pursuit of a given goal may not account for this and so AI behavior could be hard to predict.
  2. Since an AI would presumably learn from experience, and will likely run at a much faster serial processing speed than humans, its capabilities may change rapidly, ruling out the usual process of trial-and-error.
  3. AI will act in a much more open-ended domain. In contrast, our existing tools for specifying the necessary properties of a system only work well in narrow domains. For instance, for a bridge, safety relates to the ability to successfully accomplish a small number of tasks (e.g. not falling over). For these, it suffices to consider well-characterized engineering properties such as tensile strength. For AI, the number of tasks we would potentially want it to perform is large, and it is unclear how to obtain a small number of well-characterized properties that would ensure safety.
  4. Existing machine learning frameworks make it very easy for AI to acquire knowledge, but hard to acquire values. For instance, while an AI’s model of reality is flexibly learned from data, its goal/utility function is hard-coded in almost all situations; an exception is some work on inverse reinforcement learning [5], but this is still a very nascent framework. Importantly, the asymmetry between knowledge (and hence capabilities) and values is fundamental, rather than simply a statement about existing technologies. This is because knowledge is something that is regularly informed by reality, whereas values are only weakly informed by reality: an AI which learns incorrect facts could notice that it makes wrong predictions, but the world might never “tell” an AI that it learned the “wrong values”. At a technical level, while many tasks in machine learning are fully supervised or at least semi-supervised, value acquisition is a weakly supervised task.

In summary: there are several concrete global catastrophic risks posed by highly capable AI, and there are also several reasons to believe that highly capable AI would be difficult to control. Together, these suggest to me that the control of highly capable AI systems is an important problem posing unique research challenges.

Long-term Goals, Near-term Research

Above I presented an argument for why AI, in the long term, may require substantial precautionary efforts. Beyond this, I also believe that there is important research that can be done right now to reduce long-term AI risks. In this section I will elaborate on some specific research projects, though my list is not meant to be exhaustive.

  1. Value learning: In general, it seems important in the long term (and also in the short term) to design algorithms for learning values / goal systems / utility functions, rather than requiring them to be hand-coded. One framework for this is inverse reinforcement learning [5], though developing additional frameworks would also be useful.
  2. Weakly supervised learning: As argued above, inferring values, in contrast to beliefs, is an at most weakly supervised problem, since humans themselves are often incorrect about what they value and so any attempt to provide fully annotated training data about values would likely contain systematic errors. It may be possible to infer values indirectly through observing human actions; however, since humans often act immorally and human values change over time, current human actions are not consistent with our ideal long-term values, and so learning from actions in a naive way could lead to problems. Therefore, a better fundamental understanding of weakly supervised learning — particularly regarding guaranteed recovery of indirectly observed parameters under well-understood assumptions — seems important.
  3. Formal specification / verification: One way to make AI safer would be to formally specify desiderata for its behavior, and then prove that these desiderata are met. A major open challenge is to figure out how to meaningfully specify formal properties for an AI system. For instance, even if a speech transcription system did a near-perfect job of transcribing speech, it is unclear what sort of specification language one might use to state this property formally. Beyond this, though there is much existing work in formal verification, it is still extremely challenging to verify large systems.
  4. Transparency: To the extent that the decision-making process of an AI is transparent, it should be relatively easy to ensure that its impact will be positive. To the extent that the decision-making process is opaque, it should be relatively difficult to do so. Unfortunately, transparency seems difficult to obtain, especially for AIs that reach decisions through complex series of serial computations. Therefore, better techniques for rendering AI reasoning transparent seem important.
  5. Strategic assessment and planning: Better understanding of the likely impacts of AI will allow a better response. To this end, it seems valuable to map out and study specific concrete risks; for instance, better understanding ways in which machine learning could be used in cyber-attacks, or forecasting the likely effects of technology-driven unemployment, and determining useful policies around these effects. It would also be clearly useful to identify additional plausible risks beyond those of which we are currently aware. Finally, thought experiments surrounding different possible behaviors of advanced AI would help inform intuitions and point to specific technical problems. Some of these tasks are most effectively carried out by AI researchers, while others should be done in collaboration with economists, policy experts, security experts, etc.

The above constitute at least five concrete directions of research on which I think important progress can be made today, which would meaningfully improve the safety of advanced AI systems and which in many cases would likely have ancillary benefits in the short term, as well.

Related Work

At a high level, while I have implicitly provided a program of research above, there are other proposed research programs as well. Perhaps the earliest proposed program is from MIRI [6], which has focused on AI alignment problems that arise even in simplified settings (e.g. with unlimited computing power or easy-to-specify goals) in hopes of later generalizing to more complex settings. The Future of Life Institute (FLI) has also published a research priorities document [7, 8] with a broader focus, including non-technical topics such as regulation of autonomous weapons and economic shifts induced by AI-based technologies. I do not necessarily endorse either document, but think that both represent a big step in the right direction. Ideally, MIRI, FLI, and others will all justify why they think their problems are worth working on and we can let the best arguments and counterarguments rise to the top. This is already happening to some extent [9, 10, 11] but I would like to see more of it, especially from academics with expertise in machine learning and AI [12, 13].

In addition, several specific arguments I have advanced are similar to those already advanced by others. The issue of AI-driven unemployment has been studied by Brynjolfsson and McAfee [14], and is also discussed in the FLI research document. The problem of AI pursuing narrow goals has been elaborated through Bostrom’s “paperclipping argument” [15] as well as the orthogonality thesis [16], which states that beliefs and values are independent of each other. While I disagree with the orthogonality thesis in its strongest form, the arguments presented above for the difficulty of value learning can in many cases reach similar conclusions.

Omohundro [17] has argued that advanced agents would pursue certain instrumentally convergent drives under almost any value system, which is one way in which agent-like systems differ from systems without agency. Good [18] was the first to argue that AI capabilities could improve rapidly. Yudkowsky has argued that it would be easy for an AI to acquire power given few initial resources [19], though his example assumes the creation of advanced biotechnology.

Christiano has argued for the value of transparent AI systems, and proposed the “advisor games” framework as a potential operationalization of transparency [20].

Conclusion

To ensure the safety of AI systems, additional research is needed, both to meet ordinary short-term engineering desiderata as well as to make the additional precautions specific to highly capable AI systems. In both cases, there are clear programs of research that can be undertaken today, which in many cases seem to be under-researched relative to their potential societal value. I therefore think that well-directed research towards improving the safety of AI systems is a worthwhile undertaking, with the additional benefit of motivating interesting new directions of research.

Acknowledgments

Thanks to Paul Christiano, Holden Karnofsky, Percy Liang, Luke Muehlhauser, Nick Beckstead, Nate Soares, and Howie Lempel for providing feedback on a draft of this essay.

References

[1] D. Sculley, et al. Machine Learning: The High-Interest Credit Card of Technical Debt. 2014.
[2] Hal Daumé III and Daniel Marcu. Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, pages 101–126, 2006.
[3] Sinno J. Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, 2010.
[4] Dimitris Bertsimas, David B. Brown, and Constantine Caramanis. Theory and applications of robust optimization. SIAM Review, 53(3):464–501, 2011.
[5] Andrew Ng and Stuart Russell. Algorithms for inverse reinforcement learning. In International Conference in Machine Learning, pages 663–670, 2000.
[6] Nate Soares and Benja Fallenstein. Aligning Superintelligence with Human Interests: A Technical Research Agenda. 2014.
[7] Stuart Russell, Daniel Dewey, and Max Tegmark. Research priorities for robust and beneficial artificial intelligence. 2015.
[8] Daniel Dewey, Stuart Russell, and Max Tegmark. A survey of research questions for robust and beneficial AI. 2015.
[9] Paul Christiano. The Steering Problem. 2015.
[10] Paul Christiano. Stable self-improvement as an AI safety problem. 2015.
[11] Luke Muehlhauser. How to study superintelligence strategy. 2014.
[12] Stuart Russell. Of Myths and Moonshine. 2014.
[13] Tom Dietterich and Eric Horvitz. Benefits and Risks of Artificial Intelligence. 2015.
[14] Erik Brynjolfsson and Andrew McAfee. The second machine age: work, progress, and prosperity in a time of brilliant technologies. WW Norton & Company, 2014.
[15] Nick Bostrom (2003). Ethical Issues in Advanced Artificial Intelligence. Cognitive, Emotive and Ethical Aspects of Decision Making in Humans and in Artificial Intelligence.
[16] Nick Bostrom. “The superintelligent will: Motivation and instrumental rationality in advanced artificial agents.” Minds and Machines 22.2 (2012): 71-85.
[17] Stephen M. Omohundro (2008). The Basic AI Drives. Frontiers in Artificial Intelligence and Applications (IOS Press).
[18] Irving J. Good. “Speculations concerning the first ultraintelligent machine.” Advances in computers 6.99 (1965): 31-83.
[19] Eliezer Yudkowsky. “Artificial intelligence as a positive and negative factor in global risk.” Global catastrophic risks 1 (2008): 303.
[20] Paul Christiano. Advisor Games. 2015.

A Fervent Defense of Frequentist Statistics

[Highlights for the busy: de-bunking standard “Bayes is optimal” arguments; frequentist Solomonoff induction; and a description of the online learning framework.]

Short summary. This essay makes many points, each of which I think is worth reading, but if you are only going to understand one point I think it should be “Myth 5″ below, which describes the online learning framework as a response to the claim that frequentist methods need to make strong modeling assumptions. Among other things, online learning allows me to perform the following remarkable feat: if I’m betting on horses, and I get to place bets after watching other people bet but before seeing which horse wins the race, then I can guarantee that after a relatively small number of races, I will do almost as well overall as the best other person, even if the number of other people is very large (say, 1 billion), and their performance is correlated in complicated ways.

If you’re only going to understand two points, then also read about the frequentist version of Solomonoff induction, which is described in “Myth 6″.

Main article. I’ve already written one essay on Bayesian vs. frequentist statistics. In that essay, I argued for a balanced, pragmatic approach in which we think of the two families of methods as a collection of tools to be used as appropriate. Since I’m currently feeling contrarian, this essay will be far less balanced and will argue explicitly against Bayesian methods and in favor of frequentist methods. I hope this will be forgiven as so much other writing goes in the opposite direction of unabashedly defending Bayes. I should note that this essay is partially inspired by some of Cosma Shalizi’s blog posts, such as this one.

This essay will start by listing a series of myths, then debunk them one-by-one. My main motivation for this is that Bayesian approaches seem to be highly popularized, to the point that one may get the impression that they are the uncontroversially superior method of doing statistics. I actually think the opposite is true: I think most statisticans would for the most part defend frequentist methods, although there are also many departments that are decidedly Bayesian (e.g. many places in England, as well as some U.S. universities like Columbia). I have a lot of respect for many of the people at these universities, such as Andrew Gelman and Philip Dawid, but I worry that many of the other proponents of Bayes (most of them non-statisticians) tend to oversell Bayesian methods or undersell alternative methodologies.

If you are like me from, say, two years ago, you are firmly convinced that Bayesian methods are superior and that you have knockdown arguments in favor of this. If this is the case, then I hope this essay will give you an experience that I myself found life-altering: the experience of having a way of thinking that seemed unquestionably true slowly dissolve into just one of many imperfect models of reality. This experience helped me gain more explicit appreciation for the skill of viewing the world from many different angles, and of distinguishing between a very successful paradigm and reality.

If you are not like me, then you may have had the experience of bringing up one of many reasonable objections to normative Bayesian epistemology, and having it shot down by one of many “standard” arguments that seem wrong but not for easy-to-articulate reasons. I hope to lend some reprieve to those of you in this camp, by providing a collection of “standard” replies to these standard arguments.

I will start with the myths (and responses) that I think will require the least technical background and be most interesting to a general audience. Toward the end, I deal with some attacks on frequentist methods that I believe amount to technical claims that are demonstrably false; doing so involves more math. Also, I should note that for the sake of simplicity I’ve labeled everything that is non-Bayesian as a “frequentist” method, even though I think there’s actually a fair amount of variation among these methods, although also a fair amount of overlap (e.g. I’m throwing in statistical learning theory with minimax estimation, which certainly have a lot of overlap in ideas but were also in some sense developed by different communities).

The Myths:

  • Bayesian methods are optimal.
  • Bayesian methods are optimal except for computational considerations.
  • We can deal with computational constraints simply by making approximations to Bayes.
  • The prior isn’t a big deal because Bayesians can always share likelihood ratios.
  • Frequentist methods need to assume their model is correct, or that the data are i.i.d.
  • Frequentist methods can only deal with simple models, and make arbitrary cutoffs in model complexity (aka: “I’m Bayesian because I want to do Solomonoff induction”).
  • Frequentist methods hide their assumptions while Bayesian methods make assumptions explicit.
  • Frequentist methods are fragile, Bayesian methods are robust.
  • Frequentist methods are responsible for bad science
  • Frequentist methods are unprincipled/hacky.
  • Frequentist methods have no promising approach to computationally bounded inference.

Myth 1: Bayesian methods are optimal. Presumably when most people say this they are thinking of either Dutch-booking or the complete class theorem. Roughly what these say are the following:

Dutch-book argument: Every coherent set of beliefs can be modeled as a subjective probability distribution. (Roughly, coherent means “unable to be Dutch-booked”.)

Complete class theorem: Every non-Bayesian method is worse than some Bayesian method (in the sense of performing deterministically at least as poorly in every possible world).

Let’s unpack both of these. My high-level argument regarding Dutch books is that I would much rather spend my time trying to correspond with reality than trying to be internally consistent. More concretely, the Dutch-book argument says that if for every bet you force me to take one side or the other, then unless I’m Bayesian there’s a collection of bets that will cause me to lose money for sure. I don’t find this very compelling. This seems analogous to the situation where there’s some quant at Jane Street, and they’re about to run code that will make thousands of dollars trading stocks, and someone comes up to them and says “Wait! You should add checks to your code to make sure that no subset of your trades will lose you money!” This just doesn’t seem worth the quant’s time, it will slow down the code substantially, and instead the quant should be writing the next program to make thousands more dollars. This is basically what dutch-booking arguments seem like to me.

Moving on, the complete class theorem says that for any decision rule, I can do better by replacing it with some Bayesian decision rule. But this injunction is not useful in practice, because it doesn’t say anything about which decision rule I should replace it with. Of course, if you hand me a decision rule and give me infinite computational resources, then I can hand you back a Bayesian method that will perform better. But it still might not perform well. All the complete class theorem says is that every local optimum is Bayesan. To be a useful theory of epistemology, I need a prescription for how, in the first place, I am to arrive at a good decision rule, not just a locally optimal one. And this is something that frequentist methods do provide, to a far greater extent than Bayesian methods (for instance by using minimax decision rules such as the maximum-entropy example given later). Note also that many frequentist methods do correspond to a Bayesian method for some appropriately chosen prior. But the crucial point is that the frequentist told me how to pick a prior I would be happy with (also, many frequentist methods don’t correspond to a Bayesian method for any choice of prior; they nevertheless often perform quite well).

Myth 2: Bayesian methods are optimal except for computational considerations. We already covered this in the previous point under the complete class theorem, but to re-iterate: Bayesian methods are locally optimal, not global optimal. Identifying all the local optima is very different from knowing which of them is the global optimum. I would much rather have someone hand me something that wasn’t a local optimum but was close to the global optimum, than something that was a local optimum but was far from the global optimum.

Myth 3: We can deal with computational constraints simply by making approximations to Bayes. I have rarely seen this born out in practice. Here’s a challenge: suppose I give you data generated in the following way. There are a collection of vectors {x_1}, {x_2}, {\ldots}, {x_{10,000}}, each with {10^6} coordinates. I generate outputs {y_1}, {y_2}, {\ldots}, {y_{10,000}} in the following way. First I globally select {100} of the {10^6} coordinates uniformly at random, then I select a fixed vector {u} such that those {100} coordinates are drawn from i.i.d. Gaussians and the rest of the coordinates are zero. Now I set {x_n = u^{\top}y_n} (i.e. {x_n} is the dot product of {u} with {y_n}). You are given {x} and {y}, and your job is to infer {u}. This is a completely well-specified problem, the only task remaining is computational. I know people who have solved this problem using Bayesan methods with approximate inference. I have respect for these people, because doing so is no easy task. I think very few of them would say that “we can just approximate Bayesian updating and be fine”. (Also, this particular problem can be solved trivially with frequentist methods.)

A particularly egregious example of this is when people talk about “computable approximations to Solomonoff induction” or “computable approximations to AIXI” as if such notions were meaningful.

Myth 4: the prior isn’t a big deal because Bayesians can always share likelihood ratios. Putting aside the practical issue that there would in general be an infinite number of likelihood ratios to share, there is the larger issue that for any hypothesis {h}, there is also the hypothesis {h'} that matches {h} exactly up to now, and then predicts the opposite of {h} at all points in the future. You have to constrain model complexity at some point, the question is about how. To put this another way, sharing my likelihood ratios without also constraining model complexity (by focusing on a subset of all logically possible hypotheses) would be equivalent to just sharing all sensory data I’ve ever accrued in my life. To the extent that such a notion is even possible, I certainly don’t need to be a Bayesian to do such a thing.

Myth 5: frequentist methods need to assume their model is correct or that the data are i.i.d. Understanding the content of this section is the most important single insight to gain from this essay. For some reason it’s assumed that frequentist methods need to make strong assumptions (such as Gaussianity), whereas Bayesian methods are somehow immune to this. In reality, the opposite is true. While there are many beautiful and deep frequentist formalisms that answer this, I will choose to focus on one of my favorite, which is online learning.

To explain the online learning framework, let us suppose that our data are {(x_1, y_1), (x_2, y_2), \ldots, (x_T, y_T)}. We don’t observe {y_t} until after making a prediction {z_t} of what {y_t} will be, and then we receive a penalty {L(y_t, z_t)} based on how incorrect we were. So we can think of this as receiving prediction problems one-by-one, and in particular we make no assumptions about the relationship between the different problems; they could be i.i.d., they could be positively correlated, they could be anti-correlated, they could even be adversarially chosen.

As a running example, suppose that I’m betting on horses and before each race there are {n} other people who give me advice on which horse to bet on. I know nothing about horses, so based on this advice I’d like to devise a good betting strategy. In this case, {x_t} would be the {n} bets that each of the other people recommend, {z_t} would be the horse that I actually bet on, and {y_t} would be the horse that actually wins the race. Then, supposing that {y_t = z_t} (i.e., the horse I bet on actually wins), {L(y_t, z_t)} is the negative of the payoff from correctly betting on that horse. Otherwise, if the horse I bet on doesn’t win, {L(y_t, z_t)} is the cost I had to pay to place the bet.

If I’m in this setting, what guarantee can I hope for? I might ask for an algorithm that is guaranteed to make good bets — but this seems impossible unless the people advising me actually know something about horses. Or, at the very least, one of the people advising me knows something. Motivated by this, I define my regret to be the difference between my penalty and the penalty of the best of the {n} people (note that I only have access to the latter after all {T} rounds of betting). More formally, given a class {\mathcal{M}} of predictors {h : x \mapsto z}, I define

\displaystyle \mathrm{Regret}(T) = \frac{1}{T} \sum_{t=1}^T L(y_t, z_t) - \min_{h \in \mathcal{M}} \frac{1}{T} \sum_{t=1}^T L(y_t, h(x_t))

In this case, {\mathcal{M}} would have size {n} and the {i}th predictor would just always follow the advice of person {i}. The regret is then how much worse I do on average than the best expert. A remarkable fact is that, in this case, there is a strategy such that {\mathrm{Regret}(T)} shrinks at a rate of {\sqrt{\frac{\log(n)}{T}}}. In other words, I can have an average score within {\epsilon} of the best advisor after {\frac{\log(n)}{\epsilon^2}} rounds of betting.

One reason that this is remarkable is that it does not depend at all on how the data are distributed; the data could be i.i.d., positively correlated, negatively correlated, even adversarial, and one can still construct an (adaptive) prediction rule that does almost as well as the best predictor in the family.

To be even more concrete, if we assume that all costs and payoffs are bounded by {\$1} per round, and that there are {1,000,000,000} people in total, then an explicit upper bound is that after {28/\epsilon^2} rounds, we will be within {\epsilon} dollars on average of the best other person. Under slightly stronger assumptions, we can do even better, for instance if the best person has an average variance of {0.1} about their mean, then the {28} can be replaced with {4.5}.

It is important to note that the betting scenario is just a running example, and one can still obtain regret bounds under fairly general scenarios; {\mathcal{M}} could be continuous and {L} could have quite general structure; the only technical assumption is that {\mathcal{M}} be a convex set and that {L} be a convex function of {z}. These assumptions tend to be easy to satisfy, though I have run into a few situations where they end up being problematic, mainly for computational reasons. For an {n}-dimensional model family, typically {\mathrm{Regret}(T)} decreases at a rate of {\sqrt{\frac{n}{T}}}, although under additional assumptions this can be reduced to {\sqrt{\frac{\log(n)}{T}}}, as in the betting example above. I would consider this reduction to be one of the crowning results of modern frequentist statistics.

Yes, these guarantees sound incredibly awesome and perhaps too good to be true. They actually are that awesome, and they are actually true. The work is being done by measuring the error relative to the best model in the model family. We aren’t required to do well in an absolute sense, we just need to not do any worse than the best model. Of as long as at least one of the models in our family makes good predictions, that means we will as well. This is really what statistics is meant to be doing: you come up with everything you imagine could possibly be reasonable, and hand it to me, and then I come up with an algorithm that will figure out which of the things you handed me was most reasonable, and will do almost as well as that. As long as at least one of the things you come up with is good, then my algorithm will do well. Importantly, due to the {\log(n)} dependence on the dimension of the model family, you can actually write down extremely broad classes of models and I will still successfully sift through them.

Let me stress again: regret bounds are saying that, no matter how the {x_t} and {y_t} are related, no i.i.d. assumptions anywhere in sight, we will do almost as well as any predictor {h} in {\mathcal{M}} (in particular, almost as well as the best predictor).

Myth 6: frequentist methods can only deal with simple models and need to make arbitrary cutoffs in model complexity. A naive perusal of the literature might lead one to believe that frequentists only ever consider very simple models, because many discussions center on linear and log-linear models. To dispel this, I will first note that there are just as many discussions that focus on much more general properties such as convexity and smoothness, and that can achieve comparably good bounds in many cases. But more importantly, the reason we focus so much on linear models is because we have already reduced a large family of problems to (log-)linear regression. The key insight, and I think one of the most important insights in all of applied mathematics, is that of featurization: given a non-linear problem, we can often embed it into a higher-dimensional linear problem, via a feature map {\phi : X \rightarrow \mathbb{R}^n} ({\mathbb{R}^n} denotes {n}-dimensional space, i.e. vectors of real numbers of length {n}). For instance, if I think that {y} is a polynomial (say cubic) function of {x}, I can apply the mapping {\phi(x) = (1, x, x^2, x^3)}, and now look for a linear relationship between {y} and {\phi(x)}.

This insight extends far beyond polynomials. In combinatorial domains such as natural language, it is common to use indicator features: features that are {1} if a certain event occurs and {0} otherwise. For instance, I might have an indicator feature for whether two words appear consecutively in a sentence, whether two parts of speech are adjacent in a syntax tree, or for what part of speech a word has. Almost all state of the art systems in natural language processing work by solving a relatively simple regression task (typically either log-linear or max-margin) over a rich feature space (often involving hundreds of thousands or millions of features, i.e. an embedding into {\mathbb{R}^{10^5}} or {\mathbb{R}^{10^6}}).

A counter-argument to the previous point could be: “Sure, you could create a high-dimensional family of models, but it’s still a parameterized family. I don’t want to be stuck with a parameterized family, I want my family to include all Turing machines!” Putting aside for a second the question of whether “all Turing machines” is a well-advised model choice, this is something that a frequentist approach can handle just fine, using a tool called regularization, which after featurization is the second most important idea in statistics.

Specifically, given any sufficiently quickly growing function {\psi(h)}, one can show that, given {T} data points, there is a strategy whose average error is at most {\sqrt{\frac{\psi(h)}{T}}} worse than any estimator {h}. This can hold even if the model class {\mathcal{M}} is infinite dimensional. For instance, if {\mathcal{M}} consists of all probability distributions over Turing machines, and we let {h_i} denote the probability mass placed on the {i}th Turing machine, then a valid regularizer {\psi} would be

\displaystyle \psi(h) = \sum_i h_i \log(i^2 \cdot h_i)

If we consider this, then we see that, for any probability distribution over the first {2^k} Turing machines (i.e. all Turing machines with description length {\leq k}), the value of {\psi} is at most {\log((2^k)^2) = k\log(4)}. (Here we use the fact that {\psi(h) \geq \sum_i h_i \log(i^2)}, since {h_i \leq 1} and hence {h_i\log(h_i) \leq 0}.) This means that, if we receive roughly {\frac{k}{\epsilon^2}} data, we will achieve error within {\epsilon} of the best Turing machine that has description length {\leq k}.

Let me note several things here:

  • This strategy makes no assumptions about the data being i.i.d. It doesn’t even assume that the data are computable. It just guarantees that it will perform as well as any Turing machine (or distribution over Turing machines) given the appropriate amount of data.
  • This guarantee holds for any given sufficiently smooth measurement of prediction error (the update strategy depends on the particular error measure).
  • This guarantee holds deterministically, no randomness required (although predictions may need to consist of probability distributions rather than specific points, but this is also true of Bayesian predictions).

Interestingly, in the case that the prediction error is given by the negative log probability assigned to the truth, then the corresponding strategy that achieves the error bound is just normal Bayesian updating. But for other measurements of error, we get different update strategies. Although I haven’t worked out the math, intuitively this difference could be important if the universe is fundamentally unpredictable but our notion of error is insensitive to the unpredictable aspects.

Myth 7: frequentist methods hide their assumptions while Bayesian methods make assumptions explicit. I’m still not really sure where this came from. As we’ve seen numerous times so far, a very common flavor among frequentist methods is the following: I have a model class {\mathcal{M}}, I want to do as well as any model in {\mathcal{M}}; or put another way:

Assumption: At least one model in {\mathcal{M}} has error at most {E}.
Guarantee: My method will have error at most {E + \epsilon}.

This seems like a very explicit assumption with a very explicit guarantee. On the other hand, an argument I hear is that Bayesian methods make their assumptions explicit because they have an explicit prior. If I were to write this as an assumption and guarantee, I would write:

Assumption: The data were generated from the prior.
Guarantee: I will perform at least as well as any other method.

While I agree that this is an assumption and guarantee of Bayesian methods, there are two problems that I have with drawing the conclusion that “Bayesian methods make their assumptions explicit”. The first is that it can often be very difficult to understand how a prior behaves; so while we could say “The data were generated from the prior” is an explicit assumption, it may be unclear what exactly that assumption entails. However, a bigger issue is that “The data were generated from the prior” is an assumption that very rarely holds; indeed, in many cases the underlying process is deterministic (if you’re a subjective Bayesian then this isn’t necessarily a problem, but it does certainly mean that the assumption given above doesn’t hold). So given that that assumption doesn’t hold but Bayesian methods still often perform well in practice, I would say that Bayesian methods are making some other sort of “assumption” that is far less explicit (indeed, I would be very interested in understanding what this other, more nebulous assumption might be).

Myth 8: frequentist methods are fragile, Bayesian methods are robust. This is another one that’s straightforwardly false. First, since frequentist methods often rest on weaker assumptions they are more robust if the assumptions don’t quite hold. Secondly, there is an entire area of robust statistics, which focuses on being robust to adversarial errors in the problem data.

Myth 9: frequentist methods are responsible for bad science. I will concede that much bad science is done using frequentist statistics. But this is true only because pretty much all science is done using frequentist statistics. I’ve heard arguments that using Bayesian methods instead of frequentist methods would fix at least some of the problems with science. I don’t think this is particularly likely, as I think many of the problems come from mis-application of statistical tools or from failure to control for multiple hypotheses. If anything, Bayesian methods would exacerbate the former, because they often require more detailed modeling (although in most simple cases the difference doesn’t matter at all). I don’t think being Bayesian guards against multiple hypothesis testing. Yes, in some sense a prior “controls for multiple hypotheses”, but in general the issue is that the “multiple hypotheses” are never written down in the first place, or are written down and then discarded. One could argue that being in the habit of writing down a prior might make practitioners more likely to think about multiple hypotheses, but I’m not sure this is the first-order thing to worry about.

Myth 10: frequentist methods are unprincipled / hacky. One of the most beautiful theoretical paradigms that I can think of is what I could call the “geometric view of statistics”. One place that does a particularly good job of show-casing this is Shai Shalev-Shwartz’s PhD thesis, which was so beautiful that I cried when I read it. I’ll try (probably futilely) to convey a tiny amount of the intuition and beauty of this paradigm in the next few paragraphs, although focusing on minimax estimation, rather than online learning as in Shai’s thesis.

The geometric paradigm tends to emphasize a view of measurements (i.e. empirical expected values over observed data) as “noisy” linear constraints on a model family. We can control the noise by either taking few enough measurements that the total error from the noise is small (classical statistics), or by broadening the linear constraints to convex constraints (robust statistics), or by controlling the Lagrange multipliers on the constraints (regularization). One particularly beautiful result in this vein is the duality between maximum entropy and maximum likelihood. (I can already predict the Jaynesians trying to claim this result for their camp, but (i) Jaynes did not invent maximum entropy; (ii) maximum entropy is not particularly Bayesian (in the sense that frequentists use it as well); and (iii) the view on maximum entropy that I’m about to provide is different from the view given in Jaynes or by physicists in general [edit: EHeller thinks this last claim is questionable, see discussion here].)

To understand the duality mentioned above, suppose that we have a probability distribution {p(x)} and the only information we have about it is the expected value of a certain number of functions, i.e. the information that {\mathbb{E}[\phi(x)] = \phi^*}, where the expectation is taken with respect to {p(x)}. We are interested in constructing a probability distribution {q(x)} such that no matter what particular value {p(x)} takes, {q(x)} will still make good predictions. In other words (taking {\log p(x)} as our measurement of prediction accuracy) we want {\mathbb{E}_{p'}[\log q(x)]} to be large for all distributions {p'} such that {\mathbb{E}_{p'}[\phi(x)] = \phi^*}. Using a technique called Lagrangian duality, we can both find the optimal distribution {q} and compute its worse-case accuracy over all {p'} with {\mathbb{E}_{p'}[\phi(x)] = \phi^*}. The characterization is as follows: consider all probability distributions {q(x)} that are proportional to {\exp(\lambda^{\top}\phi(x))} for some vector {\lambda}, i.e. {q(x) = \exp(\lambda^{\top}\phi(x))/Z(\lambda)} for some {Z(\lambda)}. Of all of these, take the q(x) with the largest value of {\lambda^{\top}\phi^* - \log Z(\lambda)}. Then {q(x)} will be the optimal distribution and the accuracy for all distributions {p'} will be exactly {\lambda^{\top}\phi^* - \log Z(\lambda)}. Furthermore, if {\phi^*} is the empirical expectation given some number of samples, then one can show that {\lambda^{\top}\phi^* - \log Z(\lambda)} is propotional to the log likelihood of {q}, which is why I say that maximum entropy and maximum likelihood are dual to each other.

This is a relatively simple result but it underlies a decent chunk of models used in practice.

Myth 11: frequentist methods have no promising approach to computationally bounded inference. I would personally argue that frequentist methods are more promising than Bayesian methods at handling computational constraints, although computationally bounded inference is a very cutting edge area and I’m sure other experts would disagree. However, one point in favor of the frequentist approach here is that we already have some frameworks, such as the “tightening relaxations” framework discussed here, that provide quite elegant and rigorous ways of handling computationally intractable models.

References

(Myth 3) Sparse recovery: Sparse recovery using sparse matrices
(Myth 5) Online learning: Online learning and online convex optimization
(Myth 8) Robust statistics: see this blog post and the two linked papers
(Myth 10) Maximum entropy duality: Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory

Another Critique of Effective Altruism

I’ve decided to branch out a bit from technical discussions and engage in, as Scott Aaronson would call it, some metaphysical spouting. The topic of today is the effective altruism movement. I’m about to be relentlessly critical of it, so this is probably not the best post to read as your first introduction. Instead, read this and this. Then you can read what follows (but keep in mind that there are also many good things about the EA movement that I’m failing to mention here).

* * *

Another Critique of Effective Altruism

Recently Ben Kuhn wrote a critique of effective altruism. I’m glad to see such self-examination taking place, but I’m also concerned that the essay did not attack some of the most serious issues I see in the effective altruist movement, so I’ve decided to write my own critique. Due to time constraints, this critique is short and incomplete. I’ve tried to bring up arguments that would make people feel uncomfortable and defensive; hopefully I’ve succeeded.

Briefly, here are some of the major issues I have with the effective altruism movement as it currently stands:

  • Over-focus on “tried and true” and “default” options, which may both reduce actual impact and decrease exploration of new potentially high-value opportunities.

  • Over-confident claims coupled with insufficient background research.

  • Over-reliance on a small set of tools for assessing opportunities, which lead many to underestimate the value of things such as “flow-through” effects.

The common theme here is a subtle underlying message that simple, shallow analyses can allow one to make high-impact career and giving choices, and divest one of the need to dig further. I doubt that anyone explicitly believes this, but I do believe that this theme comes out implicitly both in arguments people make and in actions people take.

Lest this essay give a mistaken impression to the casual reader, I should note that there are many examplary effective altruists who I feel are mostly immune to the issues above; for instance, the GiveWell blog does a very good job of warning against the first and third points above, and I would recommend anyone who isn’t already to subscribe to it (and there are other examples that I’m failing to mention). But for the purposes of this essay, I will ignore this fact except for the current caveat.


Over-focus on “tried and true” options

It seems to me that the effective altruist movement over-focuses on “tried and true” options, both in giving opportunities and in career paths. Perhaps the biggest example of this is the prevalence of “earning to give”. While this is certainly an admirable option, it should be considered as a baseline to improve upon, not a definitive answer.

The biggest issue with the “earning to give” path is that careers in finance and software (the two most common avenues for this) are incredibly straight-forward and secure. The two things that finance and software have in common is that there is a well-defined application process similar to the one for undergraduate admissions, and given reasonable job performance one will continue to be given promotions and raises (this probably entails working hard, but the end result is still rarely in doubt). One also gets a constant source of extrinsic positive reinforcement from the money they earn. Why do I call these things an “issue”? Because I think that these attributes encourage people to pursue these paths without looking for less obvious, less certain, but ultimately better paths. One in six Yale graduates go into finance and consulting, seemingly due to the simplicity of applying and the easy supply of extrinsic motivation. My intuition is that this ratio is higher than an optimal society would have, even if such people commonly gave generously (and it is certainly much higher than the number of people who enter college planning to pursue such paths).

Contrast this with, for instance, working at a start-up. Most start-ups are low-impact, but it is undeniable that at least some have been extraordinarily high-impact, so this seems like an area that effective altruists should be considering strongly. Why aren’t there more of us at 23&me, or Coursera, or Quora, or Stripe? I think it is because these opportunities are less obvious and take more work to find, once you start working it often isn’t clear whether what you’re doing will have a positive impact or not, and your future job security is massively uncertain. There are few sources of extrinsic motivation in such a career: perhaps moreso at one of the companies mentioned above, which are reasonably established and have customers, but what about the 4-person start-up teams working in a warehouse somewhere? Some of them will go on to do great things but right now their lives must be full of anxiousness and uncertainty.

I don’t mean to fetishize start-ups. They are just one well-known example of a potentially high-value career path that, to me, seems underexplored within the EA movement. I would argue (perhaps self-servingly) that academia is another example of such a path, with similar psychological obstacles: every 5 years or so you have the opportunity to get kicked out (e.g. applying for faculty jobs, and being up for tenure), you need to relocate regularly, few people will read your work and even fewer will praise it, and it won’t be clear whether it had a positive impact until many years down the road. And beyond the “obvious” alternatives of start-ups and academia, what of the paths that haven’t been created yet? GiveWell was revolutionary when it came about. Who will be the next GiveWell? And by this I don’t mean the next charity evaluator, but the next set of people who fundamentally alter how we view altruism.

Over-confident claims coupled with insufficient background research

The history of effective altruism is littered with over-confident claims, many of which have later turned out to be false. In 2009, Peter Singer claimed that you could save a life for $200 (and many others repeated his claim). While the number was already questionable at the time, by 2011 we discovered that the number was completely off. Now new numbers were thrown around: from numbers still in the hundreds of dollars (GWWC’s estimate for SCI, which was later shown to be flawed) up to $1600 (GiveWell’s estimate for AMF, which GiveWell itself expected to go up, and which indeed did go up). These numbers were often cited without caveats, as well as other claims such as that the effectiveness of charities can vary by a factor of 1,000. How many people citing these numbers understood the process that generated them, or the high degree of uncertainty surrounding them, or the inaccuracy of past estimates? How many would have pointed out that saying that charities vary by a factor of 1,000 in effectiveness is by itself not very helpful, and is more a statement about how bad the bottom end is than how good the top end is?

More problematic than the careless bandying of numbers is the tendency toward not doing strong background research. A common pattern I see is: an effective altruist makes a bold claim, then when pressed on it offers a heuristic justification together with the claim that “estimation is the best we have”. This sort of argument acts as a conversation-stopper (and can also be quite annoying, which may be part of what drives some people away from effective altruism). In many of these cases, there are relatively easy opportunities to do background reading to further educate oneself about the claim being made. It can appear to an outside observer as though people are opting for the fun, easy activity (speculation) rather than the harder and more worthwhile activity (research). Again, I’m not claiming that this is people’s explicit thought process, but it does seem to be what ends up happening.

Why haven’t more EAs signed up for a course on global security, or tried to understand how DARPA funds projects, or learned about third-world health? I’ve heard claims that this would be too time-consuming relative to the value it provides, but this seems like a poor excuse if we want to be taken seriously as a movement (or even just want to reach consistently accurate conclusions about the world).

Over-reliance on a small set of tools

Effective altruists tend to have a lot of interest in quantitative estimates. We want to know what the best thing to do is, and we want a numerical value. This causes us to rely on scientific studies, economic reports, and Fermi estimates. It can cause us to underweight things like the competence of a particular organization, the strength of the people involved, and other “intangibles” (which are often not actually intangible but simply difficult to assign a number to). It also can cause us to over-focus on money as a unit of altruism, while often-times “it isn’t about the money”: it’s about doing the groundwork that no one is doing, or finding the opportunity that no one has found yet.

Quantitative estimates often also tend to ignore flow-through effects: effects which are an indirect, rather than direct, result of an action (such as decreased disease in the third world contributing in the long run to increased global security). These effects are difficult to quantify but human and cultural intuition can do a reasonable job of taking them into account. As such, I often worry that effective altruists may actually be less effective than “normal” altruists. (One can point to all sorts of examples of farcical charities to claim that regular altruism sucks, but this misses the point that there are also amazing organizations out there, such as the Simons Foundation or HHMI, which are doing enormous amounts of good despite not subscribing to the EA philosophy.)

What’s particularly worrisome is that even if we were less effective than normal altruists, we would probably still end up looking better by our own standards, which explicitly fail to account for the ways in which normal altruists might outperform us (see above). This is a problem with any paradigm, but the fact that the effective altruist community is small and insular and relies heavily on its paradigm makes us far more susceptible to it.

Convex Conditions for Strong Convexity

An important concept in online learning and convex optimization is that of strong convexity: a twice-differentiable function f is said to be strongly convex with respect to a norm \|\cdot\| if

z^T\frac{\partial^2 f}{\partial x^2}z \geq \|z\|^2

for all z (for functions that are not twice-differentiable, there is an analogous criterion in terms of the Bregman divergence). To check strong convexity, then, we basically need to check a condition on the Hessian, namely that z^THz \geq \|z\|^2. So, under what conditions does this hold?

For the l^2 norm, the answer is easy: z^THz \geq \|z\|_2^2 if and only if H \succeq I (i.e., H-I is positive semidefinite). This can be shown in many ways, perhaps the easiest is by noting that z^THz-\|z\|_2^2 = z^T(H-I)z.

For the l^{\infty} norm, the answer is a bit trickier but still not too complicated. Recall that we want necessary and sufficient conditions under which z^THz \geq \|z\|_{\infty}^2. Note that this is equivalent to asking that z^THz \geq (z_i)^2 for each coordinate i of z, which in turn is equivalent to H \succeq e_ie_i^T for each coordinate vector e_i (these are the vectors that are 1 in the ith coordinate and 0 everywhere else).

More generally, for any norm \|\cdot\|, there exists a dual norm \|\cdot\|_* which satisfies, among other properties, the relationship \|z\| = \sup_{\|w\|_* = 1} w^Tz. So, in general, z^THz \geq \|z\|^2 is equivalent to asking that z^THz \geq (w^Tz)^2 for all w with \|w\|_* = 1. But this is in turn equivalent to asking that

H \succeq ww^T for all w such that \|w\|_* = 1.

In fact, it suffices to pick a subset of the w such that the convex hull consists of all w with \|w\|_* \leq 1; this is why we were able to obtain such a clean formulation in the l^{\infty} case: the dual norm to l^{\infty} is l^1, whose unit ball is the simplex, which is a polytope with only 2n vertices (namely, each of the signed unit vectors \pm e_i).

We can also derive a simple (but computationally expensive) criterion for l^1 strong convexity: here the dual norm is l^{\infty}, whose unit ball is the n-dimensional hypercube, with vertices given by all 2^n vectors of the form [ \pm 1 \ \cdots \ \pm 1]. Thus z^THz \geq \|z\|_1^2 if and only if H \succeq ss^T for all 2^n sign vectors s.

Finally, we re-examine the l^2 case; even though the l^2-ball is not a polytope, we were still able to obtain a very simple expression. This was because the condition H \succeq I manages to capture simultaneously all dual vectors such that w^Tw \leq 1. We thus have the general criterion:

Theorem. H \succeq M_jM_j^T for j = 1,\ldots,m if and only if H is strongly convex with respect to the norm \|\cdot\| whose dual unit ball is the convex hull of the transformed unit balls M_j\mathcal{B}_j, j = 1, \ldots, m, where \mathcal{B}_j is the l^2 unit ball whose dimension matches the number of columns of M_j.

Proof. H \succeq M_jM_j^T if and only if z^THz \geq \max_{j=1}^m \|M_j^Tz\|_2^2. Now note that \|M_j^Tz\|_2 = \sup_{w \in \mathcal{B}_j} w^TM_j^Tz = \sup_{w' \in M_j\mathcal{B}_j} (w')^Tz. If we define \|z\| = \max_{j=1}^m \|M_j^Tz\|_2, it is then apparent that the dual norm unit ball is the convex hull of the M_j\mathcal{B}_j.

Convexity counterexample

Here’s a fun counterexample: a function \mathbb{R}^n \to \mathbb{R} that is jointly convex in any n-1 of the variables, but not in all variables at once. The function is

f(x_1,\ldots,x_n) = \frac{1}{2}(n-1.5)\sum_{i=1}^n x_i^2 - \sum_{i < j} x_ix_j

To see why this is, note that the Hessian of f is equal to

\left[ \begin{array}{cccc} n-1.5 & -1 & \cdots & -1 \\ -1 & n-1.5 & \cdots & -1 \\ \vdots & \vdots & \ddots & \vdots \\ -1 & -1 & \cdots & n-1.5 \end{array} \right]

This matrix is equal to (n-0.5)I - J, where I is the identity matrix and J is the all-ones matrix, which is rank 1 and whose single non-zero eigenvalue is n. Therefore, this matrix has n-1 eigenvalues of n-0.5, as well as a single eigenvalue of -0.5, and hence is not positive definite.

On the other hand, any submatrix of size n-1 is of the form (n-0.5)I-J, but where now J is only (n-1) \times (n-1). This matrix now has n-2 eigenvalues of n-0.5, together with a single eigenvalue of 0.5, and hence is positive definite. Therefore, the Hessian is positive definite when restricted to any n-1 variables, and hence f is convex in any n-1 variables, but not in all n variables jointly.

Probabilistic Abstractions I

(This post represents research in progress. I may think about these concepts entirely differently a few months from now, but for my own benefit I’m trying to exposit on them in order to force myself to understand them better.)

For many inference tasks, especially ones with either non-linearities or non-convexities, it is common to use particle-based methods such as beam search, particle filters, sequential Monte Carlo, or Markov Chain Monte Carlo. In these methods, we approximate a distribution by a collection of samples from that distribution, then update the samples as new information is added. For instance, in beam search, if we are trying to build up a tree, we might build up a collection of K samples for the left and right subtrees, then look at all K^2 ways of combining them into the entire tree, but then downsample again to the K trees with the highest scores. This allows us to search through the exponentially large space of all trees efficiently (albeit at the cost of possibly missing high-scoring trees).

One major problem with such particle-based methods is diversity: the particles will tend to cluster around the highest-scoring mode, rather than exploring multiple local optima if they exist. This can be bad because it makes learning algorithms overly myopic. Another problem, especially in combinatorial domains, is difficulty of partial evaluation: if we have some training data that we are trying to fit to, and we have chosen settings of some, but not all, variables in our model, it can be difficult to know if that setting is on the right track (for instance, it can be difficult to know whether a partially-built tree is a promising candidate or not). For time-series modeling, this isn’t nearly as large of a problem, since we can evaluate against a prefix of the time series to get a good idea (this perhaps explains the success of particle filters in these domains).

I’ve been working on a method that tries to deal with both of these problems, which I call probabilistic abstractions. The idea is to improve the diversity of particle-based methods by creating “fat” particles which cover multiple states at once; the reason that such fat particles help is that they allow us to first optimize for coverage (by placing down relatively large particles that cover the entire space), then later worry about more local details (by placing down many particles near promising-looking local optima).

To be more concrete, if we have a probability distribution over a set of random variables (X_1,\ldots,X_d), then our particles will be sets obtained by specifying the values of some of the X_i and leaving the rest to vary arbitrarily. So, for instance, if d=4, then \{(X_1,X_2,X_3,X_4) \mid X_2 = 1, x_4 = 7\} might be a possible “fat” particle.

By choosing some number of fat particles and assigning probabilities to them, we are implicitly specifying a polytope of possible probability distributions; for instance, if our particles are S_1,\ldots,S_k, and we assign probability \pi_i to S_i, then we have the polytope of distributions p that satisfy the constraints p(S_1) = \pi_1, p(S_2) = \pi_2, etc.

Given such a polytope, is there a way to pick a canonical representative from it? One such representative is the maximum entropy distribution in that polytope. This distribution has the property of minimizing the worst-case relative entropy to any other distribution within the polytope (and that worst-case relative entropy is just the entropy of the distribution).

Suppose that we have a polytope for two independent distributions, and we want to compute the polytope for their product. This is easy — just look at the cartesian products of each particle of the first distribution with each particle of the second distribution. If each individual distribution has k particles, then the product distribution has k^2 particles — this could be problematic computationally, so we also want a way to narrow down to a subset of the k most informative particles. These will be the k particles such that the corresponding polytope minimizes the maximum entropy of that polytope. Finding this is NP-hard in general, but I’m currently working on good heuristics for computing it.

Next, suppose that we have a distribution on a space X and want to apply a function f : X \to Y to it. If f is a complicated function, it might be difficult to propagate the fat particles (even though it would have been easy to propagate particles composed of single points). To get around this, we need what is called a valid abstraction of f: a function \tilde{f} : 2^X \to 2^Y such that \tilde{f}(S) \supseteq f(S) for all S \in 2^X. In this case, if we map a particle S to \tilde{f}(S), our equality constraint on the mass assigned to S becomes a lower bound on the mass assigned to \tilde{f}(S) — we thus still have a polytope of possible probability distributions. Depending on the exact structure of the particles (i.e. the exact way in which the different sets overlap), it may be necessary to add additional constraints to the polytope to get good performance — I feel like I have some understanding of this, but it’s something I’ll need to investigate empirically as well. It’s also interesting to note that \tilde{f} (when combined with conditioning on data, which is discussed below) allows us to assign partial credit to promising particles, which was the other property I discussed at the beginning.

Finally, suppose that I want to condition on data. In this case the polytope approach doesn’t work as well, because conditioning on data can blow up the polytope by an arbitrarily large amount. Instead, we just take the maximum-entropy distribution in our polytope and treat that as our “true” distribution, then condition. I haven’t been able to make any formal statements about this procedure, but it seems to work at least somewhat reasonably. It is worth noting that conditioning may not be straightforward, since the likelihood function may not be constant across a given fat particle. To deal with this, we can replace the likelihood function by its average (which I think can be justified in terms of maximum entropy as well, although the details here are a bit hazier).

So, in summary, we have a notion of fat particles, which provide better coverage than point particles, and can combine them, apply functions to them, subsample them, and condition on data. This is essentially all of the operations we want to be able to apply for particle-based methods, so we in theory should now be able to implement versions of these particle-based methods that get better coverage.

Pairwise Independence vs. Independence

For collections of independent random variables, the Chernoff bound and related bounds give us very sharp concentration inequalities — if X_1,\ldots,X_n are independent, then their sum has a distribution that decays like e^{-x^2}. For random variables that are only pairwise independent, the strongest bound we have is Chebyshev’s inequality, which says that their sum decays like \frac{1}{x^2}.

The point of this post is to construct an equality case for Chebyshev: a collection of pairwise independent random variables whose sum does not satisfy the concentration bound of Chernoff, and instead decays like \frac{1}{x^2}.

The construction is as follows: let X_1,\ldots,X_d be independent binary random variables, and for any S \subset \{1,\ldots,d\}, let Y_S = \sum_{i \in S} X_i, where the sum is taken mod 2. Then we can easily check that the Y_S are pairwise independent. Now consider  the random variable Z = \sum_{S} Y_S. If any of the X_i is equal to 1, then we can pair up the Y_S by either adding or removing i from S to get the other element of the pair. If we do this, we see that Z = 2^{d-1} in this case. On the other hand, if all of the X_i are equal to 0, then Z = 0 as well. Thus, with probability \frac{1}{2^d}, Z deviates from its mean by 2^{d-1}-\frac{1}{2}, whereas the variance of Z is 2^{d-2}-\frac{1}{4}. The bound on this probability form Chebyshev is \frac{2^{d-2}-1/4}{(2^{d-1}-1/2)^2}, which is very close to \frac{1}{2^d}, so this constitutes something very close to the Chebyshev equality case.

Anyways, I just thought this was a cool example that demonstrates the difference between pairwise and full independence.

Follow

Get every new post delivered to your Inbox.

Join 52 other followers