Model Mis-specification and Inverse Reinforcement Learning

In my previous post, “Latent Variables and Model Mis-specification”, I argued that while machine learning is good at optimizing accuracy on observed signals, it has less to say about correctly inferring the values for unobserved variables in a model. In this post I’d like to focus in on a specific context for this: inverse reinforcement learning (Ng et al. 2000, Abeel et al. 2004, Ziebart et al. 2008, Ho et al 2016), where one observes the actions of an agent and wants to infer the preferences and beliefs that led to those actions. For this post, I am pleased to be joined by Owain Evans, who is an active researcher in this area and has co-authored an online book about building models of agents (see here in particular for a tutorial on inverse reinforcement learning and inverse planning).

Owain and I are particularly interested in inverse reinforcement learning (IRL) because it has been proposed (most notably by Stuart Russell) as a method for learning human values in the context of AI safety; among other things, this would eventually involve learning and correctly implementing human values by artificial agents that are much more powerful, and act with much broader scope, than any humans alive today. While we think that overall IRL is a promising route to consider, we believe that there are also a number of non-obvious pitfalls related to performing IRL with a mis-specified model. The role of IRL in AI safety is to infer human values, which are represented by a reward function or utility function. But crucially, human values (or human reward functions) are never directly observed.

Below, we elaborate on these issues. We hope that by being more aware of these issues, researchers working on inverse reinforcement learning can anticipate and address the resulting failure modes. In addition, we think that considering issues caused by model mis-specification in a particular concrete context can better elucidate the general issues pointed to in the previous post on model mis-specification.

Specific Pitfalls for Inverse Reinforcement Learning

In “Latent Variables and Model Mis-specification”, Jacob talked about model mis-specification, where the “true” model does not lie in the model family being considered. We encourage readers to read that post first, though we’ve also tried to make the below readable independently.

In the context of inverse reinforcement learning, one can see some specific problems that might arise due to model mis-specification. For instance, the following are things we could misunderstand about an agent, which would cause us to make incorrect inferences about the agent’s values:

  • The actions of the agent. If we believe that an agent is capable of taking a certain action, but in reality they are not, we might make strange inferences about their values (for instance, that they highly value not taking that action). Furthermore, if our data is e.g. videos of human behavior, we have an additional inference problem of recognizing actions from the frames.
  • The information available to the agent. If an agent has access to more information than we think it does, then a plan that seems irrational to us (from the perspective of a given reward function) might actually be optimal for reasons that we fail to appreciate. In the other direction, if an agent has less information than we think, then we might incorrectly believe that they don’t value some outcome A, even though they really only failed to obtain A due to lack of information.
  • The long-term plans of the agent. An agent might take many actions that are useful in accomplishing some long-term goal, but not necessarily over the time horizon that we observe the agent. Inferring correct values thus also requires inferring such long-term goals. In addition, long time horizons can make models more brittle, thereby exacerbating model mis-specification issues.

There are likely other sources of error as well. The general point is that, given a mis-specified model of the agent, it is easy to make incorrect inferences about an agent’s values if the optimization pressure on the learning algorithm is only towards predicting actions correctly in-sample.

In the remainder of this post, we will cover each of the above aspects — actions, information, and plans — in turn, giving both quantitative models and qualitative arguments for why model mis-specification for that aspect of the agent can lead to perverse beliefs and behavior. First, though, we will briefly review the definition of inverse reinforcement learning and introduce relevant notation.

Inverse Reinforcement Learning: Definition and Notations

In inverse reinforcement learning, we want to model an agent taking actions in a given environment. We therefore suppose that we have a state space S (the set of states the agent and environment can be in), an action space A (the set of actions the agent can take), and a transition function T(s' \mid s,a), which gives the probability of moving from state s to state s' when taking action a. For instance, for an AI learning to control a car, the state space would be the possible locations and orientations of the car, the action space would be the set of control signals that the AI could send to the car, and the transition function would be the dynamics model for the car. The tuple of (S,A,T) is called an MDP\backslash R, which is a Markov Decision Process without a reward function. (The MDP\backslash R will either have a known horizon or a discount rate \gamma but we’ll leave these out for simplicity.)


Figure 1: Diagram showing how IRL and RL are related. (Credit: Pieter Abbeel’s slides on IRL)

The inference problem for IRL is to infer a reward function R given an optimal policy \pi^* : S \to A for the MDP\backslash R (see Figure 1). We learn about the policy \pi^* from samples (s,a) of states and the corresponding action according to \pi^* (which may be random). Typically, these samples come from a trajectory, which records the full history of the agent’s states and actions in a single episode:

(s_0, a_0), (s_1, a_1), \ldots, (s_n, a_n)

In the car example, this would correspond to the actions taken by an expert human driver who is demonstrating desired driving behaviour (where the actions would be recorded as the signals to the steering wheel, brake, etc.).

Given the MDP\backslash R and the observed trajectory, the goal is to infer the reward function R. In a Bayesian framework, if we specify a prior on R we have:

P(R \mid s_{0:n},a_{0:n}) \propto P( s_{0:n},a_{0:n} \mid R) P(R) = P(R) \cdot \prod_{i=0}^n P( a_i \mid s_i, R)

The likelihood P(a_i \mid s_i, R) is just \pi_R(s)[a_i], where \pi_R is the optimal policy under the reward function R. Note that computing the optimal policy given the reward is in general non-trivial; except in simple cases, we typically approximate the policy using reinforcement learning (see Figure 1). Policies are usually assumed to be noisy (e.g. using a softmax instead of deterministically taking the best action). Due to the challenges of specifying priors, computing optimal policies and integrating over reward functions, most work in IRL uses some kind of approximation to the Bayesian objective (see the references in the introduction for some examples).

Recognizing Human Actions in Data

IRL is a promising approach to learning human values in part because of the easy availability of data. For supervised learning, humans need to produce many labeled instances specialized for a task. IRL, by contrast, is an unsupervised/semi-supervised approach where any record of human behavior is a potential data source. Facebook’s logs of user behavior provide trillions of data-points. YouTube videos, history books, and literature are a trove of data on human behavior in both actual and imagined scenarios. However, while there is lots of existing data that is informative about human preferences, we argue that exploiting this data for IRL will be a difficult, complex task with current techniques.

Inferring Reward Functions from Video Frames

As we noted above, applications of IRL typically infer the reward function R from observed samples of the human policy \pi^*. Formally, the environment is a known MDP\backslash R = (S,A,T) and the observations are state-action pairs, (s,a) \sim pi^*. This assumes that (a) the environment’s dynamics T are given as part of the IRL problem, and (b) the observations are structured as “state-action” pairs. When the data comes from a human expert parking a car, these assumptions are reasonable. The states and actions of the driver can be recorded and a car simulator can be used for T. For data from YouTube videos or history books, the assumptions fail. The data is a sequence of partial observations: the transition function T is unknown and the data does not separate out state and action. Indeed, it’s a challenging ML problem to infer human actions from text or videos.


Movie still: What actions are being performed in this situation? (Source)

As a concrete example, suppose the data is a video of two co-pilots flying a plane. The successive frames provide only limited information about the state of the world at each time step and the frames often jump forward in time. So it’s more like a POMDP with a complex observation model. Moreover, the actions of each pilot need to be inferred. This is a challenging inference problem, because actions can be subtle (e.g. when a pilot nudges the controls or nods to his co-pilot).

To infer actions from observations, some model relating the true state-action (s,a) to the observed video frame must be used. But choosing any model makes substantive assumptions about how human values relate to their behavior. For example, suppose someone attacks one of the pilots and (as a reflex) he defends himself by hitting back. Is this reflexive or instinctive response (hitting the attacker) an action that is informative about the pilot’s values? Philosophers and neuroscientists might investigate this by considering the mental processes that occur before the pilot hits back. If an IRL algorithm uses an off-the-shelf action classifier, it will lock in some (contentious) assumptions about these mental processes. At the same time, an IRL algorithm cannot learn such a model because it never directly observes the mental processes that relate rewards to actions.

Inferring Policies From Video Frames

When learning a reward function via IRL, the ultimate goal is to use the reward function to guide an artificial agent’s behavior (e.g. to perform useful tasks to humans). This goal can be formalized directly, without including IRL as an intermediate step. For example, in Apprenticeship Learning, the goal is to learn a “good” policy for the MDP\backslash R from samples of the human’s policy \pi^* (where \pi^* is assumed to approximately optimize an unknown reward function). In Imitation Learning, the goal is simply to learn a policy that is similar to the human’s policy.

Like IRL, policy search techniques need to recognize an agent’s actions to infer their policy. So they have the same challenges as IRL in learning from videos or history books. Unlike IRL, policy search does not explicitly model the reward function that underlies an agent’s behavior. This leads to an additional challenge. Humans and AI systems face vastly different tasks and have different action spaces. Most actions in videos and books would never be performed by a software agent. Even when tasks are similar (e.g. humans driving in the 1930s vs. a self-driving car in 2016), it is a difficult transfer learning problem to use human policies in one task to improve AI policies in another.

IRL Needs Curated Data

We argued that records of human behaviour in books and videos are difficult for IRL algorithms to exploit. Data from Facebook seems more promising: we can store the state (e.g. the HTML or pixels displayed to the human) and each human action (clicks and scrolling). This extends beyond Facebook to any task that can be performed on a computer. While this covers a broad range of tasks, there are obvious limitations. Many people in the world have a limited ability to use a computer: we can’t learn about their values in this way. Moreover, some kinds of human preferences (e.g. preferences over physical activities) seem hard to learn about from behaviour on a computer.

Information and Biases

Human actions depend both on their preferences and their beliefs. The beliefs, like the preferences, are never directly observed. For narrow tasks (e.g. people choosing their favorite photos from a display), we can model humans as having full knowledge of the state (as in an MDP). But for most real-world tasks, humans have limited information and their information changes over time (as in a POMDP or RL problem). If IRL assumes the human has full information, then the model is mis-specified and generalizing about what the human would prefer in other scenarios can be mistaken. Here are some examples:

(1). Someone travels from their house to a cafe, which has already closed. If they are assumed to have full knowledge, then IRL would infer an alternative preference (e.g. going for a walk) rather than a preference to get a drink at the cafe.

(2). Someone takes a drug that is widely known to be ineffective. This could be because they have a false belief that the drug is effective, or because they picked up the wrong pill, or because they take the drug for its side-effects. Each possible explanation could lead to different conclusions about preferences.

(3). Suppose an IRL algorithm is inferring a person’s goals from key-presses on their laptop. The person repeatedly forgets their login passwords and has to reset them. This behavior is hard to capture with a POMDP-style model: humans forget some strings of characters and not others. IRL might infer that the person intends to repeatedly reset their passwords.

Example (3) above arises from humans forgetting information — even if the information is only a short string of characters. This is one way in which humans systematically deviate from rational Bayesian agents. The field of psychology has documented many other deviations. Below we discuss one such deviation — time-inconsistency — which has been used to explain temptation, addiction and procrastination.

Time-inconsistency and Procrastination

An IRL algorithm is inferring Alice’s preferences. In particular, the goal is to infer Alice’s preference for completing a somewhat tedious task (e.g. writing a paper) as opposed to relaxing. Alice has T days in which she could complete the task and IRL observes her working or relaxing on each successive day.


Figure 2. MDP graph for choosing whether to “work” or “wait” (relax) on a task.

Formally, let R be the preference/reward Alice assigns to completing the task. Each day, Alice can “work” (receiving cost w for doing tedious work) or “wait” (cost 0). If she works, she later receives the reward R minus a tiny, linearly increasing cost (because it’s better to submit a paper earlier). Beyond the deadline at T, Alice cannot get the reward R. For IRL, we fix \epsilon and w and infer R.

Suppose Alice chooses “wait” on Day 1. If she were fully rational, it follows that R (the preference for completing the task) is small compared to w (the psychological cost of doing the tedious work). In other words, Alice doesn’t care much about completing the task. Rational agents will do the task on Day 1 or never do it. Yet humans often care deeply about tasks yet leave them until the last minute (when finishing early would be optimal). Here we imagine that Alice has 9 days to complete the task and waits until the last possible day.


Figure 3: Graph showing IRL inferences for Optimal model (which is mis-specified) and Possibly Discounting Model (which includes hyperbolic discounting). On each day (x-axis) the model gets another observation of Alice’s choice. The y-axis shows the posterior mean for R (reward for task), where the tedious work w = -1.

Figure 3 shows results from running IRL on this problem. There is an “Optimal” model, where the agent is optimal up to an unknown level of softmax random noise (a typical assumption for IRL). There is also a “Possibly Discounting” model, where the agent is either softmax optimal or is a hyperbolic discounter (with unknown level of discounting). We do joint Bayesian inference over the completion reward R, the softmax noise and (for “Possibly Discounting”) how much the agent hyperbolically discounts. The work cost w is set to -1. Figure 3 shows that after 6 days of observing Alice procrastinate, the “Optimal” model is very confident that Alice does not care about the task (R < |w|). When Alice completes the task on the last possible day, the posterior mean on R is not much more than the prior mean. By contrast, the “Possibly Discounting” model never becomes confident that Alice doesn’t care about the task. (Note that the gap between the models would be bigger for larger T. The “Optimal” model’s posterior on R shoots back to its Day-0 prior because it explains the whole action sequence as due to high softmax noise — optimal agents without noise would either do the task immediately or not at all. Full details and code are here.)

Long-term Plans

Agents will often take long series of actions that generate negative utility for them in the moment in order to accomplish a long-term goal (for instance, studying every night in order to perform well on a test). Such long-term plans can make IRL more difficult for a few reasons. Here we focus on two: (1) IRL systems may not have access to the right type of data for learning about long-term goals, and (2) needing to predict long sequences of actions can make algorithms more fragile in the face of model mis-specification.

(1) Wrong type of data. To make inferences based on long-term plans, it would be helpful to have coherent data about a single agent’s actions over a long period of time (so that we can e.g. see the plan unfolding). But in practice we will likely have substantially more data consisting of short snapshots of a large number of different agents (e.g. because many internet services already record user interactions, but it is uncommon for a single person to be exhaustively tracked and recorded over an extended period of time even while they are offline).

The former type of data (about a single representative population measured over time) is called panel data, while the latter type of data (about different representative populations measured at each point in time) is called repeated cross-section data. The differences between these two types of data is well-studied in econometrics, and a general theme is the following: it is difficult to infer individual-level effects from cross-sectional data.

An easy and familiar example of this difference (albeit not in an IRL setting) can be given in terms of election campaigns. Most campaign polling is cross-sectional in nature: a different population of respondents is polled at each point in time. Suppose that Hillary Clinton gives a speech and her overall support according to cross-sectional polls increases by 2%; what can we conclude from this? Does it mean that 2% of people switched from Trump to Clinton? Or did 6% of people switch from Trump to Clinton while 4% switched from Clinton to Trump?

At a minimum, then, using cross-sectional data leads to a difficult disaggregation problem; for instance, different agents taking different actions at a given point in time could be due to being at different stages in the same plan, or due to having different plans, or some combination of these and other factors. Collecting demographic and other side data can help us (by allowing us to look at variation and shifts within each subpopulation), but it is unclear if this will be sufficient in general.

On the other hand, there are some services (such as Facebook or Google) that do have extensive data about individual users across a long period of time. However, this data has another issue: it is incomplete in a very systematic way (since it only tracks online behaviour). For instance, someone might go online most days to read course notes and Wikipedia for a class; this is data that would likely be recorded. However, it is less likely that one would have a record of that person taking the final exam, passing the class and then getting an internship based on their class performance. Of course, some pieces of this sequence would be inferable based on some people’s e-mail records, etc., but it would likely be under-represented in the data relative to the record of Wikipedia usage. In either case, some non-trivial degree of inference would be necessary to make sense of such data.

(2) Fragility to mis-specification. Above we discussed why observing only short sequences of actions from an agent can make it difficult to learn about their long-term plans (and hence to reason correctly about their values). Next we discuss another potential issue — fragility to model mis-specification.

Suppose someone spends 99 days doing a boring task to accomplish an important goal on day 100. A system that is only trying to correctly predict actions will be right 99% of the time if it predicts that the person inherently enjoys boring tasks. Of course, a system that understands the goal and how the tasks lead to the goal will be right 100% of the time, but even minor errors in its understanding could bring the accuracy back below 99%.

The general issue is the following: large changes in the model of the agent might only lead to small changes in the predictive accuracy of the model, and the longer the time horizon on which a goal is realized, the more this might be the case. This means that even slight mis-specifications in the model could tip the scales back in favor of a (very) incorrect reward function. A potential way of dealing with this might be to identify “important” predictions that seem closely tied to the reward function, and focus particularly on getting those predictions right (see here for a paper exploring a similar idea in the context of approximate inference).

One might object that this is only a problem in this toy setting; for instance, in the real world, one might look at the particular way in which someone is studying or performing some other boring task to see that it coherently leads towards some goal (in a way that would be less likely were the person to be doing something boring purely for enjoyment). In other words, correctly understanding the agent’s goals might allow for more fine-grained accurate predictions which would fare better under e.g. log-score than would an incorrect model.

This is a reasonable objection, but there are some historical examples of this going wrong that should give one pause. That is, there are historical instances where: (i) people expected a more complex model that seemed to get at some underlying mechanism to outperform a simpler model that ignored that mechanism, and (ii) they were wrong (the simpler model did better under log-score). The example we are most familiar with is n-gram models vs. parse trees for language modelling; the most successful language models (in terms of having the best log-score on predicting the next word given a sequence of previous words) essentially treat language as a high-order Markov chain or hidden Markov model, despite the fact that linguistic theory predicts that language should be tree-structured rather than linearly-structured. Indeed, NLP researchers have tried building language models that assume language is tree-structured, and these models perform worse, or at least do not seem to have been adopted in practice (this is true both for older discrete models and newer continuous models based on neural nets).  It’s plausible that a similar issue will occur in inverse reinforcement learning, where correctly inferring plans is not enough to win out in predictive performance. The reason for the two issues might be quite similar (in language modelling, the tree structure only wins out in statistically uncommon corner cases involving long-term and/or nested dependencies, and hence getting that part of the prediction correct doesn’t help predictive accuracy much).

The overall point is: in the case of even slight model mis-specification, the “correct” model might actually perform worse under typical metrics such as predictive accuracy. Therefore, more careful methods of constructing a model might be necessary.

Learning Values != Robustly Predicting Human Behaviour

The problems with IRL described so far will result in poor performance for predicting human choices out-of-sample. For example, if someone is observed doing boring tasks for 99 days (where they only achieve the goal on Day 100), they’ll be predicted to continue doing boring tasks even when a short-cut to the goal becomes available. So even if the goal is simply to predict human behaviour (not to infer human values), mis-specification leads to bad predictions on realistic out-of-sample scenarios.

Let’s suppose that our goal is not to predict human behaviour but to create AI systems that promote and respect human values. These goals (predicting humans and building safe AI) are distinct. Here’s an example that illustrates the difference. Consider a long-term smoker, Bob, who would continue smoking even if there were (counterfactually) a universally effective anti-smoking treatment. Maybe Bob is in denial about the health effects of smoking or Bob thinks he’ll inevitably go back to smoking whatever happens. If an AI system were assisting Bob, we might expect it to avoid promoting his smoking habit (e.g. by not offering him cigarettes at random moments). This is not paternalism, where the AI system imposes someone else’s values on Bob. The point is that even if Bob would continue smoking across many counterfactual scenarios this doesn’t mean that he places value on smoking.

How do we choose between the theory that Bob values smoking and the theory that he does not (but smokes anyway because of the powerful addiction)? Humans choose between these theories based on our experience with addictive behaviours and our insights into people’s preferences and values. This kind of insight can’t easily be captured as formal assumptions about a model, or even as a criterion about counterfactual generalization. (The theory that Bob values smoking does make accurate predictions across a wide range of counterfactuals.) Because of this, learning human values from IRL has a more profound kind of model mis-specification than the examples in Jacob’s previous post. Even in the limit of data generated from an infinite series of random counterfactual scenarios, standard IRL algorithms would not infer someone’s true values.

Predicting human actions is neither necessary nor sufficient for learning human values. In what ways, then, are the two related? One such way stems from the premise that if someone spends more resources making a decision, the resulting decision tends to be more in keeping with their true values. For instance, someone might spend lots of time thinking about the decision, they might consult experts, or they might try out the different options in a trial period before they make the real decision. Various authors have thus suggested that people’s choices under sufficient “reflection” act as a reliable indicator of their true values. Under this view, predicting a certain kind of behaviour (choices under reflection) is sufficient for learning human values. Paul Christiano has written about some proposals for doing this, though we will not discuss them here (the first link is for general AI systems while the second is for newsfeeds). In general, turning these ideas into algorithms that are tractable and learn safely remains a challenging problem.

Further reading

There is research on doing IRL for agents in POMDPs. Owain and collaborators explored the effects of limited information and cognitive biases on IRL: paper, paper, online book.

For many environments it will not be possible to identify the reward function from the observed trajectories. These identification problems are related to the mis-specification problems but are not the same thing. Active learning can help with identification (paper).

Paul Christiano raised many similar points about mis-specification in a post on his blog.

For a big-picture monograph on relations between human preferences, economic utility theory and welfare/well-being, see Hausman’s “Preference, Value, Choice and Welfare”.


Thanks to Sindy Li for reviewing a full draft of this post and providing many helpful comments. Thanks also to Michael Webb and Paul Christiano for doing the same on specific sections of the post.

Linear algebra fact

Here is interesting linear algebra fact: let A be an n \times n matrix and u be a vector such that u^{\top}A = \lambda u^{\top}. Then for any matrix B, u^{\top}((A-B)(\lambda I - B)^{-1}) = u^{\top}.

The proof is just basic algebra: u^{\top}(A-B)(\lambda I - B)^{-1} = (\lambda u^{\top} - u^{\top}B)(\lambda I - B)^{-1} = u^{\top}(\lambda I - B)(\lambda I - B)^{-1} = u^{\top}.

Why care about this? Let’s imagine that A is a (not necessarily symmetric) stochastic matrix, so 1^{\top}A = 1^{\top}. Let A-B be a low-rank approximation to A (so A-B consists of all the large singular values, and B consists of all the small singular values). Unfortunately since A is not symmetric, this low-rank approximation doesn’t preserve the eigenvalues of A and so we need not have 1^{\top}(A-B) = 1^{\top}. The (I-B)^{-1} can be thought of as a “correction” term such that the resulting matrix is still low-rank, but we’ve preserved one of the eigenvectors of A.

Prékopa–Leindler inequality

Consider the following statements:

  1. The shape with the largest volume enclosed by a given surface area is the n-dimensional sphere.
  2. A marginal or sum of log-concave distributions is log-concave.
  3. Any Lipschitz function of a standard n-dimensional Gaussian distribution concentrates around its mean.

What do these all have in common? Despite being fairly non-trivial and deep results, they all can be proved in less than half of a page using the Prékopa–Leindler inequality.

(I won’t show this here, or give formal versions of the statements above, but time permitting I will do so in a later blog post.)

Latent Variables and Model Mis-specification

Machine learning is very good at optimizing predictions to match an observed signal — for instance, given a dataset of input images and labels of the images (e.g. dog, cat, etc.), machine learning is very good at correctly predicting the label of a new image. However, performance can quickly break down as soon as we care about criteria other than predicting observables. There are several cases where we might care about such criteria:

  • In scientific investigations, we often care less about predicting a specific observable phenomenon, and more about what that phenomenon implies about an underlying scientific theory.
  • In economic analysis, we are most interested in what policies will lead to desirable outcomes. This requires predicting what would counterfactually happen if we were to enact the policy, which we (usually) don’t have any data about.
  • In machine learning, we may be interested in learning value functions which match human preferences (this is especially important in complex settings where it is hard to specify a satisfactory value function by hand). However, we are unlikely to observe information about the value function directly, and instead must infer it implicitly. For instance, one might infer a value function for autonomous driving by observing the actions of an expert driver.

In all of the above scenarios, the primary object of interest — the scientific theory, the effects of a policy, and the value function, respectively — is not part of the observed data. Instead, we can think of it as an unobserved (or “latent”) variable in the model we are using to make predictions. While we might hope that a model that makes good predictions will also place correct values on unobserved variables as well, this need not be the case in general, especially if the model is mis-specified.

I am interested in latent variable inference because I think it is a potentially important sub-problem for building AI systems that behave safely and are aligned with human values. The connection is most direct for value learning, where the value function is the latent variable of interest and the fidelity with which it is learned directly impacts the well-behavedness of the system. However, one can imagine other uses as well, such as making sure that the concepts that an AI learns sufficiently match the concepts that the human designer had in mind. It will also turn out that latent variable inference is related to counterfactual reasoning, which has a large number of tie-ins with building safe AI systems that I will elaborate on in forthcoming posts.

The goal of this post is to explain why problems show up if one cares about predicting latent variables rather than observed variables, and to point to a research direction (counterfactual reasoning) that I find promising for addressing these issues. More specifically, in the remainder of this post, I will: (1) give some formal settings where we want to infer unobserved variables and explain why we can run into problems; (2) propose a possible approach to resolving these problems, based on counterfactual reasoning.

1 Identifying Parameters in Regression Problems

Suppose that we have a regression model p_{\theta}(y \mid x), which outputs a probability distribution over y given a value for x. Also suppose we are explicitly interested in identifying the “true” value of \theta rather than simply making good predictions about y given x. For instance, we might be interested in whether smoking causes cancer, and so we care not just about predicting whether a given person will get cancer (y) given information about that person (x), but specifically whether the coefficients in \theta that correspond to a history of smoking are large and positive.

In a typical setting, we are given data points (x_1,y_1), \ldots, (x_n,y_n) on which to fit a model. Most methods of training machine learning systems optimize predictive performance, i.e. they will output a parameter \hat{\theta} that (approximately) maximizes \sum_{i=1}^n \log p_{\theta}(y_i \mid x_i). For instance, for a linear regression problem we have \log p_{\theta}(y_i \mid x_i) = -(y_i - \langle \theta, x_i \rangle)^2. Various more sophisticated methods might employ some form of regularization to reduce overfitting, but they are still fundamentally trying to maximize some measure of predictive accuracy, at least in the limit of infinite data.

Call a model well-specified if there is some parameter \theta^* for which p_{\theta^*}(y \mid x) matches the true distribution over y, and call a model mis-specified if no such \theta^* exists. One can show that for well-specified models, maximizing predictive accuracy works well (modulo a number of technical conditions). In particular, maximizing \sum_{i=1}^n \log p_{\theta}(y_i \mid x_i) will (asymptotically, as n \to \infty) lead to recovering the parameter \theta^*.

However, if a model is mis-specified, then it is not even clear what it means to correctly infer \theta. We could declare the \theta maximizing predictive accuracy to be the “correct” value of \theta, but this has issues:

  1. While \theta might do a good job of predicting y in the settings we’ve seen, it may not predict y well in very different settings.
  2. If we care about determining \theta for some scientific purpose, then good predictive accuracy may be an unsuitable metric. For instance, even though margarine consumption might correlate well with (and hence be a good predictor of) divorce rate, that doesn’t mean that there is a causal relationship between the two.

The two problems above also suggest a solution: we will say that we have done a good job of inferring a value for \theta if \theta can be used to make good predictions in a wide variety of situations, and not just the situation we happened to train the model on. (For the latter case of predicting causal relationships, the “wide variety of situations” should include the situation in which the relevant causal intervention is applied.)

Note that both of the problems above are different from the typical statistical problem of overfitting. Clasically, overfitting occurs when a model is too complex relative to the amount of data at hand, but even if we have a large amount of data the problems above could occur. This is illustrated in the following graph:


Here the blue line is the data we have (x,y), and the green line is the model we fit (with slope and intercept parametrized by \theta). We have more than enough data to fit a line to it. However, because the true relationship is quadratic, the best linear fit depends heavily on the distribution of the training data. If we had fit to a different part of the quadratic, we would have gotten a potentially very different result. Indeed, in this situation, there is no linear relationship that can do a good job of extrapolating to new situations, unless the domain of those new situations is restricted to the part of the quadratic that we’ve already seen.

I will refer to the type of error in the diagram above as mis-specification error. Again, mis-specification error is different from error due to overfitting. Overfitting occurs when there is too little data and noise is driving the estimate of the model; in contrast, mis-specification error can occur even if there is plenty of data, and instead occurs because the best-performing model is different in different scenarios.

2 Structural Equation Models

We will next consider a slightly subtler setting, which in economics is referred to as a structural equation model. In this setting we again have an output y whose distribution depends on an input x, but now this relationship is mediated by an unobserved variable z. A common example is a discrete choice model, where consumers make a choice among multiple goods (y) based on a consumer-specific utility function (z) that is influenced by demographic and other information about the consumer (x). Natural language processing provides another source of examples: in semantic parsing, we have an input utterance (x) and output denotation (y), mediated by a latent logical form z; in machine translation, we have input and output sentences (x and y) mediated by a latent alignment (z).

Symbolically, we represent a structural equation model as a parametrized probability distribution p_{\theta}(y, z \mid x), where we are trying to fit the parameters \theta. Of course, we can always turn a structural equation model into a regression model by using the identity p_{\theta}(y \mid x) = \sum_{z} p_{\theta}(y, z \mid x), which allows us to ignore z altogether. In economics this is called a reduced form model. We use structural equation models if we are specifically interested in the unobserved variable z (for instance, in the examples above we are interested in the value function for each individual, or in the logical form representing the sentence’s meaning).

In the regression setting where we cared about identifying \theta, it was obvious that there was no meaningful “true” value of \theta when the model was mis-specified. In this structural equation setting, we now care about the latent variable z, which can take on a meaningful true value (e.g. the actual utility function of a given individual) even if the overall model p_{\theta}(y,z \mid x) is mis-specified. It is therefore tempting to think that if we fit parameters \theta and use them to impute z, we will have meaningful information about the actual utility functions of individual consumers. However, this is a notational sleight of hand — just because we call z “the utility function” does not make it so. The variable z need not correspond to the actual utility function of the consumer, nor does the consumer’s preferences even need to be representable by a utility function.

We can understand what goes wrong by consider the following procedure, which formalizes the proposal above:

  1. Find \theta to maximize the predictive accuracy on the observed data, \sum_{i=1}^n \log p_{\theta}(y_i \mid x_i), where p_{\theta}(y_i \mid x_i) = \sum_z p_{\theta}(y_i, z \mid x_i)). Call the result \theta_0.
  2. Using this value \theta_0, treat z_i as being distributed according to p_{\theta_0}(z \mid x_i,y_i). On a new value x_+ for which y is not observed, treat z_+ as being distributed according to p_{\theta_0}(z \mid x_+).

As before, if the model is well-specified, one can show that such a procedure asymptotically outputs the correct probability distribution over z. However, if the model is mis-specified, things can quickly go wrong. For example, suppose that y represents what choice of drink a consumer buys, and z represents consumer utility (which might be a function of the price, attributes, and quantity of the drink). Now suppose that individuals have preferences which are influenced by unmodeled covariates: for instance, a preference for cold drinks on warm days, while the input x does not have information about the outside temperature when the drink was bought. This could cause any of several effects:

  • If there is a covariate that happens to correlate with temperature in the data, then we might conclude that that covariate is predictive of preferring cold drinks.
  • We might increase our uncertainty about z to capture the unmodeled variation in y.
  • We might implicitly increase uncertainty by moving utilities closer together (allowing noise or other factors to more easily change the consumer’s decision).

In practice we will likely have some mixture of all of these, and this will lead to systematic biases in our conclusions about the consumers’ utility functions.

The same problems as before arise: while we by design place probability mass on values of z that correctly predict the observation y, under model mis-specification this could be due to spurious correlations or other perversities of the model. Furthermore, even though predictive performance is high on the observed data (and data similar to the observed data), there is no reason for this to continue to be the case in settings very different from the observed data, which is particularly problematic if one is considering the effects of an intervention. For instance, while inferring preferences between hot and cold drinks might seem like a silly example, the design of timber auctions constitutes a much more important example with a roughly similar flavour, where it is important to correctly understand the utility functions of bidders in order to predict their behaviour under alternative auction designs (the model is also more complex, allowing even more opportunities for mis-specification to cause problems).

3 A Possible Solution: Counterfactual Reasoning

In general, under model mis-specification we have the following problems:

  • It is often no longer meaningful to talk about the “true” value of a latent variable \theta (or at the very least, not one within the specified model family).
  • Even when there is a latent variable z with a well-defined meaning, the imputed distribution over z need not match reality.

We can make sense of both of these problems by thinking in terms of counterfactual reasoning. Without defining it too formally, counterfactual reasoning is the problem of making good predictions not just in the actual world, but in a wide variety of counterfactual worlds that “could” exist. (I recommend this paper as a good overview for machine learning researchers.)

While typically machine learning models are optimized to predict well on a specific distribution, systems capable of counterfactual reasoning must make good predictions on many distributions (essentially any distribution that can be captured by a reasonable counterfactual). This stronger guarantee allows us to resolve many of the issues discussed above, while still thinking in terms of predictive performance, which historically seems to have been a successful paradigm for machine learning. In particular:

  • While we can no longer talk about the “true” value of \theta, we can say that a value of \theta is a “good” value if it makes good predictions on not just a single test distribution, but many different counterfactual test distributions. This allows us to have more confidence in the generalizability of any inferences we draw based on \theta (for instance, if \theta is the coefficient vector for a regression problem, any variable with positive sign is likely to robustly correlate with the response variable for a wide variety of settings).
  • The imputed distribution over a variable z must also lead to good predictions for a wide variety of distributions. While this does not force z to match reality, it is a much stronger condition and does at least mean that any aspect of z that can be measured in some counterfactual world must correspond to reality. (For instance, any aspect of a utility function that could at least counterfactually result in a specific action would need to match reality.)
  • We will successfully predict the effects of an intervention, as long as that intervention leads to one of the counterfactual distributions considered.

(Note that it is less clear how to actually train models to optimize counterfactual performance, since we typically won’t observe the counterfactuals! But it does at least define an end goal with good properties.)

Many people have a strong association between the concepts of “counterfactual reasoning” and “causal reasoning”. It is important to note that these are distinct ideas; causal reasoning is a type of counterfactual reasoning (where the counterfactuals are often thought of as centered around interventions), but I think of counterfactual reasoning as any type of reasoning that involves making robustly correct statistical inferences across a wide variety of distributions. On the other hand, some people take robust statistical correlation to be the definition of a causal relationship, and thus do consider causal and counterfactual reasoning to be the same thing.

I think that building machine learning systems that can do a good job of counterfactual reasoning is likely to be an important challenge, especially in cases where reliability and safety are important, and necessitates changes in how we evaluate machine learning models. In my mind, while the Turing test has many flaws, one thing it gets very right is the ability to evaluate the accuracy of counterfactual predictions (since dialogue provides the opportunity to set up counterfactual worlds via shared hypotheticals). In contrast, most existing tasks focus on repeatedly making the same type of prediction with respect to a fixed test distribution. This latter type of benchmarking is of course easier and more clear-cut, but fails to probe important aspects of our models. I think it would be very exciting to design good benchmarks that require systems to do counterfactual reasoning, and I would even be happy to incentivize such work monetarily.


Thanks to Michael Webb, Sindy Li, and Holden Karnofsky for providing feedback on drafts of this post. If any readers have additional feedback, please feel free to send it my way.

Individual Project Fund: Further Details

In my post on where I plan to donate in 2016, I said that I would set aside $2000 for funding promising projects that I come across in the next year:

The idea behind the project fund is … [to] give in a low-friction way on scales that are too small for organizations like Open Phil to think about. Moreover, it is likely good for me to develop a habit of evaluating projects I come across and thinking about whether they could benefit from additional money (either because they are funding constrained, or to incentivize an individual who is on the fence about carrying the project out). Finally, if this effort is successful, it is possible that other EAs will start to do this as well, which could magnify the overall impact. I think there is some danger that I will not be able to allocate the $2000 in the next year, in which case any leftover funds will go to next year’s donor lottery.

In this post I will give some further details about this fund. My primary goal is to give others an idea of what projects I am likely to consider funding, so that anyone who thinks they might be a good fit for this can get in contact with me. (I also expect many of the best opportunities to come from people that I meet in person but don’t necessarily read this blog, so I plan to actively look for projects throughout the year as well.)

I am looking to fund or incentivize projects that meet several of the criteria below:

  • The project is in the area of computer science, especially one of machine learning, cyber security, algorithmic game theory, or computational social choice. [Some other areas that I would be somewhat likely to consider, in order of plausibility: economics, statistics, political science (especially international security), and biology.]
  • The project either wouldn’t happen, or would seem less worthwhile / higher-effort without the funding.
  • The organizer is someone who either I or someone I trust has an exceptionally high opinion of.
  • The project addresses a topic that I personally think is highly important. High-level areas that I tend to care about include international security, existential risk, AI safety, improving political institutions, improving scientific institutions, and helping the global poor. Technical areas that I tend to care about include reliable machine learning, machine learning and security, counterfactual reasoning, and value learning. On the other hand, if you have a project that you feel has a strong case for importance but doesn’t fit into these areas, I am interested in hearing about it.
  • It is unlikely that this project or a substantially similar project would be done by someone else at a similar level of quality. (Or, whoever else is likely to do it would instead focus on a similarly high-value project, if this one were to be taken care of.)
  • The topic pertains to a technical area that I or someone I trust has a high degree of expertise in, and can evaluate more quickly and accurately than a non-specialized funder.

It isn’t necessary to meet all of the criteria above, but I would probably want most things I fund to meet at least 4 of these 6.

Here are some concrete examples of things I might fund:

  • Someone is thinking of doing a project that is undervalued (in terms of career benefits) but would be very useful. They don’t feel excited about allocating time to a non-career-relevant task but would feel more excited if getting an award of $1000 for their efforts.
  • Someone I trust is starting a new discussion group in an area that I think is important, but can’t find anyone to sponsor it, and wants money for providing food at the meetings.
  • Someone wants to do an experiment that I find valuable, but needs more compute resources than they have, and could use money for buying AWS hours.
  • Someone wants to curate a valuable dataset and needs money for hiring mechanical turkers.
  • Someone is organizing a workshop and needs money for securing a venue.
  • One project I am particularly interested in is a good survey paper at the intersection of machine learning and cyber security. If you might be interested in doing this, I would likely be willing to pay you.
  • There are likely many projects in the area of political activism that I would be interested in funding, although (due to crowdedness concerns) I have a particularly high bar for this area in terms of the criteria I laid out above.

If you think you might have a project that could use funding, please get in touch with me at Even if you are not sure if your project would be a good target for funding, I am very happy to talk to you about it. In addition, please feel free to comment either here or via e-mail if you have feedback on this general idea, or thoughts on types of small-project funding that I missed above.

Donations for 2016

The following explains where I plan to donate in 2016, with some of my thinking behind it. This year, I had $10,000 to allocate (the sum of my giving from 2015 and 2016, which I lumped together for tax reasons; although I think this was a mistake in retrospect, both due to discount rates and because I could have donated in January and December 2016 and still received the same tax benefits).

To start with the punch line: I plan to give $4000 to the EA donor lottery, $2500 to GiveWell for discretionary granting, $2000 to be held in reserve to fund promising projects, $500 to GiveDirectly, $500 to the Carnegie Endowment (earmarked for the Carnegie-Tsinghua Center), and $500 to the Blue Ribbon Study Panel.

For those interested in donating to any of these: instructions for the EA donor lottery and the Blue Ribbon Study Panel are in the corresponding links above, and you can donate to both GiveWell and GiveDirectly at this page. I am looking in to whether it is possible for small donors to give to the Carnegie Endowment, and will update this page when I find out.

At a high level, I partitioned my giving into two categories, which are roughly (A) “help poor people right now” and (B) “improve the overall trajectory of civilization” (these are meant to be rough delineations rather than rigorous definitions). I decided to split my giving into 30% category A and 70% category B. This is because while I believe that category B is the more pressing and impactful category to address in some overall utilitarian sense, I still feel a particular moral obligation towards helping the existing poor in the world we currently live in, which I don’t feel can be discharged simply by giving more to category B. The 30-70 split is meant to represent the fact that while category B seems more important to me, category A still receives substantial weight in my moral calculus (which isn’t fully utilitarian or even consequentialist).

The rest of this post treats categories A and B each in turn.

Category A: The Global Poor

Out of $3000 in total, I decided to give $2500 to GiveWell for discretionary regranting (which will likely be disbursed roughly but not exactly according to GiveWell’s recommended allocation), and $500 to some other source, with the only stipulation being that it did not exactly match GiveWell’s recommendation. The reason for this was the following: while I expect GiveWell’s recommendation to outperform any conclusion that I personally reach, I think there is substantial value in the exercise of personally thinking through where to direct my giving. A few more specific reasons:

  • Most importantly, while I think that offloading giving decisions to a trusted expert is the correct decision to maximize the impact of any individual donation, collectively it leads to a bad equilibrium where substantially fewer and less diverse brainpower is devoted to thinking about where to give. I think that giving a small but meaningful amount based on one’s own reasoning largely ameliorates this effect without losing much direct value.
  • In addition, I think it is good to build the skills to in principle think through where to direct resources, even if in practice most of the work is outsourced to a dedicated organization.
  • Finally, having a large number of individual donors check GiveWell’s work and search for alternatives creates stronger incentives for GiveWell to do a thorough job (and allows donors to have more confidence that GiveWell is doing a thorough job). While I know many GiveWell staff and believe that they would do an excellent job independently of external vetting, I still think this is good practice.

Related to the last point: doing this exercise gave me a better appreciation for the overall reliability, strengths, and limitations of GiveWell’s work. In general, I found that GiveWell’s work was incredibly thorough (more-so than I expected despite my high opinion of them), and moreover that they have moved substantial money beyond the publicized annual donor recommendations. An example of this is their 2016 grant to IDinsight. IDinsight ended up being one of my top candidates for where to donate, such that I thought it was plausibly even better than a GiveWell top charity. However, when I looked into it further it turned out that GiveWell had already essentially filled their entire funding gap.

I think this anecdote serves to illustrate a few things: first, as noted, GiveWell is very thorough, and does substantial work beyond what is apparent from the top charities page. Second, while GiveWell had already given to IDinsight, the grant was made in 2016. I think the same process I used would not have discovered IDinsight in 2015, but it’s possible that other processes would have. So, I think it is possible that a motivated individual could identify strong giving opportunities a year ahead of GiveWell. As a point against this, I think I am in an unusually good position to do this and still did not succeed. I also think that even if an individual identified a strong opportunity, it is unlikely that they could be confident that it was strong, and in most cases GiveWell’s top charities would still be better bets in expectation (but I think that merely identifying a plausibly strong giving opportunity should count as a huge success for the purposes of the overall exercise).

To elaborate on why my positioning might be atypically good: I already know GiveWell staff and so have some appreciation for their thinking, and I work at Stanford and have several friends in the economics department, which is one of the strongest departments in the world for Development Economics. In particular, I discussed my giving decisions extensively with a student of Pascaline Dupas, who is one of the world experts in the areas of economics most relevant to GiveWell’s recommendations.

Below are specifics on organizations I looked into and where I ultimately decided to give.

Object-level Process and Decisions (Category A)

My process for deciding where to give mostly consisted of talking to several people I trust, brainstorming and thinking things through myself, and a small amount of online research. (I think that I should likely have done substantially more online research than I ended up doing, but my thinking style tends to benefit from 1-on-1 discussions, which I also find more enjoyable.) The main types of charities that I ended up considering were:

  • GiveDirectly (direct cash transfers)
  • IPA/JPAL and similar groups (organizations that support academic research on  international development)
  • IDinsight and similar groups (similar to the previous group, but explicitly tries to do the “translational work” of going from academic research to evidence-backed large-scale interventions)
  • public information campaigns (such as Development Media International)
  • animal welfare
  • start-ups or other small groups in the development space that might need seed funding
  • meta-charities such as CEA that try to increase the amount of money moved to EA causes (or evidence-backed charity more generally)

I ultimately felt unsure whether animal welfare should count in this category, and while I felt that CEA was a potentially strong candidate in terms of pure cost-effectiveness, directing funds there felt overly insular/meta to me in a way that defeated the purpose of the giving exercise. (Note: two individuals who reviewed this post encouraged me to revisit this point; as a result, next year I plan to look into CEA in more detail.)

While looking into the “translational work” category, I came across one organization other than IDinsight that did work in this area and was well-regarded by at least some economists. While I was less impressed by them than I was by IDinsight, they seemed plausibly strong, and it turned out that GiveWell had not yet evaluated them. While I ended up deciding not to give to them (based on feeling that IDinsight was likely to do substantially better work in the same area) I did send GiveWell an e-mail bringing the organization to their attention.

When looking into IPA, my impression was that while they have been responsible for some really good work in the past, this was primarily while they were a smaller organization, and they have now become large and bureaucratic enough that their future value will be substantially lower. However, I also found out about an individual who was running a small organization in the same space as IPA, and seemed to be doing very good work. While I was unable to offer them money for reasons related to conflict of interest, I do plan to try to find ways to direct funds to them if they are interested.

While public information campaigns seem like they could a priori be very effective, briefly looking over GiveWell’s page on DMI gave me the impression that GiveWell had already considered this area in a great deal of depth and prioritized other interventions for good reasons.

I ultimately decided to give my money to GiveDirectly. While in some sense this violates the spirit of the exercise, I felt satisfied about having found at least one potentially good giving opportunity (the small IPA-like organization) even if I was unable to give to it personally, and overall felt that I had done a reasonable amount of research. Moreover, I have a strong intuition that 0% is the wrong allocation for GiveDirectly, and it wasn’t clear to me that GiveWell’s reasons for recommending 0% were strong enough to override that intuition.

So, overall, $2500 of my donation will go to GiveWell for discretionary re-granting, and $500 to GiveDirectly.

Trajectory of Civilization (Category B)

First, I plan to put $2000 into escrow for the purpose of supporting any useful small projects (specifically in the field of computer science / machine learning) that I come across in the next year. For the remaining $5000, I plan to allocate $4000 of it to the donor lottery, $500 to the Carnegie Endowment, and $500 to the Blue Ribbon Study Panel on Biodefense. For the latter, I wanted to donate to something that improved medium-term international security, because I believe that this is an important area that is relatively under-invested in by the effective altruist community (both in terms of money and cognitive effort). Here are all of the major possibilities that I considered:

  • Donating to the Future of Humanity Institute, with funds earmarked towards their collaboration with Allan Dafoe. I decided against this because my impression was that this particular project was not funding-constrained. (However, I am very excited by the work that Allan and his collaborators are doing, and would like to find ways to meaningfully support it.)
  • Donating to the Carnegie Endowment, restricted specifically to the Carnegie-Tsinghua Center. My understanding is that this is one of the few western organizations working to influence China’s nuclear policy (though this is based on personal conversation and not something I have looked into myself). My intuition is that influencing Chinese nuclear policy is substantially more tractable than U.S. nuclear policy, due to far fewer people trying to do so. In addition, from looking at their website, I felt that most of the areas they worked in were important areas, which I believe to be unusual for large organizations with multiple focuses (as a contrast, for other organizations with a similar number of focus areas, I felt that roughly half of the areas were obviously orders of magnitude less important than the areas I was most excited about). I had some reservations about donating (due to their size: $30 million in revenue per year, and $300 million in assets), but I decided to donate $500 anyways because I am excited about this general type of work. (This organization was brought to my attention by Nick Beckstead; Nick notes that he doesn’t have strong opinions about this organization, primarily due to not knowing much about them.)
  • Donating to the Blue Ribbon Study Panel: I am basically trusting Jaime Yassif that this is a strong recommendation within the area of biodefense.
  • Donating to the ACLU: The idea here would be to decrease the probability that a President Trump seriously erodes democratic norms within the U.S. I however currently expect the ACLU to be well-funded (my understanding is that they got a flood of donations after Trump was elected).
  • Donating to the DNC or the Obama/Holder redistricting campaign: This is based on the idea that (1) Democrats are much better than Republicans for global stability / good U.S. policy, and (2) Republicans should be punished for helping Trump to become president. I basically agree with both, and could see myself donating to the redistricting campaign in particular in the future, but this intuitively feels less tractable/underfunded than non-partisan efforts like the Carnegie Endowment or Blue Ribbon Study Panel.
  • Creating a prize fund for incentivizing important research projects within computer science: I was originally planning to allocate $1000 to $2000 to this, based on the idea that computer science is a key field for multiple important areas (both AI safety and cyber security) and that as an expert in this field I would be in a unique position to identify useful projects relative to others in the EA community. However, after talking to several people and thinking about it myself, I decided that it was likely not tractable to provide meaningful incentives via prizes at such a small scale, and opted to instead set aside $2000 to support promising projects as I come across them.

(As a side note: it isn’t completely clear to me whether the Carnegie Endowment accepts small donations. I plan to contact them about this, and if they do not, allocate the money to the Blue Ribbon Study Panel instead.)

In the remainder of this post I will briefly describe the $2000 project fund, how I plan to use it, and why I decided it was a strong giving opportunity. I also plan to describe this in more detail in a separate follow-up post. Credit goes to Owen Cotton-Barratt for suggesting this idea. In addition, one of Paul Christiano’s blog posts inspired me to think about using prizes to incentivize research, and Holden Karnofsky further encouraged me to think along these lines.

The idea behind the project fund is similar to the idea behind the prize fund: I understand research in computer science better than most other EAs, and can give in a low-friction way on scales that are too small for organizations like Open Phil to think about. Moreover, it is likely good for me to develop a habit of evaluating projects I come across and thinking about whether they could benefit from additional money (either because they are funding constrained, or to incentivize an individual who is on the fence about carrying the project out). Finally, if this effort is successful, it is possible that other EAs will start to do this as well, which could magnify the overall impact. I think there is some danger that I will not be able to allocate the $2000 in the next year, in which case any leftover funds will go to next year’s donor lottery.

Thinking Outside One’s Paradigm

When I meet someone who works in a field outside of computer science, I usually ask them a lot of questions about their field that I’m curious about. (This is still relevant even if I’ve already met someone in that field before, because it gives me an idea of the range of expert consensus; for some questions this ends up being surprisingly variable.) I often find that, as an outsider, I can think of natural-seeming questions that experts in the field haven’t thought about, because their thinking is confined by their field’s paradigm while mine is not (pessimistically, it’s instead constrained by a different paradigm, i.e. computer science).

Usually my questions are pretty naive, and are basically what a computer scientist would think to ask based on their own biases. For instance:

  • Neuroscience: How much computation would it take to simulate a brain? Do our current theories of how neurons work allow us to do that even in principle?
  • Political science: How does the rise of powerful multinational corporations affect theories of international security (typical past theories assume that the only major powers are states)? How do we keep software companies (like Google, etc.) politically accountable? How will cyber attacks / cyber warfare affect international security?
  • Materials science: How much of the materials design / discovery process can be automated? What are the bottlenecks to building whatever materials we would like to? How can different research groups effectively communicate and streamline their steps for synthesizing materials?

When I do this, it’s not unusual for me to end up asking questions that the other person hasn’t really thought about before. In this case, responses range from “that’s not a question that our field studies” to “I haven’t thought about this much, but let’s try to think it through on the spot”. Of course, sometimes the other person has thought about it, and sometimes my question really is just silly or ill-formed for some reason (I suspect this is true more often than I’m explicitly made aware of, since some people are too polite to point it out to me).

I find the cases where the other person hasn’t thought about the question to be striking, because it means that I as a naive outsider can ask natural-seeming questions that haven’t been considered before by an expert in the field. I think what is going on here is that I and my interlocutor are using different paradigms (in the Kuhnian sense) for determining what questions are worth asking in a field. But while there is a sense in which the other person’s paradigm is more trustworthy — since it arose from a consensus of experts in the relevant field — that doesn’t mean that it’s absolutely reliable. Paradigms tend to blind one to evidence or problems that don’t fit into that paradigm, and paradigm shifts in science aren’t really that rare. (In addition, many fields including machine learning don’t even have a single agreed-upon paradigm.)

I think that as a scientist (or really, even as a citizen) it is important to be able to see outside one’s own paradigm. I currently think that I do a good job of this, but it seems to me that there’s a big danger of becoming more entrenched as I get older. Based on the above experiences, I plan to use the following test: When someone asks me a question about my field, how often have I not thought about it before? How tempted am I to say, “That question isn’t interesting”? If these start to become more common, then I’ll know something has gone wrong.

A few miscellaneous observations:

  • There are several people I know who routinely have answers to whatever questions I ask. Interestingly, they tend to be considered slightly “crackpot-ish” within their field; and they might also be less successful by conventional metrics, relatively to how smart they are considered by their colleagues. I think this is a result of the fact that most academic fields over-reward progress within that field’s paradigm and under-reward progress outside of it.
  • Beyond “slightly crakpot-ish academics”, the other set of people who routinely have answers to my questions are philosophers and some people in program manager roles (this includes certain types of VCs as well).
  • I would guess that in general technical fields that overlap with the humanities are more likely to take a broad view and not get stuck in a single paradigm. For instance, I would expect political scientists to have thought about most of the political science questions I mentioned above; however, I haven’t talked to enough political scientists (or social scientists in general) to have much confidence in this.