I’ve spent much of the last few days reading various ICML papers and I find there’s a few pieces of feedback that I give consistently across several papers. I’ve collated some of these below. As a general note, many of these are about local style rather than global structure; I think that good local style probably contributes substantially more to readability than global structure and is in general under-rated. I’m in general pretty willing to break rules about global structure (such as even having a conclusion section in the first place! though this might cause reviewers to look at your paper funny), but not to break local stylistic rules without strong reasons.

• Be precise. This isn’t about being pedantic, but about maximizing information content. Choose your words carefully so that you say what you mean to say. For instance, replace “performance” with “accuracy” or “speed” depending on what you mean.
• Be concise. Most of us write in an overly wordy style, because it’s easy to and no one drilled it out of us. Not only does wordiness decrease readability, it wastes precious space if you have a page limit.
• Avoid complex sentence structure. Most research is already difficult to understand and digest; there’s no reason to make it harder by having complex run-on sentences.
• Use consistent phrasing. In general prose, we’re often told to refer to the same thing in different ways to avoid boring the reader, but in technical writing this will lead to confusion. Hopefully your actual results are interesting enough that the reader doesn’t need to be entertained by your large vocabulary.

Abstract

• There’s more than one approach to writing a good abstract, and which one you take will depend on the sort of paper you’re writing. I’ll give one approach that is good for papers presenting an unusual or unfamiliar idea to readers.
• The first sentence / phrase should be something that all readers will agree with. The second should be something that many readers would find surprising, or wouldn’t have thought about before; but it should follow from (or at least be supported by) the first sentence. The general idea is that you need to start by warming the reader up and putting them in the right context, before they can appreciate your brilliant insight.
• Here’s an example from my Reified Context Models paper: “A classic tension exists between exact inference in a simple model and approximate inference in a complex model. The latter offers expressivity and thus accuracy, but the former provides coverage of the space, an important property for confidence estimation and learning with indirect supervision.” Note how the second sentence conveys a non-obvious claim — that coverage is important for confidence estimation as well as for indirect supervision. It’s tempting to lead with this in order to make the first sentence more punchy, but this will tend to go over reader’s heads. Imagine if the abstract had started, “In the context of inference algorithms, coverage of the space is important for confidence estimation and indirect supervision.” No one is going to understand what that means.

Introduction

• The advice in this section is most applicable to the introduction section (and maybe related work and discussion), but applies on some level to other parts of the paper as well.
• Many authors (myself included) end up using phrases like “much recent interest” and “increasingly important” because these phrases show up frequently in academic papers, and they are vague enough to be defensible. Even though these phrases are common, they are bad writing! They are imprecise and rely on hedge words to avoid having to explain why something is interesting or important.
• Make sure to provide context before introducing a new concept; if you suddenly start talking about “NP-hardness” or “local transformations”, you need to first explain to the reader why this is something that should be considered in the present situation.
• Don’t beat around the bush; if the point is “A, therefore B” (where B is some good fact about your work), then say that, rather than being humble and just pointing out A.
• Don’t make the reader wait for the payoff; spell it out in the introduction. I frequently find that I have to wait until Section 4 to find out why I should care about a paper; while I might read that far, most reviewers are going to give up about halfway through Section 1. (Okay, that was a bit of an exaggeration; they’ll probably wait until the end of Section 1 before giving up.)

Conclusion / Discussion

• I generally put in the conclusion everything that I wanted to put in the introduction, but couldn’t because readers wouldn’t be able to appreciate the context without reading the rest of the paper first. This is a relatively straightforward way to write a conclusion that isn’t just a re-hash of the introduction.
• The conclusion can also be a good place to discuss open questions that you’d like other researchers to think about.
• My model is that only the ~5 people most interested in your paper are going to actually read this section, so it’s worth somewhat tailoring to that audience. Unfortunately, the paper reviewers might also read this section, so you can’t tailor it too much or the reviewers might get upset if they end up not being in the target audience.
• For theory papers, having a conclusion is completely optional (I usually skip it). In this case, open problems can go in the introduction. If you’re submitting a theory paper to NIPS or ICML, you unfortunately need a conclusion or  reviewers will get upset. In my opinion, this is an instance where peer review makes the paper worse rather than better.

LaTeX

• Proper citation style: one should write “Widgets are awesome (Smith, 2001).” or “Smith (2001) shows that widgets are awesome.” but never “(Smith, 2001) shows that widgets are awesome.” You can control this in LaTeX using \citep{} and \citet{} if you use natbib.
• Display equations can take up a lot of space if over-used, but at the same time, too many in-line equations can make your document hard to read. Think carefully about which equations are worth displaying, and whether your in-line equations are becoming too dense.
• If leave a blank line after or , you will create an extra line break in the document. This is sort of annoying because white-space isn’t supposed to matter in that way, but you can save a lot of space by remembering this.
• DON’T use the fullpage package. I’m used to using \usepackage{fullpage} in documents to get the margins that I want, but this will override options in many style files (including jmlr.sty which is used in machine learning).
• \left( and \right) can be convenient for auto-sizing parentheses, but are often overly conservative (e.g. making parentheses too big due to serifs or subscripts). It’s fine to use \left( and \right) initially, but you might want to specify explicit sizes with \big(, \Big(, \bigg(, etc. in the final pass.
• When displaying a sequence of equations (e.g. with the align environment), use \stackrel{} on any non-trivial equality or inequality statements and justify these steps immediately after the equation. See the bottom of page 6 of this paper for an example.
• Make sure that \label{} commands come after the \caption{} command in a figure (rather than before), otherwise your numbering will be wrong.

Math

• When using a variable that hasn’t appeared in a while, remind the reader what it is (i.e., “the sample space $\mathcal{X}$” rather than “$\mathcal{X}$“.
• If it’s one of the main points of your work, call it a Theorem. If it’s a non-trivial conclusion that requires a somewhat involved argument (but it’s not a main point of the work), call it a Proposition. If the proof is short or routine, call it a Lemma, unless it follows directly from a Theorem you just stated, in which case call it a Corollary.
• As a general rule there shouldn’t be more than 3 theorems in your paper (probably not more than 1). If you think this is unreasonable, consider that my COLT 2015 paper has 3 theorems across 24 pages, and my STOC 2017 paper has 2 theorems across 47 pages (not counting stating the same theorem in multiple locations).
• If you just made a mathematical argument in the text that ended up with a non-trivial conclusion, you probably want to encapsulate it in a Proposition or Theorem. (Better yet, state the theorem before the argument so that the reader knows what you’re arguing for; although this isn’t always the best ordering.)

## Model Mis-specification and Inverse Reinforcement Learning

In my previous post, “Latent Variables and Model Mis-specification”, I argued that while machine learning is good at optimizing accuracy on observed signals, it has less to say about correctly inferring the values for unobserved variables in a model. In this post I’d like to focus in on a specific context for this: inverse reinforcement learning (Ng et al. 2000, Abeel et al. 2004, Ziebart et al. 2008, Ho et al 2016), where one observes the actions of an agent and wants to infer the preferences and beliefs that led to those actions. For this post, I am pleased to be joined by Owain Evans, who is an active researcher in this area and has co-authored an online book about building models of agents (see here in particular for a tutorial on inverse reinforcement learning and inverse planning).

Owain and I are particularly interested in inverse reinforcement learning (IRL) because it has been proposed (most notably by Stuart Russell) as a method for learning human values in the context of AI safety; among other things, this would eventually involve learning and correctly implementing human values by artificial agents that are much more powerful, and act with much broader scope, than any humans alive today. While we think that overall IRL is a promising route to consider, we believe that there are also a number of non-obvious pitfalls related to performing IRL with a mis-specified model. The role of IRL in AI safety is to infer human values, which are represented by a reward function or utility function. But crucially, human values (or human reward functions) are never directly observed.

Below, we elaborate on these issues. We hope that by being more aware of these issues, researchers working on inverse reinforcement learning can anticipate and address the resulting failure modes. In addition, we think that considering issues caused by model mis-specification in a particular concrete context can better elucidate the general issues pointed to in the previous post on model mis-specification.

### Specific Pitfalls for Inverse Reinforcement Learning

In “Latent Variables and Model Mis-specification”, Jacob talked about model mis-specification, where the “true” model does not lie in the model family being considered. We encourage readers to read that post first, though we’ve also tried to make the below readable independently.

In the context of inverse reinforcement learning, one can see some specific problems that might arise due to model mis-specification. For instance, the following are things we could misunderstand about an agent, which would cause us to make incorrect inferences about the agent’s values:

• The actions of the agent. If we believe that an agent is capable of taking a certain action, but in reality they are not, we might make strange inferences about their values (for instance, that they highly value not taking that action). Furthermore, if our data is e.g. videos of human behavior, we have an additional inference problem of recognizing actions from the frames.
• The information available to the agent. If an agent has access to more information than we think it does, then a plan that seems irrational to us (from the perspective of a given reward function) might actually be optimal for reasons that we fail to appreciate. In the other direction, if an agent has less information than we think, then we might incorrectly believe that they don’t value some outcome A, even though they really only failed to obtain A due to lack of information.
• The long-term plans of the agent. An agent might take many actions that are useful in accomplishing some long-term goal, but not necessarily over the time horizon that we observe the agent. Inferring correct values thus also requires inferring such long-term goals. In addition, long time horizons can make models more brittle, thereby exacerbating model mis-specification issues.

There are likely other sources of error as well. The general point is that, given a mis-specified model of the agent, it is easy to make incorrect inferences about an agent’s values if the optimization pressure on the learning algorithm is only towards predicting actions correctly in-sample.

In the remainder of this post, we will cover each of the above aspects — actions, information, and plans — in turn, giving both quantitative models and qualitative arguments for why model mis-specification for that aspect of the agent can lead to perverse beliefs and behavior. First, though, we will briefly review the definition of inverse reinforcement learning and introduce relevant notation.

### Inverse Reinforcement Learning: Definition and Notations

In inverse reinforcement learning, we want to model an agent taking actions in a given environment. We therefore suppose that we have a state space $S$ (the set of states the agent and environment can be in), an action space $A$ (the set of actions the agent can take), and a transition function $T(s' \mid s,a)$, which gives the probability of moving from state $s$ to state $s'$ when taking action $a$. For instance, for an AI learning to control a car, the state space would be the possible locations and orientations of the car, the action space would be the set of control signals that the AI could send to the car, and the transition function would be the dynamics model for the car. The tuple of $(S,A,T)$ is called an $MDP\backslash R$, which is a Markov Decision Process without a reward function. (The $MDP\backslash R$ will either have a known horizon or a discount rate $\gamma$ but we’ll leave these out for simplicity.)

Figure 1: Diagram showing how IRL and RL are related. (Credit: Pieter Abbeel’s slides on IRL)

The inference problem for IRL is to infer a reward function $R$ given an optimal policy $\pi^* : S \to A$ for the $MDP\backslash R$ (see Figure 1). We learn about the policy $\pi^*$ from samples $(s,a)$ of states and the corresponding action according to $\pi^*$ (which may be random). Typically, these samples come from a trajectory, which records the full history of the agent’s states and actions in a single episode:

$(s_0, a_0), (s_1, a_1), \ldots, (s_n, a_n)$

In the car example, this would correspond to the actions taken by an expert human driver who is demonstrating desired driving behaviour (where the actions would be recorded as the signals to the steering wheel, brake, etc.).

Given the $MDP\backslash R$ and the observed trajectory, the goal is to infer the reward function $R$. In a Bayesian framework, if we specify a prior on $R$ we have:

$P(R \mid s_{0:n},a_{0:n}) \propto P( s_{0:n},a_{0:n} \mid R) P(R) = P(R) \cdot \prod_{i=0}^n P( a_i \mid s_i, R)$

The likelihood $P(a_i \mid s_i, R)$ is just $\pi_R(s)[a_i]$, where $\pi_R$ is the optimal policy under the reward function $R$. Note that computing the optimal policy given the reward is in general non-trivial; except in simple cases, we typically approximate the policy using reinforcement learning (see Figure 1). Policies are usually assumed to be noisy (e.g. using a softmax instead of deterministically taking the best action). Due to the challenges of specifying priors, computing optimal policies and integrating over reward functions, most work in IRL uses some kind of approximation to the Bayesian objective (see the references in the introduction for some examples).

### Recognizing Human Actions in Data

IRL is a promising approach to learning human values in part because of the easy availability of data. For supervised learning, humans need to produce many labeled instances specialized for a task. IRL, by contrast, is an unsupervised/semi-supervised approach where any record of human behavior is a potential data source. Facebook’s logs of user behavior provide trillions of data-points. YouTube videos, history books, and literature are a trove of data on human behavior in both actual and imagined scenarios. However, while there is lots of existing data that is informative about human preferences, we argue that exploiting this data for IRL will be a difficult, complex task with current techniques.

Inferring Reward Functions from Video Frames

As we noted above, applications of IRL typically infer the reward function R from observed samples of the human policy $\pi^*$. Formally, the environment is a known $MDP\backslash R = (S,A,T)$ and the observations are state-action pairs, $(s,a) \sim pi^*$. This assumes that (a) the environment’s dynamics $T$ are given as part of the IRL problem, and (b) the observations are structured as “state-action” pairs. When the data comes from a human expert parking a car, these assumptions are reasonable. The states and actions of the driver can be recorded and a car simulator can be used for $T$. For data from YouTube videos or history books, the assumptions fail. The data is a sequence of partial observations: the transition function $T$ is unknown and the data does not separate out state and action. Indeed, it’s a challenging ML problem to infer human actions from text or videos.

Movie still: What actions are being performed in this situation? (Source)

As a concrete example, suppose the data is a video of two co-pilots flying a plane. The successive frames provide only limited information about the state of the world at each time step and the frames often jump forward in time. So it’s more like a POMDP with a complex observation model. Moreover, the actions of each pilot need to be inferred. This is a challenging inference problem, because actions can be subtle (e.g. when a pilot nudges the controls or nods to his co-pilot).

To infer actions from observations, some model relating the true state-action $(s,a)$ to the observed video frame must be used. But choosing any model makes substantive assumptions about how human values relate to their behavior. For example, suppose someone attacks one of the pilots and (as a reflex) he defends himself by hitting back. Is this reflexive or instinctive response (hitting the attacker) an action that is informative about the pilot’s values? Philosophers and neuroscientists might investigate this by considering the mental processes that occur before the pilot hits back. If an IRL algorithm uses an off-the-shelf action classifier, it will lock in some (contentious) assumptions about these mental processes. At the same time, an IRL algorithm cannot learn such a model because it never directly observes the mental processes that relate rewards to actions.

Inferring Policies From Video Frames

When learning a reward function via IRL, the ultimate goal is to use the reward function to guide an artificial agent’s behavior (e.g. to perform useful tasks to humans). This goal can be formalized directly, without including IRL as an intermediate step. For example, in Apprenticeship Learning, the goal is to learn a “good” policy for the $MDP\backslash R$ from samples of the human’s policy $\pi^*$ (where $\pi^*$ is assumed to approximately optimize an unknown reward function). In Imitation Learning, the goal is simply to learn a policy that is similar to the human’s policy.

Like IRL, policy search techniques need to recognize an agent’s actions to infer their policy. So they have the same challenges as IRL in learning from videos or history books. Unlike IRL, policy search does not explicitly model the reward function that underlies an agent’s behavior. This leads to an additional challenge. Humans and AI systems face vastly different tasks and have different action spaces. Most actions in videos and books would never be performed by a software agent. Even when tasks are similar (e.g. humans driving in the 1930s vs. a self-driving car in 2016), it is a difficult transfer learning problem to use human policies in one task to improve AI policies in another.

IRL Needs Curated Data

We argued that records of human behaviour in books and videos are difficult for IRL algorithms to exploit. Data from Facebook seems more promising: we can store the state (e.g. the HTML or pixels displayed to the human) and each human action (clicks and scrolling). This extends beyond Facebook to any task that can be performed on a computer. While this covers a broad range of tasks, there are obvious limitations. Many people in the world have a limited ability to use a computer: we can’t learn about their values in this way. Moreover, some kinds of human preferences (e.g. preferences over physical activities) seem hard to learn about from behaviour on a computer.

### Information and Biases

Human actions depend both on their preferences and their beliefs. The beliefs, like the preferences, are never directly observed. For narrow tasks (e.g. people choosing their favorite photos from a display), we can model humans as having full knowledge of the state (as in an MDP). But for most real-world tasks, humans have limited information and their information changes over time (as in a POMDP or RL problem). If IRL assumes the human has full information, then the model is mis-specified and generalizing about what the human would prefer in other scenarios can be mistaken. Here are some examples:

(1). Someone travels from their house to a cafe, which has already closed. If they are assumed to have full knowledge, then IRL would infer an alternative preference (e.g. going for a walk) rather than a preference to get a drink at the cafe.

(2). Someone takes a drug that is widely known to be ineffective. This could be because they have a false belief that the drug is effective, or because they picked up the wrong pill, or because they take the drug for its side-effects. Each possible explanation could lead to different conclusions about preferences.

(3). Suppose an IRL algorithm is inferring a person’s goals from key-presses on their laptop. The person repeatedly forgets their login passwords and has to reset them. This behavior is hard to capture with a POMDP-style model: humans forget some strings of characters and not others. IRL might infer that the person intends to repeatedly reset their passwords.

Example (3) above arises from humans forgetting information — even if the information is only a short string of characters. This is one way in which humans systematically deviate from rational Bayesian agents. The field of psychology has documented many other deviations. Below we discuss one such deviation — time-inconsistency — which has been used to explain temptation, addiction and procrastination.

Time-inconsistency and Procrastination

An IRL algorithm is inferring Alice’s preferences. In particular, the goal is to infer Alice’s preference for completing a somewhat tedious task (e.g. writing a paper) as opposed to relaxing. Alice has $T$ days in which she could complete the task and IRL observes her working or relaxing on each successive day.

Figure 2. MDP graph for choosing whether to “work” or “wait” (relax) on a task.

Formally, let R be the preference/reward Alice assigns to completing the task. Each day, Alice can “work” (receiving cost $w$ for doing tedious work) or “wait” (cost $0$). If she works, she later receives the reward $R$ minus a tiny, linearly increasing cost (because it’s better to submit a paper earlier). Beyond the deadline at $T$, Alice cannot get the reward $R$. For IRL, we fix $\epsilon$ and $w$ and infer $R$.

Suppose Alice chooses “wait” on Day 1. If she were fully rational, it follows that R (the preference for completing the task) is small compared to $w$ (the psychological cost of doing the tedious work). In other words, Alice doesn’t care much about completing the task. Rational agents will do the task on Day 1 or never do it. Yet humans often care deeply about tasks yet leave them until the last minute (when finishing early would be optimal). Here we imagine that Alice has 9 days to complete the task and waits until the last possible day.

Figure 3: Graph showing IRL inferences for Optimal model (which is mis-specified) and Possibly Discounting Model (which includes hyperbolic discounting). On each day ($x$-axis) the model gets another observation of Alice’s choice. The $y$-axis shows the posterior mean for $R$ (reward for task), where the tedious work $w = -1$.

Figure 3 shows results from running IRL on this problem. There is an “Optimal” model, where the agent is optimal up to an unknown level of softmax random noise (a typical assumption for IRL). There is also a “Possibly Discounting” model, where the agent is either softmax optimal or is a hyperbolic discounter (with unknown level of discounting). We do joint Bayesian inference over the completion reward $R$, the softmax noise and (for “Possibly Discounting”) how much the agent hyperbolically discounts. The work cost $w$ is set to $-1$. Figure 3 shows that after 6 days of observing Alice procrastinate, the “Optimal” model is very confident that Alice does not care about the task $(R < |w|)$. When Alice completes the task on the last possible day, the posterior mean on R is not much more than the prior mean. By contrast, the “Possibly Discounting” model never becomes confident that Alice doesn’t care about the task. (Note that the gap between the models would be bigger for larger $T$. The “Optimal” model’s posterior on R shoots back to its Day-0 prior because it explains the whole action sequence as due to high softmax noise — optimal agents without noise would either do the task immediately or not at all. Full details and code are here.)

### Long-term Plans

Agents will often take long series of actions that generate negative utility for them in the moment in order to accomplish a long-term goal (for instance, studying every night in order to perform well on a test). Such long-term plans can make IRL more difficult for a few reasons. Here we focus on two: (1) IRL systems may not have access to the right type of data for learning about long-term goals, and (2) needing to predict long sequences of actions can make algorithms more fragile in the face of model mis-specification.

(1) Wrong type of data. To make inferences based on long-term plans, it would be helpful to have coherent data about a single agent’s actions over a long period of time (so that we can e.g. see the plan unfolding). But in practice we will likely have substantially more data consisting of short snapshots of a large number of different agents (e.g. because many internet services already record user interactions, but it is uncommon for a single person to be exhaustively tracked and recorded over an extended period of time even while they are offline).

The former type of data (about a single representative population measured over time) is called panel data, while the latter type of data (about different representative populations measured at each point in time) is called repeated cross-section data. The differences between these two types of data is well-studied in econometrics, and a general theme is the following: it is difficult to infer individual-level effects from cross-sectional data.

An easy and familiar example of this difference (albeit not in an IRL setting) can be given in terms of election campaigns. Most campaign polling is cross-sectional in nature: a different population of respondents is polled at each point in time. Suppose that Hillary Clinton gives a speech and her overall support according to cross-sectional polls increases by 2%; what can we conclude from this? Does it mean that 2% of people switched from Trump to Clinton? Or did 6% of people switch from Trump to Clinton while 4% switched from Clinton to Trump?

At a minimum, then, using cross-sectional data leads to a difficult disaggregation problem; for instance, different agents taking different actions at a given point in time could be due to being at different stages in the same plan, or due to having different plans, or some combination of these and other factors. Collecting demographic and other side data can help us (by allowing us to look at variation and shifts within each subpopulation), but it is unclear if this will be sufficient in general.

On the other hand, there are some services (such as Facebook or Google) that do have extensive data about individual users across a long period of time. However, this data has another issue: it is incomplete in a very systematic way (since it only tracks online behaviour). For instance, someone might go online most days to read course notes and Wikipedia for a class; this is data that would likely be recorded. However, it is less likely that one would have a record of that person taking the final exam, passing the class and then getting an internship based on their class performance. Of course, some pieces of this sequence would be inferable based on some people’s e-mail records, etc., but it would likely be under-represented in the data relative to the record of Wikipedia usage. In either case, some non-trivial degree of inference would be necessary to make sense of such data.

(2) Fragility to mis-specification. Above we discussed why observing only short sequences of actions from an agent can make it difficult to learn about their long-term plans (and hence to reason correctly about their values). Next we discuss another potential issue — fragility to model mis-specification.

Suppose someone spends 99 days doing a boring task to accomplish an important goal on day 100. A system that is only trying to correctly predict actions will be right 99% of the time if it predicts that the person inherently enjoys boring tasks. Of course, a system that understands the goal and how the tasks lead to the goal will be right 100% of the time, but even minor errors in its understanding could bring the accuracy back below 99%.

The general issue is the following: large changes in the model of the agent might only lead to small changes in the predictive accuracy of the model, and the longer the time horizon on which a goal is realized, the more this might be the case. This means that even slight mis-specifications in the model could tip the scales back in favor of a (very) incorrect reward function. A potential way of dealing with this might be to identify “important” predictions that seem closely tied to the reward function, and focus particularly on getting those predictions right (see here for a paper exploring a similar idea in the context of approximate inference).

One might object that this is only a problem in this toy setting; for instance, in the real world, one might look at the particular way in which someone is studying or performing some other boring task to see that it coherently leads towards some goal (in a way that would be less likely were the person to be doing something boring purely for enjoyment). In other words, correctly understanding the agent’s goals might allow for more fine-grained accurate predictions which would fare better under e.g. log-score than would an incorrect model.

This is a reasonable objection, but there are some historical examples of this going wrong that should give one pause. That is, there are historical instances where: (i) people expected a more complex model that seemed to get at some underlying mechanism to outperform a simpler model that ignored that mechanism, and (ii) they were wrong (the simpler model did better under log-score). The example we are most familiar with is n-gram models vs. parse trees for language modelling; the most successful language models (in terms of having the best log-score on predicting the next word given a sequence of previous words) essentially treat language as a high-order Markov chain or hidden Markov model, despite the fact that linguistic theory predicts that language should be tree-structured rather than linearly-structured. Indeed, NLP researchers have tried building language models that assume language is tree-structured, and these models perform worse, or at least do not seem to have been adopted in practice (this is true both for older discrete models and newer continuous models based on neural nets).  It’s plausible that a similar issue will occur in inverse reinforcement learning, where correctly inferring plans is not enough to win out in predictive performance. The reason for the two issues might be quite similar (in language modelling, the tree structure only wins out in statistically uncommon corner cases involving long-term and/or nested dependencies, and hence getting that part of the prediction correct doesn’t help predictive accuracy much).

The overall point is: in the case of even slight model mis-specification, the “correct” model might actually perform worse under typical metrics such as predictive accuracy. Therefore, more careful methods of constructing a model might be necessary.

### Learning Values != Robustly Predicting Human Behaviour

The problems with IRL described so far will result in poor performance for predicting human choices out-of-sample. For example, if someone is observed doing boring tasks for 99 days (where they only achieve the goal on Day 100), they’ll be predicted to continue doing boring tasks even when a short-cut to the goal becomes available. So even if the goal is simply to predict human behaviour (not to infer human values), mis-specification leads to bad predictions on realistic out-of-sample scenarios.

Let’s suppose that our goal is not to predict human behaviour but to create AI systems that promote and respect human values. These goals (predicting humans and building safe AI) are distinct. Here’s an example that illustrates the difference. Consider a long-term smoker, Bob, who would continue smoking even if there were (counterfactually) a universally effective anti-smoking treatment. Maybe Bob is in denial about the health effects of smoking or Bob thinks he’ll inevitably go back to smoking whatever happens. If an AI system were assisting Bob, we might expect it to avoid promoting his smoking habit (e.g. by not offering him cigarettes at random moments). This is not paternalism, where the AI system imposes someone else’s values on Bob. The point is that even if Bob would continue smoking across many counterfactual scenarios this doesn’t mean that he places value on smoking.

How do we choose between the theory that Bob values smoking and the theory that he does not (but smokes anyway because of the powerful addiction)? Humans choose between these theories based on our experience with addictive behaviours and our insights into people’s preferences and values. This kind of insight can’t easily be captured as formal assumptions about a model, or even as a criterion about counterfactual generalization. (The theory that Bob values smoking does make accurate predictions across a wide range of counterfactuals.) Because of this, learning human values from IRL has a more profound kind of model mis-specification than the examples in Jacob’s previous post. Even in the limit of data generated from an infinite series of random counterfactual scenarios, standard IRL algorithms would not infer someone’s true values.

Predicting human actions is neither necessary nor sufficient for learning human values. In what ways, then, are the two related? One such way stems from the premise that if someone spends more resources making a decision, the resulting decision tends to be more in keeping with their true values. For instance, someone might spend lots of time thinking about the decision, they might consult experts, or they might try out the different options in a trial period before they make the real decision. Various authors have thus suggested that people’s choices under sufficient “reflection” act as a reliable indicator of their true values. Under this view, predicting a certain kind of behaviour (choices under reflection) is sufficient for learning human values. Paul Christiano has written about some proposals for doing this, though we will not discuss them here (the first link is for general AI systems while the second is for newsfeeds). In general, turning these ideas into algorithms that are tractable and learn safely remains a challenging problem.

There is research on doing IRL for agents in POMDPs. Owain and collaborators explored the effects of limited information and cognitive biases on IRL: paper, paper, online book.

For many environments it will not be possible to identify the reward function from the observed trajectories. These identification problems are related to the mis-specification problems but are not the same thing. Active learning can help with identification (paper).

Paul Christiano raised many similar points about mis-specification in a post on his blog.

For a big-picture monograph on relations between human preferences, economic utility theory and welfare/well-being, see Hausman’s “Preference, Value, Choice and Welfare”.

### Acknowledgments

Thanks to Sindy Li for reviewing a full draft of this post and providing many helpful comments. Thanks also to Michael Webb and Paul Christiano for doing the same on specific sections of the post.

## Prékopa–Leindler inequality

Consider the following statements:

1. The shape with the largest volume enclosed by a given surface area is the $n$-dimensional sphere.
2. A marginal or sum of log-concave distributions is log-concave.
3. Any Lipschitz function of a standard $n$-dimensional Gaussian distribution concentrates around its mean.

What do these all have in common? Despite being fairly non-trivial and deep results, they all can be proved in less than half of a page using the Prékopa–Leindler inequality.

(I won’t show this here, or give formal versions of the statements above, but time permitting I will do so in a later blog post.)

## Two Strange Facts

Here are two strange facts about matrices, which I can prove but not in a satisfying way.

1. If $A$ and $B$ are symmetric matrices satisfying $0 \preceq A \preceq B$, then $A^{1/2} \preceq B^{1/2}$, and $B^{-1} \preceq A^{-1}$, but it is NOT necessarily the case that $A^2 \preceq B^2$. Is there a nice way to see why the first two properties should hold but not necessarily the third? In general, do we have $A^p \preceq B^p$ if $p \in [0,1]$?
2. Given a rectangular matrix $W \in \mathbb{R}^{n \times d}$, and a set $S \subseteq [n]$, let $W_S$ be the submatrix of $W$ with rows in $S$, and let $\|W_S\|_*$ denote the nuclear norm (sum of singular values) of $W_S$. Then the function $f(S) = \|W_S\|_*$ is submodular, meaning that $f(S \cup T) + f(S \cap T) \leq f(S) + f(T)$ for all sets $S, T$. In fact, this is true if we take $f_p(S)$, defined as the sum of the $p$th powers of the singular values of $W_S$, for any $p \in [0,2]$. The only proof I know involves trigonometric integrals and seems completely unmotivated to me. Is there any clean way of seeing why this should be true?

If anyone has insight into either of these, I’d be very interested!

## Difficulty of Predicting the Maximum of Gaussians

Suppose that we have a random variable $X \in \mathbb{R}^d$, such that $\mathbb{E}[XX^{\top}] = I_{d \times d}$. Now take k independent Gaussian random variables $Z_1, \ldots, Z_k \sim \mathcal{N}(0, I_{d \times d})$, and let J be the argmax (over j in 1, …, k) of $Z_j^{\top}X$.

It seems that it should be very hard to predict J well, in the following sense: for any function $q(j \mid x)$, the expectation of $\mathbb{E}_{x}[q(J \mid x)]$, should with high probability be very close to $\frac{1}{k}$ (where the second probability is taken over the randomness in $Z$). In fact, Alex Zhai and I think that the probability of the expectation exceeding $\frac{1}{k}$ should be at most $\exp(-C(\epsilon/k)^2d)$ for some constant C. (We can already show this to be true where we replace $(\epsilon/k)^2$ with $(\epsilon/k)^4$.) I will not sketch a proof here but the idea is pretty cool, it basically uses Lipschitz concentration of Gaussian random variables.

I’m mainly posting this problem because I think it’s pretty interesting, in case anyone else is inspired to work on it. It is closely related to the covering number of exponential families under the KL divergence, where we are interested in coverings at relatively large radii ($\log(k) - \epsilon$ rather than $\epsilon$).

## Verifying Stability of Stochastic Systems

I just finished presenting my recent paper on stochastic verification at RSS 2011. There is a conference version online, with a journal article to come later. In this post I want to go over the problem statement and my solution.

Problem Statement

Abstractly, the goal is to be given some sort of description of a system, and of a goal for that system, and then verify that the system will reach that goal. The difference between our work and a lot (but not all) of the previous work is that we want to work with an explicit noise model for the system. So, for instance, I tell you that the system satisfies

$dx(t) = f(x) dt + g(x) dw(t),$

where $f(x)$ represents the nominal dynamics of the system, $g(x)$ represents how noise enters the system, and $dw(t)$ is a standard Wiener process (the continuous-time version of Gaussian noise). I would like to, for instance, verify that $h(x(T)) < 0$ for some function $h$ and some final time $T$. For example, if $x$ is one-dimensional then I could ask that $x(10)^2-1 < 0$, which is asking for $x$ to be within a distance of $1$ of the origin at time $10$. For now, I will focus on time-invariant systems and stability conditions. This means that $f$ and $g$ are not functions of $t$, and the condition we want to verify is that $h(x(t)) < 0$ for all $t \in [0,T]$. However, it is not too difficult to extend these ideas to the time-varying case, as I will show in the results at the end.

The tool we will use for our task is a supermartingale, which allows us to prove bounds on the probability that a system leaves a certain region.

Supermartingales

Let us suppose that I have a non-negative function $V$ of my state $x$ such that $\mathbb{E}[\dot{V}(x(t))] \leq c$ for all $x$ and $t$. Here we define $\mathbb{E}[\dot{V}(x(t))]$ as

$\lim\limits_{\Delta t \to 0^+} \frac{\mathbb{E}[V(x(t+\Delta t)) \mid x(t)]-V(x(t))}{\Delta t}.$

Then, just by integrating, we can see that $\mathbb{E}[V(x(t)) \mid x(0)] \leq V(x(0))+ct$. By Markov’s inequality, the probability that $V(x(t)) \geq \rho$ is at most $\frac{V(x(0))+ct}{\rho}$.

We can actually prove something stronger as follows: note that if we re-define our Markov process to stop evolving as soon as $V(x(t)) = \rho$, then this only sets $\mathbb{E}[\dot{V}]$ to zero in certain places. Thus the probability that $V(x(t)) \geq \rho$ for this new process is at most $\frac{V(x(0))+\max(c,0)t}{\rho}$. Since the process stops as soon as $V(x) \geq \rho$, we obtain the stronger result that the probability that $V(x(s)) \geq \rho$ for any $s \in [0,t]$ is at most $\frac{V(x(0))+\max(c,0)t}{\rho}$. Finally, we only need the condition $\mathbb{E}[\dot{V}] \leq c$ to hold when $V(x) < \rho$. We thus obtain the following:

Theorem. Let $V(x)$ be a non-negative function such that $\mathbb{E}[\dot{V}(x(t))] \leq c$ whenever $V(x) < \rho$. Then with probability at least $1-\frac{V(x(0))+\max(c,0)T}{\rho}$, $V(x(t)) < \rho$ for all $t \in [0,T)$.

We call the condition $\mathbb{E}[\dot{V}] \leq c$ the supermartingale condition, and a function that satisfies the martingale condition is called a supermartingale. If we can construct supermartingales for our system, then we can bound the probability that trajectories of the system leave a given region.

NOTE: for most people, a supermartingale is something that satisfies the condition $\mathbb{E}[\dot{V}] \leq 0$. However, this condition is often impossible to satisfy for systems we might care about. For instance, just consider exponential decay driven by Gaussian noise:

Once the system gets close enough to the origin, the exponential decay ceases to matter much and the system is basically just getting bounced around by the Gaussian noise. In particular, if the system is ever at the origin, it will get perturbed away again, so you cannot hope to find a non-constant function of $x$ that is decreasing in expectation everywhere (just consider the global minimum of such a function: in all cases, there is a non-zero probability that the Gaussian noise will cause $V(x)$ to increase, but a zero probability that $V(x)$ will decrease because we are already at the global minimum this argument doesn’t actually work, but I am pretty sure that my claim is true at least subject to sufficient technical conditions).

Applying the Martingale Theorem

Now that we have this theorem, we need some way to actually use it. First, let us try to get a more explicit version of the Martingale condition for the systems we are considering, which you will recall are of the form $dx(t) = f(x) dt + g(x) dw(t)$. Note that $V(x+\Delta x) = V(x) + \frac{\partial V}{\partial x} \Delta x + \frac{1}{2}Trace\left(\Delta x^T \frac{\partial^2 V}{\partial x^2}\Delta x\right)+O(\Delta x^3)$.

Then $\mathbb{E}[\dot{V}(x)] = \lim_{\Delta t \to 0^+} \frac{\frac{\partial V}{\partial x} \mathbb{E}[\Delta x]+\frac{1}{2}Trace\left(\mathbb{E}[\Delta x^T \frac{\partial^2 V}{\partial x^2}\Delta x\right)+O(\Delta x^3)}{\Delta t}$. A Wiener process satisfies $\mathbb{E}[dw(t)] = 0$ and $\mathbb{E}[dw(t)^2] = dt$, so only the nominal dynamics ($f(x)$) affect the limit of the first-order term while only the noise ($g(x)$) affects the limit of the second-order term (the third-order and higher terms in $\Delta x$ all go to zero). We thus end up with the formula

$\mathbb{E}[\dot{V}(x)] = \frac{\partial V}{\partial x}f(x)+\frac{1}{2}Trace\left(g(x)^T\frac{\partial^2 V}{\partial x^2}g(x)\right).$

It is not that difficult to construct a supermartingale, but most supermartingales that you construct will yield a pretty poor bound. To illustrate this, consider the system $dx(t) = -x dt + dw(t)$. This is the example in the image from the previous section. Now consider a quadratic function $V(x) = x^2$. The preceding formula tells us that $\mathbb{E}[\dot{V}(x)] = -2x^2+1$. We thus have $\mathbb{E}[\dot{V}(x)] \leq 1$, which means that the probability of leaving the region $x^2 < \rho$ is at most $\frac{x(0)^2+T}{\rho}$. This is not particularly impressive: it says that we should expect $x$ to grow roughly as $\sqrt{T}$, which is how quickly $x$ would grow if it was a random walk with no stabilizing component at all.

One way to deal with this is to have a state-dependent bound $\mathbb{E}[\dot{V}] \leq c-kV$. This has been considered for instance by Pham, Tabareau, and Slotine (see Lemma 2 and Theorem 2), but I am not sure whether their results still work if the supermartingale condition only holds locally instead of globally; I haven’t spent much time on this, so they could generalize quite trivially.

Another way to deal with this is to pick a more quickly-growing candidate supermartingale. For instance, we could pick $V(x) = x^4$. Then $\mathbb{E}[\dot{V}] = -4x^4+6x^2$, which has a global maximum of $\frac{9}{4}$ at $x = \frac{\sqrt{3}}{2}$. This bounds then says that $x$ grows at a rate of at most $T^{\frac{1}{4}}$, which is better than before, but still much worse than reality.

We could keep improving on this bound by considering successively faster-growing polynomials. However, automating such a process becomes expensive once the degree of the polynomial gets large. Instead, let’s consider a function like $V(x) = e^{0.5x^2}$. Then $\mathbb{E}[\dot{V}] = e^{0.5x^2}(0.5-0.5x^2)$, which has a maximum of 0.5 at x=0. Now our bound says that we should expect x to grow like $\sqrt{\log(T)}$, which is a much better growth rate (and roughly the true growth rate, at least in terms of the largest value of $|x|$ over the time interval $[0,T]$).

This leads us to our overall strategy for finding good supermartingales. We will search across functions of the form $V(x) = e^{x^TSx}$ where $S \succeq 0$ is a matrix (the $\succeq$ means “positive semidefinite”, which roughly means that the graph of the function $x^TSx$ looks like a bowl rather than a saddle/hyperbola). This begs two questions: how to upper-bound the global maximum of $\mathbb{E}[\dot{V}]$ for this family, and how to search efficiently over this family. The former is done by doing some careful work with inequalities, while the latter is done with semidefinite programming. I will explain both below.

Upper-bounding $\mathbb{E}[\dot{V}]$

In general, if $V(x) = e^{x^TSx}$, then $\mathbb{E}[\dot{V}(x)] = e^{x^TSx}\left(2x^TSf(x)+Trace(g(x)^TSg(x))+2x^TSg(x)g(x)^TSx\right)$. We would like to show that such a function is upper-bounded by a constant $c$. To do this, move the exponential term to the right-hand-side to get the equivalent condition $2x^TSf(x)+Trace(g(x)^TSg(x))+2x^TSg(x)g(x)^TSx \leq ce^{-x^TSx}$. Then we can lower-bound $e^{-x^TSx}$ by $1-x^TSx$ and obtain the sufficient condition

$c(1-x^TSx)-2x^TSf(x)-Trace(g(x)^TSg(x))-2x^TSg(x)g(x)^TSx \geq 0.$

It is still not immediately clear how to check such a condition, but somehow the fact that this new condition only involves polynomials (assuming that f and g are polynomials) seems like it should make computations more tractable. This is indeed the case. While checking if a polynomial is positive is NP-hard, checking whether it is a sum of squares of other polynomials can be done in polynomial time. While sum of squares is not the same as positive, it is a sufficient condition (since the square of a real number is always positive).

The way we check whether a polynomial p(x) is a sum of squares is to formulate it as the semidefinite program: $p(x) = z^TMz, M \succeq 0$, where $z$ is a vector of monomials. The condition $p(x) = z^TMz$ is a set of affine constraints on the entries of $M$, so that the above program is indeed semidefinite and can be solved efficiently.

Efficiently searching across all matrices S

We can extend on the sum-of-squares idea in the previous section to search over $S$. Note that if $p$ is a parameterized polynomial whose coefficient are affine in a set of decision variables, then the condition $p(x) = z^TMz$ is again a set of affine constraints on $M$. This almost solves our problem for us, but not quite. The issue is the form of $p(x)$ in our case:

$c(1-x^TSx)-2x^TSf(x)-Trace(g(x)^TSg(x))-2x^TSg(x)g(x)^TSx$

Do you see the problem? There are two places where the constraints do not appear linearly in the decision variables: $c$ and $S$ multiply each other in the first term, and $S$ appears quadratically in the last term. While the first non-linearity is not so bad ($c$ is a scalar so it is relatively cheap to search over $c$ exhaustively), the second non-linearity is more serious. Fortunately, we can resolve the issue with Schur complements. The idea behind Schur complements is that, assuming $A \succeq 0$, the condition $B^TA^{-1}B \preceq C$ is equivalent to $\left[ \begin{array}{cc} A & B \\ B^T & C \end{array} \right] \succeq 0$. In our case, this means that our condition is equivalent to the condition that

$\left[ \begin{array}{cc} 0.5I & g^TSx \\ x^TSg & c(1-x^TSx)-2x^TSf(x)-Trace(g(x)^TSg(x))\end{array} \right] \succeq 0$

where $I$ is the identity matrix. Now we have a condition that is linear in the decision variable $S$, but it is no longer a polynomial condition, it is a condition that a matrix polynomial be positive semidefinite. Fortunately, we can reduce this to a purely polynomial condition by creating a set of dummy variables $y$ and asking that

$y^T\left[ \begin{array}{cc} 0.5I & g^TSx \\ x^TSg & c(1-x^TSx)-2x^TSf(x)-Trace(g(x)^TSg(x))\end{array} \right]y \geq 0$

We can then do a line search over $c$ and solve a semidefinite program to determine a feasible value of $S$. If we care about remaining within a specific region, we can maximize $\rho$ such that $x^TSx < \rho$ implies that we stay in the region. Since our bound on the probability of leaving the grows roughly as $ce^{-\rho}T$, this is a pretty reasonable thing to maximize (we would actually want to maximize $\rho-\log(c)$, but this is a bit more difficult to do).

Oftentimes, for instance if we are verifying stability around a trajectory, we would like $c$ to be time-varying. In this case an exhaustive search is no longer feasible. Instead we alternate between searching over $S$ and searching over $c$. In the step where we search over $S$, we maximize $\rho$. In the step where we search over $c$, we maximize the amount by which we could change $c$ and still satisfy the constraints (the easiest way to do this is by first maximizing $c$, then minimizing $c$, then taking the average; the fact that semidefinite constraints are convex implies that this optimizes the margin on $c$ for a fixed $S$).

A final note is that systems are often only stable locally, and so we only want to check the constraint $\mathbb{E}[\dot{V}] \leq c$ in a region where $V(x) < \rho$. We can do this by adding a Lagrange multiplier to our constraints. For instance, if we want to check that $p(x) \geq 0$ whenever $(x) \leq 0$, it suffices to find a polynomial $\lambda(x)$ such that $\lambda(x) \geq 0$ and $p(x)+\lambda(x)s(x) \geq 0$. (You should convince yourself that this is true; the easiest proof is just by casework on the sign of $s(x)$.) This again introduces a non-linearity in the constraints, but if we fix $S$ and $\rho$ then the constraints are linear in $c$ and $\lambda$, and vice-versa, so we can perform the same alternating maximization as before.

Results

Below is the most exciting result, it is for an airplane with a noisy camera trying to avoid obstacles. Using the verification methods above, we can show that with probability at least $0.99$ that the plane trajectory will not leave the gray region:

## Least Squares and Fourier Analysis

I ended my last post on a somewhat dire note, claiming that least squares can do pretty terribly when fitting data. It turns out that things aren’t quite as bad as I thought, but most likely worse than you would expect.

The theme of this post is going to be things you use all the time (or at least, would use all the time if you were an electrical engineer), but probably haven’t ever thought deeply about. I’m going to include a combination of mathematical proofs and matlab demonstrations, so there should hopefully be something here for everyone.

My first topic is going to be, as promised, least squares curve fitting. I’ll start by talking about situations when it can fail, and also about situations when it is “optimal” in some well-defined sense. To do that, I’ll have to use some Fourier analysis, which will present a good opportunity to go over when frequency-domain methods can be very useful, when they can fail, and what you can try to do when they fail.

When Least Squares Fails

To start, I’m going to do a simple matlab experiment. I encourage you to follow along if you have matlab (if you have MIT certificates you can get it for free at http://matlab.mit.edu/).

Let’s pretend we have some simple discrete-time process, y(n+1) = a*y(n) + b*u(n), where y is the variable we care about and u is some input signal. We’ll pick a = 0.8, b = 1.0 for our purposes, and u is chosen to be a discrete version of a random walk. The code below generates the y signal, then uses least squares to recover a and b. (I recommend taking advantage of cell mode if you’re typing this in yourself.)

%% generate data

a = 0.8; b = 1.0;

N = 1000;

ntape = 1:N; y = zeros(N,1); u = zeros(N-1,1);

for n=1:N-2

if rand < 0.02

u(n+1) = 1-u(n);

else

u(n+1) = u(n);

end

end

for n=1:N-1

y(n+1) = a*y(n)+b*u(n);

end

plot(ntape,y);

%% least squares fit (map y(n) and u(n) to y(n+1))

A = [y(1:end-1) u]; b = y(2:end);

params = A\b;

afit = params(1)

bfit = params(2)

The results are hardly surprising (you get afit = 0.8, bfit = 1.0). For the benefit of those without matlab, here is a plot of y against n:

Now let’s add some noise to the signal. The code below generates noise whose size is about 6% of the size of the data (in the sense of L2 norm).

%%

yn = y + 0.1*randn(N,1); % gaussian noise with standard deviation 0.2

A = [yn(1:end-1) u]; b = yn(2:end);

params = A\b;

afit = params(1)

bfit = params(2)

This time the results are much worse: afit = 0.7748, bfit = 1.1135. You might be tempted to say that this isn’t so much worse than we might expect — the accuracy of our parameters is roughly the accuracy of our data. The problem is that, if you keep running the code above (which will generate new noise each time), you will always end up with afit close to 0.77 and bfit close to 1.15. In other words, the parameters are systematically biased by the noise. Also, we should expect our accuracy to increase with more samples, but that isn’t the case here. If we change N to 100,000, we get afit = 0.7716, bfit = 1.1298. More samples will decrease the standard deviation of our answer (running the code multiple times will yield increasingly similar results), but not necessarily its correctness.

A more dire way of thinking about this is that increasing the number of samples will increase how “certain” we are of our answer, but it won’t change the fact that our answer is wrong. So we will end up being quite certain of an incorrect answer.

Why does this happen? It turns out that when we use least squares, we are making certain assumptions about the structure of our noise, and those assumptions don’t hold in the example above. In particular, in a model like the one above, least squares assumes that all noise is process noise, meaning that noise at one step gets propagated to future steps. Such noise might come from a system with unmodelled friction or some external physical disturbance. In contrast, the noise we have is output noise, meaning that the reading of our signal is slightly off. What the above example shows is that a model constructed via least squares will be systematically biased by output noise.

That’s the intuition, now let’s get into the math.When we do least squares, we are trying to solve some equation Ax=b for x, where A and b are both noisy. So we really have something like A+An and b+bn, where An and bn are the noise on A and b.

Before we continue, I think it’s best to stop and think about what we really want. So what is it that we actually want? We observe a bunch of data as input, and some more data as output. We would like a way of predicting, given the input, what the output should be. In this sense, then, the distinction between “input noise” (An) and “output noise” bn is meaningless, as we don’t get to see either and all they do is cause b to not be exactly Ax. (If we start with assumptions on what noise “looks like”, then distinguishing between different sources of noise turns out to be actually useful. More on that later.)

If the above paragraph isn’t satisfying, then we can use the more algebraic explanation that the noise An and bn induces a single random variable on the relationship between observed input and observed output. In fact, if we let A’=A+An then we end up fitting $A'x = b+(b_n-A_nx)$, so we can just define $e = b_n-A_nx$ and have a single noise term.

Now, back to least squares. Least squares tries to minimize $\|Ax-b\|_2^2$, that is, the squared error in the $l^2$ norm. If we instead have a noisy $b$, then we are trying to minimize $\|Ax-b-e\|_2^2$, which will happen when $x$ satisfies $A^TAx = A^T(b+e)$.

If there actually exists an $\hat{x}$ such that $A\hat{x} = b$ (which is what we are positing, subject to some error term), then minimizing $Ax-b$ is achieved by setting $x$ to $\hat{x}$. Note that $\hat{x}$ is what we would like to recover. So $\hat{x}$ would be the solution to $A^TAx = A^Tb$, and thus we see that an error $e$ introduces a linear error in our estimate of $x$. (To be precise, the error affects our answer for $x$ via the operator $(A^TA)^{-1}A^T$.)

Now all this can be seen relatively easily by just using the standard formula for the solution to least squares: $x = (A^TA)^{-1}A^Tb$. But I find that it is easy to get confused about what exactly the “true” answer is when you are fitting data, so I wanted to go through each step carefully.

At any rate, we have a formula for how the error $e$ affects our estimate of $x$, now I think there are two major questions to answer:

(1) In what way can $e$ systematically bias our estimate for $x$?

(2) What can we say about the variance on our estimate for $x$?

To calculate the bias on $x$, we need to calculate $\mathbb{E}((A^TA)^{-1}A^Te)$, where $\mathbb{E}$ stands for expected value. Since $(A^TA)^{-1}$ is invertible, this is the same as $(A^TA)^{-1}\mathbb{E}(A^Te)$. In particular, we will get an unbiased estimate exactly when $A$ and $e$ are uncorrelated. Most importantly, when we have noise on our inputs then $A$ and $e$ will (probably) be correlated, and we won’t get an unbiased result.

How bad is the bias? Well, if A actually has a noise component (i.e. $A=A_0+A_n$), and e is $b_n-A_nx$, and we assume that our noise is uncorrelated with the constant matrix $A$, then we get a correlation matrix equal to $A_n^T(b_n-A_nx)$, which, assuming that $A_n$ and $b_n$ are uncorrelated, gives us $-A_n^TA_nx$. The overall bias then comes out to $-(A^TA)^{-1}\mathbb{E}(A_n^TA_n)x$.

I unfortunately don’t have as nice of an expression for the variance, although you can of course calculate it in terms of $A, b, x, A_n$, and $b_n$.

At any rate, if noise doesn’t show up in the input, and the noise that does show up is uncorrelated with the input, then we should end up with no bias. But if either of those things is true, we will end up with bias. When modelling a dynamical system, input noise corresponds to measurement noise (your sensors are imperfect), while output noise corresponds to process noise (the system doesn’t behave exactly as expected).

One way we can see how noise being correlated with $A$ can lead to bias is if our “noise” is actually an unmodelled quadratic term. Imagine trying to fit a line to a parabola. You won’t actually fit the tangent line to the parabola, instead you’ll probably end up fitting something that looks like a secant. However, the exact slope of the line you pick will depend pretty strongly on the distribution of points you sample along the parabola. Depending on what you want the linear model for, this could either be fine (as long as you sample a distribution of points that matches the distribution of situations that you think you’ll end up using the model for), or very annoying (if you really wanted the tangent).

If you’re actually just dealing with a parabola, then you can still get the tangent by sampling symmetrically about the point you care about, but once you get to a cubic this is no longer the case.

As a final note, one reasonable way (although I’m not convinced it’s the best, or even a particularly robust way) of determining if a linear fit of your data is likely to return something meaningful is to look at the condition number of your matrix, which can be computed in matlab using the cond function and can also be realized as the square root of the ratio of the largest to the smallest eigenvalue of $A^TA$. Note that the condition number says nothing about whether your data has a reasonable linear fit (it can’t, since it doesn’t take $b$ into account). Rather, it is a measure of how well-defined the coefficients of such a fit would be. In particular, it will be large if your data is close to lying on a lower-dimensional subspace (which can end up really screwing up your fit). In this case, you either need to collect better data or figure out why your data lies on a lower-dimensional subspace (it could be that there is some additional structure to your system that you didn’t think about; see point (3) below about a system that is heavily damped).

I originally wanted to write down a lot more about specific ways that noise can come into the picture, but I haven’t worked it all out myself, and it’s probably too ambitious a project for a single blog post anyways. So instead I’m going to leave you with a bunch of things to think about. I know the answers to some of these, for others I have ideas, and for others I’m still trying to work out a good answer.

1) Can anything be done to deal with measurement noise? In particular, can anything be done to deal with the sort of noise that comes from encoders (i.e., a discretization of the signal)?

2) Is there a good way of measuring when noise will be problematic to our fit?

3) How can we fit models to systems that evolve on multiple time-scales? For example, an extremely damped system such as

$\dot{x}_1 = x_2$

$\dot{x}_2 = -cx_1-Mx_2$

where M >> c. You could take, for example, $M = 20$, $c = 1$, in which case the system  behaves almost identically to the system

$\dot{x}_1 = \frac{-M + \sqrt{M^2-4c}}{2} x_1$

with $x_2$ set to the derivative of $x_1$. Then the data will all lie almost on a line, which can end up screwing up your fit. So in what exact ways can your fit get screwed up, and what can be done to deal with it? (This is essentially the problem that I’m working on right now.)

4) Is there a way to defend against non-linearities in a system messing up our fit? Can we figure out when these non-linearities occur, and to what extent?

5) What problems might arise when we try to fit a system that is unstable or only slightly stable, and what is a good strategy for modelling such a system?

When Least Squares Works

Now that I’ve convinced you that least squares can run into problems, let’s talk about when it can do well.

As Paul Christiano pointed out to me, when you have some system where you can actually give it inputs and measure the outputs, least squares is likely to do a fairly good job. This is because you can (in principle) draw the data you use to fit your model from the same distribution as you expect to encounter when the model is used in practice. However, you will still run into the problem that failure to measure the input accurately introduces biases. And no, these biases can’t be eradicated completely by averaging the result across many samples, because the bias is always a negative definite matrix applied to $x$ (the parameters we are trying to find), and any convex combination of negative definite matrices will remain negative definite.

Intuitively, what this says is that if you can’t trust your input, then you shouldn’t rely on it strongly as a predictor. Unfortunately, the only way that a linear model knows how to trust something less is by making the coefficient on that quantity “smaller” in some sense (in the negative definite sense here). So really the issue is that least squares is too “dumb” to deal with the issue of measurement error on the input.

But I said that I’d give examples of when least squares works, and here I am telling you more about why it fails. One powerful and unexpected aspect of least squares is that it can fit a wide variety of non-linear models. For example, if we have a system $y = c_1+c_2x+c_3x^2+c_4\cos(x)$, then we just form a matrix $A = \left[ \begin{array}{cccc} 1 & x & x^2 & \cos(x) \end{array} \right]$ and $b = y$, where for example $\cos(x)$ is actually a column vector where the $i$th row is the cosine of the $i$th piece of input data. This will often be the case in physical systems, and I think is always the case for systems solved via Newton’s laws (although you might have to consolide parameters, for example fitting both $mgl$ and $ml^2$ in the case of a pendulum). This isn’t necessarily the case for reduced models of complicated systems, for example the sort of models used for fluid dynamics. However, I think that the fact that linear fitting techniques can be applied to such a rich class of systems is quite amazing.

There is also a place where least squares not only works but is in some sense optimal: detecting the frequency response of a system. Actually, it is only optimal in certain situations, but even outside of those situations it has many advantages over a standard discrete Fourier transform. To get into the applications of least squares here, I’m going to have to take a detour into Fourier analysis.

Fourier Analysis

If you already know Fourier analysis, you can probably skip most of this section (although I recommend reading the last two paragraphs).

Suppose that we have a sequence of $N$ signals at equally spaced points in time. Call this sequence $x_1$, $x_2$, $\ldots$, $x_n$. We can think of this as a function $f : \{0,1,\ldots,N-1\} \to \mathbb{R}$, or, more accurately, $f : \{0,\Delta t, 2\Delta t, \ldots, (N-1)\Delta t\} \to \mathbb{R}$. For reasons that will become apparent later, we will actually think of this as a function $f : \{0,\Delta t, 2\Delta t, \ldots, (N-1)\Delta t\} \to \mathbb{C}$.

This function is part of the vector space of all functions from $\{0,\Delta t, 2\Delta t, \ldots, (N-1)\Delta t\}$ to $\mathbb{C}$. One can show that the functions on $\{0,\Delta t,\ldots,(N-1)\Delta t\}$ defined by

$f_k(x) = e^{\frac{2\pi i k x}{N \Delta t}},$

with $k$ ranging from $0$ to $N-1$, are all orthogonal to each other, and thus form a basis for the space of all functions from $\{0,\Delta t,2\Delta t,\ldots,(N-1)\Delta y\}$ to $\mathbb{C}$ (now it is important to use $\mathbb{C}$ since the $f_k$ take on complex values). It follows that our function $f$ can be written uniquely in the form $f(x) = \sum_{k=0}^{N-1} c_kf_k(x)$, where the $c_k$ are constants. Now because of this we can associate with each $f$ a function $\hat{f} : \{0,\frac{2 \pi}{N \Delta t},\frac{4\pi}{N\Delta t},\ldots,\frac{(N-1)\pi}{N\delta t}\} \to \mathbb{C}$ given by $\hat{f}(\frac{2\pi k}{N \Delta t}) := c_k$.

An intuitive way of thinking about this is that any function can be uniquely decomposed as a superposition of complex exponential functions at different frequencies. The function $\hat{f}$ is a measure of the component of $f$ at each of these frequencies. We refer to $\hat{f}$ as the Fourier transform of $f$.

While there’s a lot more that could be said on this, and I’m tempted to re-summarize all of the major results in Fourier analysis, I’m going to refrain from doing so because there are plenty of texts on it and you can probably get the relevant information (such as how to compute the Fourier coefficients, the inverse Fourier transform, etc.) from those. In fact, you could start by checking out Wikipedia’s article. It is also worth noting that the Fourier transform can be computed in $O(N\log N)$ time using any one of many “fast Fourier transform” algorithms (fft in matlab).

I will, however, draw your attention to the fact that if we start with information about $f$ at times $\{0,\Delta t,\ldots, (N-1)\Delta t\}$, then we end up with frequency information at the frequencies $\{0,\frac{2\pi}{N\Delta t},\ldots,\frac{2\pi(N-1)}{N\Delta t}\}$. Also, you should really think of the frequencies as wrapping around cyclically (frequencies that differ from each other by a multiple of $\frac{2\pi}{\Delta t}$ are indistinguishable on the interval we sampled over), and also if $f$ is real-valued then $\hat(f)(-\omega) = \overline{\hat{f}(\omega)}$, where the bar means complex conjugate and $-\omega$ is, as just noted, the same as $\frac{2\pi}{\Delta t}-\omega$.

A final note before continuing is that we could have decomposed $f$ into a set of almost any $N$ frequencies (as long as they were still linearly independent), although we can’t necessarily do so in $O(N\log N)$ time. We will focus on the set of frequencies obtained by a Fourier transform for now.

When Fourier Analysis Fails

The goal of taking a Fourier transform is generally to decompose a signal into component frequencies, under the assumption that the signal itself was generated by some “true” superposition of frequencies. This “true” superposition would best be defined as the frequency spectrum we would get if we had an infinitely long continuous tape of noise-free measurements and then took the continuous Fourier transform.

I’ve already indicated one case in which Fourier analysis can fail, and this is given by the fact that the Fourier transform can’t distinguish between frequencies that are separated from each other by multiples of $\frac{2\pi}{\Delta t}$. In fact, what happens in general is that you run into problems when your signal contains frequencies that move faster than your sampling rate. The rule of thumb is that your signal should contain no significant frequency content above the Nyquist rate, which is half the sampling frequency. One way to think of this is that the “larger” half of our frequencies (i.e. $\frac{\pi}{\Delta t}$ up through $\frac{2\pi}{\Delta t}$) are really just the negatives of the smaller half of our frequencies, and so we can measure frequencies up to roughly $\frac{\pi}{\Delta t}$ before different frequencies start to run into each other.

The general phenomenon that goes on here is known as aliasing, and is the same sort of effect as what happens when you spin a bicycle wheel really fast and it appears to be moving backwards instead of forwards. The issue is that your eye only samples at a given rate and so rotations at speeds faster than that appear the same to you as backwards motion. See also this image from Wikipedia and the section in the aliasing article about sampling sinusoidal functions.

The take-away message here is that you need to sample fast enough to capture all of the actual motion in your data, and the way you solve aliasing issues is by increasing the sample rate.

A trickier problem is the “windowing” problem, also known as spectral leakage. [Note: I really recommend reading the linked wikipedia article at some point, as it is a very well-written and insightful treatment of this issue.] The problem can be summarized intuitively as follows: nearby frequencies will “bleed into” each other, and the easiest way to reduce this phenomenon is to increase your sample time. Another intuitive statement to this effect is that the extent to which you can distinguish between two nearby frequencies is roughly proportional to the number of full periods that you observe of their difference frequency. I will make both of these statements precise below. First, though, let me convince you that spectral leakage is relevant by showing you what the Fourier transform of a periodic signal looks like when the period doesn’t fit into the sampling window. The first image below is a plot of y=cos(t), and the second is a snapshot of part of the Fourier transform (blue is real part, green is imaginary part). Note that the plot linearly interpolates between sample points. Also note that the sampling frequency was 100Hz, although that is almost completely irrelevant.

The actual frequency content should be a single spike at $\omega = 1$, so windowing can in fact cause non-trivial issues with your data.

Now let’s get down to the actual analytical reason for the windowing / spectral leakage issue. Recall the formula for the Fourier transform: $\hat{f}(\omega) = \frac{1}{N} \sum_{t} f(t)e^{-i\omega t}$. Now suppose that $f$ is a complex exponential with some frequency $\omega'$, i.e. $f(t) = e^{i\omega' t}$. Then some algebra will yield the formula

$\hat{f}(\omega) = \frac{1}{N} \frac{e^{i(\omega'-\omega)N\Delta t}-1}{e^{i(\omega'-\omega)\Delta t}-1}$,

which tells us the extent to which a signal at a frequency of $\omega'$ will incorrectly contribute to the estimate of the frequency content at $\omega$. The main thing to note here is that larger values of $N$ will cause this function to become more concentrated horizontally, which means that, in general (although not necessarily at a given point), it will become smaller. At the same time, if you change the sampling rate without changing the total sampling time then you won’t significantly affect the function. This means that the easiest way to decrease windowing is to increase the amount of time that you sample your signal, but that sampling more often will not help you at all.

Another point is that spectral leakage is generically roughly proportional to the inverse of the distance between the two frequencies (although it goes to zero when the difference in frequencies is close to a multiple of $\frac{2\pi}{N\Delta t}$), which quantifies the earlier statement about the extent to which two frequencies can be separated from each other.

Some other issues to keep in mind: the Fourier transform won’t do a good job with quasi-periodic data (data that is roughly periodic with a slowly-moving phase shift), and there is also no guarantee that your data will have good structure in the frequency domain. It just happens that this is in theory the case for analytic systems with a periodic excitation (see note (1) in the last section of this post — “Answers to Selected Exercises” — for a more detailed explanation).

When Fourier Analysis Succeeds

Despite issues with aliasing and spectral leakage, there are some strong points to the Fourier transform. The first is that, since the Fourier transform is an orthogonal map, it does not amplify noise. More precisely, $\|\hat{f}-\hat{f_0}\|_2 = \frac{1}{\sqrt{N}}\|f-f_0\|_2$, so two signals that are close together have Fourier transforms that are also close together. This may be somewhat surprising since normally when one fits $N$ parameters to a signal of length $N$, there are significant issues with overfitting that can cause noise to be amplified substantially.

However, while the Fourier transform does not amplify noise, it can concentrate noise. In particular, if the noise has some sort of quasi-periodic structure then it will be concentrated over a fairly small range of frequencies.

Note, though, that the L2 norm of the noise in the frequency domain will be roughly constant relative to the number of samples. This is because, if $f_0$ is the “true” signal and $f$ is the measured signal, then $\|f-f_0\|_2 = \Theta(\sqrt(N))$, so that $\|\hat{f}-\hat{f_0}\|_2 = \Theta(1)$. Now also note that the number of frequency measurements we get out of the Fourier transform within a fixed band is proportional to the sampling time, that is, it is $\Theta(N\Delta t)$. If we put these assumptions together, and also assume that the noise is quasi-periodic such that it will be concentrated over a fixed set of frequencies, then we get $\Theta(1)$ noise distributed in the L2 sense over $\Theta(N\Delta t)$ frequencies, which implies that the level of noise at a given frequency should be $\Theta(\frac{1}{\sqrt{N\Delta t}})$. In other words, sampling for a longer time will increase our resolution on frequency measurements, which means that the noise at a given frequency will decrease as the square-root of the sampling time, which is nice.

My second point is merely that there is no spectral leakage between frequencies that differ by multiples of $\frac{2\pi}{N\Delta t}$, so in the special case when all significant frequency content of the signal occurs at frequencies that are multiples of $\frac{2\pi}{N\Delta t}$ and that are less than $\frac{\pi}{\Delta t},$ all problems with windowing and aliasing go away and we do actually get a perfect measure of the frequency content of the original signal.

Least Squares as a Substitute

The Fourier transform gives us information about the frequency content at $0, \frac{2\pi}{N\Delta t}, \frac{4\pi}{N\Delta t}, \ldots, \frac{2(N-1)\pi}{N\Delta t}$. However, this set of frequencies is somewhat arbitrary and might not match up well to the “important” frequencies in the data. If we have extra information about the specific set of frequencies we should be caring about, then a good substitute for Fourier analysis is to do least squares fitting to the signal as a superposition of the frequencies you care about.

In the special case that the frequencies you care about are a subset of the frequencies provided by the Fourier transform, you will get identical results (this has to do with the fact that complex exponentials at these frequencies are all orthogonal to each other).

In the special case that you exactly identify which frequencies occur in the signal, you eliminate the spectral leakage problem entirely (it still occurs in theory, but not between any of the frequencies that actually occur). A good way to do this in the case of a dynamical system is to excite the system at a fixed frequency so that you know to look for that frequency plus small harmonics of that frequency in the output.

In typical cases least squares will be fairly resistant to noise unless that noise has non-trivial spectral content at frequencies near those being fit. This is almost tautologically true, as it just says that spectral leakage is small between frequencies that aren’t close together. However, this isn’t exactly true, as fitting non-orthogonal frequencies changes the sort of spectral leakage that you get, and picking a “bad” set of frequencies (usually meaning large condition number) can cause lots of spectral leakage even between far apart frequencies, or else drastically exacerbate the effects of noise.

This leads to one reason not to use least squares and to use the Fourier transform instead (other than the fact that the Fourier transform is more efficient in an algorithmic sense at getting data about large sets of frequencies — $\Theta(N\log(N))$ instead of $\Theta(N^2)$). The Fourier transform always has a condition number of $1$, whereas least squares will in general have a condition number greater than $1$, and poor choices of frequencies can lead to very large condition numbers. I typically run into this problem when I attempt to gain lots of resolution on a fixed range of frequencies.

This makes sense, because there are information-theoretic limits on the amount of frequency data I can get out of a given amount of time-domain data, and if I could zoom in on a given frequency individually, then I could just do that for all frequencies one-by-one and break the information theory bounds. To beat these bounds you will have to at least implicitly make additional assumptions about the structure of the data. However, I think you can probably get pretty good results without making too strong of assumptions, but I unfortunately don’t personally know how to do that yet.

So to summarize, the Fourier transform is nice because it is orthogonal and can be computed quickly. Least squares is nice because it allows you to pick which frequencies you want and so gives you a way to encode additional information you might have about the structure of the signal.

(1) What does spectral leakage look like for non-orthogonal sets of frequencies? What do the “bad” cases look like?

(2) What is a good set of assumptions to make that helps us get better frequency information? (The weaker the assumption and the more leverage you get out of it, the better it is.)

(3) Perhaps we could try something like: “pick the smallest set of frequencies that gives us a good fit to the data”. How could we actually implement this in practice, and would it have any shortcomings? How good would it be at pulling weak signals out of noisy data?

(4) What in general is a good strategy for pulling a weak signal out of noisy data?

(5) What is a good way of dealing with quasi-periodic noise?

(6) Is there a way to deal with windowing issues, perhaps by making statistical assumptions about the data that allows us to “sample” from possible hypothetical continuations of the signal to later points in time?

Take-away lessons

To summarize, I would say the following:

Least squares

• good when you get to sample from a distribution of inputs that matches the actual distribution that you’re going to deal with in practice
• bad due to systematic biases when noise is correlated with signal (usually occurs with “output noise” in the case of a dynamical system)

Fourier transform

• good for getting a large set of frequency data
• good because of small condition number
• can fail due to aliasing
• also can be bad due to spectral leakage, which can be dealt with by using least squares if you have good information about which frequencies are important