New new blog location

I’ve switched the host of my blog to Ghost, so my blog is now located at This is different from the previous re-hosting on my personal website, which is no longer maintained.

Feedly users can subscribe via RSS here.
Or click the bottom-right subscribe button on this page to get e-mail notifications of new posts.

New blog location

[Update: This post is out of date. My blog has now moved again, to]

I’ve decided to move this blog to my personal website. It’s now located here:, including all the old posts and comments, plus some new and upcoming posts :).

Advice for Authors

I’ve spent much of the last few days reading various ICML papers and I find there’s a few pieces of feedback that I give consistently across several papers. I’ve collated some of these below. As a general note, many of these are about local style rather than global structure; I think that good local style probably contributes substantially more to readability than global structure and is in general under-rated. I’m in general pretty willing to break rules about global structure (such as even having a conclusion section in the first place! though this might cause reviewers to look at your paper funny), but not to break local stylistic rules without strong reasons.

General Writing Advice

  • Be precise. This isn’t about being pedantic, but about maximizing information content. Choose your words carefully so that you say what you mean to say. For instance, replace “performance” with “accuracy” or “speed” depending on what you mean.
  • Be concise. Most of us write in an overly wordy style, because it’s easy to and no one drilled it out of us. Not only does wordiness decrease readability, it wastes precious space if you have a page limit.
  • Avoid complex sentence structure. Most research is already difficult to understand and digest; there’s no reason to make it harder by having complex run-on sentences.
  • Use consistent phrasing. In general prose, we’re often told to refer to the same thing in different ways to avoid boring the reader, but in technical writing this will lead to confusion. Hopefully your actual results are interesting enough that the reader doesn’t need to be entertained by your large vocabulary.


  • There’s more than one approach to writing a good abstract, and which one you take will depend on the sort of paper you’re writing. I’ll give one approach that is good for papers presenting an unusual or unfamiliar idea to readers.
  • The first sentence / phrase should be something that all readers will agree with. The second should be something that many readers would find surprising, or wouldn’t have thought about before; but it should follow from (or at least be supported by) the first sentence. The general idea is that you need to start by warming the reader up and putting them in the right context, before they can appreciate your brilliant insight.
  • Here’s an example from my Reified Context Models paper: “A classic tension exists between exact inference in a simple model and approximate inference in a complex model. The latter offers expressivity and thus accuracy, but the former provides coverage of the space, an important property for confidence estimation and learning with indirect supervision.” Note how the second sentence conveys a non-obvious claim — that coverage is important for confidence estimation as well as for indirect supervision. It’s tempting to lead with this in order to make the first sentence more punchy, but this will tend to go over reader’s heads. Imagine if the abstract had started, “In the context of inference algorithms, coverage of the space is important for confidence estimation and indirect supervision.” No one is going to understand what that means.


  • The advice in this section is most applicable to the introduction section (and maybe related work and discussion), but applies on some level to other parts of the paper as well.
  • Many authors (myself included) end up using phrases like “much recent interest” and “increasingly important” because these phrases show up frequently in academic papers, and they are vague enough to be defensible. Even though these phrases are common, they are bad writing! They are imprecise and rely on hedge words to avoid having to explain why something is interesting or important.
  • Make sure to provide context before introducing a new concept; if you suddenly start talking about “NP-hardness” or “local transformations”, you need to first explain to the reader why this is something that should be considered in the present situation.
  • Don’t beat around the bush; if the point is “A, therefore B” (where B is some good fact about your work), then say that, rather than being humble and just pointing out A.
  • Don’t make the reader wait for the payoff; spell it out in the introduction. I frequently find that I have to wait until Section 4 to find out why I should care about a paper; while I might read that far, most reviewers are going to give up about halfway through Section 1. (Okay, that was a bit of an exaggeration; they’ll probably wait until the end of Section 1 before giving up.)

Conclusion / Discussion

  • I generally put in the conclusion everything that I wanted to put in the introduction, but couldn’t because readers wouldn’t be able to appreciate the context without reading the rest of the paper first. This is a relatively straightforward way to write a conclusion that isn’t just a re-hash of the introduction.
  • The conclusion can also be a good place to discuss open questions that you’d like other researchers to think about.
  • My model is that only the ~5 people most interested in your paper are going to actually read this section, so it’s worth somewhat tailoring to that audience. Unfortunately, the paper reviewers might also read this section, so you can’t tailor it too much or the reviewers might get upset if they end up not being in the target audience.
  • For theory papers, having a conclusion is completely optional (I usually skip it). In this case, open problems can go in the introduction. If you’re submitting a theory paper to NIPS or ICML, you unfortunately need a conclusion or  reviewers will get upset. In my opinion, this is an instance where peer review makes the paper worse rather than better.


  • Proper citation style: one should write “Widgets are awesome (Smith, 2001).” or “Smith (2001) shows that widgets are awesome.” but never “(Smith, 2001) shows that widgets are awesome.” You can control this in LaTeX using \citep{} and \citet{} if you use natbib.
  • Display equations can take up a lot of space if over-used, but at the same time, too many in-line equations can make your document hard to read. Think carefully about which equations are worth displaying, and whether your in-line equations are becoming too dense.
  • If leave a blank line after \end{equation} or $$, you will create an extra line break in the document. This is sort of annoying because white-space isn’t supposed to matter in that way, but you can save a lot of space by remembering this.
  • DON’T use the fullpage package. I’m used to using \usepackage{fullpage} in documents to get the margins that I want, but this will override options in many style files (including jmlr.sty which is used in machine learning).
  • \left( and \right) can be convenient for auto-sizing parentheses, but are often overly conservative (e.g. making parentheses too big due to serifs or subscripts). It’s fine to use \left( and \right) initially, but you might want to specify explicit sizes with \big(, \Big(, \bigg(, etc. in the final pass.
  • When displaying a sequence of equations (e.g. with the align environment), use \stackrel{} on any non-trivial equality or inequality statements and justify these steps immediately after the equation. See the bottom of page 6 of this paper for an example.
  • Make sure that \label{} commands come after the \caption{} command in a figure (rather than before), otherwise your numbering will be wrong.


  • When using a variable that hasn’t appeared in a while, remind the reader what it is (i.e., “the sample space \mathcal{X}” rather than “\mathcal{X}“.
  • If it’s one of the main points of your work, call it a Theorem. If it’s a non-trivial conclusion that requires a somewhat involved argument (but it’s not a main point of the work), call it a Proposition. If the proof is short or routine, call it a Lemma, unless it follows directly from a Theorem you just stated, in which case call it a Corollary.
  • As a general rule there shouldn’t be more than 3 theorems in your paper (probably not more than 1). If you think this is unreasonable, consider that my COLT 2015 paper has 3 theorems across 24 pages, and my STOC 2017 paper has 2 theorems across 47 pages (not counting stating the same theorem in multiple locations).
  • If you just made a mathematical argument in the text that ended up with a non-trivial conclusion, you probably want to encapsulate it in a Proposition or Theorem. (Better yet, state the theorem before the argument so that the reader knows what you’re arguing for; although this isn’t always the best ordering.)

Model Mis-specification and Inverse Reinforcement Learning

In my previous post, “Latent Variables and Model Mis-specification”, I argued that while machine learning is good at optimizing accuracy on observed signals, it has less to say about correctly inferring the values for unobserved variables in a model. In this post I’d like to focus in on a specific context for this: inverse reinforcement learning (Ng et al. 2000, Abeel et al. 2004, Ziebart et al. 2008, Ho et al 2016), where one observes the actions of an agent and wants to infer the preferences and beliefs that led to those actions. For this post, I am pleased to be joined by Owain Evans, who is an active researcher in this area and has co-authored an online book about building models of agents (see here in particular for a tutorial on inverse reinforcement learning and inverse planning).

Owain and I are particularly interested in inverse reinforcement learning (IRL) because it has been proposed (most notably by Stuart Russell) as a method for learning human values in the context of AI safety; among other things, this would eventually involve learning and correctly implementing human values by artificial agents that are much more powerful, and act with much broader scope, than any humans alive today. While we think that overall IRL is a promising route to consider, we believe that there are also a number of non-obvious pitfalls related to performing IRL with a mis-specified model. The role of IRL in AI safety is to infer human values, which are represented by a reward function or utility function. But crucially, human values (or human reward functions) are never directly observed.

Below, we elaborate on these issues. We hope that by being more aware of these issues, researchers working on inverse reinforcement learning can anticipate and address the resulting failure modes. In addition, we think that considering issues caused by model mis-specification in a particular concrete context can better elucidate the general issues pointed to in the previous post on model mis-specification.

Specific Pitfalls for Inverse Reinforcement Learning

In “Latent Variables and Model Mis-specification”, Jacob talked about model mis-specification, where the “true” model does not lie in the model family being considered. We encourage readers to read that post first, though we’ve also tried to make the below readable independently.

In the context of inverse reinforcement learning, one can see some specific problems that might arise due to model mis-specification. For instance, the following are things we could misunderstand about an agent, which would cause us to make incorrect inferences about the agent’s values:

  • The actions of the agent. If we believe that an agent is capable of taking a certain action, but in reality they are not, we might make strange inferences about their values (for instance, that they highly value not taking that action). Furthermore, if our data is e.g. videos of human behavior, we have an additional inference problem of recognizing actions from the frames.
  • The information available to the agent. If an agent has access to more information than we think it does, then a plan that seems irrational to us (from the perspective of a given reward function) might actually be optimal for reasons that we fail to appreciate. In the other direction, if an agent has less information than we think, then we might incorrectly believe that they don’t value some outcome A, even though they really only failed to obtain A due to lack of information.
  • The long-term plans of the agent. An agent might take many actions that are useful in accomplishing some long-term goal, but not necessarily over the time horizon that we observe the agent. Inferring correct values thus also requires inferring such long-term goals. In addition, long time horizons can make models more brittle, thereby exacerbating model mis-specification issues.

There are likely other sources of error as well. The general point is that, given a mis-specified model of the agent, it is easy to make incorrect inferences about an agent’s values if the optimization pressure on the learning algorithm is only towards predicting actions correctly in-sample.

In the remainder of this post, we will cover each of the above aspects — actions, information, and plans — in turn, giving both quantitative models and qualitative arguments for why model mis-specification for that aspect of the agent can lead to perverse beliefs and behavior. First, though, we will briefly review the definition of inverse reinforcement learning and introduce relevant notation.

Inverse Reinforcement Learning: Definition and Notations

In inverse reinforcement learning, we want to model an agent taking actions in a given environment. We therefore suppose that we have a state space S (the set of states the agent and environment can be in), an action space A (the set of actions the agent can take), and a transition function T(s' \mid s,a), which gives the probability of moving from state s to state s' when taking action a. For instance, for an AI learning to control a car, the state space would be the possible locations and orientations of the car, the action space would be the set of control signals that the AI could send to the car, and the transition function would be the dynamics model for the car. The tuple of (S,A,T) is called an MDP\backslash R, which is a Markov Decision Process without a reward function. (The MDP\backslash R will either have a known horizon or a discount rate \gamma but we’ll leave these out for simplicity.)


Figure 1: Diagram showing how IRL and RL are related. (Credit: Pieter Abbeel’s slides on IRL)

The inference problem for IRL is to infer a reward function R given an optimal policy \pi^* : S \to A for the MDP\backslash R (see Figure 1). We learn about the policy \pi^* from samples (s,a) of states and the corresponding action according to \pi^* (which may be random). Typically, these samples come from a trajectory, which records the full history of the agent’s states and actions in a single episode:

(s_0, a_0), (s_1, a_1), \ldots, (s_n, a_n)

In the car example, this would correspond to the actions taken by an expert human driver who is demonstrating desired driving behaviour (where the actions would be recorded as the signals to the steering wheel, brake, etc.).

Given the MDP\backslash R and the observed trajectory, the goal is to infer the reward function R. In a Bayesian framework, if we specify a prior on R we have:

P(R \mid s_{0:n},a_{0:n}) \propto P( s_{0:n},a_{0:n} \mid R) P(R) = P(R) \cdot \prod_{i=0}^n P( a_i \mid s_i, R)

The likelihood P(a_i \mid s_i, R) is just \pi_R(s)[a_i], where \pi_R is the optimal policy under the reward function R. Note that computing the optimal policy given the reward is in general non-trivial; except in simple cases, we typically approximate the policy using reinforcement learning (see Figure 1). Policies are usually assumed to be noisy (e.g. using a softmax instead of deterministically taking the best action). Due to the challenges of specifying priors, computing optimal policies and integrating over reward functions, most work in IRL uses some kind of approximation to the Bayesian objective (see the references in the introduction for some examples).

Recognizing Human Actions in Data

IRL is a promising approach to learning human values in part because of the easy availability of data. For supervised learning, humans need to produce many labeled instances specialized for a task. IRL, by contrast, is an unsupervised/semi-supervised approach where any record of human behavior is a potential data source. Facebook’s logs of user behavior provide trillions of data-points. YouTube videos, history books, and literature are a trove of data on human behavior in both actual and imagined scenarios. However, while there is lots of existing data that is informative about human preferences, we argue that exploiting this data for IRL will be a difficult, complex task with current techniques.

Inferring Reward Functions from Video Frames

As we noted above, applications of IRL typically infer the reward function R from observed samples of the human policy \pi^*. Formally, the environment is a known MDP\backslash R = (S,A,T) and the observations are state-action pairs, (s,a) \sim pi^*. This assumes that (a) the environment’s dynamics T are given as part of the IRL problem, and (b) the observations are structured as “state-action” pairs. When the data comes from a human expert parking a car, these assumptions are reasonable. The states and actions of the driver can be recorded and a car simulator can be used for T. For data from YouTube videos or history books, the assumptions fail. The data is a sequence of partial observations: the transition function T is unknown and the data does not separate out state and action. Indeed, it’s a challenging ML problem to infer human actions from text or videos.


Movie still: What actions are being performed in this situation? (Source)

As a concrete example, suppose the data is a video of two co-pilots flying a plane. The successive frames provide only limited information about the state of the world at each time step and the frames often jump forward in time. So it’s more like a POMDP with a complex observation model. Moreover, the actions of each pilot need to be inferred. This is a challenging inference problem, because actions can be subtle (e.g. when a pilot nudges the controls or nods to his co-pilot).

To infer actions from observations, some model relating the true state-action (s,a) to the observed video frame must be used. But choosing any model makes substantive assumptions about how human values relate to their behavior. For example, suppose someone attacks one of the pilots and (as a reflex) he defends himself by hitting back. Is this reflexive or instinctive response (hitting the attacker) an action that is informative about the pilot’s values? Philosophers and neuroscientists might investigate this by considering the mental processes that occur before the pilot hits back. If an IRL algorithm uses an off-the-shelf action classifier, it will lock in some (contentious) assumptions about these mental processes. At the same time, an IRL algorithm cannot learn such a model because it never directly observes the mental processes that relate rewards to actions.

Inferring Policies From Video Frames

When learning a reward function via IRL, the ultimate goal is to use the reward function to guide an artificial agent’s behavior (e.g. to perform useful tasks to humans). This goal can be formalized directly, without including IRL as an intermediate step. For example, in Apprenticeship Learning, the goal is to learn a “good” policy for the MDP\backslash R from samples of the human’s policy \pi^* (where \pi^* is assumed to approximately optimize an unknown reward function). In Imitation Learning, the goal is simply to learn a policy that is similar to the human’s policy.

Like IRL, policy search techniques need to recognize an agent’s actions to infer their policy. So they have the same challenges as IRL in learning from videos or history books. Unlike IRL, policy search does not explicitly model the reward function that underlies an agent’s behavior. This leads to an additional challenge. Humans and AI systems face vastly different tasks and have different action spaces. Most actions in videos and books would never be performed by a software agent. Even when tasks are similar (e.g. humans driving in the 1930s vs. a self-driving car in 2016), it is a difficult transfer learning problem to use human policies in one task to improve AI policies in another.

IRL Needs Curated Data

We argued that records of human behaviour in books and videos are difficult for IRL algorithms to exploit. Data from Facebook seems more promising: we can store the state (e.g. the HTML or pixels displayed to the human) and each human action (clicks and scrolling). This extends beyond Facebook to any task that can be performed on a computer. While this covers a broad range of tasks, there are obvious limitations. Many people in the world have a limited ability to use a computer: we can’t learn about their values in this way. Moreover, some kinds of human preferences (e.g. preferences over physical activities) seem hard to learn about from behaviour on a computer.

Information and Biases

Human actions depend both on their preferences and their beliefs. The beliefs, like the preferences, are never directly observed. For narrow tasks (e.g. people choosing their favorite photos from a display), we can model humans as having full knowledge of the state (as in an MDP). But for most real-world tasks, humans have limited information and their information changes over time (as in a POMDP or RL problem). If IRL assumes the human has full information, then the model is mis-specified and generalizing about what the human would prefer in other scenarios can be mistaken. Here are some examples:

(1). Someone travels from their house to a cafe, which has already closed. If they are assumed to have full knowledge, then IRL would infer an alternative preference (e.g. going for a walk) rather than a preference to get a drink at the cafe.

(2). Someone takes a drug that is widely known to be ineffective. This could be because they have a false belief that the drug is effective, or because they picked up the wrong pill, or because they take the drug for its side-effects. Each possible explanation could lead to different conclusions about preferences.

(3). Suppose an IRL algorithm is inferring a person’s goals from key-presses on their laptop. The person repeatedly forgets their login passwords and has to reset them. This behavior is hard to capture with a POMDP-style model: humans forget some strings of characters and not others. IRL might infer that the person intends to repeatedly reset their passwords.

Example (3) above arises from humans forgetting information — even if the information is only a short string of characters. This is one way in which humans systematically deviate from rational Bayesian agents. The field of psychology has documented many other deviations. Below we discuss one such deviation — time-inconsistency — which has been used to explain temptation, addiction and procrastination.

Time-inconsistency and Procrastination

An IRL algorithm is inferring Alice’s preferences. In particular, the goal is to infer Alice’s preference for completing a somewhat tedious task (e.g. writing a paper) as opposed to relaxing. Alice has T days in which she could complete the task and IRL observes her working or relaxing on each successive day.


Figure 2. MDP graph for choosing whether to “work” or “wait” (relax) on a task.

Formally, let R be the preference/reward Alice assigns to completing the task. Each day, Alice can “work” (receiving cost w for doing tedious work) or “wait” (cost 0). If she works, she later receives the reward R minus a tiny, linearly increasing cost (because it’s better to submit a paper earlier). Beyond the deadline at T, Alice cannot get the reward R. For IRL, we fix \epsilon and w and infer R.

Suppose Alice chooses “wait” on Day 1. If she were fully rational, it follows that R (the preference for completing the task) is small compared to w (the psychological cost of doing the tedious work). In other words, Alice doesn’t care much about completing the task. Rational agents will do the task on Day 1 or never do it. Yet humans often care deeply about tasks yet leave them until the last minute (when finishing early would be optimal). Here we imagine that Alice has 9 days to complete the task and waits until the last possible day.


Figure 3: Graph showing IRL inferences for Optimal model (which is mis-specified) and Possibly Discounting Model (which includes hyperbolic discounting). On each day (x-axis) the model gets another observation of Alice’s choice. The y-axis shows the posterior mean for R (reward for task), where the tedious work w = -1.

Figure 3 shows results from running IRL on this problem. There is an “Optimal” model, where the agent is optimal up to an unknown level of softmax random noise (a typical assumption for IRL). There is also a “Possibly Discounting” model, where the agent is either softmax optimal or is a hyperbolic discounter (with unknown level of discounting). We do joint Bayesian inference over the completion reward R, the softmax noise and (for “Possibly Discounting”) how much the agent hyperbolically discounts. The work cost w is set to -1. Figure 3 shows that after 6 days of observing Alice procrastinate, the “Optimal” model is very confident that Alice does not care about the task (R < |w|). When Alice completes the task on the last possible day, the posterior mean on R is not much more than the prior mean. By contrast, the “Possibly Discounting” model never becomes confident that Alice doesn’t care about the task. (Note that the gap between the models would be bigger for larger T. The “Optimal” model’s posterior on R shoots back to its Day-0 prior because it explains the whole action sequence as due to high softmax noise — optimal agents without noise would either do the task immediately or not at all. Full details and code are here.)

Long-term Plans

Agents will often take long series of actions that generate negative utility for them in the moment in order to accomplish a long-term goal (for instance, studying every night in order to perform well on a test). Such long-term plans can make IRL more difficult for a few reasons. Here we focus on two: (1) IRL systems may not have access to the right type of data for learning about long-term goals, and (2) needing to predict long sequences of actions can make algorithms more fragile in the face of model mis-specification.

(1) Wrong type of data. To make inferences based on long-term plans, it would be helpful to have coherent data about a single agent’s actions over a long period of time (so that we can e.g. see the plan unfolding). But in practice we will likely have substantially more data consisting of short snapshots of a large number of different agents (e.g. because many internet services already record user interactions, but it is uncommon for a single person to be exhaustively tracked and recorded over an extended period of time even while they are offline).

The former type of data (about a single representative population measured over time) is called panel data, while the latter type of data (about different representative populations measured at each point in time) is called repeated cross-section data. The differences between these two types of data is well-studied in econometrics, and a general theme is the following: it is difficult to infer individual-level effects from cross-sectional data.

An easy and familiar example of this difference (albeit not in an IRL setting) can be given in terms of election campaigns. Most campaign polling is cross-sectional in nature: a different population of respondents is polled at each point in time. Suppose that Hillary Clinton gives a speech and her overall support according to cross-sectional polls increases by 2%; what can we conclude from this? Does it mean that 2% of people switched from Trump to Clinton? Or did 6% of people switch from Trump to Clinton while 4% switched from Clinton to Trump?

At a minimum, then, using cross-sectional data leads to a difficult disaggregation problem; for instance, different agents taking different actions at a given point in time could be due to being at different stages in the same plan, or due to having different plans, or some combination of these and other factors. Collecting demographic and other side data can help us (by allowing us to look at variation and shifts within each subpopulation), but it is unclear if this will be sufficient in general.

On the other hand, there are some services (such as Facebook or Google) that do have extensive data about individual users across a long period of time. However, this data has another issue: it is incomplete in a very systematic way (since it only tracks online behaviour). For instance, someone might go online most days to read course notes and Wikipedia for a class; this is data that would likely be recorded. However, it is less likely that one would have a record of that person taking the final exam, passing the class and then getting an internship based on their class performance. Of course, some pieces of this sequence would be inferable based on some people’s e-mail records, etc., but it would likely be under-represented in the data relative to the record of Wikipedia usage. In either case, some non-trivial degree of inference would be necessary to make sense of such data.

(2) Fragility to mis-specification. Above we discussed why observing only short sequences of actions from an agent can make it difficult to learn about their long-term plans (and hence to reason correctly about their values). Next we discuss another potential issue — fragility to model mis-specification.

Suppose someone spends 99 days doing a boring task to accomplish an important goal on day 100. A system that is only trying to correctly predict actions will be right 99% of the time if it predicts that the person inherently enjoys boring tasks. Of course, a system that understands the goal and how the tasks lead to the goal will be right 100% of the time, but even minor errors in its understanding could bring the accuracy back below 99%.

The general issue is the following: large changes in the model of the agent might only lead to small changes in the predictive accuracy of the model, and the longer the time horizon on which a goal is realized, the more this might be the case. This means that even slight mis-specifications in the model could tip the scales back in favor of a (very) incorrect reward function. A potential way of dealing with this might be to identify “important” predictions that seem closely tied to the reward function, and focus particularly on getting those predictions right (see here for a paper exploring a similar idea in the context of approximate inference).

One might object that this is only a problem in this toy setting; for instance, in the real world, one might look at the particular way in which someone is studying or performing some other boring task to see that it coherently leads towards some goal (in a way that would be less likely were the person to be doing something boring purely for enjoyment). In other words, correctly understanding the agent’s goals might allow for more fine-grained accurate predictions which would fare better under e.g. log-score than would an incorrect model.

This is a reasonable objection, but there are some historical examples of this going wrong that should give one pause. That is, there are historical instances where: (i) people expected a more complex model that seemed to get at some underlying mechanism to outperform a simpler model that ignored that mechanism, and (ii) they were wrong (the simpler model did better under log-score). The example we are most familiar with is n-gram models vs. parse trees for language modelling; the most successful language models (in terms of having the best log-score on predicting the next word given a sequence of previous words) essentially treat language as a high-order Markov chain or hidden Markov model, despite the fact that linguistic theory predicts that language should be tree-structured rather than linearly-structured. Indeed, NLP researchers have tried building language models that assume language is tree-structured, and these models perform worse, or at least do not seem to have been adopted in practice (this is true both for older discrete models and newer continuous models based on neural nets).  It’s plausible that a similar issue will occur in inverse reinforcement learning, where correctly inferring plans is not enough to win out in predictive performance. The reason for the two issues might be quite similar (in language modelling, the tree structure only wins out in statistically uncommon corner cases involving long-term and/or nested dependencies, and hence getting that part of the prediction correct doesn’t help predictive accuracy much).

The overall point is: in the case of even slight model mis-specification, the “correct” model might actually perform worse under typical metrics such as predictive accuracy. Therefore, more careful methods of constructing a model might be necessary.

Learning Values != Robustly Predicting Human Behaviour

The problems with IRL described so far will result in poor performance for predicting human choices out-of-sample. For example, if someone is observed doing boring tasks for 99 days (where they only achieve the goal on Day 100), they’ll be predicted to continue doing boring tasks even when a short-cut to the goal becomes available. So even if the goal is simply to predict human behaviour (not to infer human values), mis-specification leads to bad predictions on realistic out-of-sample scenarios.

Let’s suppose that our goal is not to predict human behaviour but to create AI systems that promote and respect human values. These goals (predicting humans and building safe AI) are distinct. Here’s an example that illustrates the difference. Consider a long-term smoker, Bob, who would continue smoking even if there were (counterfactually) a universally effective anti-smoking treatment. Maybe Bob is in denial about the health effects of smoking or Bob thinks he’ll inevitably go back to smoking whatever happens. If an AI system were assisting Bob, we might expect it to avoid promoting his smoking habit (e.g. by not offering him cigarettes at random moments). This is not paternalism, where the AI system imposes someone else’s values on Bob. The point is that even if Bob would continue smoking across many counterfactual scenarios this doesn’t mean that he places value on smoking.

How do we choose between the theory that Bob values smoking and the theory that he does not (but smokes anyway because of the powerful addiction)? Humans choose between these theories based on our experience with addictive behaviours and our insights into people’s preferences and values. This kind of insight can’t easily be captured as formal assumptions about a model, or even as a criterion about counterfactual generalization. (The theory that Bob values smoking does make accurate predictions across a wide range of counterfactuals.) Because of this, learning human values from IRL has a more profound kind of model mis-specification than the examples in Jacob’s previous post. Even in the limit of data generated from an infinite series of random counterfactual scenarios, standard IRL algorithms would not infer someone’s true values.

Predicting human actions is neither necessary nor sufficient for learning human values. In what ways, then, are the two related? One such way stems from the premise that if someone spends more resources making a decision, the resulting decision tends to be more in keeping with their true values. For instance, someone might spend lots of time thinking about the decision, they might consult experts, or they might try out the different options in a trial period before they make the real decision. Various authors have thus suggested that people’s choices under sufficient “reflection” act as a reliable indicator of their true values. Under this view, predicting a certain kind of behaviour (choices under reflection) is sufficient for learning human values. Paul Christiano has written about some proposals for doing this, though we will not discuss them here (the first link is for general AI systems while the second is for newsfeeds). In general, turning these ideas into algorithms that are tractable and learn safely remains a challenging problem.

Further reading

There is research on doing IRL for agents in POMDPs. Owain and collaborators explored the effects of limited information and cognitive biases on IRL: paper, paper, online book.

For many environments it will not be possible to identify the reward function from the observed trajectories. These identification problems are related to the mis-specification problems but are not the same thing. Active learning can help with identification (paper).

Paul Christiano raised many similar points about mis-specification in a post on his blog.

For a big-picture monograph on relations between human preferences, economic utility theory and welfare/well-being, see Hausman’s “Preference, Value, Choice and Welfare”.


Thanks to Sindy Li for reviewing a full draft of this post and providing many helpful comments. Thanks also to Michael Webb and Paul Christiano for doing the same on specific sections of the post.

Prékopa–Leindler inequality

Consider the following statements:

  1. The shape with the largest volume enclosed by a given surface area is the n-dimensional sphere.
  2. A marginal or sum of log-concave distributions is log-concave.
  3. Any Lipschitz function of a standard n-dimensional Gaussian distribution concentrates around its mean.

What do these all have in common? Despite being fairly non-trivial and deep results, they all can be proved in less than half of a page using the Prékopa–Leindler inequality.

(I won’t show this here, or give formal versions of the statements above, but time permitting I will do so in a later blog post.)

Two Strange Facts

Here are two strange facts about matrices, which I can prove but not in a satisfying way.

  1. If A and B are symmetric matrices satisfying 0 \preceq A \preceq B, then A^{1/2} \preceq B^{1/2}, and B^{-1} \preceq A^{-1}, but it is NOT necessarily the case that A^2 \preceq B^2. Is there a nice way to see why the first two properties should hold but not necessarily the third? In general, do we have A^p \preceq B^p if p \in [0,1]?
  2. Given a rectangular matrix W \in \mathbb{R}^{n \times d}, and a set S \subseteq [n], let W_S be the submatrix of W with rows in S, and let \|W_S\|_* denote the nuclear norm (sum of singular values) of W_S. Then the function f(S) = \|W_S\|_* is submodular, meaning that f(S \cup T) + f(S \cap T) \leq f(S) + f(T) for all sets S, T. In fact, this is true if we take f_p(S), defined as the sum of the pth powers of the singular values of W_S, for any p \in [0,2]. The only proof I know involves trigonometric integrals and seems completely unmotivated to me. Is there any clean way of seeing why this should be true?

If anyone has insight into either of these, I’d be very interested!

Difficulty of Predicting the Maximum of Gaussians

Suppose that we have a random variable X \in \mathbb{R}^d, such that \mathbb{E}[XX^{\top}] = I_{d \times d}. Now take k independent Gaussian random variables Z_1, \ldots, Z_k \sim \mathcal{N}(0, I_{d \times d}), and let J be the argmax (over j in 1, …, k) of Z_j^{\top}X.

It seems that it should be very hard to predict J well, in the following sense: for any function q(j \mid x), the expectation of \mathbb{E}_{x}[q(J \mid x)], should with high probability be very close to \frac{1}{k} (where the second probability is taken over the randomness in Z). In fact, Alex Zhai and I think that the probability of the expectation exceeding \frac{1}{k} should be at most \exp(-C(\epsilon/k)^2d) for some constant C. (We can already show this to be true where we replace (\epsilon/k)^2 with (\epsilon/k)^4.) I will not sketch a proof here but the idea is pretty cool, it basically uses Lipschitz concentration of Gaussian random variables.

I’m mainly posting this problem because I think it’s pretty interesting, in case anyone else is inspired to work on it. It is closely related to the covering number of exponential families under the KL divergence, where we are interested in coverings at relatively large radii (\log(k) - \epsilon rather than \epsilon).

Verifying Stability of Stochastic Systems

I just finished presenting my recent paper on stochastic verification at RSS 2011. There is a conference version online, with a journal article to come later. In this post I want to go over the problem statement and my solution.

Problem Statement

Abstractly, the goal is to be given some sort of description of a system, and of a goal for that system, and then verify that the system will reach that goal. The difference between our work and a lot (but not all) of the previous work is that we want to work with an explicit noise model for the system. So, for instance, I tell you that the system satisfies

dx(t) = f(x) dt + g(x) dw(t),

where f(x) represents the nominal dynamics of the system, g(x) represents how noise enters the system, and $dw(t)$ is a standard Wiener process (the continuous-time version of Gaussian noise). I would like to, for instance, verify that h(x(T)) < 0 for some function h and some final time T. For example, if x is one-dimensional then I could ask that x(10)^2-1 < 0, which is asking for x to be within a distance of 1 of the origin at time 10. For now, I will focus on time-invariant systems and stability conditions. This means that f and g are not functions of t, and the condition we want to verify is that h(x(t)) < 0 for all t \in [0,T]. However, it is not too difficult to extend these ideas to the time-varying case, as I will show in the results at the end.

The tool we will use for our task is a supermartingale, which allows us to prove bounds on the probability that a system leaves a certain region.


Let us suppose that I have a non-negative function V of my state x such that \mathbb{E}[\dot{V}(x(t))] \leq c for all x and t. Here we define \mathbb{E}[\dot{V}(x(t))] as

\lim\limits_{\Delta t \to 0^+} \frac{\mathbb{E}[V(x(t+\Delta t)) \mid x(t)]-V(x(t))}{\Delta t}.

Then, just by integrating, we can see that \mathbb{E}[V(x(t)) \mid x(0)] \leq V(x(0))+ct. By Markov’s inequality, the probability that V(x(t)) \geq \rho is at most \frac{V(x(0))+ct}{\rho}.

We can actually prove something stronger as follows: note that if we re-define our Markov process to stop evolving as soon as V(x(t)) = \rho, then this only sets \mathbb{E}[\dot{V}] to zero in certain places. Thus the probability that V(x(t)) \geq \rho for this new process is at most \frac{V(x(0))+\max(c,0)t}{\rho}. Since the process stops as soon as V(x) \geq \rho, we obtain the stronger result that the probability that V(x(s)) \geq \rho for any s \in [0,t] is at most \frac{V(x(0))+\max(c,0)t}{\rho}. Finally, we only need the condition \mathbb{E}[\dot{V}] \leq c to hold when V(x) < \rho. We thus obtain the following:

Theorem. Let V(x) be a non-negative function such that \mathbb{E}[\dot{V}(x(t))] \leq c whenever V(x) < \rho. Then with probability at least 1-\frac{V(x(0))+\max(c,0)T}{\rho}, V(x(t)) < \rho for all t \in [0,T).

We call the condition \mathbb{E}[\dot{V}] \leq c the supermartingale condition, and a function that satisfies the martingale condition is called a supermartingale. If we can construct supermartingales for our system, then we can bound the probability that trajectories of the system leave a given region.

NOTE: for most people, a supermartingale is something that satisfies the condition \mathbb{E}[\dot{V}] \leq 0. However, this condition is often impossible to satisfy for systems we might care about. For instance, just consider exponential decay driven by Gaussian noise:


Once the system gets close enough to the origin, the exponential decay ceases to matter much and the system is basically just getting bounced around by the Gaussian noise. In particular, if the system is ever at the origin, it will get perturbed away again, so you cannot hope to find a non-constant function of x that is decreasing in expectation everywhere (just consider the global minimum of such a function: in all cases, there is a non-zero probability that the Gaussian noise will cause V(x) to increase, but a zero probability that V(x) will decrease because we are already at the global minimum this argument doesn’t actually work, but I am pretty sure that my claim is true at least subject to sufficient technical conditions).

Applying the Martingale Theorem

Now that we have this theorem, we need some way to actually use it. First, let us try to get a more explicit version of the Martingale condition for the systems we are considering, which you will recall are of the form dx(t) = f(x) dt + g(x) dw(t). Note that V(x+\Delta x) = V(x) + \frac{\partial V}{\partial x} \Delta x + \frac{1}{2}Trace\left(\Delta x^T \frac{\partial^2 V}{\partial x^2}\Delta x\right)+O(\Delta x^3).

Then \mathbb{E}[\dot{V}(x)] = \lim_{\Delta t \to 0^+} \frac{\frac{\partial V}{\partial x} \mathbb{E}[\Delta x]+\frac{1}{2}Trace\left(\mathbb{E}[\Delta x^T \frac{\partial^2 V}{\partial x^2}\Delta x\right)+O(\Delta x^3)}{\Delta t}. A Wiener process satisfies \mathbb{E}[dw(t)] = 0 and \mathbb{E}[dw(t)^2] = dt, so only the nominal dynamics (f(x)) affect the limit of the first-order term while only the noise (g(x)) affects the limit of the second-order term (the third-order and higher terms in \Delta x all go to zero). We thus end up with the formula

\mathbb{E}[\dot{V}(x)] = \frac{\partial V}{\partial x}f(x)+\frac{1}{2}Trace\left(g(x)^T\frac{\partial^2 V}{\partial x^2}g(x)\right).

It is not that difficult to construct a supermartingale, but most supermartingales that you construct will yield a pretty poor bound. To illustrate this, consider the system dx(t) = -x dt + dw(t). This is the example in the image from the previous section. Now consider a quadratic function V(x) = x^2. The preceding formula tells us that \mathbb{E}[\dot{V}(x)] = -2x^2+1. We thus have \mathbb{E}[\dot{V}(x)] \leq 1, which means that the probability of leaving the region x^2 < \rho is at most \frac{x(0)^2+T}{\rho}. This is not particularly impressive: it says that we should expect x to grow roughly as \sqrt{T}, which is how quickly x would grow if it was a random walk with no stabilizing component at all.

One way to deal with this is to have a state-dependent bound $\mathbb{E}[\dot{V}] \leq c-kV$. This has been considered for instance by Pham, Tabareau, and Slotine (see Lemma 2 and Theorem 2), but I am not sure whether their results still work if the supermartingale condition only holds locally instead of globally; I haven’t spent much time on this, so they could generalize quite trivially.

Another way to deal with this is to pick a more quickly-growing candidate supermartingale. For instance, we could pick V(x) = x^4. Then \mathbb{E}[\dot{V}] = -4x^4+6x^2, which has a global maximum of \frac{9}{4} at $x = \frac{\sqrt{3}}{2}$. This bounds then says that x grows at a rate of at most T^{\frac{1}{4}}, which is better than before, but still much worse than reality.

We could keep improving on this bound by considering successively faster-growing polynomials. However, automating such a process becomes expensive once the degree of the polynomial gets large. Instead, let’s consider a function like V(x) = e^{0.5x^2}. Then \mathbb{E}[\dot{V}] = e^{0.5x^2}(0.5-0.5x^2), which has a maximum of 0.5 at x=0. Now our bound says that we should expect x to grow like \sqrt{\log(T)}, which is a much better growth rate (and roughly the true growth rate, at least in terms of the largest value of |x| over the time interval [0,T]).

This leads us to our overall strategy for finding good supermartingales. We will search across functions of the form V(x) = e^{x^TSx} where S \succeq 0 is a matrix (the \succeq means “positive semidefinite”, which roughly means that the graph of the function x^TSx looks like a bowl rather than a saddle/hyperbola). This begs two questions: how to upper-bound the global maximum of \mathbb{E}[\dot{V}] for this family, and how to search efficiently over this family. The former is done by doing some careful work with inequalities, while the latter is done with semidefinite programming. I will explain both below.

Upper-bounding \mathbb{E}[\dot{V}]

In general, if V(x) = e^{x^TSx}, then \mathbb{E}[\dot{V}(x)] = e^{x^TSx}\left(2x^TSf(x)+Trace(g(x)^TSg(x))+2x^TSg(x)g(x)^TSx\right). We would like to show that such a function is upper-bounded by a constant c. To do this, move the exponential term to the right-hand-side to get the equivalent condition 2x^TSf(x)+Trace(g(x)^TSg(x))+2x^TSg(x)g(x)^TSx \leq ce^{-x^TSx}. Then we can lower-bound e^{-x^TSx} by 1-x^TSx and obtain the sufficient condition

c(1-x^TSx)-2x^TSf(x)-Trace(g(x)^TSg(x))-2x^TSg(x)g(x)^TSx \geq 0.

It is still not immediately clear how to check such a condition, but somehow the fact that this new condition only involves polynomials (assuming that f and g are polynomials) seems like it should make computations more tractable. This is indeed the case. While checking if a polynomial is positive is NP-hard, checking whether it is a sum of squares of other polynomials can be done in polynomial time. While sum of squares is not the same as positive, it is a sufficient condition (since the square of a real number is always positive).

The way we check whether a polynomial p(x) is a sum of squares is to formulate it as the semidefinite program: p(x) = z^TMz, M \succeq 0, where z is a vector of monomials. The condition p(x) = z^TMz is a set of affine constraints on the entries of M, so that the above program is indeed semidefinite and can be solved efficiently.

Efficiently searching across all matrices S

We can extend on the sum-of-squares idea in the previous section to search over S. Note that if p is a parameterized polynomial whose coefficient are affine in a set of decision variables, then the condition p(x) = z^TMz is again a set of affine constraints on M. This almost solves our problem for us, but not quite. The issue is the form of p(x) in our case:


Do you see the problem? There are two places where the constraints do not appear linearly in the decision variables: c and S multiply each other in the first term, and S appears quadratically in the last term. While the first non-linearity is not so bad (c is a scalar so it is relatively cheap to search over c exhaustively), the second non-linearity is more serious. Fortunately, we can resolve the issue with Schur complements. The idea behind Schur complements is that, assuming A \succeq 0, the condition B^TA^{-1}B \preceq C is equivalent to \left[ \begin{array}{cc} A & B \\ B^T & C \end{array} \right] \succeq 0. In our case, this means that our condition is equivalent to the condition that

\left[ \begin{array}{cc} 0.5I & g^TSx \\ x^TSg & c(1-x^TSx)-2x^TSf(x)-Trace(g(x)^TSg(x))\end{array} \right] \succeq 0

where I is the identity matrix. Now we have a condition that is linear in the decision variable S, but it is no longer a polynomial condition, it is a condition that a matrix polynomial be positive semidefinite. Fortunately, we can reduce this to a purely polynomial condition by creating a set of dummy variables y and asking that

y^T\left[ \begin{array}{cc} 0.5I & g^TSx \\ x^TSg & c(1-x^TSx)-2x^TSf(x)-Trace(g(x)^TSg(x))\end{array} \right]y \geq 0

We can then do a line search over c and solve a semidefinite program to determine a feasible value of S. If we care about remaining within a specific region, we can maximize \rho such that x^TSx < \rho implies that we stay in the region. Since our bound on the probability of leaving the grows roughly as ce^{-\rho}T, this is a pretty reasonable thing to maximize (we would actually want to maximize \rho-\log(c), but this is a bit more difficult to do).

Oftentimes, for instance if we are verifying stability around a trajectory, we would like c to be time-varying. In this case an exhaustive search is no longer feasible. Instead we alternate between searching over S and searching over c. In the step where we search over S, we maximize \rho. In the step where we search over c, we maximize the amount by which we could change c and still satisfy the constraints (the easiest way to do this is by first maximizing c, then minimizing c, then taking the average; the fact that semidefinite constraints are convex implies that this optimizes the margin on c for a fixed S).

A final note is that systems are often only stable locally, and so we only want to check the constraint \mathbb{E}[\dot{V}] \leq c in a region where V(x) < \rho. We can do this by adding a Lagrange multiplier to our constraints. For instance, if we want to check that p(x) \geq 0 whenever (x) \leq 0, it suffices to find a polynomial \lambda(x) such that \lambda(x) \geq 0 and p(x)+\lambda(x)s(x) \geq 0. (You should convince yourself that this is true; the easiest proof is just by casework on the sign of s(x).) This again introduces a non-linearity in the constraints, but if we fix S and \rho then the constraints are linear in c and \lambda, and vice-versa, so we can perform the same alternating maximization as before.


Below is the most exciting result, it is for an airplane with a noisy camera trying to avoid obstacles. Using the verification methods above, we can show that with probability at least 0.99 that the plane trajectory will not leave the gray region:


Least Squares and Fourier Analysis

I ended my last post on a somewhat dire note, claiming that least squares can do pretty terribly when fitting data. It turns out that things aren’t quite as bad as I thought, but most likely worse than you would expect.

The theme of this post is going to be things you use all the time (or at least, would use all the time if you were an electrical engineer), but probably haven’t ever thought deeply about. I’m going to include a combination of mathematical proofs and matlab demonstrations, so there should hopefully be something here for everyone.

My first topic is going to be, as promised, least squares curve fitting. I’ll start by talking about situations when it can fail, and also about situations when it is “optimal” in some well-defined sense. To do that, I’ll have to use some Fourier analysis, which will present a good opportunity to go over when frequency-domain methods can be very useful, when they can fail, and what you can try to do when they fail.

When Least Squares Fails

To start, I’m going to do a simple matlab experiment. I encourage you to follow along if you have matlab (if you have MIT certificates you can get it for free at

Let’s pretend we have some simple discrete-time process, y(n+1) = a*y(n) + b*u(n), where y is the variable we care about and u is some input signal. We’ll pick a = 0.8, b = 1.0 for our purposes, and u is chosen to be a discrete version of a random walk. The code below generates the y signal, then uses least squares to recover a and b. (I recommend taking advantage of cell mode if you’re typing this in yourself.)

%% generate data

a = 0.8; b = 1.0;

N = 1000;

ntape = 1:N; y = zeros(N,1); u = zeros(N-1,1);

for n=1:N-2

if rand < 0.02

u(n+1) = 1-u(n);


u(n+1) = u(n);



for n=1:N-1

y(n+1) = a*y(n)+b*u(n);



%% least squares fit (map y(n) and u(n) to y(n+1))

A = [y(1:end-1) u]; b = y(2:end);

params = A\b;

afit = params(1)

bfit = params(2)

The results are hardly surprising (you get afit = 0.8, bfit = 1.0). For the benefit of those without matlab, here is a plot of y against n:

Now let’s add some noise to the signal. The code below generates noise whose size is about 6% of the size of the data (in the sense of L2 norm).


yn = y + 0.1*randn(N,1); % gaussian noise with standard deviation 0.2

A = [yn(1:end-1) u]; b = yn(2:end);

params = A\b;

afit = params(1)

bfit = params(2)

This time the results are much worse: afit = 0.7748, bfit = 1.1135. You might be tempted to say that this isn’t so much worse than we might expect — the accuracy of our parameters is roughly the accuracy of our data. The problem is that, if you keep running the code above (which will generate new noise each time), you will always end up with afit close to 0.77 and bfit close to 1.15. In other words, the parameters are systematically biased by the noise. Also, we should expect our accuracy to increase with more samples, but that isn’t the case here. If we change N to 100,000, we get afit = 0.7716, bfit = 1.1298. More samples will decrease the standard deviation of our answer (running the code multiple times will yield increasingly similar results), but not necessarily its correctness.

A more dire way of thinking about this is that increasing the number of samples will increase how “certain” we are of our answer, but it won’t change the fact that our answer is wrong. So we will end up being quite certain of an incorrect answer.

Why does this happen? It turns out that when we use least squares, we are making certain assumptions about the structure of our noise, and those assumptions don’t hold in the example above. In particular, in a model like the one above, least squares assumes that all noise is process noise, meaning that noise at one step gets propagated to future steps. Such noise might come from a system with unmodelled friction or some external physical disturbance. In contrast, the noise we have is output noise, meaning that the reading of our signal is slightly off. What the above example shows is that a model constructed via least squares will be systematically biased by output noise.

That’s the intuition, now let’s get into the math.When we do least squares, we are trying to solve some equation Ax=b for x, where A and b are both noisy. So we really have something like A+An and b+bn, where An and bn are the noise on A and b.

Before we continue, I think it’s best to stop and think about what we really want. So what is it that we actually want? We observe a bunch of data as input, and some more data as output. We would like a way of predicting, given the input, what the output should be. In this sense, then, the distinction between “input noise” (An) and “output noise” bn is meaningless, as we don’t get to see either and all they do is cause b to not be exactly Ax. (If we start with assumptions on what noise “looks like”, then distinguishing between different sources of noise turns out to be actually useful. More on that later.)

If the above paragraph isn’t satisfying, then we can use the more algebraic explanation that the noise An and bn induces a single random variable on the relationship between observed input and observed output. In fact, if we let A’=A+An then we end up fitting A'x = b+(b_n-A_nx), so we can just define e = b_n-A_nx and have a single noise term.

Now, back to least squares. Least squares tries to minimize \|Ax-b\|_2^2, that is, the squared error in the l^2 norm. If we instead have a noisy b, then we are trying to minimize \|Ax-b-e\|_2^2, which will happen when x satisfies A^TAx = A^T(b+e).

If there actually exists an \hat{x} such that A\hat{x} = b (which is what we are positing, subject to some error term), then minimizing Ax-b is achieved by setting x to \hat{x}. Note that \hat{x} is what we would like to recover. So \hat{x} would be the solution to A^TAx = A^Tb, and thus we see that an error e introduces a linear error in our estimate of x. (To be precise, the error affects our answer for x via the operator (A^TA)^{-1}A^T.)

Now all this can be seen relatively easily by just using the standard formula for the solution to least squares: x = (A^TA)^{-1}A^Tb. But I find that it is easy to get confused about what exactly the “true” answer is when you are fitting data, so I wanted to go through each step carefully.

At any rate, we have a formula for how the error e affects our estimate of x, now I think there are two major questions to answer:

(1) In what way can e systematically bias our estimate for x?

(2) What can we say about the variance on our estimate for x?

To calculate the bias on x, we need to calculate \mathbb{E}((A^TA)^{-1}A^Te), where \mathbb{E} stands for expected value. Since (A^TA)^{-1} is invertible, this is the same as (A^TA)^{-1}\mathbb{E}(A^Te). In particular, we will get an unbiased estimate exactly when A and e are uncorrelated. Most importantly, when we have noise on our inputs then A and e will (probably) be correlated, and we won’t get an unbiased result.

How bad is the bias? Well, if A actually has a noise component (i.e. A=A_0+A_n), and e is b_n-A_nx, and we assume that our noise is uncorrelated with the constant matrix A, then we get a correlation matrix equal to A_n^T(b_n-A_nx), which, assuming that A_n and b_n are uncorrelated, gives us -A_n^TA_nx. The overall bias then comes out to -(A^TA)^{-1}\mathbb{E}(A_n^TA_n)x.

I unfortunately don’t have as nice of an expression for the variance, although you can of course calculate it in terms of A, b, x, A_n, and b_n.

At any rate, if noise doesn’t show up in the input, and the noise that does show up is uncorrelated with the input, then we should end up with no bias. But if either of those things is true, we will end up with bias. When modelling a dynamical system, input noise corresponds to measurement noise (your sensors are imperfect), while output noise corresponds to process noise (the system doesn’t behave exactly as expected).

One way we can see how noise being correlated with A can lead to bias is if our “noise” is actually an unmodelled quadratic term. Imagine trying to fit a line to a parabola. You won’t actually fit the tangent line to the parabola, instead you’ll probably end up fitting something that looks like a secant. However, the exact slope of the line you pick will depend pretty strongly on the distribution of points you sample along the parabola. Depending on what you want the linear model for, this could either be fine (as long as you sample a distribution of points that matches the distribution of situations that you think you’ll end up using the model for), or very annoying (if you really wanted the tangent).

If you’re actually just dealing with a parabola, then you can still get the tangent by sampling symmetrically about the point you care about, but once you get to a cubic this is no longer the case.

As a final note, one reasonable way (although I’m not convinced it’s the best, or even a particularly robust way) of determining if a linear fit of your data is likely to return something meaningful is to look at the condition number of your matrix, which can be computed in matlab using the cond function and can also be realized as the square root of the ratio of the largest to the smallest eigenvalue of A^TA. Note that the condition number says nothing about whether your data has a reasonable linear fit (it can’t, since it doesn’t take b into account). Rather, it is a measure of how well-defined the coefficients of such a fit would be. In particular, it will be large if your data is close to lying on a lower-dimensional subspace (which can end up really screwing up your fit). In this case, you either need to collect better data or figure out why your data lies on a lower-dimensional subspace (it could be that there is some additional structure to your system that you didn’t think about; see point (3) below about a system that is heavily damped).

I originally wanted to write down a lot more about specific ways that noise can come into the picture, but I haven’t worked it all out myself, and it’s probably too ambitious a project for a single blog post anyways. So instead I’m going to leave you with a bunch of things to think about. I know the answers to some of these, for others I have ideas, and for others I’m still trying to work out a good answer.

1) Can anything be done to deal with measurement noise? In particular, can anything be done to deal with the sort of noise that comes from encoders (i.e., a discretization of the signal)?

2) Is there a good way of measuring when noise will be problematic to our fit?

3) How can we fit models to systems that evolve on multiple time-scales? For example, an extremely damped system such as

\dot{x}_1 = x_2

\dot{x}_2 = -cx_1-Mx_2

where M >> c. You could take, for example, M = 20, c = 1, in which case the system  behaves almost identically to the system

\dot{x}_1 = \frac{-M + \sqrt{M^2-4c}}{2} x_1

with x_2 set to the derivative of x_1. Then the data will all lie almost on a line, which can end up screwing up your fit. So in what exact ways can your fit get screwed up, and what can be done to deal with it? (This is essentially the problem that I’m working on right now.)

4) Is there a way to defend against non-linearities in a system messing up our fit? Can we figure out when these non-linearities occur, and to what extent?

5) What problems might arise when we try to fit a system that is unstable or only slightly stable, and what is a good strategy for modelling such a system?

When Least Squares Works

Now that I’ve convinced you that least squares can run into problems, let’s talk about when it can do well.

As Paul Christiano pointed out to me, when you have some system where you can actually give it inputs and measure the outputs, least squares is likely to do a fairly good job. This is because you can (in principle) draw the data you use to fit your model from the same distribution as you expect to encounter when the model is used in practice. However, you will still run into the problem that failure to measure the input accurately introduces biases. And no, these biases can’t be eradicated completely by averaging the result across many samples, because the bias is always a negative definite matrix applied to x (the parameters we are trying to find), and any convex combination of negative definite matrices will remain negative definite.

Intuitively, what this says is that if you can’t trust your input, then you shouldn’t rely on it strongly as a predictor. Unfortunately, the only way that a linear model knows how to trust something less is by making the coefficient on that quantity “smaller” in some sense (in the negative definite sense here). So really the issue is that least squares is too “dumb” to deal with the issue of measurement error on the input.

But I said that I’d give examples of when least squares works, and here I am telling you more about why it fails. One powerful and unexpected aspect of least squares is that it can fit a wide variety of non-linear models. For example, if we have a system y = c_1+c_2x+c_3x^2+c_4\cos(x), then we just form a matrix A = \left[ \begin{array}{cccc} 1 & x & x^2 & \cos(x) \end{array} \right] and b = y, where for example \cos(x) is actually a column vector where the ith row is the cosine of the ith piece of input data. This will often be the case in physical systems, and I think is always the case for systems solved via Newton’s laws (although you might have to consolide parameters, for example fitting both mgl and ml^2 in the case of a pendulum). This isn’t necessarily the case for reduced models of complicated systems, for example the sort of models used for fluid dynamics. However, I think that the fact that linear fitting techniques can be applied to such a rich class of systems is quite amazing.

There is also a place where least squares not only works but is in some sense optimal: detecting the frequency response of a system. Actually, it is only optimal in certain situations, but even outside of those situations it has many advantages over a standard discrete Fourier transform. To get into the applications of least squares here, I’m going to have to take a detour into Fourier analysis.

Fourier Analysis

If you already know Fourier analysis, you can probably skip most of this section (although I recommend reading the last two paragraphs).

Suppose that we have a sequence of N signals at equally spaced points in time. Call this sequence x_1, x_2, \ldots, x_n. We can think of this as a function f : \{0,1,\ldots,N-1\} \to \mathbb{R}, or, more accurately, f : \{0,\Delta t, 2\Delta t, \ldots, (N-1)\Delta t\} \to \mathbb{R}. For reasons that will become apparent later, we will actually think of this as a function f : \{0,\Delta t, 2\Delta t, \ldots, (N-1)\Delta t\} \to \mathbb{C}.

This function is part of the vector space of all functions from \{0,\Delta t, 2\Delta t, \ldots, (N-1)\Delta t\} to \mathbb{C}. One can show that the functions on \{0,\Delta t,\ldots,(N-1)\Delta t\} defined by

f_k(x) = e^{\frac{2\pi i k x}{N \Delta t}},

with k ranging from 0 to N-1, are all orthogonal to each other, and thus form a basis for the space of all functions from \{0,\Delta t,2\Delta t,\ldots,(N-1)\Delta y\} to \mathbb{C} (now it is important to use \mathbb{C} since the f_k take on complex values). It follows that our function f can be written uniquely in the form f(x) = \sum_{k=0}^{N-1} c_kf_k(x), where the c_k are constants. Now because of this we can associate with each f a function \hat{f} : \{0,\frac{2 \pi}{N \Delta t},\frac{4\pi}{N\Delta t},\ldots,\frac{(N-1)\pi}{N\delta t}\} \to \mathbb{C} given by \hat{f}(\frac{2\pi k}{N \Delta t}) := c_k.

An intuitive way of thinking about this is that any function can be uniquely decomposed as a superposition of complex exponential functions at different frequencies. The function \hat{f} is a measure of the component of f at each of these frequencies. We refer to \hat{f} as the Fourier transform of f.

While there’s a lot more that could be said on this, and I’m tempted to re-summarize all of the major results in Fourier analysis, I’m going to refrain from doing so because there are plenty of texts on it and you can probably get the relevant information (such as how to compute the Fourier coefficients, the inverse Fourier transform, etc.) from those. In fact, you could start by checking out Wikipedia’s article. It is also worth noting that the Fourier transform can be computed in O(N\log N) time using any one of many “fast Fourier transform” algorithms (fft in matlab).

I will, however, draw your attention to the fact that if we start with information about f at times \{0,\Delta t,\ldots, (N-1)\Delta t\}, then we end up with frequency information at the frequencies \{0,\frac{2\pi}{N\Delta t},\ldots,\frac{2\pi(N-1)}{N\Delta t}\}. Also, you should really think of the frequencies as wrapping around cyclically (frequencies that differ from each other by a multiple of \frac{2\pi}{\Delta t} are indistinguishable on the interval we sampled over), and also if f is real-valued then \hat(f)(-\omega) = \overline{\hat{f}(\omega)}, where the bar means complex conjugate and -\omega is, as just noted, the same as \frac{2\pi}{\Delta t}-\omega.

A final note before continuing is that we could have decomposed f into a set of almost any N frequencies (as long as they were still linearly independent), although we can’t necessarily do so in O(N\log N) time. We will focus on the set of frequencies obtained by a Fourier transform for now.

When Fourier Analysis Fails

The goal of taking a Fourier transform is generally to decompose a signal into component frequencies, under the assumption that the signal itself was generated by some “true” superposition of frequencies. This “true” superposition would best be defined as the frequency spectrum we would get if we had an infinitely long continuous tape of noise-free measurements and then took the continuous Fourier transform.

I’ve already indicated one case in which Fourier analysis can fail, and this is given by the fact that the Fourier transform can’t distinguish between frequencies that are separated from each other by multiples of \frac{2\pi}{\Delta t}. In fact, what happens in general is that you run into problems when your signal contains frequencies that move faster than your sampling rate. The rule of thumb is that your signal should contain no significant frequency content above the Nyquist rate, which is half the sampling frequency. One way to think of this is that the “larger” half of our frequencies (i.e. \frac{\pi}{\Delta t} up through \frac{2\pi}{\Delta t}) are really just the negatives of the smaller half of our frequencies, and so we can measure frequencies up to roughly \frac{\pi}{\Delta t} before different frequencies start to run into each other.

The general phenomenon that goes on here is known as aliasing, and is the same sort of effect as what happens when you spin a bicycle wheel really fast and it appears to be moving backwards instead of forwards. The issue is that your eye only samples at a given rate and so rotations at speeds faster than that appear the same to you as backwards motion. See also this image from Wikipedia and the section in the aliasing article about sampling sinusoidal functions.

The take-away message here is that you need to sample fast enough to capture all of the actual motion in your data, and the way you solve aliasing issues is by increasing the sample rate.

A trickier problem is the “windowing” problem, also known as spectral leakage. [Note: I really recommend reading the linked wikipedia article at some point, as it is a very well-written and insightful treatment of this issue.] The problem can be summarized intuitively as follows: nearby frequencies will “bleed into” each other, and the easiest way to reduce this phenomenon is to increase your sample time. Another intuitive statement to this effect is that the extent to which you can distinguish between two nearby frequencies is roughly proportional to the number of full periods that you observe of their difference frequency. I will make both of these statements precise below. First, though, let me convince you that spectral leakage is relevant by showing you what the Fourier transform of a periodic signal looks like when the period doesn’t fit into the sampling window. The first image below is a plot of y=cos(t), and the second is a snapshot of part of the Fourier transform (blue is real part, green is imaginary part). Note that the plot linearly interpolates between sample points. Also note that the sampling frequency was 100Hz, although that is almost completely irrelevant.

The actual frequency content should be a single spike at \omega = 1, so windowing can in fact cause non-trivial issues with your data.

Now let’s get down to the actual analytical reason for the windowing / spectral leakage issue. Recall the formula for the Fourier transform: \hat{f}(\omega) = \frac{1}{N} \sum_{t} f(t)e^{-i\omega t}. Now suppose that f is a complex exponential with some frequency \omega', i.e. f(t) = e^{i\omega' t}. Then some algebra will yield the formula

\hat{f}(\omega) = \frac{1}{N} \frac{e^{i(\omega'-\omega)N\Delta t}-1}{e^{i(\omega'-\omega)\Delta t}-1},

which tells us the extent to which a signal at a frequency of \omega' will incorrectly contribute to the estimate of the frequency content at \omega. The main thing to note here is that larger values of N will cause this function to become more concentrated horizontally, which means that, in general (although not necessarily at a given point), it will become smaller. At the same time, if you change the sampling rate without changing the total sampling time then you won’t significantly affect the function. This means that the easiest way to decrease windowing is to increase the amount of time that you sample your signal, but that sampling more often will not help you at all.

Another point is that spectral leakage is generically roughly proportional to the inverse of the distance between the two frequencies (although it goes to zero when the difference in frequencies is close to a multiple of \frac{2\pi}{N\Delta t}), which quantifies the earlier statement about the extent to which two frequencies can be separated from each other.

Some other issues to keep in mind: the Fourier transform won’t do a good job with quasi-periodic data (data that is roughly periodic with a slowly-moving phase shift), and there is also no guarantee that your data will have good structure in the frequency domain. It just happens that this is in theory the case for analytic systems with a periodic excitation (see note (1) in the last section of this post — “Answers to Selected Exercises” — for a more detailed explanation).

When Fourier Analysis Succeeds

Despite issues with aliasing and spectral leakage, there are some strong points to the Fourier transform. The first is that, since the Fourier transform is an orthogonal map, it does not amplify noise. More precisely, \|\hat{f}-\hat{f_0}\|_2 = \frac{1}{\sqrt{N}}\|f-f_0\|_2, so two signals that are close together have Fourier transforms that are also close together. This may be somewhat surprising since normally when one fits N parameters to a signal of length N, there are significant issues with overfitting that can cause noise to be amplified substantially.

However, while the Fourier transform does not amplify noise, it can concentrate noise. In particular, if the noise has some sort of quasi-periodic structure then it will be concentrated over a fairly small range of frequencies.

Note, though, that the L2 norm of the noise in the frequency domain will be roughly constant relative to the number of samples. This is because, if f_0 is the “true” signal and f is the measured signal, then \|f-f_0\|_2 = \Theta(\sqrt(N)), so that \|\hat{f}-\hat{f_0}\|_2 = \Theta(1). Now also note that the number of frequency measurements we get out of the Fourier transform within a fixed band is proportional to the sampling time, that is, it is \Theta(N\Delta t). If we put these assumptions together, and also assume that the noise is quasi-periodic such that it will be concentrated over a fixed set of frequencies, then we get \Theta(1) noise distributed in the L2 sense over \Theta(N\Delta t) frequencies, which implies that the level of noise at a given frequency should be \Theta(\frac{1}{\sqrt{N\Delta t}}). In other words, sampling for a longer time will increase our resolution on frequency measurements, which means that the noise at a given frequency will decrease as the square-root of the sampling time, which is nice.

My second point is merely that there is no spectral leakage between frequencies that differ by multiples of \frac{2\pi}{N\Delta t}, so in the special case when all significant frequency content of the signal occurs at frequencies that are multiples of \frac{2\pi}{N\Delta t} and that are less than \frac{\pi}{\Delta t}, all problems with windowing and aliasing go away and we do actually get a perfect measure of the frequency content of the original signal.

Least Squares as a Substitute

The Fourier transform gives us information about the frequency content at 0, \frac{2\pi}{N\Delta t}, \frac{4\pi}{N\Delta t}, \ldots, \frac{2(N-1)\pi}{N\Delta t}. However, this set of frequencies is somewhat arbitrary and might not match up well to the “important” frequencies in the data. If we have extra information about the specific set of frequencies we should be caring about, then a good substitute for Fourier analysis is to do least squares fitting to the signal as a superposition of the frequencies you care about.

In the special case that the frequencies you care about are a subset of the frequencies provided by the Fourier transform, you will get identical results (this has to do with the fact that complex exponentials at these frequencies are all orthogonal to each other).

In the special case that you exactly identify which frequencies occur in the signal, you eliminate the spectral leakage problem entirely (it still occurs in theory, but not between any of the frequencies that actually occur). A good way to do this in the case of a dynamical system is to excite the system at a fixed frequency so that you know to look for that frequency plus small harmonics of that frequency in the output.

In typical cases least squares will be fairly resistant to noise unless that noise has non-trivial spectral content at frequencies near those being fit. This is almost tautologically true, as it just says that spectral leakage is small between frequencies that aren’t close together. However, this isn’t exactly true, as fitting non-orthogonal frequencies changes the sort of spectral leakage that you get, and picking a “bad” set of frequencies (usually meaning large condition number) can cause lots of spectral leakage even between far apart frequencies, or else drastically exacerbate the effects of noise.

This leads to one reason not to use least squares and to use the Fourier transform instead (other than the fact that the Fourier transform is more efficient in an algorithmic sense at getting data about large sets of frequencies — \Theta(N\log(N)) instead of \Theta(N^2)). The Fourier transform always has a condition number of 1, whereas least squares will in general have a condition number greater than 1, and poor choices of frequencies can lead to very large condition numbers. I typically run into this problem when I attempt to gain lots of resolution on a fixed range of frequencies.

This makes sense, because there are information-theoretic limits on the amount of frequency data I can get out of a given amount of time-domain data, and if I could zoom in on a given frequency individually, then I could just do that for all frequencies one-by-one and break the information theory bounds. To beat these bounds you will have to at least implicitly make additional assumptions about the structure of the data. However, I think you can probably get pretty good results without making too strong of assumptions, but I unfortunately don’t personally know how to do that yet.

So to summarize, the Fourier transform is nice because it is orthogonal and can be computed quickly. Least squares is nice because it allows you to pick which frequencies you want and so gives you a way to encode additional information you might have about the structure of the signal.

Some interesting questions to ask:

(1) What does spectral leakage look like for non-orthogonal sets of frequencies? What do the “bad” cases look like?

(2) What is a good set of assumptions to make that helps us get better frequency information? (The weaker the assumption and the more leverage you get out of it, the better it is.)

(3) Perhaps we could try something like: “pick the smallest set of frequencies that gives us a good fit to the data”. How could we actually implement this in practice, and would it have any shortcomings? How good would it be at pulling weak signals out of noisy data?

(4) What in general is a good strategy for pulling a weak signal out of noisy data?

(5) What is a good way of dealing with quasi-periodic noise?

(6) Is there a way to deal with windowing issues, perhaps by making statistical assumptions about the data that allows us to “sample” from possible hypothetical continuations of the signal to later points in time?

Take-away lessons

To summarize, I would say the following:

Least squares

  • good when you get to sample from a distribution of inputs that matches the actual distribution that you’re going to deal with in practice
  • bad due to systematic biases when noise is correlated with signal (usually occurs with “output noise” in the case of a dynamical system)

Fourier transform

  • good for getting a large set of frequency data
  • good because of small condition number
  • can fail due to aliasing
  • also can be bad due to spectral leakage, which can be dealt with by using least squares if you have good information about which frequencies are important

Answers to selected exercises

Okay well mainly I just feel like some of the questions that I gave as exercises are important enough that you should know the answer. There isn’t necessarily a single answer, but I’ll at least give you a good way of doing something if I know of one. I’ve added a fold so you can avoid spoilers. It turns out that for this post I only have one good answer, which is about dealing with non-linear dynamical systems.

Read more of this post

Linear Control Theory: Part I

Last time I talked about linear control, I presented a Linear Quadratic Regulator as a general purpose hammer for solving linear control problems. In this post I’m going to explain why LQR by itself is not enough (even for nominally linear systems). (Author’s note: I got to the end of the post and realized I didn’t fulfill my promise in the previous sentence. So it’s redacted, but will hopefully be dealt with in a later post.) Then I’m going to do my best to introduce a lot of the standard ideas in linear control theory.

My motivation for this is that, even though these ideas have a reasonably nice theory from a mathematical standpoint, they are generally presented from an engineering standpoint. And although all of the math is right there, and I’m sure that professional control theorists understand it much better than I do, I found that I had to go to a lot of effort to synthesize a good mathematical explanation of the underlying theory.

However, this effort was not due to any inherent difficulties in the theory itself, but rather, like I said, a disconnect in the intuition of, and issues relevant to, an engineer versus a mathematician. I’m not going to claim that one way of thinking is better than the other, but my way of thinking, and I assume that of most of my audience, falls more in line with the mathematical viewpoint. What’s even better is that many of the techniques built up for control theory have interesting ramifications when considered as statements about vector spaces. I hope that you’ll find the exposition illuminating.

As before, we will consider a linear system

\dot{x} = Ax+Bu,

where A and B are matrices and u is a vector of control inputs (x is the state of the system). However, in addition to a control input u, we will have an output y, such that y is a function of x and u:

y =  Cx+Du.

In some cases, y will be a set of observed states of a system, but in principal y can be any quantity we care about, provided that it is a linear function of state and control. We further assume that A, B, C, and D are constant with respect to time. We call a system that follows this assumption a linear time-invariant system, or just LTI system.

Since the system is linear, we have superposition and therefore can break up any function (for example, the function from u(t) to y(t)) into a function from each coordinate of u(t) to each coordinate of y(t). For each of these functions, we can take their Laplace transform. So, we start with

\dot{x} = Ax+Bu

y =   Cx+Du

and end up with (after taking the Laplace transform)

sX   = AX+BU

Y =   CX+DU.

Solving these two equations for Y as a function of U gives Y = (C(sI-A)^{-1}B+D)U. We call this mapping from U to Y the transfer function of the system. Cramer’s Rule implies that the transfer function of any linear time-invariant system will be a matrix where each entry is a ratio of two polynomials. We refer to such transfer functions as rational. I will show later that the converse is also true: any rational matrix is the transfer function of some LTI system. We call such an LTI system the state-space representation of the transfer function. (I apologize for throwing all this terminology at you, but it is used pretty unapologetically in control systems literature so I’d feel bad leaving it out.)

As an example, consider a damped harmonic oscillator with an external force u as a control input, and suppose that the outputs we care about are position and velocity. We will let q denote the position of the oscillator. This has the following state-space representation:

\left[ \begin{array}{c} \dot{q} \\ \ddot{q} \end{array} \right] = \left[   \begin{array}{cc} 0 & 1 \\ -k & -b \end{array} \right] \left[   \begin{array}{c} q \\ \dot{q} \end{array} \right] + \left[   \begin{array}{c} 0 \\ 1 \end{array} \right] u

\left[ \begin{array}{c} y_1 \\ y_2 \end{array} \right] = \left[   \begin{array}{cc} 1 & 0 \\ 0 & 1 \end{array} \right] \left[   \begin{array}{c} q \\ \dot{q} \end{array} \right] + 0 \cdot u

Here k is the spring constant of the oscillator and b is the damping factor. For convenience we will write x instead of \left[ \begin{array}{c} q \\ \dot{q} \end{array} \right] and y instead of \left[ \begin{array}{c} y_1 \\ y_2   \end{array} \right]. Also, we will let I denote the 2   \times 2 identity matrix. Then, after taking the Laplace transform, we get

sX   = \left[ \begin{array}{cc} 0 & 1 \\ -k & -b \end{array}   \right]X + \left[ \begin{array}{c} 0 \\ 1 \end{array} \right]U

Y =   X.

Solving the first equation gives

\left[ \begin{array}{cc} s & -1 \\ k & s+b \end{array} \right] X   = \left[ \begin{array}{c} 0 \\ 1 \end{array} \right]U,


X =   \frac{1}{s^2+bs+k}\left[ \begin{array}{cc} s+b & 1 \\ -k & s   \end{array} \right]\left[ \begin{array}{c} 0 \\ 1 \end{array}\right]U =   \frac{1}{s^2+bs+k} \left[ \begin{array}{c} 1 \\ s \end{array} \right]U

Therefore, the transfer function from U to Y is \frac{1}{s^2+bs+k} \left[ \begin{array}{c} 1 \\ s \end{array}   \right].

We can think of the transfer function as a multiplier on the frequency spectrum of u (note that s is allowed to be an arbitrary complex number; if s is non-real then we have oscillation at a frequency equal to the imaginary part of s; if \Re(s) < 0 then we have damped oscillation, whereas if \Re(s) > 0 then the magnitude of the oscillation increases exponentially. Note that \Re(s) denotes the real part of s.

Exercise: What does a pole of a transfer function correspond to? What about a zero? Answers below the fold.

If a transfer function has a pole, then it means that even if a given frequency doesn’t show up in the input u, it can still show up in the output y. Thus it is some self-sustaining, natural mode of the system. For LTI systems, this corresponds to an eigenvector of the matrix A, and the location of the pole is the corresponding eigenvalue.

A zero, on the other hand, means that a mode will not show up in the output even if it is present in the input. So for instance, the damped oscillator has poles at \frac{-b \pm \sqrt{b^2-4k}}{2}. Let us assume that b and k are both positive for the damped oscillator. Then, for b \geq 2\sqrt{k}, both of the poles are real and negative, meaning that the system is critically damped. For b < 2\sqrt{k}, the poles have negative real part and imaginary part equal to \sqrt{k-\frac{b^2}{4}}, meaning that the system will exhibit damped oscillation. Finally, there is a zero in the second coordinate of the transfer matrix at s = 0. This corresponds to the fact that a harmonic oscillator can be held at a fixed distance from its natural fixed point by a fixed external force. Since the distance is fixed, the contribution to velocity is zero.

There is more to be said on transfer functions, but before I go into that I would like to give you a working picture of how u and y should be viewed mathematically. This is a view that I only recently acquired. For this I owe thanks to Stefano Stramigioli, who gave a very interesting talk on port-Hamiltonian methods at Dynamic Walking 2010. (Update: Stefano recommends this book as a resource for learning more.)


Here is how I think you should think about linear control mathematically. First, you have a state-space V. You also have a space of controls U and a space of outputs Y. Finally, you have a space TV, the tangent space to V.

Ignoring U and Y for a moment, let’s just focus on V and TV. We can think of elements of TV as generalized forces, and the elements of V as generalized velocities. I realize that state-space also takes position into account, but you will note that no external forces show up in the equations for position, so I think this view still makes sense.

If we have a set of forces and velocities, then we can compute power (if our system is in regular Cartesian coordinates, then this is just \vec{F} \cdot \vec{v}). In this way, we can think of V and TV as dual to each other. I think that generalized velocities are actually somehow supposed to live in the cotangent space T^*V, rather than V, but I don’t know enough analysis to see why this is true. If someone else does, I would love to hear your explanation.

At any rate, we have these two spaces, V and TV, that are in duality with each other. The operator A : V \to TV then induces a map \tilde{A} from \mathcal{L}^{1}(\mathbb{R},TV) to \mathcal{L}^{1}(\mathbb{R},V), where \mathcal{L}^{1}(X,Y) is the space of Lesbegue-integrable functions from X to Y (although in practice all of our inputs and outputs will be real-valued, not complex-valued, since the systems we care about are all physical). Since V and TV are in duality with each other, we can also think of this as assigning a power history to any force history (the power history being [\tilde{A}(f)](f), where f is the force history).

What’s more remarkable is that the transfer function from force histories to state histories is (sI-A)^{-1} in the Laplace domain (as discussed above — just set B = C = I for the state-space representation). Therefore it is invertible except on a set of measure zero (the poles of A) and so as far as \mathcal{L}^{1} spaces are concerned it is an isomorphism; this is a bit of a technical point here, but I’m using the fact that \mathcal{L}^1 spaces are composed of equivalence classes of functions that differ on sets of measure zero, and also probably implicitly using some theorems from Fourier analysis about how the Fourier (Laplace) transform is an isomorphism from \mathcal{L}^{1}(\mathbb{R},V) to itself. I’m still glossing over some technical details here; in particular, I think you might need to consider the intersection of \mathcal{L}^1 and \mathcal{L}^2 instead of just \mathcal{L}^1, and also the target space of the Fourier transform is really \mathcal{L}^{1}(\widehat{\mathbb{R}},V), not \mathcal{L}^1(\mathbb{R},V), but these details aren’t really important to the exposition.

Getting back on track, we’ve just shown that the dynamics matrix A of a linear system induces an isomorphism between force histories and state histories. My guess is that you can also show this for reasonably nice non-linear systems, but I don’t have a proof off the top of my head. So, letting U denote the space of control signals and Y the space of outputs, what we have is something like this:

\mathcal{L}^{1}(\mathbb{R},U) \xrightarrow{B}  \mathcal{L}^{1}(\mathbb{R},TV) \xrightarrow{\overset{\tilde{A}}{\sim}}  \mathcal{L}^{1}(\mathbb{R},V) \xrightarrow{C}   \mathcal{L}^{1}(\mathbb{R},Y)

Incidentally, that middle map (the isomorphism with \tilde{A}) is hideous-looking, and if someone has a good way to typeset such a thing I would like to know about it.

In any case, in this context it is pretty easy to see how the inputs and outputs play dual roles to each other, and in fact if we replaced A, B, and C each with their adjoints A^{\dagger}, B^{\dagger}, and C^{\dagger}, then we get a new dynamical system where the inputs and outputs actually switch places (as well as the matrices governing the inputs and outputs). Note that I’ve left D out of this for now. I’m not really sure yet of a good way to fit it into this picture; it’s possible that D is just unnatural mathematically but sometimes necessary physically (although usually we can assume that D = 0).

Now that we have this nice framework for thinking about linear control systems, I’m going to introduce controllers and observers, and it will be easy to see that they are dual to each other in the sense just described.

Controllability and Observability

Go back to the non-linear case for a moment and suppose that we have a system \dot{x} = f(x,u), or, in the notation I’ve been using, \dot{x} = f(x) + Bu. We say that such a system is controllable if for any two states x_1 and x_2, there exists a time t_0 > 0 and a control signal u(t) such that if x(0) =  x_1 then x(t_0) = x_2 when the system is driven by the control signal u(t). What this says intuitively is that we can get from any state to any other state in a finite amount of time.

For linear systems, controllability implies something stronger — we can actually get from any state to any other state arbitrarily quickly, and this is often times the definition given in the linear case. For non-linear systems, this is not the case, as a trivial example we could have

\dot{x_1} = u

\dot{x_2} = max(x_1,1)

There are a few important properties of linear systems that are equivalent to controllability:

(1) There is no proper subspace W of the state space such that A(W)  \subset W and B(U) \subset W, where U is the space of possible instantaneous control signals. The intuition is that there is no subspace that the passive dynamics (without control) can get stuck in such that the control input can’t move the dynamics out of that space.

(2) There is no left eigenvector of A that is in the left null space of B. In other words, it actually suffices to check the criterion (1) above just for one-dimensional subspaces.

(3) The matrix [B \ AB \ A^2B \ \ldots \ A^{n-1}B], where n is the dimension of the state space of the system, has full row rank.

(4) For any choice of n eigenvalues \lambda_1, \ldots, \lambda_n, there exists a matrix F such that A+BF has generalized eigenvalues \lambda_1, \ldots, \lambda_n. We can think of this as saying that an appropriate linear feedback law u = Fx can be used to give the closed-loop (i.e. after control is applied) dynamics arbitrary eigenvalues.

I will leave (1) and (2) to you as exercises. Note that this is because I actually think you can solve them, not because I’m being lazy. (3) I will prove shortly (it is a very useful computational criterion for testing controllability). (4) I will prove later in this post. I should also note that these criteria also hold for a discrete-time system

x_{n+1} = Ax_n + Bu_n

y_n = Cx_n + Du_n

Proof of (3): In the case of a discrete-time system, if we have control inputs u_1, \ldots, u_k, then x_{k+1} will be

A^k  x_1 + (Bu_k + ABu_{k-1} + A^2Bu_{k-2} + \ldots + A^{k-1}Bu_1)

In particular, after k time steps we can affect x_{k+1} by an arbitrary linear combination of elements from the row spaces of A^{i}B, where i ranges from 0 to k-1. In other words, we can drive x_{k+1} to an arbitrary state if and only if the row space of [A^{i}B]_{i=0}^{k-1} is the entire state space, i.e. [A^{i}B]_{i=0}^{k-1} has full row rank. So a discrete-time system is controllable if and only if [A^{i}B]_{i=0}^{k-1} has full row rank for some sufficiently large k.

To finish the discrete-time case, we use the Cayley-Hamilton theorem, which shows that any n \times n matrix satisfies a degree n polynomial, and so in particular it suffices to pick k = n above, since A^nB can be written as a linear combination of A^{i}B for i < n, and similarly for any larger powers of A.

Now we need to deal with the continuous time case. In this case, we can use the theory of linear differential equations to show that

x(t) = x(0)e^{At} + \int_{0}^{t} e^{A\tau}Bu(t-\tau) d\tau,

where e^{A\tau} is the matrix exponential of A\tau. But if we use the Cayley-Hamilton theorem a second time, we see that $e^{A\tau}$ can be expressed as an (n-1)st degree polynomial in A\tau, so that there exists some c_0(\tau), \ldots, c_{n-1}(\tau) such that

x(t) =e^{At}x(0) + \sum_{k=0}^{n-1} A^kB \int_{0}^{t} c_k(\tau)u(t-\tau)   d\tau.

From here it is clear that, in order for a continuous time system to be controllable, the controllability matrix must have full row rank (since x(t) is equal to e^{At}x(0) plus something in the row space of the controllability matrix). The converse is less obvious. If the c_k(\tau) were linearly independent functions, then we would be done, because the last term in the sum can be thought of as the inner product of c_k(\tau) and u(t-\tau), and we can just use Gram-Schmidt orthogonalization to show that those inner products can be chosen arbitrarily (if you don’t see this then figuring it out is a good linear algebra exercise).

The problem is that the c_k(\tau) are not necessarily linearly independent. If A has all distinct eigenvalues, then they will be. This is because we have the relations e^{At}v = e^{\lambda t} v and A^k v = \lambda^k v for any \lambda-eigenvector v of A, so we can write n distinct exponential functions as a linear combination of the c_k(\tau), and any relation among the c_k would imply a relation among the e_{\lambda t}, which is impossible (it is a basic result from Fourier analysis that exponential functions are linearly independent).

However, this result actually needs A to have distinct eigenvalues. In particular, if one takes A = I, the n \times n identity matrix, then you can show that all but one of the c_k can be chosen arbitrarily. This is because I, I^2, \ldots are all equal to each other, and thus linearly dependent.

What we need to do instead is let m be the degree of the minimal polynomial p such that p(A) = 0. Then we can actually write e^{At} as \sum_{k=0}^{m-1} d_k(t) for some functions d:

\sum_{k=0}^{m-1} d_k(t)A^k = e^{At}

By the way in which the d_k were constructed (by applying polynomial relations to an absolutely convergent Taylor series), we know that they are all infinitely differentiable, hence we can differentiate both sides l times and write

\sum_{k=0}^{m-1} d_k^{(l)}(t) A^k = A^l e^{At}

Now look at these derivatives from l = 0 to l = m-1. If the d_k(t) were linearly dependent, their derivatives would satisfy the same relation, and therefore (by evaluating everything at t = 0, the matrices A^0, A^1, \ldots, A^{m-1} would satisfy a linear relation, which is impossible, since then A would satisfy a polynomial relation of degree less than m.

So, the d_k(t) are linearly independent, and thus by the argument with Gram-Schmidt above we can write anything in the row space of B, AB, \ldots, A^{m-1}B as

e^{At}x(0) + \sum_{k=0}^{m-1} A^kB \int_{0}^{t} d_k(\tau)u(t-\tau)   d\tau

for any t > 0. So are we done? Almost. The last step we need to finish is to note that if A satisfies a polynomial of degree m then the row space of [B \ AB \ \ldots \ A^{m-1}B] is the same as the row space of [B \ AB \ \ldots \ A^{n-1}B], for n > m.

So, that proves the result (3) about the controllability matrix. It was a lot of work in the continuous time case, although it matches our intuition for why it should be true (taking an exponential and taking a derivative are somewhat complementary to each other, so it made sense to do so; and I think there are probably results in analysis that make this connection precise and explain why we should get the controllability result in the continuous case more or less for free).

As I said before, (4) will have to wait until later.

In addition to controllability, we have a notion of stabilizability, which means that we can influence all unstable modes of A. In other words, we can make sure that the system eventually converges to the origin (although not necessarily in finite time). Versions of criteria (2) and (4) exist for stabilizable systems. Criterion (2) becomes a requirement that no left eigenvector of A whose eigenvalue has non-negative real part is in the left null space of B. Criterion (4) becomes a requirement that there exist F such that A+BF has only eigenvalues with negative real part.


We say that a system is observable if, for any initial state x(0) and any control tape u(t), it is possible in finite time to infer x(0) given only u(t) and the output y(t). In particular, we are not given any information about the internal states x(t) of the system (except through y(t)), although it is assumed that A, B, C, and D are known. If we have a non-linear system

\dot{x} = f(x,u)

y = g(x,u)

then it is assumed that f and g are known.

It turns out that observability for a system is exactly the same as controllability for the dual system, so all the criteria from the previous section hold in a suitably dual form. One thing worth thinking about is why these results still hold for any control tape u(t).

(1) There is no non-zero subspace W of V such that A(W) \subset W and C(W) = 0. In other words, there is no space that doesn’t show up in the output and such that the natural dynamics of the system stay in that space.

(2) There is no right eigenvector of A that is in the right null space of C.

(3) The matrix \left[ \begin{array}{c} C \\ CA \\ CA^2 \\ \vdots \\ CA^{n-1} \end{array} \right] has full column rank.

(4) The eigenvalues of A+LC can be assigned arbitrarily by an appropriate choice of L.

Just as the matrix F from the previous section can be thought of as a linear feedback law that gives the system arbitrary eigenvalues, the matrix L is part of a feedback law for something called a Luenburger observer.

Also, just as there is stabilizability for a system, meaning that we can control all of the unstable modes, there is also detectability, which means that we can detect all of the unstable modes.

Luenburger Observers

An observer is a process that estimates the state of an observable system given information about its outputs. If a system is detectable, and L is such that A+LC has only eigenvalues with negative real part, then consider the system

\dot{q} = Aq+Bu+L(Cq+Du-y)

Using the fact that Du-y = -Cx, we see that

\dot{(q-x)} = (A+LC)(q-x), so that q-x decays exponentially to zero (by the assumption on the eigenvalues of A+LC. Thus the dynamical system above, which is called a Luenburger observer, will asymptotically approach the true state of a system given arbitrary initial conditions.

If a system is both controllable and observable, can we design an observer and a controller that working together successfully control the system? (This question is non-trivial because the controller has to use the estimated state from the controller, rather than the actual state of the system, for feedback.) The answer is no in general, but it is yes for linear systems.

Let $F$ be such that A+BF is stable and let L be such that A+LC is stable. (A matrix is stable if all of its eigenvalues have negative real part.) Now we will consider the system obtained by using L as a Luenburger observer and F as a linear feedback law. Let e := q-x. Then we have

\dot{e} = (A+LC)e

\dot{x} = Ax+BFq = (A+BF)x + BFe

In matrix form, this gives

\left[ \begin{array}{c} \dot{e} \\ \dot{x} \end{array} \right] = \left[ \begin{array}{cc} A+LC & 0 \\ BF & A+BF \end{array} \right] \left[ \begin{array}{c} e \\ x \end{array} \right].

Because of the block triangular form of the matrix, we can see that its eigenvalues are given by the eigenvalues of A+LC and A+BF. Since A+LC and A+BF are both stable, so is the matrix given above, so we can successfully stabilize the above system to the origin. Of course, this is weaker than full controllability. However, if we have full controllability and observability, then we can set the eigenvalues of the above matrix arbitrarily, which should imply full controllability (I haven’t sat down and proved this rigorously, though).

So, now we know how to stabilize a linear system if it is detectable and stabilizable. The main thing to take away from this is the fact that the poles of the coupled dynamics of state and observation error are exactly the eigenvalues of A+BF and A+LC considered individually.

State-space representations

The final topic I’d like to talk about in this post is state-space representations of transfer functions. It is here that I will prove all of the results that I promised to take care of later. There are plenty more topics in linear control theory, but I’ve been writing this post for a few days now and it’s at a good stopping point, so I’ll leave the rest of the topics for a later post.

A state-space representation of a transfer function is exactly what it sounds like. Given a transfer function P(s) from U to Y, find a state-space model

\dot{x} = f(x,u)

y = g(x,u)

that has P as a transfer function. We’ll be concerned with linear state-space representations only.

The first thing to note is that a linear state-space representation of P(s) can always be reduced to a smaller representation unless the representation is both controllable and observable (by just restricting to the controllable and observable subspace).

The next thing to note is that, since the transfer function of a state-space representation is C(sI-A)^{-1}B+D, a transfer function P(s) has an irreducible (in the sense of the preceding paragraph) linear state-space representation of degree n if and only if P(s) = \frac{q(s)}{r(s)}, where q and r are polynomials with \deg(q) \leq \deg(r) = n. Thus all controllable and observable linear state-space representations of P(s) have the same dimension, and therefore there exists some non-canonical vector space isomorphism such that we can think of any two such representations as living in the same state space (though possibly with different matrices A, B, C, and D).

Finally, if two state-space representations over the same vector space have the same transfer function, then one can be obtained from the other by a chance of coordinates. I will now make this more precise and also prove it.

Claim: Suppose that R_1 and R_2 are two (not necessarily linear) state-space representations with the same input-output mapping. If R_1 is controllable and R_2 is observable, then there is a canonical map from the state space of R_1 to the state space of R_2. If R_1 is observable, then this map is injective. If R_2 is controllable, then this map is surjective. If R_1 and R_2 are both linear representations, then the map is linear.

Proof: Let the two representations be \dot{x_1} = f_1(x_1,u), y_1 = g_1(x_1,u) and \dot{x_2} = f_2(x_2,u), y_2 = g_2(x_2,u).

Since R_1 is controllable, we can take an input tape that sends x_1 to an arbitrary state x at some time t_0. Then by looking at y_2 evolve under the same input tape, by the observability of R_2 we will eventually be able to determine x_2(t_0) uniquely. The canonical map sends the x we chose to x_2(t_0).The fact that y_1(t) = y_2(t) for all t guarantees that x_2(t_0) is well-defined (i.e., it doesn’t matter what u we choose to get there).

If R_2 is controllable, then we can choose a u that causes us to end up with whatever x_2(t_0) we choose, which implies that the map is surjective. Now for the purposes of actually computing the map, we can always assume that the control input becomes 0 once we get to the desired x_1(t_0). Then there is a one-to-one correspondence between possible output tapes after time t_0 and possible values of x_2(t_0). If R_1 is observable, this is also true for x_1(t_0), which implies injectivity. I will leave it to you to verify that the map is linear if both representations are linear.

Finally, I would like to introduce a special case of controllable canonical form and use it to prove criterion (4) about controllability. It will also show, at least in a special case, that any transfer function that is a quotient of two polynomials (where the denominator has at least as high degree as the numerator) has a linear state-space representation.

The special case is when U is one-dimensional. Then our transfer matrix can be written in the form

p(s) = \frac{\vec{c_1}s^{n-1}+\vec{c_2}s^{n-2}+\ldots+\vec{c_n}}{s^n+a_1s^{n-1}+\ldots+a_n}+\vec{d}

It turns out that this transfer function can be represented by the following transfer matrix:

A = \left[ \begin{array}{ccccc} -a_1 & -a_2 & \ldots   & -a_{n-1} & -a_n \\ 1 & 0 & \ldots & 0  & 0 \\ 0   & 1 & \ldots & 0 & 0 \\ \vdots & \vdots &   \ldots & \vdots & \vdots \\ 0 & 0 & \ldots & 1 &   0 \end{array} \right], B = \left[ \begin{array}{c} 1 \\ 0 \\ 0 \\   \vdots \\ 0 \end{array} \right]

C = \left[ \begin{array}{ccccc} \vec{c_1} & \vec{c_2} & \cdots & \vec{c_{n-1}} & \vec{c_n} \end{array} \right], D = \vec{d}

This might seem a bit contrived, but the construction for A is a nice trick for constructing a matrix with a given characteristic polynomial. Also note that $A$ will have a single Jordan block for each distinct eigenvalue (whose size is the number of times that eigenvalue appears in the list \lambda_1, \ldots, \lambda_n). One can show directly that this is a necessary and sufficient condition for being controllable by a single input.

I will leave it to you to check the details that the above state-space model actually has P(s) as a transfer function. (Bonus question: what is the equivalent observable canonical form for observable single-output systems?) I will wrap up this post by proving criterion (4) about controllability, as promised. I have reproduced it below for convenience:

(4) An LTI system is controllable if and only if we can assign the eigenvalues of A+BF arbitrarily by a suitable choice of F.

Proof: I will prove the “only if” direction, since that is the difficult direction. First consider the case when we have a single-input system. Then take the transfer function from u to x (this is the same as assuming that C = I, D = 0). By the result above and the assumption of controllability, there exists a system with the same transfer function in controllable canonical form, and thus there is a change of coordinates that puts our system in controllable canonical form. Once we are in canonical form, it is easy to see that by choosing F = \left[ \begin{array}{ccccc} -b_1 & -b_2 & \ldots & -b_{n-1} & -b_n \end{array} \right], we end up with a system whose characteristic polynomial is \lambda^n + (a_1+b_1)\lambda^{n-1} + \ldots + (a_{n-1}+b_{n-1})\lambda + (a_n+b_n). We can therefore give A+BF an arbitrary characteristic polynomial, and thus choose its eigenvalues arbitrarily.

This proves the desired result in the case when we have a single input to our system. When we have multiple inputs, we have to consider them one-by-one, and use the fact that linear feedback can’t affect the eigenvalues of the parts of the system that are outside the controllable subspace. I haven’t checked this approach very carefully, so it might not work, but I am pretty sure it can be made to work. If you want more details, feel free to ask me and I will provide them. At this point, though, I’m writing more of a treatise than a blog post, so I really think I should cut myself off here. I hope the exposition hasn’t suffered at all from this, but if it has, feel free to call me on it and I will clarify myself.

My next post will take a break from linear control and tell you why using least squares is one of the worst ideas ever (because you think it will work when it actually won’t; if you don’t believe me I’ll show you how negligible sampling errors can easily cause you to be off by 10 percent in your model parameters).