I ended my last post on a somewhat dire note, claiming that least squares can do pretty terribly when fitting data. It turns out that things aren’t quite as bad as I thought, but most likely worse than you would expect.
The theme of this post is going to be things you use all the time (or at least, would use all the time if you were an electrical engineer), but probably haven’t ever thought deeply about. I’m going to include a combination of mathematical proofs and matlab demonstrations, so there should hopefully be something here for everyone.
My first topic is going to be, as promised, least squares curve fitting. I’ll start by talking about situations when it can fail, and also about situations when it is “optimal” in some well-defined sense. To do that, I’ll have to use some Fourier analysis, which will present a good opportunity to go over when frequency-domain methods can be very useful, when they can fail, and what you can try to do when they fail.
When Least Squares Fails
To start, I’m going to do a simple matlab experiment. I encourage you to follow along if you have matlab (if you have MIT certificates you can get it for free at http://matlab.mit.edu/).
Let’s pretend we have some simple discrete-time process, y(n+1) = a*y(n) + b*u(n), where y is the variable we care about and u is some input signal. We’ll pick a = 0.8, b = 1.0 for our purposes, and u is chosen to be a discrete version of a random walk. The code below generates the y signal, then uses least squares to recover a and b. (I recommend taking advantage of cell mode if you’re typing this in yourself.)
%% generate data
a = 0.8; b = 1.0;
N = 1000;
ntape = 1:N; y = zeros(N,1); u = zeros(N-1,1);
if rand < 0.02
u(n+1) = 1-u(n);
u(n+1) = u(n);
y(n+1) = a*y(n)+b*u(n);
%% least squares fit (map y(n) and u(n) to y(n+1))
A = [y(1:end-1) u]; b = y(2:end);
params = A\b;
afit = params(1)
bfit = params(2)
The results are hardly surprising (you get afit = 0.8, bfit = 1.0). For the benefit of those without matlab, here is a plot of y against n:
Now let’s add some noise to the signal. The code below generates noise whose size is about 6% of the size of the data (in the sense of L2 norm).
yn = y + 0.1*randn(N,1); % gaussian noise with standard deviation 0.2
A = [yn(1:end-1) u]; b = yn(2:end);
params = A\b;
afit = params(1)
bfit = params(2)
This time the results are much worse: afit = 0.7748, bfit = 1.1135. You might be tempted to say that this isn’t so much worse than we might expect — the accuracy of our parameters is roughly the accuracy of our data. The problem is that, if you keep running the code above (which will generate new noise each time), you will always end up with afit close to 0.77 and bfit close to 1.15. In other words, the parameters are systematically biased by the noise. Also, we should expect our accuracy to increase with more samples, but that isn’t the case here. If we change N to 100,000, we get afit = 0.7716, bfit = 1.1298. More samples will decrease the standard deviation of our answer (running the code multiple times will yield increasingly similar results), but not necessarily its correctness.
A more dire way of thinking about this is that increasing the number of samples will increase how “certain” we are of our answer, but it won’t change the fact that our answer is wrong. So we will end up being quite certain of an incorrect answer.
Why does this happen? It turns out that when we use least squares, we are making certain assumptions about the structure of our noise, and those assumptions don’t hold in the example above. In particular, in a model like the one above, least squares assumes that all noise is process noise, meaning that noise at one step gets propagated to future steps. Such noise might come from a system with unmodelled friction or some external physical disturbance. In contrast, the noise we have is output noise, meaning that the reading of our signal is slightly off. What the above example shows is that a model constructed via least squares will be systematically biased by output noise.
That’s the intuition, now let’s get into the math.When we do least squares, we are trying to solve some equation Ax=b for x, where A and b are both noisy. So we really have something like A+An and b+bn, where An and bn are the noise on A and b.
Before we continue, I think it’s best to stop and think about what we really want. So what is it that we actually want? We observe a bunch of data as input, and some more data as output. We would like a way of predicting, given the input, what the output should be. In this sense, then, the distinction between “input noise” (An) and “output noise” bn is meaningless, as we don’t get to see either and all they do is cause b to not be exactly Ax. (If we start with assumptions on what noise “looks like”, then distinguishing between different sources of noise turns out to be actually useful. More on that later.)
If the above paragraph isn’t satisfying, then we can use the more algebraic explanation that the noise An and bn induces a single random variable on the relationship between observed input and observed output. In fact, if we let A’=A+An then we end up fitting , so we can just define and have a single noise term.
Now, back to least squares. Least squares tries to minimize , that is, the squared error in the norm. If we instead have a noisy , then we are trying to minimize , which will happen when satisfies .
If there actually exists an such that (which is what we are positing, subject to some error term), then minimizing is achieved by setting to . Note that is what we would like to recover. So would be the solution to , and thus we see that an error introduces a linear error in our estimate of . (To be precise, the error affects our answer for via the operator .)
Now all this can be seen relatively easily by just using the standard formula for the solution to least squares: . But I find that it is easy to get confused about what exactly the “true” answer is when you are fitting data, so I wanted to go through each step carefully.
At any rate, we have a formula for how the error affects our estimate of , now I think there are two major questions to answer:
(1) In what way can systematically bias our estimate for ?
(2) What can we say about the variance on our estimate for ?
To calculate the bias on , we need to calculate , where stands for expected value. Since is invertible, this is the same as . In particular, we will get an unbiased estimate exactly when and are uncorrelated. Most importantly, when we have noise on our inputs then and will (probably) be correlated, and we won’t get an unbiased result.
How bad is the bias? Well, if A actually has a noise component (i.e. ), and e is , and we assume that our noise is uncorrelated with the constant matrix , then we get a correlation matrix equal to , which, assuming that and are uncorrelated, gives us . The overall bias then comes out to .
I unfortunately don’t have as nice of an expression for the variance, although you can of course calculate it in terms of , and .
At any rate, if noise doesn’t show up in the input, and the noise that does show up is uncorrelated with the input, then we should end up with no bias. But if either of those things is true, we will end up with bias. When modelling a dynamical system, input noise corresponds to measurement noise (your sensors are imperfect), while output noise corresponds to process noise (the system doesn’t behave exactly as expected).
One way we can see how noise being correlated with can lead to bias is if our “noise” is actually an unmodelled quadratic term. Imagine trying to fit a line to a parabola. You won’t actually fit the tangent line to the parabola, instead you’ll probably end up fitting something that looks like a secant. However, the exact slope of the line you pick will depend pretty strongly on the distribution of points you sample along the parabola. Depending on what you want the linear model for, this could either be fine (as long as you sample a distribution of points that matches the distribution of situations that you think you’ll end up using the model for), or very annoying (if you really wanted the tangent).
If you’re actually just dealing with a parabola, then you can still get the tangent by sampling symmetrically about the point you care about, but once you get to a cubic this is no longer the case.
As a final note, one reasonable way (although I’m not convinced it’s the best, or even a particularly robust way) of determining if a linear fit of your data is likely to return something meaningful is to look at the condition number of your matrix, which can be computed in matlab using the cond function and can also be realized as the square root of the ratio of the largest to the smallest eigenvalue of . Note that the condition number says nothing about whether your data has a reasonable linear fit (it can’t, since it doesn’t take into account). Rather, it is a measure of how well-defined the coefficients of such a fit would be. In particular, it will be large if your data is close to lying on a lower-dimensional subspace (which can end up really screwing up your fit). In this case, you either need to collect better data or figure out why your data lies on a lower-dimensional subspace (it could be that there is some additional structure to your system that you didn’t think about; see point (3) below about a system that is heavily damped).
I originally wanted to write down a lot more about specific ways that noise can come into the picture, but I haven’t worked it all out myself, and it’s probably too ambitious a project for a single blog post anyways. So instead I’m going to leave you with a bunch of things to think about. I know the answers to some of these, for others I have ideas, and for others I’m still trying to work out a good answer.
1) Can anything be done to deal with measurement noise? In particular, can anything be done to deal with the sort of noise that comes from encoders (i.e., a discretization of the signal)?
2) Is there a good way of measuring when noise will be problematic to our fit?
3) How can we fit models to systems that evolve on multiple time-scales? For example, an extremely damped system such as
where M >> c. You could take, for example, , , in which case the system behaves almost identically to the system
with set to the derivative of . Then the data will all lie almost on a line, which can end up screwing up your fit. So in what exact ways can your fit get screwed up, and what can be done to deal with it? (This is essentially the problem that I’m working on right now.)
4) Is there a way to defend against non-linearities in a system messing up our fit? Can we figure out when these non-linearities occur, and to what extent?
5) What problems might arise when we try to fit a system that is unstable or only slightly stable, and what is a good strategy for modelling such a system?
When Least Squares Works
Now that I’ve convinced you that least squares can run into problems, let’s talk about when it can do well.
As Paul Christiano pointed out to me, when you have some system where you can actually give it inputs and measure the outputs, least squares is likely to do a fairly good job. This is because you can (in principle) draw the data you use to fit your model from the same distribution as you expect to encounter when the model is used in practice. However, you will still run into the problem that failure to measure the input accurately introduces biases. And no, these biases can’t be eradicated completely by averaging the result across many samples, because the bias is always a negative definite matrix applied to (the parameters we are trying to find), and any convex combination of negative definite matrices will remain negative definite.
Intuitively, what this says is that if you can’t trust your input, then you shouldn’t rely on it strongly as a predictor. Unfortunately, the only way that a linear model knows how to trust something less is by making the coefficient on that quantity “smaller” in some sense (in the negative definite sense here). So really the issue is that least squares is too “dumb” to deal with the issue of measurement error on the input.
But I said that I’d give examples of when least squares works, and here I am telling you more about why it fails. One powerful and unexpected aspect of least squares is that it can fit a wide variety of non-linear models. For example, if we have a system , then we just form a matrix and , where for example is actually a column vector where the th row is the cosine of the th piece of input data. This will often be the case in physical systems, and I think is always the case for systems solved via Newton’s laws (although you might have to consolide parameters, for example fitting both and in the case of a pendulum). This isn’t necessarily the case for reduced models of complicated systems, for example the sort of models used for fluid dynamics. However, I think that the fact that linear fitting techniques can be applied to such a rich class of systems is quite amazing.
There is also a place where least squares not only works but is in some sense optimal: detecting the frequency response of a system. Actually, it is only optimal in certain situations, but even outside of those situations it has many advantages over a standard discrete Fourier transform. To get into the applications of least squares here, I’m going to have to take a detour into Fourier analysis.
If you already know Fourier analysis, you can probably skip most of this section (although I recommend reading the last two paragraphs).
Suppose that we have a sequence of signals at equally spaced points in time. Call this sequence , , , . We can think of this as a function , or, more accurately, . For reasons that will become apparent later, we will actually think of this as a function .
This function is part of the vector space of all functions from to . One can show that the functions on defined by
with ranging from to , are all orthogonal to each other, and thus form a basis for the space of all functions from to (now it is important to use since the take on complex values). It follows that our function can be written uniquely in the form , where the are constants. Now because of this we can associate with each a function given by .
An intuitive way of thinking about this is that any function can be uniquely decomposed as a superposition of complex exponential functions at different frequencies. The function is a measure of the component of at each of these frequencies. We refer to as the Fourier transform of .
While there’s a lot more that could be said on this, and I’m tempted to re-summarize all of the major results in Fourier analysis, I’m going to refrain from doing so because there are plenty of texts on it and you can probably get the relevant information (such as how to compute the Fourier coefficients, the inverse Fourier transform, etc.) from those. In fact, you could start by checking out Wikipedia’s article. It is also worth noting that the Fourier transform can be computed in time using any one of many “fast Fourier transform” algorithms (fft in matlab).
I will, however, draw your attention to the fact that if we start with information about at times , then we end up with frequency information at the frequencies . Also, you should really think of the frequencies as wrapping around cyclically (frequencies that differ from each other by a multiple of are indistinguishable on the interval we sampled over), and also if is real-valued then , where the bar means complex conjugate and is, as just noted, the same as .
A final note before continuing is that we could have decomposed into a set of almost any frequencies (as long as they were still linearly independent), although we can’t necessarily do so in time. We will focus on the set of frequencies obtained by a Fourier transform for now.
When Fourier Analysis Fails
The goal of taking a Fourier transform is generally to decompose a signal into component frequencies, under the assumption that the signal itself was generated by some “true” superposition of frequencies. This “true” superposition would best be defined as the frequency spectrum we would get if we had an infinitely long continuous tape of noise-free measurements and then took the continuous Fourier transform.
I’ve already indicated one case in which Fourier analysis can fail, and this is given by the fact that the Fourier transform can’t distinguish between frequencies that are separated from each other by multiples of . In fact, what happens in general is that you run into problems when your signal contains frequencies that move faster than your sampling rate. The rule of thumb is that your signal should contain no significant frequency content above the Nyquist rate, which is half the sampling frequency. One way to think of this is that the “larger” half of our frequencies (i.e. up through ) are really just the negatives of the smaller half of our frequencies, and so we can measure frequencies up to roughly before different frequencies start to run into each other.
The general phenomenon that goes on here is known as aliasing, and is the same sort of effect as what happens when you spin a bicycle wheel really fast and it appears to be moving backwards instead of forwards. The issue is that your eye only samples at a given rate and so rotations at speeds faster than that appear the same to you as backwards motion. See also this image from Wikipedia and the section in the aliasing article about sampling sinusoidal functions.
The take-away message here is that you need to sample fast enough to capture all of the actual motion in your data, and the way you solve aliasing issues is by increasing the sample rate.
A trickier problem is the “windowing” problem, also known as spectral leakage. [Note: I really recommend reading the linked wikipedia article at some point, as it is a very well-written and insightful treatment of this issue.] The problem can be summarized intuitively as follows: nearby frequencies will “bleed into” each other, and the easiest way to reduce this phenomenon is to increase your sample time. Another intuitive statement to this effect is that the extent to which you can distinguish between two nearby frequencies is roughly proportional to the number of full periods that you observe of their difference frequency. I will make both of these statements precise below. First, though, let me convince you that spectral leakage is relevant by showing you what the Fourier transform of a periodic signal looks like when the period doesn’t fit into the sampling window. The first image below is a plot of y=cos(t), and the second is a snapshot of part of the Fourier transform (blue is real part, green is imaginary part). Note that the plot linearly interpolates between sample points. Also note that the sampling frequency was 100Hz, although that is almost completely irrelevant.
The actual frequency content should be a single spike at , so windowing can in fact cause non-trivial issues with your data.
Now let’s get down to the actual analytical reason for the windowing / spectral leakage issue. Recall the formula for the Fourier transform: . Now suppose that is a complex exponential with some frequency , i.e. . Then some algebra will yield the formula
which tells us the extent to which a signal at a frequency of will incorrectly contribute to the estimate of the frequency content at . The main thing to note here is that larger values of will cause this function to become more concentrated horizontally, which means that, in general (although not necessarily at a given point), it will become smaller. At the same time, if you change the sampling rate without changing the total sampling time then you won’t significantly affect the function. This means that the easiest way to decrease windowing is to increase the amount of time that you sample your signal, but that sampling more often will not help you at all.
Another point is that spectral leakage is generically roughly proportional to the inverse of the distance between the two frequencies (although it goes to zero when the difference in frequencies is close to a multiple of ), which quantifies the earlier statement about the extent to which two frequencies can be separated from each other.
Some other issues to keep in mind: the Fourier transform won’t do a good job with quasi-periodic data (data that is roughly periodic with a slowly-moving phase shift), and there is also no guarantee that your data will have good structure in the frequency domain. It just happens that this is in theory the case for analytic systems with a periodic excitation (see note (1) in the last section of this post — “Answers to Selected Exercises” — for a more detailed explanation).
When Fourier Analysis Succeeds
Despite issues with aliasing and spectral leakage, there are some strong points to the Fourier transform. The first is that, since the Fourier transform is an orthogonal map, it does not amplify noise. More precisely, , so two signals that are close together have Fourier transforms that are also close together. This may be somewhat surprising since normally when one fits parameters to a signal of length , there are significant issues with overfitting that can cause noise to be amplified substantially.
However, while the Fourier transform does not amplify noise, it can concentrate noise. In particular, if the noise has some sort of quasi-periodic structure then it will be concentrated over a fairly small range of frequencies.
Note, though, that the L2 norm of the noise in the frequency domain will be roughly constant relative to the number of samples. This is because, if is the “true” signal and is the measured signal, then , so that . Now also note that the number of frequency measurements we get out of the Fourier transform within a fixed band is proportional to the sampling time, that is, it is . If we put these assumptions together, and also assume that the noise is quasi-periodic such that it will be concentrated over a fixed set of frequencies, then we get noise distributed in the L2 sense over frequencies, which implies that the level of noise at a given frequency should be . In other words, sampling for a longer time will increase our resolution on frequency measurements, which means that the noise at a given frequency will decrease as the square-root of the sampling time, which is nice.
My second point is merely that there is no spectral leakage between frequencies that differ by multiples of , so in the special case when all significant frequency content of the signal occurs at frequencies that are multiples of and that are less than all problems with windowing and aliasing go away and we do actually get a perfect measure of the frequency content of the original signal.
Least Squares as a Substitute
The Fourier transform gives us information about the frequency content at . However, this set of frequencies is somewhat arbitrary and might not match up well to the “important” frequencies in the data. If we have extra information about the specific set of frequencies we should be caring about, then a good substitute for Fourier analysis is to do least squares fitting to the signal as a superposition of the frequencies you care about.
In the special case that the frequencies you care about are a subset of the frequencies provided by the Fourier transform, you will get identical results (this has to do with the fact that complex exponentials at these frequencies are all orthogonal to each other).
In the special case that you exactly identify which frequencies occur in the signal, you eliminate the spectral leakage problem entirely (it still occurs in theory, but not between any of the frequencies that actually occur). A good way to do this in the case of a dynamical system is to excite the system at a fixed frequency so that you know to look for that frequency plus small harmonics of that frequency in the output.
In typical cases least squares will be fairly resistant to noise unless that noise has non-trivial spectral content at frequencies near those being fit. This is almost tautologically true, as it just says that spectral leakage is small between frequencies that aren’t close together. However, this isn’t exactly true, as fitting non-orthogonal frequencies changes the sort of spectral leakage that you get, and picking a “bad” set of frequencies (usually meaning large condition number) can cause lots of spectral leakage even between far apart frequencies, or else drastically exacerbate the effects of noise.
This leads to one reason not to use least squares and to use the Fourier transform instead (other than the fact that the Fourier transform is more efficient in an algorithmic sense at getting data about large sets of frequencies — instead of ). The Fourier transform always has a condition number of , whereas least squares will in general have a condition number greater than , and poor choices of frequencies can lead to very large condition numbers. I typically run into this problem when I attempt to gain lots of resolution on a fixed range of frequencies.
This makes sense, because there are information-theoretic limits on the amount of frequency data I can get out of a given amount of time-domain data, and if I could zoom in on a given frequency individually, then I could just do that for all frequencies one-by-one and break the information theory bounds. To beat these bounds you will have to at least implicitly make additional assumptions about the structure of the data. However, I think you can probably get pretty good results without making too strong of assumptions, but I unfortunately don’t personally know how to do that yet.
So to summarize, the Fourier transform is nice because it is orthogonal and can be computed quickly. Least squares is nice because it allows you to pick which frequencies you want and so gives you a way to encode additional information you might have about the structure of the signal.
Some interesting questions to ask:
(1) What does spectral leakage look like for non-orthogonal sets of frequencies? What do the “bad” cases look like?
(2) What is a good set of assumptions to make that helps us get better frequency information? (The weaker the assumption and the more leverage you get out of it, the better it is.)
(3) Perhaps we could try something like: “pick the smallest set of frequencies that gives us a good fit to the data”. How could we actually implement this in practice, and would it have any shortcomings? How good would it be at pulling weak signals out of noisy data?
(4) What in general is a good strategy for pulling a weak signal out of noisy data?
(5) What is a good way of dealing with quasi-periodic noise?
(6) Is there a way to deal with windowing issues, perhaps by making statistical assumptions about the data that allows us to “sample” from possible hypothetical continuations of the signal to later points in time?
To summarize, I would say the following:
- good when you get to sample from a distribution of inputs that matches the actual distribution that you’re going to deal with in practice
- bad due to systematic biases when noise is correlated with signal (usually occurs with “output noise” in the case of a dynamical system)
- good for getting a large set of frequency data
- good because of small condition number
- can fail due to aliasing
- also can be bad due to spectral leakage, which can be dealt with by using least squares if you have good information about which frequencies are important
Answers to selected exercises
Okay well mainly I just feel like some of the questions that I gave as exercises are important enough that you should know the answer. There isn’t necessarily a single answer, but I’ll at least give you a good way of doing something if I know of one. I’ve added a fold so you can avoid spoilers. It turns out that for this post I only have one good answer, which is about dealing with non-linear dynamical systems.
Read more of this post