Hacker News new | ask | show | jobs
by BenoitEssiambre 5345 days ago
I think what the author is describing is simple overfitting.

http://en.wikipedia.org/wiki/Overfitting

It is quite a newbie mistake for a scientist to be surprised by it. It affects every kind of modelling.

I thought maybe this article would talk about why economic models are worst than other kinds of models. There are issues that arise when applying scientific models to the economy caused by the fact that when even good models are used to predict markets, the use of the models themselves to do trading, distorts the markets. When multiple parties use good models to compete in markets, they distort the markets in such a way that destroys the predictive power of the models.

There is a great explanation by Glen Whitman of Agoraphilia, that uses grocery line wait time predictions as a metaphor for this:

http://agoraphilia.blogspot.com/2005/03/doing-lines.html

See also:

http://lesswrong.com/lw/yv/markets_are_antiinductive/

http://en.wikipedia.org/wiki/Efficient-market_hypothesis

2 comments

Alternatively, it may be simple information theory: A model that takes in 100 bits of specification simply can not correctly describe a process that has 10,000 bit's worth of degrees of freedom. And that's before we talk about iteration over time, and before we get to the final killer you mention, which is when the models are ruined by their own application to the domain.

I think radical underspecification is much more likely than overspecification, really.

(Since I encounter this a lot, let me pre-answer one question in advance, which is "What if only 300 bits really matter and the rest don't matter as much?" and the answer is that the term bit in information theory encompasses that idea already. If you have ten "bits", but they tend to be highly correlated together such that they are usually all 0 or all 1, you in fact don't have ten bits in information theory. Ten bits are, by definition, ten fully-independent true or false values. Bits-in-memory are not the same as information-theory-bits. A real system with 10,000 bits can not, pretty much by definition, be modeled by 100 bits. If it could, it would be a system with only 100 bits in the first place. Information theory cares about the true degrees of freedom available, not about your particular representation of the system.)

Here's the thing: you're both right. It's both radically underspecified and overfitted. The information-theoretic argument demonstrate that a model cannot exactly match the reality unless it's as complex as the reality.

This article speaks of the separate problem that economic models are not evaluated in any sort of experiments, and thus are prone to overfitting. This makes them unlikely to even approximate well.

Consider a basic multilayer perceptron-style neural network. Overfitting is a well-understood problem in training an MLP. We work around it by training on a part of the data, and then measuring its accuracy on another part -- much as Carter did in his analysis. If the accuracy is poor, something is adjusted: the size of the hidden layer can be increased, the training set expanded, the duration of the training increased or decreased, or the MLP model discarded entirely.

If increase of the training set or reduction of the duration improves accuracy against the test set, this means we had an overfitting problem.

"It's both radically underspecified and overfitted."

He used a perfect model (of a hypothetical world) which had exactly the right parameters, and then he calibrated it using exactly correct data.

So I don't see how this could be underspecified or overfitted. Can you please explain?

"The information-theoretic argument demonstrate that a model cannot exactly match the reality unless it's as complex as the reality."

In this case he defined his model to be reality.

Those particular statements referred to some representative economic model, not the experiment in question. In the experiment in question, the model is fully specified by definition.

As far as overfitting goes, that applies when you have a parameterized general model and need to discover the correct parameters. You probably won't get the exact correct parameters; instead, you'll (hopefully) get parameters that approximate reality well.

More closely matching the training data can actually make it a worse approximation in the general case.

"The information-theoretic argument demonstrate that a model cannot exactly match the reality unless it's as complex as the reality."

What if reality is self-similar at certain scales? You could generate something that resembles the whole from one part of it.

Then the reality is, information-theoretically, less complex and you can use a less complex model to represent it.
Re. your eloquent bracket: the "true number of bits", as in "the most compact description of any given state of a system", is in general uncomputable (see Kolmogorov complexity), for any sufficiently powerful language of description.

If you require notions such as "the true minimum number of bits" to be practical, you have to put additional restrictions on the language by which you describe the system -- such as your probability model. The representation does matter.

You're right, and overfitting cannot be an explanation for this phenomenon -- when there are many equally valid alternative outcomes to a problem, which is what's being described, the solution is underdetermined by definition.

In the (ML) terms I'm used to, it is as an error surface with many local minima. That is, if you start out with a guess for the parameters and try to progressively optimize the cost function to reach a point where the error is lowest (i.e. the tangent of the error is 0), where you end if is extremely dependent on where you start out. When you find a local minimum, you have found a point where there is no nearby point that is better, but there may be some other point (or many) somewhere else in the model that is better. The very best one is the global minimum.

This is a well known problem in ML for non-convex error functions, and there are various methods for trying to avoid local minima and reach a global minimum.

But this case is actually worse than that -- it is an error surface with many global minima. Each is effectively a perfect fit for the data to date, but give different predictions about future data. Since each function is a perfect fit, it is literally impossible to predict the proper parameters. Which is what underspecification is.

A model that takes in 100 bits of specification simply can not correctly describe a process that has 10,000 bit's worth of degrees of freedom.

If I'm correct, though, the OP is talking about creating a model with 100 bits of specification, and then creating a model of that model and trying to train those 100 bits, which seems like it should be a more tractable problem.

To me it sounds more like he's just rediscovered the fact that when you try to set a model's parameters based on a limited set of observations (he generated 3 years worth of data from his model, then trained parameters based on that data), there's a lot of uncertainty left over, and you won't necessarily get the right model.

This is quite obvious - if your observations only cover a limited portion of phase space, then you shouldn't be surprised that in a complex enough model multiple parameterizations will fit the observations equally well. You just didn't have enough freaking data to distinguish between the models! In all branches of science, we deal with this problem, and the solution is that you try to find the simplest possible model that accurately explains your data (or, as is happening in physics right now, you try to enumerate the next level of theories that reproduce current data so that you can figure out which experiments you'll need to run to distinguish between them).

So this has doesn't hint at any sort of fundamental flaw with modeling in general (and yeegads, it has even less to do with finance...) - it's just that he didn't have enough data to infer a proper parameterization. Don't build complex models and expect to train them on small datasets...

"I think what the author is describing is simple overfitting."

It doesn't look like overfitting to me. The input data is perfect, and the model is perfect, so it doesn't look like overfitting can occur.