Hacker News new | ask | show | jobs
by jschwartzi 2056 days ago
Polynomial fitting can be used to generate curves which fit any dataset perfectly given enough degrees. Like if you give me a data set with 150 points in it that are apparently randomly distributed throughout the sample space, I can give you a nice high-order polynomial that perfectly passes through all 150 data points.

Making any claims based on that kind of curve fitting is a huge red flag, especially if you don't discuss that.

One of the reasons we would prefer lower-order fits to higher-order fits is that there is a very real risk of overfitting in the interior of a data set but then providing completely inaccurate results anywhere but at the data points. Seeing that kind of overfitting in a scientific paper without any justification suggests that the author of the paper is making statements without any basis in science.

1 comments

Would saying it's like getting too many returns with a greedy regex be a close enough example non-mathy coders could grok?
It's more like someone is asking you to create a regex that matches phone numbers and they give you an example number for you to work with. You have the bright idea of writing the example number verbatim as your regex and voilĂ . It matches the sample perfectly, job done. In reality that regex can't be applied to anything else.
No.

It like saying that a rule with as many special cases as data points is not actually a general rule. It is just a collection of special cases that explains nothing.

More precisely, if you give a model enough parameters, you can always have it describe your data well. The question is whether it is likely to predict future data. And the answer is no.

It's more like, high-order polynomials are fundamentally wild animals. You can use a math trick to make them to go through specified points, and to early statisticians & data analysts, this seemed like a good way to model a nonlinear trend based on a set of sample points. But that trick doesn't make polynomials tame. And it gets worse the more points you need to interpolate - you have to add a degree to the polynomial every time you want to go through another point. Each new degree makes the polynomial wilder and wilder outside of the points you're interpolating.

We later discovered that the tame functions that usually work well for extrapolating from a sample are things called splines.

I don't think so. You can easily get too many returns without any overfitting at all.

If you want to make it simple, think of using a 12th order curve as being similar to picking whatever 12 points you want to match and then drawing lines from point to point.