Hacker News new | ask | show | jobs
by vitaminj 5723 days ago
In statistics, you're supposed to come up with a statistical model first before running regressions on the data. But quite a few papers I've read (especially in finance) seem to go the other way around, i.e.

They run regressions on a data set, adding and subtracting independent variables until the t values and standard errors start looking good.

Then they construct the linear model, assume the Gauss-Markov assumptions and sometimes (though not always) try to explain the causal relationship between the variables.

This is obviously very wrong and nobody has any clue what the distribution of the least squares estimators to these models are. But I've seen plenty of examples of this, which is enough to void the results of the paper (even if the model they come up with is somewhat plausible).

2 comments

In practice that's fairly common in all areas of science. You look for patterns in data and infer a relationship/equation/etc. Of course, you are supposed to confirm that it actually holds in new data / subsequent experiments.

Widespread use of data-mining software does make it much easier to do dodgy things on a wide scale.

there's nothing wrong with looking at some of the data first per se, provided you do not use the same data to draw conclusions. i.e., have a training and a test data set.