| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bunderbunder 2905 days ago

But they use it in different ways. For example, an ARMA model is specifically looking for dependencies among the data points, so there assuming iid among them would be an absurdity. In time series analysis, you're looking for the model's residuals, not the source data, to be independent and identically distributed.

Also, in real-world statistical modeling, there's nuance. Just like for any assumption of a parametric model, the data not being iid doesn't mean that the model is 100% crap, it means that you can't draw specific conclusions about the quality of the model.

Which is fine, because maybe you don't care to draw those conclusions, anyway. One of the key differences between machine learning and traditional statistical analysis is that you aren't so worried about developing parsimonious models with well-defined parameters. You're typically just empirically interested in the model's predictive or descriptive utility. This difference isn't a result of one school being more principled and the other being more lackadaisical. It's reflective of differing goals: One approach was developed for use in scientific hypothesis testing, where your primary deliverable is (in the case of something like regression, anyway) the model's parameters, and its estimates are a means to evaluate those parameters. The other approach is used for modeling processes, where the primary deliverable is the estimates, and the parameters are a means to get those estimates.