Hacker News new | ask | show | jobs
by md2be 2853 days ago
I agree with your sentiments, but there is a contribution that the CS departments made that the statistics, math, Econ (as in econometrics departments) seemed to have overlooked. I remember going to each of these departments in 2002 and asking them why don’t we split the data sets to train and update the coefficients and automate the process. The answer was always the same “that’s trivial and adds nothing to the field”.
2 comments

> why don’t we split the data sets to train and update the coefficients and automate the process.

What you just stated is just a pipeline. You can just split the data and train it and automate with tree ensemble that aren't boosting that is if you're talking about doing in parallel.

If you're just saying split and do as batch process in different time interval you can do that with nonparametric bayesian.

CS contribution in creating Deep learning and having it be the best accurate algo for certain data domain is pretty nice. But again Stat care a lot more than prediction.

I think that ML is very useful, but remember that forecasting is really not the main objective of econometric models.

Basically, forecasting implies you have a good handle on all properties of the relevant distributions, which in my opinion is a lost cause in social sciences (think external validity).

Instead, econometrics is nowadays mainly concerned with the identification of causal effect using non-parametric or semi-parametric approaches. Basically, you can believably estimate the directionality of some mechanism, but you probably never have the data or model to make a good out of sample prediction. You can, but it's basically implied that approaches that consistently estimate some marginal of a conditional expectation will NOT be that useful to predict a whole stochastic process.

Also, using training and test sets kind of predicates that your process is very stable. Otherwise the "test" set is not really a good test, is it? Again, in social sciences these things are hard to argue. You usually wanna generalize some mechanism from this industry to that industry, not find a good predictor in the same industry. Test datasets still run on the same data!

ML is successful because in practice we DO care about prediction. This allows us to do all the cool things. Because econometrics/stats is so conservative and comes from a causal standpoint, people are just really shy to develop a model for prediction (not everywhere true, but that's the gist). For ML, the primary question is basically how good the thing predicts. When I first tried scikit learn way back, I was so confused it didn't offer standard errors or some other statistical measure. But then I saw how ingrained the in-sample, out-sample process is and I thought well - that's really useful.

tl;dr: Stats and ML have different objectives, but there is a lot to learn in stats for ML