You can fit a linear regression with just a few points, technically if you have one more data point than regressors, it works. Because you've assumed a linear relationship with normally distributed errors. And you can interpret the output, because the values of the regression coefficients tell you something. "having a high score on X doubles the odds of outcome Y", for example.
Also, because you've assumed a structure to the data, you can more easily test if the data has deviated from that structure. This can be data drift or single outliers-- for example, GARCH models (a type of regression) allow the normal distribution of the error to have a varying variance, so you can detect different variance regimes.
In short, they help a human understand and interpret data.
From what little I know, ML is not so good at that. But it has other advantages, and you don't always need or want the understanding. If your want to i.e. detect ground cover in satellite images, then all you care about is valid outputs, not necessarily the importance of near-infrared vs red band.
And ML (can) beat regression models by providing better interpolation, by better handling regions of the data space which violate the assumptions of the regression model, etc.
So it is a tradeoff. Both approaches are highly performant, just at different tasks.
An example I have used a few times to highlight the difference between "understanding how a model gives predictions" and "model uses predictors/features in a way that makes sense" is a linear regression model that Bondarchuk, a famous throwing coach of former USSR.
The linear regression was used to predict the distance expected in one of the throws, given the performances in other athletic tasks, say max squat, max power clean, max long jump and a few others.
Some of the regression coefficients were negative, which means that increasing performance in, say, long jump, leads to shorter shot-put (I don't remember which throw the model was for) distances.
The model and approach looked understandable and "weird" at the same time.
From a purely statistical perspective it makes sense, since that was the results coming from, I assume, maximum likelihood estimation. From a predictive performance, retrospectively it surely worked because it gave good prediction on past data, assuming there were training and test data (most likely, they were not, but let's assume).
But from a future prediction perspective, i.e. the forecasting and thus the manipulation of training to obtain a certain performance, did it make sense? I am very confident it did not, because, among other things, the performances of auxiliary lifts/feats were not independent (you cannot work on a heavier one rep max in the power clean and hope or work toward a shorter long jump performance).
The model by itself might have accurate, but considering that interpretable and thus guiding changes in the training program would have been a quite naive mistake. This kinda mistake is quite common among many who think too much about the machinery of the model and way too little about the domain.
> But from a future prediction perspective, i.e. the forecasting and thus the manipulation of training to obtain a certain performance, did it make sense?
Those are two different questions. For forecasting without manipulation of training it would still make sense. But it wouldn’t make sense for causal analysis.
That is a true difference and I should have been clearer. I should have specified that I don't believe they used any test data, the "study" had probably been done with no test data set and simply using all data for the estimation of regression parameters.
I believe that (1) the model made little "mechanistic" sense, (1) the forecasting accuracy of the model would have been low, (3) the model had good hind-casting accuracy through overfitting by modeling the "noise" in the data with too many predictors. (4) the model could have not guided any training.
> Because you've assumed a linear relationship with normally distributed errors.
Normality will give you stronger results, but generally linear regression doesn’t require normality. You only need errors to be uncorrelated for OLS to be the best linear unbiased estimator.
> how is it different from statistical forecasting
In statistical learning/forecasting, the researcher typically specifies the statistical model.
In machine learning, the statistical model is approximated by the algorithm.
Since a ML model needs to learn both the model form and the model parameters, it takes more data and also it does not allow for understanding (since it does not output the form of the model it learned).
Also, because you've assumed a structure to the data, you can more easily test if the data has deviated from that structure. This can be data drift or single outliers-- for example, GARCH models (a type of regression) allow the normal distribution of the error to have a varying variance, so you can detect different variance regimes.
In short, they help a human understand and interpret data.
From what little I know, ML is not so good at that. But it has other advantages, and you don't always need or want the understanding. If your want to i.e. detect ground cover in satellite images, then all you care about is valid outputs, not necessarily the importance of near-infrared vs red band.
And ML (can) beat regression models by providing better interpolation, by better handling regions of the data space which violate the assumptions of the regression model, etc.
So it is a tradeoff. Both approaches are highly performant, just at different tasks.