| For the deep learning vertex the OP states: "At the right vertex, we have Breiman's know-nothing approach—high-capacity models like neural nets, decision forests, and nonparametrics that will fit anything given enough data. This is engineering with less science (see these remarks). Deep learning people cluster here." The phrase "will fit anything given enough data" is misleading and not correct about cutting-edge machine learning methods. "Fit" is a useless term, and you will instead find people talking about "bias" and "variance". For any supervised method (you know the intended outputs) you apply to predict data, there are three sources of error: bias, variance, and random. Random error is some irreducible unpredictability that cannot be modeled. Bias occurs from bad assumptions made by the model itself (e.g. maybe the model is too simple). Variance is sensitivity to small changes in the data the algorithm is trained on. High bias means the model is too simple to capture all the variations in the data set. High variance means the model is too overfit on the data at hand and it is not successfully generalizing to unseen data. In real-world problems there is a direct tradeoff between bias and variance. Nevertheless the goal of any supervised learning model is to have both low bias and low variance. By splitting off a big (~10-20%) chunk of all data available into a "test" set, training the model on the remaining "train" set, then evaluating it on the "test" set, it's possible to estimate the generalizability of the model on future "unseen" data by whatever metric you want. By additionally plotting learning curves one can crudely estimate whether we have high bias or high variance. Hence the insinuation that machine learning blindly "fits" data as much as possible is false. Sophisticated (yet not difficult) methods both minimize and estimate the generalizability of the model to future unseen data (minimizing variance), inevitably at the cost of some notion of accuracy (increasing bias). I think the OP's objection is that such ML methods "know nothing". This is a trivial statement to make. Rather, I would turn the objection on its head and ask "If our methods achieve acceptable estimated generalizability on unseen data, do we need to know anything?". This reminds me of Alan Turing's arguments about machines passing the Turing Test vs. "are they really human?". |
These high-capacity models (neural nets, decision trees, boosting) do overfit like crazy and tend to be used as black boxes without any domain knowledge. The key in his statement is when he says "given enough data," because having tons of data is one of the best ways to combat overfitting (given enough data, variance is negligible). And the fact that we can measure how much they overfit and take steps to regularize doesn't change the fact that, for example, deep learning is really way more of an engineering discipline than a mathematical or statistical discipline. And these are not criticisms of those areas at all: those are exciting areas of research precisely because there are so many unsolved problems and areas where we are working without a solid understanding!