Hacker News new | ask | show | jobs
by g42gregory 1283 days ago
I think this bitter lesson needs to be taken for a several grains of salt.

Number one, the progress in a particular AI field tends to go, at first, from custom to more general algorithms, exactly as Professor Richard Sutton described. However, there is a second part to this progress, where, once we "understood" (which we never really do) the new level of general algorithms (say Transformers in NLP), we begin to put back in the all the things we learned before (say, from Linguistics experience, we put the bias towards compositionality and corresponding tree structures back into the Transformers).

Number two, the computationally scalable algorithms always win in the environments where you have unlimited access to the computation and the data, i.e. if you working for Google, Facebook, Alibaba, etc... In other companies, you have limited computational budget and limited data. You could end up putting back-in a lot of sophisticated inductive biases back into your DL algorithms.

2 comments

I don’t follow your critique for #2. SVMs, Random forests, etc., aren’t the counterexample to Rich’s post (for anyone who knows him, Rich doesn’t even particularly _like_ neural networks). The counterexample is hand crafted features.

A counter example would be showing a number of successful examples in, say, computer vision, where handcrafted features do better than learned features. This is largely not the case. In, say, both NLP and Computer Vision, learned features dominate, even at companies with less compute (they use pretrained models).

(Disclaimer: I work with Rich.)

Thank you for the good point! I edited the comment.
I like your edited version!
I think there's also a challenging line to draw about where a defining a search space stops and where encoding knowledge begins. When you define attention modules for an LLM, encode a search heuristic into A*, or define a feature space for a random forest, you are encoding domain knowledge, i.e. adding bias in exchange for faster learning relative to an even more general model. At any given time, the best performing computation heavy techniques have embedded more structural knowledge than zero, while much less than some experts believed necessary.