|
The main problem with the “Bitter Lesson” is that there’s something even bitter-er behind it — the “Harsh Reality” that while we may scale models on compute and data, that simply broadly inserting tons of data without any sort of curation yields essentially garbage models. The “Harsh Reality” is that while you may only need data, the current best models and companies behind them spend enormously on gathering high quality labeled data with extensive oversight and curation. This curation is of course being partially automated as well, but ultimately there’s billions or even tens of billions of dollars flowing into gathering, reviewing, and processing subjectively high quality data. Interestingly, in the time that this paper was published, the harsh reality was not so harsh. For example in things like face detection, (actual) next word prediction, and other purely self supervised and not instruction tuned or “Chat” style models, data was truly all you needed. You didn’t need “good” faces. As long as it was indeed a face, the data itself was enough. Now, it’s not. In order to make these machines useful and not just function approximators, we need extremely large dataset curation industries. If you learned the bitter lesson, you better accept the harsh reality, too. |
I think "harsh reality" is one way to look at it, but you can also take an optimistic perspective: you really can achieve great, magical experiences by putting in (what could be considered) unreasonable effort.