Hacker News new | ask | show | jobs
by aabhay 340 days ago
The main problem with the “Bitter Lesson” is that there’s something even bitter-er behind it — the “Harsh Reality” that while we may scale models on compute and data, that simply broadly inserting tons of data without any sort of curation yields essentially garbage models.

The “Harsh Reality” is that while you may only need data, the current best models and companies behind them spend enormously on gathering high quality labeled data with extensive oversight and curation. This curation is of course being partially automated as well, but ultimately there’s billions or even tens of billions of dollars flowing into gathering, reviewing, and processing subjectively high quality data.

Interestingly, in the time that this paper was published, the harsh reality was not so harsh. For example in things like face detection, (actual) next word prediction, and other purely self supervised and not instruction tuned or “Chat” style models, data was truly all you needed. You didn’t need “good” faces. As long as it was indeed a face, the data itself was enough. Now, it’s not. In order to make these machines useful and not just function approximators, we need extremely large dataset curation industries.

If you learned the bitter lesson, you better accept the harsh reality, too.

4 comments

So true. I recently wrote about how Merlin achieved magical bird identification not through better algorithms, but better expertise in creating great datasets: https://digitalseams.com/blog/what-birdsong-and-backends-can...

I think "harsh reality" is one way to look at it, but you can also take an optimistic perspective: you really can achieve great, magical experiences by putting in (what could be considered) unreasonable effort.

Thanks for the intro to Merlin! I just went outside of my house and used it on 5 different types of birds and it helped me identify 100%. Relevent (possibly out of date) xkcd comic

[0]https://xkcd.com/1425/

Relevant - and old enough that those five years have been successfully granted!
I think your comment has some threads in common with Rodney Brooks' response: https://rodneybrooks.com/a-better-lesson/
While I agree with you, it’s worth noting that current LLM training uses a significant percentage of all available written data for training. The transition from GPT-2 era models to now (GPT-3+) saw the transition from novel models that can kinda imitate speech to models that can converse, write code, and use tools. It’s only after the readily available data was exhausted, that future gains came curation and large amounts of synthetic data.
Transfer learning isn’t about “exhausting” all available un-curated data, its simply that the systems are large enough to support it. There’s not that much of a reason to train on all available data. And its not all, there’s still a very significant filtration happening. For example they don’t train on petabytes of log files, that would just be terribly uninteresting data.
> The transition from GPT-2 era models to now (GPT-3+) saw the transition from novel models that can kinda imitate speech to models that can converse, write code, and use tools.

Which is fundamentally about data. OpenAI invested an absurd amount of money to get the human annotations to drive RHLF.

RHLF itself is a very vanilla reinforcement learning algo + some branding/marketing.

Another name for gathering and curating high-quality datasets is "science". One would hope "AI pioneer" USA would embrace this harsh reality and invest massively in basic science education and infrastructure. But we are seeing the opposite, and basically no awareness of this "harsh reality" among the AI hype...