Hacker News new | ask | show | jobs
by jroesch 700 days ago
I think this is roughly correct. My 2c is that folks used the initial web data to cold start and bootstrap the first few models, but so much of the performance increase we have seen at smaller sizes is a shift towards more conscientious data creation/purchase/curation/preparation and more refined evaluation datasets. I think the idea of scraping random text except maybe for the initial language understanding pre-training phase will be diminished over time.

This is understood in the academic literature as well, as people months/years ago were writing papers that a smaller amount of high quality data, is worth more than a large amount of low quality data (which tracks with what you can pick up from an ML 101 education/training).