Hacker News new | ask | show | jobs
by vineyardmike 340 days ago
While I agree with you, it’s worth noting that current LLM training uses a significant percentage of all available written data for training. The transition from GPT-2 era models to now (GPT-3+) saw the transition from novel models that can kinda imitate speech to models that can converse, write code, and use tools. It’s only after the readily available data was exhausted, that future gains came curation and large amounts of synthetic data.
2 comments

Transfer learning isn’t about “exhausting” all available un-curated data, its simply that the systems are large enough to support it. There’s not that much of a reason to train on all available data. And its not all, there’s still a very significant filtration happening. For example they don’t train on petabytes of log files, that would just be terribly uninteresting data.
> The transition from GPT-2 era models to now (GPT-3+) saw the transition from novel models that can kinda imitate speech to models that can converse, write code, and use tools.

Which is fundamentally about data. OpenAI invested an absurd amount of money to get the human annotations to drive RHLF.

RHLF itself is a very vanilla reinforcement learning algo + some branding/marketing.