Hacker News new | ask | show | jobs
by gwern 1535 days ago
Humans produce an astonishing amount of text if you consider all the source code, research data, social media websites, emails etc and project out a decade or two; there is also multimodal and RL to consider as a source of 'tokens' like visual tokens, which have ~infinite data. Text is great, but there is no reason to train only text. It's just a good starting point.

But the real question you should be asking is, where would you get the compute to train a model that needs 216t tokens?