Hacker News new | ask | show | jobs
by timpetri 1536 days ago
Question related to the Chinchilla paper[0], which says that optimal amount of training data for ~500B, 1T, and 10T param models are 11T, 21.2T, 216.2T tokens, respectively. The PaLM paper[1] says it made use of 700B tokens.

How many tokens of training data have humans produced across the entire internet, all our written works, etc? Is there such a thing as a 216 trillion token set?

[0] https://arxiv.org/abs/2203.15556 [1] https://arxiv.org/abs/2204.02311

1 comments

Humans produce an astonishing amount of text if you consider all the source code, research data, social media websites, emails etc and project out a decade or two; there is also multimodal and RL to consider as a source of 'tokens' like visual tokens, which have ~infinite data. Text is great, but there is no reason to train only text. It's just a good starting point.

But the real question you should be asking is, where would you get the compute to train a model that needs 216t tokens?