|
|
|
|
|
by oersted
519 days ago
|
|
Well, LLMs are also remarkably good at generalizing. Look at the datasets, they don't literally train on every conceivable type of question the user might ask, the LLM can adapt just as you can. The actual challenge towards general intelligence is that LLMs struggle with certain types of questions even if you *do* train it on millions of examples of that type of question. Mostly questions that require complex logical reasoning, although consistent progress is being done in this direction. |
|
I'm serious. We don't have the datasets. But we do know the size of the datasets. And the sizes suggest incredible amounts of information.
Take an estimate of 100 tokens ~= 75 words[0]. What is a trillion tokens? Well, that's 750bn words. There are approximately 450 words on a page[1]. So that's 1.66... bn pages! If we put that in 500 page books, that would come out to 3.33... million books!
Llama 3 has a pretraining size of 15T tokens[2] (this does not include training, so more info added later). So that comes to ~50m books. Then, keep in mind that this data is filtered and deduplicated. Even considering a high failure rate in deduplication, this an unimaginable amount of information.
[0] https://help.openai.com/en/articles/4936856-what-are-tokens-...
[1] https://wordcounter.net/words-per-page
[2] https://ai.meta.com/blog/meta-llama-3/