|
|
|
|
|
by tempusalaria
815 days ago
|
|
The training data is pretty much anything you can read on the internet plus books. This is then cleaned up to remove nonsense, some technical files, and repeated files. From this, they tend to weight some sources more - e.g. Wikipedia gets a pretty high weighting in the data mix. Overall these data mixes have multiple trillion token counts. GPT-4 apparently trained on multiple epochs of the same data mix. So would assume this one did too as it’s a similar token count |
|