Hacker News new | ask | show | jobs
by GaggiX 795 days ago
The dataset is 7 times bigger than the dataset used for Llama 2 as reported by Meta.
1 comments

Has Meta disclosed how much parts of the dataset were repeated? I've only seen the "number of tokens trained" number.