Hacker News new | ask | show | jobs
by hewlett 1210 days ago
> https://research.facebook.com/file/1574548786327032/LLaMA--O...

> Dataset Sampling prop. Epochs Disk size

> CommonCrawl 67.0% 1.10 3.3 TB

> C4 15.0% 1.06 783 GB

> Github 4.5% 0.64 328 GB

> Wikipedia 4.5% 2.45 83 GB

> Books 4.5% 2.23 85 GB

> ArXiv 2.5% 1.06 92 GB

> StackExchange 2.0% 1.03 78 GB

At no point do they use information from Facebook customers