|
|
|
|
|
by hewlett
1210 days ago
|
|
> https://research.facebook.com/file/1574548786327032/LLaMA--O... > Dataset Sampling prop. Epochs Disk size > CommonCrawl 67.0% 1.10 3.3 TB > C4 15.0% 1.06 783 GB > Github 4.5% 0.64 328 GB > Wikipedia 4.5% 2.45 83 GB > Books 4.5% 2.23 85 GB > ArXiv 2.5% 1.06 92 GB > StackExchange 2.0% 1.03 78 GB At no point do they use information from Facebook customers |
|