| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by arugulum 2001 days ago
	570GB of Common Crawl post-filtering, but only 40% of CC data was seen even once during training, though CC is only 60% of the training data. You could work through the math to find the rough size of GPT-3's training data, but it sounds like The Pile is of comparable size.

1 comments

leogao 2001 days ago

Yeah, the Pile is approximately the size of the GPT-3 training data, which is not a coincidence--one major reason we created the Pile (though certainly not the only one) was for our GPT-3 replication project.

link