| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dirheist 1277 days ago
	It is, it's libgen + commoncrawl + wikidump + a bunch of other datasets. OpenAI claim that commoncrawl is roughly 60% of its total training corpus and they also claim they use the other datasets listed. They probably also have some sort of proprietary Q&A/search query corpus via Microsoft.

1 comments

> It is, it's libgen + commoncrawl + wikidump + a bunch of other datasets.

I'm having trouble finding a source for the libgen claim. Is that confirmed or just rumor?

> Informed 'best guess' only. > Sources: https://lifearchitect.ai/papers/

Doesn't seem too convincing to me