| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mcculley 769 days ago
	They are not “trained on all publicly available human knowledge”. Go look at the training data sets used. Most human knowledge that has been digitized is not publicly available (e.g., Google Books). These models are not able to get to data sets behind paywalls (e.g., scientific journals). It will be a huge step forward for humanity when we can run algorithms across all human knowledge. We are far from that.

1 comments

neurostimulant 768 days ago

There is a rumor that OpenAI might've used libgen in their training data.

link

mcculley 768 days ago

Someone will. The potential gains are too high to ignore it.

link

nojvek 768 days ago

We are talking about trillions of tokens.

I’m sure the big players like Google, Meta, OpenAI have used anything and everything they can get their hands on.

Libgen is a wonder of the internet. I’m glad it exists.

link

mcculley 768 days ago

I am also glad that libgen exists. Liberating human knowledge from copyright will improve humanity overall.

But I don’t understand how you can be sure that the big players are using it as a training corpus. Such an effort of questionable legality would be a significant investment of resources. Certainly as the computronium gets cheaper and techniques evolve, bringing it into reach of entities that don’t answer to shareholders and investors, it will happen. What makes you sure that publicly owned companies or OpenAI are training on libgen?

link