| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by benxh 1119 days ago
	To be honest, I've been asking myself the same thing, technically the amount of "good quality" data in libgen is huge, way larger than the books3 dataset. However it would probably run afoul of copyright. Then again, a huge amount of data that LLMs go through is copyrighted.

1 comments

napier 1118 days ago

Training on copyright data is arguably considered fair use in quite a few jurisdictions to various extents and levels of precedent, and entirely legal for entities based in Japan.

link

benxh 1118 days ago

Yes, but the acquisition of that data itself is illegal in almost all jurisdictions, since libgen is treated as a piracy website. Now if there were a pipeline to access books from Amazon or the Google Books project for training it would be a different story.

Still, for certain languages, only libgen and public piracy websites contain any scientific or fiction material in digital formats. E.g. my native language doesn't have easily accessible e-books at all, unless you go through illegal means.

I hope somebody undertakes the steps necessary to train on the entirety of libgen. The amount of high quality tokens in libgen should be substantial.

link

fragmede 1118 days ago

Google has the resources train on Google Books, Google Scholar, and their crawled copy of the whole Internet. No clue what Bard is/isn't trained on tho.

link