| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by napier 1119 days ago
	I’d like to see a model with the effluent of the internet intelligently filtered from the pretraining data by LLM and human curation, and much more effort to include digitised archival sources and the entirety of books and high quality media transcripts. I imagine it would yield far better baseline quality outputs with much less than current “requirements” for (over)correction with ultimately disastrous RLHF masking.

1 comments

jiggawatts 1119 days ago

I'd love to play with a version of GPT 4 fine-tuned with every science textbook written in the last few decades, every published science paper (not just preprints from ArXiV), and everything generated by every large research institute. Think NASA, CERN, etc...

Or one tuned with every fiction novel ever written, along with every screenplay.

link

benxh 1118 days ago

So a model fine-tuned on libgen?

link

anticensor 1118 days ago

Why not?

link

benxh 1118 days ago

To be honest, I've been asking myself the same thing, technically the amount of "good quality" data in libgen is huge, way larger than the books3 dataset. However it would probably run afoul of copyright. Then again, a huge amount of data that LLMs go through is copyrighted.

link

napier 1117 days ago

Training on copyright data is arguably considered fair use in quite a few jurisdictions to various extents and levels of precedent, and entirely legal for entities based in Japan.

link

benxh 1116 days ago

Yes, but the acquisition of that data itself is illegal in almost all jurisdictions, since libgen is treated as a piracy website. Now if there were a pipeline to access books from Amazon or the Google Books project for training it would be a different story.

Still, for certain languages, only libgen and public piracy websites contain any scientific or fiction material in digital formats. E.g. my native language doesn't have easily accessible e-books at all, unless you go through illegal means.

I hope somebody undertakes the steps necessary to train on the entirety of libgen. The amount of high quality tokens in libgen should be substantial.

link

napier 1119 days ago

I would gladly pay triple digits a month for exactly that.

link