How about stay anonymous and just violate all the copyright laws? There's already pirate bay, libgen, sci-hub, zlibrary, etc., surely it's possible for there to be an opensource & pirate LLM model.
If it were practical to mirror sci-hub and libgen, that would be one thing, but despite a lot of talk online I have yet to see a practical way to put my hands on such a thing.
I'm not sure what you are referring to with 'such a thing', but mirroring libgen and zlib is really not hard. Libgen offers Torrent links as does Anna's Archive. The libgen domains are fragile, but here's a link to the Anna's Archive torrents: https://annas-archive.org/torrents. They even have a page talking about training LLMs on this data: https://annas-archive.org/llm
Exactly, it's easy to say "Just torrent it," but that requires a lot of people to stick their necks out, including the user who just wants a copy of the data.
We need the ability to circulate HDDs physically in a semi-organized fashion, samizdat-style.
Mirroring libgen is definitely within reach, it's "just" 50 or so terabytes with torrents freely available for bulk downloading.
Realistically only maybe 10% of that is actually useful, but reaching that 10% is gonna be very labour-intensive. You would have to do a lot of cleanup of different formats, duplicate uploads, different editions of the same book, scanned PDFs, and what not, while big players with their own ebook stores (Amazon, Google, Apple, any ebook store) already have all of the proper metadata, a common format to work with, and a lot less duplicates.
Isn't there some kind of standard for publication metadata? The one which will allow to uniquely identify publication + further track different editions as children of "original" publication? Maybe we should create one and make it freely available?
How would one anonymously train a LLM of sufficient size to produce the performance needed? Does it not required hundreds/ thousands of expensive Nvidia GPUs?