Hacker News new | ask | show | jobs
by lithiumii 1000 days ago
How about stay anonymous and just violate all the copyright laws? There's already pirate bay, libgen, sci-hub, zlibrary, etc., surely it's possible for there to be an opensource & pirate LLM model.
2 comments

If it were practical to mirror sci-hub and libgen, that would be one thing, but despite a lot of talk online I have yet to see a practical way to put my hands on such a thing.
I'm not sure what you are referring to with 'such a thing', but mirroring libgen and zlib is really not hard. Libgen offers Torrent links as does Anna's Archive. The libgen domains are fragile, but here's a link to the Anna's Archive torrents: https://annas-archive.org/torrents. They even have a page talking about training LLMs on this data: https://annas-archive.org/llm
Do any of these methods actually work though? Last time I looked (admittedly, 6 months or so ago), there were 0 seeders on the torrents.
Exactly, it's easy to say "Just torrent it," but that requires a lot of people to stick their necks out, including the user who just wants a copy of the data.

We need the ability to circulate HDDs physically in a semi-organized fashion, samizdat-style.

Mirroring libgen is definitely within reach, it's "just" 50 or so terabytes with torrents freely available for bulk downloading.

Realistically only maybe 10% of that is actually useful, but reaching that 10% is gonna be very labour-intensive. You would have to do a lot of cleanup of different formats, duplicate uploads, different editions of the same book, scanned PDFs, and what not, while big players with their own ebook stores (Amazon, Google, Apple, any ebook store) already have all of the proper metadata, a common format to work with, and a lot less duplicates.

Isn't there some kind of standard for publication metadata? The one which will allow to uniquely identify publication + further track different editions as children of "original" publication? Maybe we should create one and make it freely available?
How would one anonymously train a LLM of sufficient size to produce the performance needed? Does it not required hundreds/ thousands of expensive Nvidia GPUs?
Hardware gets better, the masses have amassed quite a lot of it already, and it depends how soon you need your AI.
Hostile foreign nation trains it and releases it in the persona of an anonymous hackerman.
Basically the plot of Ghost in the Shell.