Hacker News new | ask | show | jobs
by dmingod 3166 days ago
Is there a way to download the whole archive? People can do cool stuff like visualization etc. on it.
1 comments

Yes, but it's huge. At one time you could torrent it piece by piece but now the link appears broken...

http://libgen.io/scimag/repository_torrent_notforall/

Anyone have the current link?

Good luck extracting much of anything useful out of older PDFs though.

This? http://libgen.io/scimag/repository_torrent/

I'm not sure what the numbers mean, but the last-modified dates on those torrents span a range of 3 years ago to this month.

Yes, that's the one! Those numbers refer to the number of papers in each torrent, so each one contains 100,000 papers giving a current total of 66+ million.

The torrents of 100,000 are broken into 1000-paper zip archives that can be downloaded individually, so it's pretty manageable if you want to just check out a random sampling of the papers.

I would love to see somebody do some kind of massive scale analysis of the papers, but just extracting plain text from all those PDFs is a pretty herculean task considering that many would need to be OCRed, and others end up pretty garbled / misformated with pdftotext and the like.

I could have sworn they were uploaded to usenet too. But I can't find it for the life of me.
I thought about mirroring it, the repository db is 200MB and simple in structure, but then you have to have quite a lot of hdd on your side (20, 200TB maybe more, can't recall)