| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dmingod 3166 days ago
	Is there a way to download the whole archive? People can do cool stuff like visualization etc. on it.

1 comments

3131s 3166 days ago

Yes, but it's huge. At one time you could torrent it piece by piece but now the link appears broken...

http://libgen.io/scimag/repository_torrent_notforall/

Anyone have the current link?

Good luck extracting much of anything useful out of older PDFs though.

link

userbinator 3166 days ago

This? http://libgen.io/scimag/repository_torrent/

I'm not sure what the numbers mean, but the last-modified dates on those torrents span a range of 3 years ago to this month.

link

3131s 3166 days ago

Yes, that's the one! Those numbers refer to the number of papers in each torrent, so each one contains 100,000 papers giving a current total of 66+ million.

The torrents of 100,000 are broken into 1000-paper zip archives that can be downloaded individually, so it's pretty manageable if you want to just check out a random sampling of the papers.

I would love to see somebody do some kind of massive scale analysis of the papers, but just extracting plain text from all those PDFs is a pretty herculean task considering that many would need to be OCRed, and others end up pretty garbled / misformated with pdftotext and the like.

link

lsh 3166 days ago

https://elifesciences.org/labs/5b56aff6/sciencebeam-using-co...

link

snowpanda 3166 days ago

I could have sworn they were uploaded to usenet too. But I can't find it for the life of me.

link

agumonkey 3166 days ago

I thought about mirroring it, the repository db is 200MB and simple in structure, but then you have to have quite a lot of hdd on your side (20, 200TB maybe more, can't recall)

link