Hacker News new | ask | show | jobs
by userbinator 3166 days ago
This? http://libgen.io/scimag/repository_torrent/

I'm not sure what the numbers mean, but the last-modified dates on those torrents span a range of 3 years ago to this month.

2 comments

Yes, that's the one! Those numbers refer to the number of papers in each torrent, so each one contains 100,000 papers giving a current total of 66+ million.

The torrents of 100,000 are broken into 1000-paper zip archives that can be downloaded individually, so it's pretty manageable if you want to just check out a random sampling of the papers.

I would love to see somebody do some kind of massive scale analysis of the papers, but just extracting plain text from all those PDFs is a pretty herculean task considering that many would need to be OCRed, and others end up pretty garbled / misformated with pdftotext and the like.

I could have sworn they were uploaded to usenet too. But I can't find it for the life of me.