Hacker News new | ask | show | jobs
by 3131s 3166 days ago
Yes, that's the one! Those numbers refer to the number of papers in each torrent, so each one contains 100,000 papers giving a current total of 66+ million.

The torrents of 100,000 are broken into 1000-paper zip archives that can be downloaded individually, so it's pretty manageable if you want to just check out a random sampling of the papers.

I would love to see somebody do some kind of massive scale analysis of the papers, but just extracting plain text from all those PDFs is a pretty herculean task considering that many would need to be OCRed, and others end up pretty garbled / misformated with pdftotext and the like.

1 comments