Hacker News new | ask | show | jobs
by ivansavz 2243 days ago
If you're going to be doing ML and require downloads of PDFs, I would recommend getting the bulk data from s3 instead of downloading: https://arxiv.org/help/bulk_data_s3 It's a little more complicated to use, but you get it ALL ;)

In addition to TfIdf, topic modelling would is a very good fit for browsing and finding similar papers. Here is a demo of LDA applied to 10% of the quant-ph arXiv papers that I worked on back in the day: https://www.cs.mcgill.ca/~isavov/arxiv_demo/readme.html

1 comments

This is very cool, thank you :). I was trying to keep the script lightweight so only wanted articles that I'd already read used for the NLP. In hindsight that may not have been necessary.