| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ivansavz 2243 days ago
	If you're going to be doing ML and require downloads of PDFs, I would recommend getting the bulk data from s3 instead of downloading: https://arxiv.org/help/bulk_data_s3 It's a little more complicated to use, but you get it ALL ;) In addition to TfIdf, topic modelling would is a very good fit for browsing and finding similar papers. Here is a demo of LDA applied to 10% of the quant-ph arXiv papers that I worked on back in the day: https://www.cs.mcgill.ca/~isavov/arxiv_demo/readme.html

1 comments

191101 2243 days ago

This is very cool, thank you :). I was trying to keep the script lightweight so only wanted articles that I'd already read used for the NLP. In hindsight that may not have been necessary.

link