Hacker News new | ask | show | jobs
by throwawaygog6 1362 days ago
>>> If someone wants to host the raw files to allow others to download it let me know. It is a 83 GB tar.gz file which uncompressed is just over 1 TB in size.

Anyone knows if it is possible to download similar data set for youtube and reddit? I have ideas for search engine based on it, but I don't want to write/maintain scraper scripts.

2 comments

There's a very large dataset of Reddit posts and comments at https://files.pushshift.io/reddit/
I have most of the historical reddit data except ~year I use to train ML models. Let me see if I can find a public link for you...