Hacker News new | ask | show | jobs
by throwitawayfam 1630 days ago
The problem with using Reddit specifically is that you can't filter by date anymore. Reddit has poisoned their results to show old posts with new dates on Google.
1 comments

Huh, I wonder, can I download reddit? Like, all the text posts, ignoring images. I wonder how big of a db that is and how hard would it be to crawl it myself. It can't be more than a few gb of data. I mean, at this point there is a lot of information there that is just begging to be leveraged.
Pushshift has a monthly comment[1] and submission data dump that you can download. Last June 2021's (comment) size was 20+ GB compressed in ZS.

[1]- https://files.pushshift.io/reddit/comments/