Hacker News new | ask | show | jobs
by alexghr 1104 days ago
I think what the article tries to say is that OpenAI have already scraped Reddit for training data and with the recent API changes and subreddits going dark, new competitors in the AI space won't have it as easy to get the same training set.
1 comments

Honestly this sounds like a shower-thought post. With even basic research, Internet Archive and The Eye have Reddit historical data freely available. My desktop PC has all comments and posts from 2007-early 2023, in a convenient jsonl zst. It's only 3TB.