Hacker News new | ask | show | jobs
by toomuchtodo 233 days ago
Without gating AI scraper access, Reddit’s enterprise value based on only ad revenue is greatly diminished. If the AI folks impair Reddit’s economics through their maneuvers, that might not be so bad (as Reddit’s behavior of late has been “all this user generated content belongs to us to monetize as we see fit”).
1 comments

The AI companies could just pull the content from Reddit mirrors like https://arctic-shift.photon-reddit.com/search/ and https://search.pullpush.io/. It's not difficult to scrape nor difficult to acquire archives of all Reddit posts and comments.
They would most likely use the browsers they offer users to scrap and stream the content back to an endpoint for ingest and processing as users browse Reddit, think Recap the Law extension for Pacer (which scrapes Pacer while a user browses it and ships the data to the Internet Archive) or ArchiveTeam’s Warrior VM. You can’t defend against scraping when every user browser, that looks like a human because it is a human, is a crawler node.

At least, this is how I would engineer a public browser operating as an adversarial distributed crawler network.