Can IPFS or torrent and large local databases decentralised by people be a solution to this? I personally have the resources to share and host TBs of data but didn't find a good use to it.
For that to work, a website has to push a mirror into that alternate system, and the scraper has to know the associated mirror exists.
That's two big "ifs" for something I'm not aware of a standardized way of announcing. And the entire thing crumbles as soon as someone who wants every drop of data possible says "crawl their sites anyway to make sure they didn't forget to publish anything into the 2nd system."
I doubt, as the article mentions scraping the same resource after just 6 hours. AI companies want to make sure they have fresh data, whileit would be hard to keep such a database updated.
That's two big "ifs" for something I'm not aware of a standardized way of announcing. And the entire thing crumbles as soon as someone who wants every drop of data possible says "crawl their sites anyway to make sure they didn't forget to publish anything into the 2nd system."