The rules are that a large corporate AI company is able to scrape literally everything, and will use the full force of the law and any technology they can come up with to prevent you as an individual or a startup from doing so. Because having the audacity to try to exploit your betters would be "Theft".
Small mitigation (by no way absolving them): isolated developers, different teams. Another way: they see "stealing" of their compute directly in their devop tools every day, but are several abstractions away from doing the same thing to other people.
I think opt-outs are a bit backwards, ethically speaking. Instead of asking for permission, they take unless you tell them to no longer do it from now on.
I can imagine their models have been trained on a lot of websites before opt outs became a thing, and the models will probably incorporate that for forever.
But at least for websites there's an opt-out, even if only for the big AI companies. Open source code never even got that option ;).
It was a dataset of the entirety of the public internet from the very beginning that bypassed paywalls etc, there’s virtually nothing they haven’t scraped.
> the big AI companies do have opt out mechanisms for scraping and search.
PRESS RELEASE: UNITED BURGLARS SOCIETY
The United Burglars Society understands that being burgled may be inconvenient for some. In response, UBS has introduced the Opt-Out system for those who wish not to be burgled.
Please understand that each burglar is an independent contractor, so those wishing not to burgled should go to the website for each burglar in their area and opt-out there. UBS is not responsible for unwanted burglaries due to failing to opt-out.
Question: if I disallow all of OpenAI's crawlers, do they detect this and retroactively filter out all of my data from other corpuses, such as CommonCrawl?
The fact is my data exists in corpuses used by OpenAI before I was even aware anyone was scraping it. I'm wondering what can be done about that, if anything.
Performing an automated action on a website that has not consented is the problem. OpenAI showing you how to opt-opt is backwards. Consent comes first.
Bit concerning that some professional engineers don't understand this given the sensitive systems they interact with.
Just respect the bloody robots.txt and hold your horses. Ask your precious product built on the relentless, hostile scraping to devise a strategy that doesn't look like a cancer growth.
It seems likely that they buy data from companies who don't obey the same constraints however, making it easy to launder the unethical part through a third party.