Hacker News new | ask | show | jobs
by littlestymaar 78 days ago
Yeah, they know it's bad, they just don't think the rules apply to them.
7 comments

The rules are that a large corporate AI company is able to scrape literally everything, and will use the full force of the law and any technology they can come up with to prevent you as an individual or a startup from doing so. Because having the audacity to try to exploit your betters would be "Theft".
They know that the rules apply to them. They hope that they can avoid being caught.
It’s only bad if you’re a closed, for-profit entity

</sarcasm>

Was that sarcasm? Speaking of it, what parts of OpenAI are still open?
I know, always hard to tell on HN. Added the relevant declarative tag
The front door…
Small mitigation (by no way absolving them): isolated developers, different teams. Another way: they see "stealing" of their compute directly in their devop tools every day, but are several abstractions away from doing the same thing to other people.
They never have and feel they are above reproach. Anytime Altman opens his mouth that's apparent. It's for the good of humanity dontcha know. LOL
You nailed it.
For what it's worth, the big AI companies do have opt out mechanisms for scraping and search.

OpenAI documents how to opt out of scraping here: https://developers.openai.com/api/docs/bots

Anthropic documents how to opt out of scraping here: https://privacy.claude.com/en/articles/8896518-does-anthropi...

I'm not sure if Gemini lets you opt out without also delisting you from Google search rankings.

I think opt-outs are a bit backwards, ethically speaking. Instead of asking for permission, they take unless you tell them to no longer do it from now on.

I can imagine their models have been trained on a lot of websites before opt outs became a thing, and the models will probably incorporate that for forever.

But at least for websites there's an opt-out, even if only for the big AI companies. Open source code never even got that option ;).

> a lot of websites

It was a dataset of the entirety of the public internet from the very beginning that bypassed paywalls etc, there’s virtually nothing they haven’t scraped.

> the big AI companies do have opt out mechanisms for scraping and search.

PRESS RELEASE: UNITED BURGLARS SOCIETY

The United Burglars Society understands that being burgled may be inconvenient for some. In response, UBS has introduced the Opt-Out system for those who wish not to be burgled.

Please understand that each burglar is an independent contractor, so those wishing not to burgled should go to the website for each burglar in their area and opt-out there. UBS is not responsible for unwanted burglaries due to failing to opt-out.

Question: if I disallow all of OpenAI's crawlers, do they detect this and retroactively filter out all of my data from other corpuses, such as CommonCrawl?

The fact is my data exists in corpuses used by OpenAI before I was even aware anyone was scraping it. I'm wondering what can be done about that, if anything.

Performing an automated action on a website that has not consented is the problem. OpenAI showing you how to opt-opt is backwards. Consent comes first.

Bit concerning that some professional engineers don't understand this given the sensitive systems they interact with.

Just respect the bloody robots.txt and hold your horses. Ask your precious product built on the relentless, hostile scraping to devise a strategy that doesn't look like a cancer growth.
Death by a thousand opt-outs.
It seems likely that they buy data from companies who don't obey the same constraints however, making it easy to launder the unethical part through a third party.