| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by littlestymaar 78 days ago
	Yeah, they know it's bad, they just don't think the rules apply to them.

7 comments

mapt 78 days ago

The rules are that a large corporate AI company is able to scrape literally everything, and will use the full force of the law and any technology they can come up with to prevent you as an individual or a startup from doing so. Because having the audacity to try to exploit your betters would be "Theft".

vbezhenar 78 days ago

They know that the rules apply to them. They hope that they can avoid being caught.

catoc 78 days ago

It’s only bad if you’re a closed, for-profit entity

</sarcasm>

lukan 78 days ago

Was that sarcasm? Speaking of it, what parts of OpenAI are still open?

catoc 78 days ago

I know, always hard to tell on HN. Added the relevant declarative tag

reactordev 78 days ago

The front door…

skeeter2020 78 days ago

Small mitigation (by no way absolving them): isolated developers, different teams. Another way: they see "stealing" of their compute directly in their devop tools every day, but are several abstractions away from doing the same thing to other people.

splatter9859 78 days ago

They never have and feel they are above reproach. Anytime Altman opens his mouth that's apparent. It's for the good of humanity dontcha know. LOL

kamban 78 days ago

You nailed it.

tedsanders 78 days ago

For what it's worth, the big AI companies do have opt out mechanisms for scraping and search.

OpenAI documents how to opt out of scraping here: https://developers.openai.com/api/docs/bots

Anthropic documents how to opt out of scraping here: https://privacy.claude.com/en/articles/8896518-does-anthropi...

I'm not sure if Gemini lets you opt out without also delisting you from Google search rankings.

foresterre 78 days ago

I think opt-outs are a bit backwards, ethically speaking. Instead of asking for permission, they take unless you tell them to no longer do it from now on.

I can imagine their models have been trained on a lot of websites before opt outs became a thing, and the models will probably incorporate that for forever.

But at least for websites there's an opt-out, even if only for the big AI companies. Open source code never even got that option ;).

kneel25 78 days ago

> a lot of websites

It was a dataset of the entirety of the public internet from the very beginning that bypassed paywalls etc, there’s virtually nothing they haven’t scraped.

qaadika 78 days ago

> the big AI companies do have opt out mechanisms for scraping and search.

PRESS RELEASE: UNITED BURGLARS SOCIETY

The United Burglars Society understands that being burgled may be inconvenient for some. In response, UBS has introduced the Opt-Out system for those who wish not to be burgled.

Please understand that each burglar is an independent contractor, so those wishing not to burgled should go to the website for each burglar in their area and opt-out there. UBS is not responsible for unwanted burglaries due to failing to opt-out.

maplethorpe 66 days ago

Question: if I disallow all of OpenAI's crawlers, do they detect this and retroactively filter out all of my data from other corpuses, such as CommonCrawl?

The fact is my data exists in corpuses used by OpenAI before I was even aware anyone was scraping it. I'm wondering what can be done about that, if anything.

netdevphoenix 78 days ago

Performing an automated action on a website that has not consented is the problem. OpenAI showing you how to opt-opt is backwards. Consent comes first.

Bit concerning that some professional engineers don't understand this given the sensitive systems they interact with.

subscribed 78 days ago

Just respect the bloody robots.txt and hold your horses. Ask your precious product built on the relentless, hostile scraping to devise a strategy that doesn't look like a cancer growth.

keybored 78 days ago

Death by a thousand opt-outs.

Tarq0n 77 days ago

It seems likely that they buy data from companies who don't obey the same constraints however, making it easy to launder the unethical part through a third party.