Hacker News new | ask | show | jobs
by maitland 1025 days ago
I'd be amazed if ai companies like openai respect robots.txt or any other good faith mechanism.
4 comments

Seems to me that any organisation that doesn't honor robots.txt would have a tougher time justifying to people that they are acting in good faith. Just my two cents of course. I understand OpenAI does honor it, which is to be expected from responsible orgs.
Yeah, I don't trust that they will all respect it. Just like you can't trust any other scrapers to respect these mechanisms, and just like you couldn't trust websites to honor the DNT signal.

if it's voluntary and companies will make more money by not doing it, then at least some are not going to do it.

So, from my point of view, this effort is mostly meaningless. It's not enough to get me to open my websites again, anyway.

It's true that some dishonest folk might not honor it, but it does communicate the wishes of the stakeholders of a website in machine readable terms. It means owners of bots cannot claim permission to use content is granted, as clear intent in robots.txt would show it is not.
> as clear intent in robots.txt would show it is not.

Would it?

The primary purpose of robots.txt is not actually to lock out bots (that's why respecting it is not mandatory). It's to give the bots guidance as to which parts of your site are appropriate for them and which parts are not.

This may make the "clear intent" argument weak in court.

> This may make the "clear intent" argument weak in court.

The standard has the keyword "Disallow", not "Avoid". I can't speak for anyone else of course, but that seems a pretty clear indicator of intent to me. By that I mean a site's stakeholders want to indicate that certain bots are disallowed from crawling a portion of their website.

But you and I aren't judges in a court. They go by different rules, such as the official intent and meaning of the robots.txt system itself.

I'm not saying a court wouldn't find intent signaled, I don't know, only that it's not clear-cut that it would.

Aside from whether intent is signalled or not, I would imagine courts may want to identify whether it is reasonable for any particular intent to not be honored in any particular set of circumstances. As you say, robots.txt isn't mandatory to be honored by itself. Perhaps other things might make it so? I don't know.
I mean, they did write their own crawler and have huge financial incentives to respect it.

What isn't known is if those sites will still have their content possibly included in training corpuses from CommonCrawl or ThePile, etc.

I mean they literally wrote their own crawler which has docs for it. I'm sure they'll respect it. https://platform.openai.com/docs/gptbot

What isn't known is if those same sites will be also included in corpuses such as CommonCrawl or ThePile, leading to being included in training as is.

All this "AI" garbage bots are being filtered on IP level for me.
IP level filtering is purely a "better than nothing" approach, though. It's very fragile and porous. You have to constantly monitor your logs to find the new IP addresses to filter out as they are adopted, but the only way you'll see them in your logs is after they've already done the scraping.
There are multiple behaviour-based tools to use for spotting and terminating this sort of activities. I do agree however, it's a very tedious semi-automated task of monitoring logs, which many business wouldn't bother to have.