Seems to me that any organisation that doesn't honor robots.txt would have a tougher time justifying to people that they are acting in good faith. Just my two cents of course. I understand OpenAI does honor it, which is to be expected from responsible orgs.
Yeah, I don't trust that they will all respect it. Just like you can't trust any other scrapers to respect these mechanisms, and just like you couldn't trust websites to honor the DNT signal.
if it's voluntary and companies will make more money by not doing it, then at least some are not going to do it.
So, from my point of view, this effort is mostly meaningless. It's not enough to get me to open my websites again, anyway.
It's true that some dishonest folk might not honor it, but it does communicate the wishes of the stakeholders of a website in machine readable terms. It means owners of bots cannot claim permission to use content is granted, as clear intent in robots.txt would show it is not.
> as clear intent in robots.txt would show it is not.
Would it?
The primary purpose of robots.txt is not actually to lock out bots (that's why respecting it is not mandatory). It's to give the bots guidance as to which parts of your site are appropriate for them and which parts are not.
This may make the "clear intent" argument weak in court.
> This may make the "clear intent" argument weak in court.
The standard has the keyword "Disallow", not "Avoid". I can't speak for anyone else of course, but that seems a pretty clear indicator of intent to me. By that I mean a site's stakeholders want to indicate that certain bots are disallowed from crawling a portion of their website.
Aside from whether intent is signalled or not, I would imagine courts may want to identify whether it is reasonable for any particular intent to not be honored in any particular set of circumstances. As you say, robots.txt isn't mandatory to be honored by itself. Perhaps other things might make it so? I don't know.
IP level filtering is purely a "better than nothing" approach, though. It's very fragile and porous. You have to constantly monitor your logs to find the new IP addresses to filter out as they are adopted, but the only way you'll see them in your logs is after they've already done the scraping.
There are multiple behaviour-based tools to use for spotting and terminating this sort of activities. I do agree however, it's a very tedious semi-automated task of monitoring logs, which many business wouldn't bother to have.