Hacker News new | ask | show | jobs
by JohnFen 1027 days ago
Yeah, I don't trust that they will all respect it. Just like you can't trust any other scrapers to respect these mechanisms, and just like you couldn't trust websites to honor the DNT signal.

if it's voluntary and companies will make more money by not doing it, then at least some are not going to do it.

So, from my point of view, this effort is mostly meaningless. It's not enough to get me to open my websites again, anyway.

2 comments

It's true that some dishonest folk might not honor it, but it does communicate the wishes of the stakeholders of a website in machine readable terms. It means owners of bots cannot claim permission to use content is granted, as clear intent in robots.txt would show it is not.
> as clear intent in robots.txt would show it is not.

Would it?

The primary purpose of robots.txt is not actually to lock out bots (that's why respecting it is not mandatory). It's to give the bots guidance as to which parts of your site are appropriate for them and which parts are not.

This may make the "clear intent" argument weak in court.

> This may make the "clear intent" argument weak in court.

The standard has the keyword "Disallow", not "Avoid". I can't speak for anyone else of course, but that seems a pretty clear indicator of intent to me. By that I mean a site's stakeholders want to indicate that certain bots are disallowed from crawling a portion of their website.

But you and I aren't judges in a court. They go by different rules, such as the official intent and meaning of the robots.txt system itself.

I'm not saying a court wouldn't find intent signaled, I don't know, only that it's not clear-cut that it would.

Aside from whether intent is signalled or not, I would imagine courts may want to identify whether it is reasonable for any particular intent to not be honored in any particular set of circumstances. As you say, robots.txt isn't mandatory to be honored by itself. Perhaps other things might make it so? I don't know.
I mean, they did write their own crawler and have huge financial incentives to respect it.

What isn't known is if those sites will still have their content possibly included in training corpuses from CommonCrawl or ThePile, etc.