| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by JohnFen 1027 days ago

Yeah, I don't trust that they will all respect it. Just like you can't trust any other scrapers to respect these mechanisms, and just like you couldn't trust websites to honor the DNT signal.

if it's voluntary and companies will make more money by not doing it, then at least some are not going to do it.

So, from my point of view, this effort is mostly meaningless. It's not enough to get me to open my websites again, anyway.

2 comments

nuc1e0n 1027 days ago

It's true that some dishonest folk might not honor it, but it does communicate the wishes of the stakeholders of a website in machine readable terms. It means owners of bots cannot claim permission to use content is granted, as clear intent in robots.txt would show it is not.

link

JohnFen 1027 days ago

> as clear intent in robots.txt would show it is not.

Would it?

The primary purpose of robots.txt is not actually to lock out bots (that's why respecting it is not mandatory). It's to give the bots guidance as to which parts of your site are appropriate for them and which parts are not.

This may make the "clear intent" argument weak in court.

link

nuc1e0n 1027 days ago

> This may make the "clear intent" argument weak in court.

The standard has the keyword "Disallow", not "Avoid". I can't speak for anyone else of course, but that seems a pretty clear indicator of intent to me. By that I mean a site's stakeholders want to indicate that certain bots are disallowed from crawling a portion of their website.

link

JohnFen 1027 days ago

But you and I aren't judges in a court. They go by different rules, such as the official intent and meaning of the robots.txt system itself.

I'm not saying a court wouldn't find intent signaled, I don't know, only that it's not clear-cut that it would.

link

nuc1e0n 1027 days ago

Aside from whether intent is signalled or not, I would imagine courts may want to identify whether it is reasonable for any particular intent to not be honored in any particular set of circumstances. As you say, robots.txt isn't mandatory to be honored by itself. Perhaps other things might make it so? I don't know.

link

brianjking 1027 days ago

I mean, they did write their own crawler and have huge financial incentives to respect it.

What isn't known is if those sites will still have their content possibly included in training corpuses from CommonCrawl or ThePile, etc.

link