Hacker News new | ask | show | jobs
by satyrnein 2533 days ago
Imagine if search engines had to "reach out to each [site owner] and figure out a business arrangement with them." The world decided that opt out via robots.txt was a better approach.

If the municipality wants to get the information out, this could be a win-win, just like search engines were. Do check for robots.text, though!

3 comments

We found at one job that approximate one quarter of well-known search engines blatantly use robots.txt noindex declarations as a list of URLs to index, and one openly mocked us for asking them to stop.

Voluntary honor systems don’t work, because there’s no way to compel non-compliers to stop other than standard “anti-attacker arms race” approaches, such as the obstacle described at the head of this thread.

It sounds like scraping is a big problem for you guys. What kind of outfit is it, if you don't mind me asking?
Drop me an email and I’m happy to describe further.
Well, the search engines decided that robots.txt was the better approach for them. Which makes sense, since they want control over as much data as possible, that's their profit motive. The jury is still out on whether that's a long-term win-win social contract between search engine companies and the world.
> Well, the search engines decided that robots.txt was the better approach for them. Which makes sense, since they want control over as much data as possible, that's their profit motive. The jury is still out on whether that's a long-term win-win social contract between search engine companies and the world.

Are you really arguing that the internet would be _more_ accessible if search engines had to reach out to every site they wanted to crawl?

How many companies out there complain about being scraped by Google? How many companies benefit from search-driven traffic?

The alternative would have been opt-in instead of opt-out. Everything excluded by default, except what robots.txt allows you to index.

Naturally, Google didn't want that.

I would assume that any site that was implementing JS-level blocks also has the appropriate robots.txt file in place.
That's not true in the actual web, however.

The best example is a large number of unimportant sites that send 429 errors for /robots.txt if they think it's a scraper. A 4xx result for robots.txt is considered to mean no robots.txt for most crawlers. So the website is getting the reverse of what it thought it was getting.