| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by satyrnein 2533 days ago
	Imagine if search engines had to "reach out to each [site owner] and figure out a business arrangement with them." The world decided that opt out via robots.txt was a better approach. If the municipality wants to get the information out, this could be a win-win, just like search engines were. Do check for robots.text, though!

3 comments

floatingatoll 2532 days ago

We found at one job that approximate one quarter of well-known search engines blatantly use robots.txt noindex declarations as a list of URLs to index, and one openly mocked us for asking them to stop.

Voluntary honor systems don’t work, because there’s no way to compel non-compliers to stop other than standard “anti-attacker arms race” approaches, such as the obstacle described at the head of this thread.

link

jdc 2532 days ago

It sounds like scraping is a big problem for you guys. What kind of outfit is it, if you don't mind me asking?

link

floatingatoll 2532 days ago

Drop me an email and I’m happy to describe further.

link

samcal 2533 days ago

Well, the search engines decided that robots.txt was the better approach for them. Which makes sense, since they want control over as much data as possible, that's their profit motive. The jury is still out on whether that's a long-term win-win social contract between search engine companies and the world.

link

vageli 2532 days ago

> Well, the search engines decided that robots.txt was the better approach for them. Which makes sense, since they want control over as much data as possible, that's their profit motive. The jury is still out on whether that's a long-term win-win social contract between search engine companies and the world.

Are you really arguing that the internet would be _more_ accessible if search engines had to reach out to every site they wanted to crawl?

How many companies out there complain about being scraped by Google? How many companies benefit from search-driven traffic?

link

gnud 2532 days ago

The alternative would have been opt-in instead of opt-out. Everything excluded by default, except what robots.txt allows you to index.

Naturally, Google didn't want that.

link

klenwell 2533 days ago

I would assume that any site that was implementing JS-level blocks also has the appropriate robots.txt file in place.

link

greglindahl 2532 days ago

That's not true in the actual web, however.

The best example is a large number of unimportant sites that send 429 errors for /robots.txt if they think it's a scraper. A 4xx result for robots.txt is considered to mean no robots.txt for most crawlers. So the website is getting the reverse of what it thought it was getting.

link