Hacker News new | ask | show | jobs
by suicas 1912 days ago
A company I worked for ~7 years ago ran its own focused web crawler (fetching ~10-100m pages per month, targeting certain sections of the web).

There were a surprising number of sites out there that explicitly blocked access to anyone but Google/Bing at the time.

We'd also get a dozen complaints or so a month from sites we'd crawled. Mostly upset about us using up their bandwidth, and telling us that only Google was allowed to crawl them (though having no robots.txt configured to say so).

2 comments

Isn't that the website owners right though? I'm not sure I understand the problem here.

If Google is taking traffic and reducing revenue, a company can deny in robots.txt. Google will actually follow those rules - unlike most others that are supposedly in this 2nd class.

Yup, no problem here, was just making an observation about how common such blocking was (and about the fact that some people were upset at being crawled by someone other than Google, despite not blocking them).

The company did respect robots.txt, though it was initially a bit of a struggle to convince certain project managers to do so.

> Isn't that the website owners right though?

No. The internet is public. Publishers shouldn't get any say in who accesses their content or how they do it. As far as I'm concerned, the fact that they do is a bug.

No, it's not. I can setup a login page and keep you out if I want. And I can do it however I want.
But your login page will be public and subject to being crawled.
My server, my rules.
I usually recommend setting only Google/Bing/Yandex/Baidu etc to Allow and everything else to Disallow.

Yes, the bad bots don't give a fuck, but even the non-malicious bots (ahrefs, moz, some university's search engine etc) don't bring any value to me as a site owner, take up band width and resources and fill up logs. If you can remove them with three lines in your robots.txt, that's less noise. Especially universities do, in my opinion, often behave badly and are uncooperative when you point out their throttling does not work and they're hammering your server. Giving them a "Go Away, You Are Not Wanted Here" in a robots.txt works for most, and the rest just gets blocked.

> they're hammering your server

Why can't you just ratelimit IPs that are "too active" for your server to handle?

From some I could, but why would I? If they're not adding value and they don't want to behave, I don't see a reason to spend money to adapt my systems to be "inclusive" towards their usage patterns.
In context, you're justifying blocking all automated traffic, even that which does behave, by pointing out that some of it doesn't. That attitude seems lazy at best, malicious at worst.
CGNAT
Now that's a really good point. I wonder why there isn't a standard protocol for signalling upstream that a particular connection is abusive and to please rate limit the path at the source on your behalf? It would certainly add complexity, but the current situation is hardly better.