Hacker News new | ask | show | jobs
by fffobar 1398 days ago
> Luckily, the Dotbot clearly identifies itself in the User-Agent header, and they have a working support email, so after a five month communication in a ticket I discovered the reason.

I laughed at the "five month", then realized it is actually impressive that OP got any response at all. What a time to be alive.

Also, why not just ban the offender?

4 comments

If by offender you mean the bot, then I'm confused. The bot asked for robots.txt in a plaintext format. The bot was delivered some binary garbage that it couldn't parse. The bot continued to ask for an updated robots.txt, and continued to be told that the binary garbage was the intended content. What more was it supposed to do, exactly? The offending party here is the broken hosting platform.
I don't think banning the offender is in order to punish them for doing something wrong so much as to manually ensure it's not crawling your site
> Also, why not just ban the offender?

I was thinking about that, but didn't find an easy way ban a crawler. Google App Engine has a firewall, but it works based on IP addresses. Banning based on User-Agent would need to be done in the app code, and that essentially handling a request, even if in a cheaper way. I didn't want to touch the application at all, hoping to resolve this on the crawler side, whom I suspected being an unintentional "offender".

Speaking about the five months - that's fine. We were not communicating every day of course. And indeed impressive that I had my case handled at all.

I knew for years that unwanted crawling happens by various crawlers, and was reminded of that in metrics from time to time. One day I was in the mood to study deeper, found two crawlers in the access logs, studied their web sites and emailed them.

One didn't respond at all. The moz.com created a ticket, four days later a support engineer replied, a week later I replied. We had some back and forth. I supposed they don't recognize `User-agent: *` and need `User-agent: Dotbot`. David - the support engineer - expressed several other hypotheses. There was a period of silence, then I raised my issue again, David had it reviewed by some other people at moz.com and they pointed to the gzipped response.

BTW, what I learned, is that "If no Accept-Encoding field is present in a request, the server MAY assume that the client will accept any content coding." (https://www.rfc-editor.org/rfc/rfc9110.html#name-accept-enco...).

So if we make an HTTP request, unless we explicitly specify `Accept-Encoding: identity` we'd better be prepared to inspect the Content-Encoding in the response and decompress data if necessary.

But since Google App Engine returns gzipped content even for requests with `Accept-Encoding: identity`, I accepted that the the failure is on my side and went on with the config changes. Still, left a recommendation for moz.com to support gzip on their end.

> Also, why not just ban the offender?

Surely the offender here is Google AppEngine.

Because what if other people use the same software?