| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by avodonosov 1446 days ago

> Also, why not just ban the offender?

I was thinking about that, but didn't find an easy way ban a crawler. Google App Engine has a firewall, but it works based on IP addresses. Banning based on User-Agent would need to be done in the app code, and that essentially handling a request, even if in a cheaper way. I didn't want to touch the application at all, hoping to resolve this on the crawler side, whom I suspected being an unintentional "offender".

Speaking about the five months - that's fine. We were not communicating every day of course. And indeed impressive that I had my case handled at all.

I knew for years that unwanted crawling happens by various crawlers, and was reminded of that in metrics from time to time. One day I was in the mood to study deeper, found two crawlers in the access logs, studied their web sites and emailed them.

One didn't respond at all. The moz.com created a ticket, four days later a support engineer replied, a week later I replied. We had some back and forth. I supposed they don't recognize `User-agent: *` and need `User-agent: Dotbot`. David - the support engineer - expressed several other hypotheses. There was a period of silence, then I raised my issue again, David had it reviewed by some other people at moz.com and they pointed to the gzipped response.

BTW, what I learned, is that "If no Accept-Encoding field is present in a request, the server MAY assume that the client will accept any content coding." (https://www.rfc-editor.org/rfc/rfc9110.html#name-accept-enco...).

So if we make an HTTP request, unless we explicitly specify `Accept-Encoding: identity` we'd better be prepared to inspect the Content-Encoding in the response and decompress data if necessary.

But since Google App Engine returns gzipped content even for requests with `Accept-Encoding: identity`, I accepted that the the failure is on my side and went on with the config changes. Still, left a recommendation for moz.com to support gzip on their end.